{"count": 4211, "next": null, "previous": null, "results": [{"id": 36166, "uid": "b194c41cbdf4c2b779d2615875139971", "name": "TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models", "authors": [{"id": 158989, "fullname": "LI XIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/158989?format=json", "institution": "Tsinghua University"}, {"id": 88509, "fullname": "Yali Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88509?format=json", "institution": "Tsinghua University"}, {"id": 102550, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102550?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88500, "fullname": "Shengjin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88500?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general robotic manipulation. However, existing approaches typically omit intermediate reasoning steps and directly regress actions, limiting reasoning interpretability and performance in long-horizon or compositional tasks. Although recent studies introduce Chain-of-Thought (CoT) reasoning into VLA models, their effectiveness remains suboptimal due to two key issues: (1) generating a full reasoning trajectory at every timestep introduces substantial redundancy, thereby hinders real-time deployment and (2) reasoning is performed independently, neglecting temporal consistency, which leads to planning conflicts.We propose TRM-VLA, a temporal-aware reasoning and memorization framework that integrates explicit temporal modeling into the VLA reasoning process. TRM-VLA consists of two core components: (1) Keyframe-Triggered Reasoning (KTR), which identifies task progress and performs hierarchical CoT reasoning only at key decision points to reduce redundant inference; and (2) Granularity-adaptable Context Memory (GCM), which dynamically stores and retrieves historical reasoning trajectories to maintain inter-frame coherence and global context. Built upon a dual-system architecture\u2014combining a multimodal foundation model for slow reasoning (System 2) with a diffusion-based policy for fast execution (System 1)\u2014TRM-VLA learns to plan and act efficiently in a unified manner. Extensive experiments on LIBERO-90, SIMPLER, and four real-world robotic tasks demonstrate that TRM-VLA achieves state-of-the-art performance while improving reasoning efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36166", "url": null, "sourceid": 33473, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36169, "uid": "364154911ada592eb2cdbb776033ebce", "name": "Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping", "authors": [{"id": 181987, "fullname": "Tianxiang Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/181987?format=json", "institution": "Peking University"}, {"id": 181881, "fullname": "Hulingxiao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181881?format=json", "institution": "Peking University"}, {"id": 87023, "fullname": "Yuxin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87023?format=json", "institution": "Peking University"}], "abstract": "The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) \u2014 an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36169", "url": null, "sourceid": 37717, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36170, "uid": "b2e65e738c327d1a8c3c27092d00b6c1", "name": "CAPT : Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment", "authors": [{"id": 180670, "fullname": "Maoyuan Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180670?format=json", "institution": "Minzu University of China"}, {"id": 184321, "fullname": "Yutong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184321?format=json", "institution": "Minzu University of China"}, {"id": 184322, "fullname": "Xinyang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184322?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184323, "fullname": "Lijuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184323?format=json", "institution": "National Library of China"}, {"id": 130585, "fullname": "Guoshun Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130585?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184324, "fullname": "Chuang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184324?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories.  We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model\u2019s intrinsic bias and limited fine-grained discriminative ability.  To address this, we propose **CAPT**, a **C**onfusion-**A**ware **P**rompt **T**uning framework that enables models to learn from their own misalignment.  Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples.  On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts.  To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning.  Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72% of confusable sample pairs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36170", "url": null, "sourceid": 30657, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36175, "uid": "7a2c5f44c553096c00bd05e62c4f3771", "name": "Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments", "authors": [{"id": 145523, "fullname": "Shuang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/145523?format=json", "institution": "The Ohio State University"}, {"id": 184335, "fullname": "Debao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184335?format=json", "institution": "Ohio State University, Columbus"}, {"id": 184336, "fullname": "Deyan Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184336?format=json", "institution": "Ohio State University, Columbus"}, {"id": 184337, "fullname": "Haolin Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184337?format=json", "institution": "University of Southern California"}, {"id": 184338, "fullname": "Yang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184338?format=json", "institution": "Ohio State University, Columbus"}, {"id": 128277, "fullname": "Yajie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128277?format=json", "institution": "University of Southern California"}, {"id": 75794, "fullname": "Rongjun Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75794?format=json", "institution": "The Ohio State University"}], "abstract": "Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce \\textit{Olbedo}, a large-scale aerial dataset for outdoor albedo--shading decomposition in the wild. \\textit{Olbedo} contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each view is accompanied by multi-view consistent albedo and shading maps, metric depth, surface normals, sun and sky shading components, camera poses, and, for recent flights, measured HDR sky domes. These annotations are derived from an inverse-rendering refinement pipeline over multi-view stereo reconstructions and calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that \\textit{Olbedo} enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on \\textit{Olbedo} significantly improves single-view outdoor albedo prediction on the MatrixCity benchmark. We further illustrate applications of \\textit{Olbedo}-trained models to multi-view consistent relighting of 3D assets, material editing, and scene change analysis for urban digital twins. We release the dataset, baseline models, and an evaluation protocol to support future research in outdoor intrinsic decomposition and illumination-aware aerial vision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36175", "url": null, "sourceid": 44157, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36176, "uid": "6c814714356d2058b2b2445291147fea", "name": "Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers", "authors": [{"id": 176005, "fullname": "jian ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/176005?format=json", "institution": "OPPO"}, {"id": 176039, "fullname": "Qirong Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/176039?format=json", "institution": "OPPO"}, {"id": 175912, "fullname": "Xujie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175912?format=json", "institution": "Sun Yat-sen University"}, {"id": 175931, "fullname": "Peixing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/175931?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 184339, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184339?format=json", "institution": "OPPO AI Center"}, {"id": 154487, "fullname": "Haonan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154487?format=json", "institution": "OPPO Guangdong Mobile Telecommunications Co., Ltd."}], "abstract": "Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\\% reduction in parameter count compared to the full model, with less than 3\\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36176", "url": null, "sourceid": 34925, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36180, "uid": "4c5637327b218329fb37b0c97223fff2", "name": "Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction", "authors": [{"id": 159903, "fullname": "Tao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/159903?format=json", "institution": "Zhejiang University"}, {"id": 158268, "fullname": "Peishan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158268?format=json", "institution": "Zhejiang University"}, {"id": 153853, "fullname": "Yudong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/153853?format=json", "institution": "Zhejiang University"}, {"id": 184346, "fullname": "Yingfeng Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184346?format=json", "institution": null}, {"id": 88382, "fullname": "Wei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88382?format=json", "institution": " Shenzhen DJI Sciences and Technologies Ltd."}, {"id": 176618, "fullname": "Weiqiang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/176618?format=json", "institution": "Horizon Robotics"}, {"id": 184347, "fullname": "Qian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184347?format=json", "institution": "Horizon Robotics"}, {"id": 91241, "fullname": "Wei Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/91241?format=json", "institution": "Zhejiang Lab"}, {"id": 76570, "fullname": "Sida Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/76570?format=json", "institution": "Zhejiang University"}, {"id": 184348, "fullname": "Xiaoyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184348?format=json", "institution": "Horizon Robotics"}, {"id": 76363, "fullname": "Xiaowei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76363?format=json", "institution": "Zhejiang University"}], "abstract": "This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\\cite{Schops_2019_CVPR} and Oxford Spires~\\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving state-of-the-art performance in both pose estimation and 3D reconstruction accuracy while maintaining efficiency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36180", "url": null, "sourceid": 30659, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36200, "uid": "c5136a36b0bdea61cf049154a776ecc2", "name": "RAID: Retrieval-Augmented Anomaly Detection", "authors": [{"id": 183559, "fullname": "Mingxiu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/183559?format=json", "institution": "Northeastern University"}, {"id": 184430, "fullname": "Zhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184430?format=json", "institution": "Northeastern University"}, {"id": 75946, "fullname": "Gaochang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75946?format=json", "institution": "Northeastern University"}, {"id": 184431, "fullname": "Tianyou Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184431?format=json", "institution": null}, {"id": 75928, "fullname": "Xiatian Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75928?format=json", "institution": "University of Surrey"}], "abstract": "Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamental challenge: matching between test images and normal templates inevitably introduces noise due to intra-class variations, imperfect correspondences, and limited templates. Observing that Retrieval-Augmented Generation (RAG) leverages retrieved samples directly in the generation process, we reinterpret UAD through this lens and introduce RAID, a retrieval-augmented UAD framework designed for noise-resilient anomaly detection and localization. Unlike standard RAG that enriches context or knowledge, we focus on using retrieved normal samples to guide noise suppression in anomaly map generation. RAID retrieves class-, semantic-, and instance-level representations from a hierarchical vector database, forming a coarse-to-fine pipeline. A matching cost volume correlates the input with retrieved exemplars, followed by a guided Mixture-of-Experts (MoE) network that leverages the retrieved samples to adaptively suppress matching noise and produce fine-grained anomaly maps. RAID achieves state-of-the-art performance across full-shot, few-shot, and multi-dataset settings on MVTec, VisA, MPDD, and BTAD benchmarks. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36200", "url": null, "sourceid": 36690, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36189, "uid": "93c6c0bd09a280606df12f66752d7b76", "name": "Learning to Track Instance from Single Nature Language Description", "authors": [{"id": 155926, "fullname": "Yaozong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155926?format=json", "institution": "Guangxi Normal University"}, {"id": 128975, "fullname": "Bineng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128975?format=json", "institution": "Guangxi Normal University"}, {"id": 155925, "fullname": "Qihua Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155925?format=json", "institution": "Guangxi Normal University"}, {"id": 184384, "fullname": "Shuimu Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184384?format=json", "institution": null}, {"id": 184385, "fullname": "Haiying Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184385?format=json", "institution": "Guangxi Normal University"}, {"id": 129683, "fullname": "Shuxiang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/129683?format=json", "institution": "Guangxi Normal University"}], "abstract": "How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \\textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \\textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \\textbf{\\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \\textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\\tracker} surpasses SOTA self-supervised methods, achieving an improvement of more than 11.2\\%, 5\\%, and 3.3\\% in AUC score on the OTB99, LaSOT, and TNL2K datasets, respectively.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36189", "url": null, "sourceid": 31907, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36224, "uid": "a48bae986dd278e99e16e47f3b324ea8", "name": "FEAT: Fashion Editing and Try-On from Any Design", "authors": [{"id": 148638, "fullname": "Soye Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/148638?format=json", "institution": "Kookmin University"}, {"id": 184488, "fullname": "Keonyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184488?format=json", "institution": "Soongsil University"}, {"id": 184489, "fullname": "Dahuin Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/184489?format=json", "institution": "Soongsil University"}, {"id": 184490, "fullname": "Jaekoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184490?format=json", "institution": "Kookmin University"}], "abstract": "Fashion design aims to express a designer\u2019s creative intent and to depict how garments interact with the human body. Recent generative approaches condition on multimodal inputs to support garment editing and enable virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT Fashion Editing and Try-On from Any Design, a method that enables editing and try-on across both garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36224", "url": null, "sourceid": 42758, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36191, "uid": "e88ee0686126767dc6a2abbe746c7b60", "name": "Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field", "authors": [{"id": 131914, "fullname": "Shangjie Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/131914?format=json", "institution": "Georgia Institute of Technology"}, {"id": 137772, "fullname": "Jesse Dill", "url": "http://cvpr.thecvf.com/api/miniconf/users/137772?format=json", "institution": null}, {"id": 183502, "fullname": "Dhruv Ahuja", "url": "http://cvpr.thecvf.com/api/miniconf/users/183502?format=json", "institution": "Georgia Institute of Technology"}, {"id": 131892, "fullname": "Frank Dellaert", "url": "http://cvpr.thecvf.com/api/miniconf/users/131892?format=json", "institution": "Georgia Tech"}, {"id": 131907, "fullname": "Panagiotis Tsiotras", "url": "http://cvpr.thecvf.com/api/miniconf/users/131907?format=json", "institution": "Georgia Institute of Technology"}, {"id": 131889, "fullname": "Danfei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131889?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network\u2013based uncertainty-aware volume rendering process, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation.Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36191", "url": null, "sourceid": 36795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36193, "uid": "5d9e37d3db09f8460ba8ef65b16596ca", "name": "Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation", "authors": [{"id": 184400, "fullname": "Ziyue Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184400?format=json", "institution": "University of Hong Kong"}, {"id": 184401, "fullname": "Jiahe Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184401?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 184402, "fullname": "Xia Hongyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184402?format=json", "institution": "University of Hong Kong"}, {"id": 184403, "fullname": "Xinrui Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/184403?format=json", "institution": null}, {"id": 134974, "fullname": "Feifei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134974?format=json", "institution": "University of Hong Kong"}, {"id": 75508, "fullname": "Yuyin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75508?format=json", "institution": "UC Santa Cruz"}, {"id": 184404, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184404?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 184405, "fullname": "Jiawei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184405?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 99759, "fullname": "Liangqiong Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/99759?format=json", "institution": "The University of Hong Kong"}], "abstract": "We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain.  This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Code is released to promote further research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36193", "url": null, "sourceid": 42443, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36194, "uid": "581b87f8c38532b6e3cbb05a43836400", "name": "AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition", "authors": [{"id": 184406, "fullname": "Zichuan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184406?format=json", "institution": null}, {"id": 182509, "fullname": "Yicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182509?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184407, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184407?format=json", "institution": "Tencent AI Lab"}, {"id": 184408, "fullname": "Lvfang Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184408?format=json", "institution": "Tencent HY"}, {"id": 184409, "fullname": "Deheng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184409?format=json", "institution": "Tencent; Tencent"}], "abstract": "Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36194", "url": null, "sourceid": 43535, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36195, "uid": "b41b72cb176c62f458414fca1cc9183a", "name": "Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction", "authors": [{"id": 180762, "fullname": "wenfei guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180762?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 91006, "fullname": "Jilin Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/91006?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 184410, "fullname": "Tong Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184410?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 184411, "fullname": "Xumin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184411?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 184412, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184412?format=json", "institution": "Institute of Computing Technology, CAS"}, {"id": 184413, "fullname": "Chen Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/184413?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 76192, "fullname": "Yu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76192?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant  domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36195", "url": null, "sourceid": 37228, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36196, "uid": "9874e7748b87210dab60a15f1757d02f", "name": "Transform to Transfer: Boosting Adversarial Attack Transferability on Vision-Language Pre-training Models", "authors": [{"id": 181306, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181306?format=json", "institution": "Fuzhou University"}, {"id": 184414, "fullname": "Jia-Li Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184414?format=json", "institution": "Fuzhou University"}, {"id": 184415, "fullname": "Luojun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184415?format=json", "institution": "Fuzhou University"}, {"id": 184416, "fullname": "Wei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184416?format=json", "institution": "Fujian University of Technology"}], "abstract": "Visual-Language Pre-training (VLP) models, while achieving state-of-the-art performance on various multimodal tasks, exhibit significant vulnerability to multimodal adversarial examples. In black-box attack scenarios of VLP models, a key challenge lies in the limited transferability of these adversarial examples. Existing methods to enhance transferability often suffer from an excessive dependence on the source model and a reliance on limited and fixed transformation techniques. To overcome these limitations, We propose a novel Transform to Transfer Attack (TTA) method. Our approach introduces a learnable transformation mechanism that adaptively selects optimal combinations of transformations to maximize input diversity, and incorporates integrated gradients to mitigate over-fitting on the source model, thereby refining the attack optimization process. Extensive experiments demonstrate that TTA achieves outstanding attack performance in downstream tasks, outperforming current state-of-the-art attack methods across different VLP architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36196", "url": null, "sourceid": 40206, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36198, "uid": "3385a4e712e62153e3a3c7bd8f7e5329", "name": "Concept-Aware Batch Sampling Improves Language-Image Pretraining", "authors": [{"id": 180321, "fullname": "Adhiraj Ghosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/180321?format=json", "institution": "T\u00fcbingen AI Centre"}, {"id": 154680, "fullname": "Vishaal Udandarao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154680?format=json", "institution": "ELLIS | University of Cambridge | University of Tuebingen"}, {"id": 140528, "fullname": "Thao Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/140528?format=json", "institution": "University of Washington, Seattle"}, {"id": 88373, "fullname": "Matteo Farina", "url": "http://cvpr.thecvf.com/api/miniconf/users/88373?format=json", "institution": "ELLIS, University of Trento &amp; University of T\u00fcbingen"}, {"id": 184422, "fullname": "Mehdi Cherti", "url": "http://cvpr.thecvf.com/api/miniconf/users/184422?format=json", "institution": "Forschungszentrum J\u00fclich"}, {"id": 85307, "fullname": "Jenia Jitsev", "url": "http://cvpr.thecvf.com/api/miniconf/users/85307?format=json", "institution": "Juelich Supercomputing Center, Research Center Juelich; LAION"}, {"id": 85745, "fullname": "Sewoong Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/85745?format=json", "institution": "University of Washington"}, {"id": 75841, "fullname": "Elisa Ricci", "url": "http://cvpr.thecvf.com/api/miniconf/users/75841?format=json", "institution": "University of Trento"}, {"id": 85320, "fullname": "Ludwig Schmidt", "url": "http://cvpr.thecvf.com/api/miniconf/users/85320?format=json", "institution": "University of Washington"}, {"id": 154684, "fullname": "Matthias Bethge", "url": "http://cvpr.thecvf.com/api/miniconf/users/154684?format=json", "institution": "University of Tuebingen"}], "abstract": "What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset.  However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional dataset bias. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce CABS (Concept-Aware Batch Sampling), a simple yet effective batch-sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization, to curate batches with the broadest coverage of available concepts, and (ii) CABS-FM (Frequency Maximization), to curate batches with maximal object multiplicity. Through extensive evaluations with four visual backbones and a suite of 28 benchmarks, we demonstrate that CABS significantly benefits Language-Image Pretraining (LIP) and yields highly performant models on long-tailed evaluations. Overall, CABS represents a strong open-source alternative to proprietary online curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks. Both DataConcept and the source code for CABS will be made public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36198", "url": null, "sourceid": 36666, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36207, "uid": "ea229359ad9a1b62eda0169bf1530e8c", "name": "Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual\u2013Inertial Odometry", "authors": [{"id": 179909, "fullname": "Feiyang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179909?format=json", "institution": "Southeast University"}, {"id": 184451, "fullname": "Shenghe Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184451?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184452, "fullname": "Chunyan Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184452?format=json", "institution": "Southeast University"}, {"id": 184453, "fullname": "Guangbin Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184453?format=json", "institution": "Southeast University"}], "abstract": "Visual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality.Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms.Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL policy that serves as our core contribution: (1) a Select Agent intelligently gates the entire VO pipeline based only on high-frequency IMU data; and (2) a composite Fusion Agent that first estimates a robust velocity state via a supervised network , before an RL policy adaptively fuses the full (p, v, q) state.Experiments on the EuRoC MAV and TUM-VI datasets show that, in our unified evaluation, the proposed method achieves a more favorable accuracy\u2013efficiency\u2013memory trade-off than prior GPU-based VO/VIO systems: it attains the best average ATE while running up to $1.77\\times$ faster and using less GPU memory. Compared to classical optimization-based VIO systems, our approach maintains competitive trajectory accuracy while substantially reducing computational load.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36207", "url": null, "sourceid": 43529, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36208, "uid": "238e625ff1f20b656f8986f42a76092f", "name": "When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks", "authors": [{"id": 181227, "fullname": "Minhyeok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/181227?format=json", "institution": "Chung-Ang University"}], "abstract": "Neural networks are often treated as monolithic black boxes that process all inputs uniformly through all layers. However, researchers intuitively wonder: do simple images require all 50 layers of ResNet-50, or is the prediction effectively decided much earlier? We investigate when pretrained models make up their minds during a forward pass by training linear probes at each layer of ResNet variants on ImageNet, without modifying the base model. Our findings reveal substantial computational heterogeneity across architectures: ResNet-50 and ResNet-101 exhibit mean decision depths of 5.5--5.6 layers (k=2 stability), while ResNet-18 requires deeper relative processing at 7.4 layers. We discover pronounced bimodal patterns with distinct populations of early and late deciders, where 39--43\\% of samples in deeper ResNets achieve stability within the first third of the network, while 39--54\\% require processing beyond 70\\% depth. The decision layer is highly sensitive to stability criteria, with mean depths increasing from 2.6--4.1 (k=1) to 9.0--10.0 (k=4). Linear probe accuracy exhibits sharp jumps in final residual stages, reaching 73--75\\% for ResNet-50/101 and 65\\% for ResNet-18, indicating that semantic consolidation occurs late. These findings expose computational heterogeneity in standard inference and provide actionable guidance for early exit strategies.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36208", "url": null, "sourceid": 30909, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36206, "uid": "696e8c935c8ce98badc28242fad73dfb", "name": "Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving", "authors": [{"id": 175480, "fullname": "Zehao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175480?format=json", "institution": "University of California, Riverside"}, {"id": 184448, "fullname": "Huaide Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184448?format=json", "institution": "University of California, Riverside; Southern University of Science and Technology"}, {"id": 184449, "fullname": "Shuaiwu Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184449?format=json", "institution": "Wenzhou-Kean University"}, {"id": 179883, "fullname": "Yuping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179883?format=json", "institution": "University of Michigan"}, {"id": 184450, "fullname": "Hang Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184450?format=json", "institution": "University of California, Riverside"}, {"id": 156898, "fullname": "Jiachen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156898?format=json", "institution": "University of California, Riverside"}], "abstract": "Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver\u2019s own style, highlighting personalization as a key capability for human-centered autonomous driving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36206", "url": null, "sourceid": 34500, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36213, "uid": "5985e72b3752e4749926885db1b45be4", "name": "FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision", "authors": [{"id": 84643, "fullname": "Tobias Kirschstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/84643?format=json", "institution": "Department of Informatics, Technische Universit\u00e4t M\u00fcnchen"}, {"id": 84635, "fullname": "Simon Giebenhain", "url": "http://cvpr.thecvf.com/api/miniconf/users/84635?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 75895, "fullname": "Matthias Nie\u00dfner", "url": "http://cvpr.thecvf.com/api/miniconf/users/75895?format=json", "institution": "Technical University of Munich"}], "abstract": "We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36213", "url": null, "sourceid": 40548, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36214, "uid": "a8fbc8578b551fd7eababa4a2b1b3fc1", "name": "Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing", "authors": [{"id": 76494, "fullname": "Gyojin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/76494?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 85168, "fullname": "Junmo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85168?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36214", "url": null, "sourceid": 33274, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36217, "uid": "f27f39a7867b92c1c69d38f49fdd3144", "name": "Personalized Image Descriptions from Attention Sequences", "authors": [{"id": 135763, "fullname": "Ruoyu Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/135763?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 76444, "fullname": "Hieu Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/76444?format=json", "institution": "EPFL"}, {"id": 85167, "fullname": "Jingyi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85167?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 87419, "fullname": "Sounak Mondal", "url": "http://cvpr.thecvf.com/api/miniconf/users/87419?format=json", "institution": "State University of New York, Stony Brook"}, {"id": 184474, "fullname": "Abe Leite", "url": "http://cvpr.thecvf.com/api/miniconf/users/184474?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 87396, "fullname": "Gregory Zelinsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/87396?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 136690, "fullname": "Minh Nguyen Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/136690?format=json", "institution": "Qualcomm Inc, QualComm; University of Adelaide"}, {"id": 85146, "fullname": "Dimitris Samaras", "url": "http://cvpr.thecvf.com/api/miniconf/users/85146?format=json", "institution": "Stony Brook University"}], "abstract": "People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription\u2013PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision\u2013language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multi-modal systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36217", "url": null, "sourceid": 45715, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36220, "uid": "3e1c3ef46aaa8b57458d2df424f99f1e", "name": "OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement", "authors": [{"id": 180410, "fullname": "Rui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180410?format=json", "institution": "Shenzhen University"}, {"id": 127066, "fullname": "Huisi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127066?format=json", "institution": "Shenzhen University"}, {"id": 127069, "fullname": "Jing Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/127069?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Accurate segmentation of cardiac chambers in echocardiography videos is essential for quantitative cardiac assessment. However, ultrasound noise, artifacts, and cardiac motion pose significant challenges to robust spatiotemporal modeling. Recent approaches such as Transformers, linear attention, and state-space models improve accuracy, yet Transformers often remain computationally expensive, whereas linear attention and state-space models typically lack geometric regularization, leading to unstable spatiotemporal interactions under complex cardiac motion. We introduce OSA, a lightweight linear sequence architecture designed for stable and efficient cardiac video segmentation. OSA incorporates an Anatomical Prior-aware Feature Enhancement (APFE) module that decouples and fuses complementary anatomical components to strengthen boundary\u2013region discrimination. Orthogonalized State Update (OSU) enforces spectral-norm and orthogonality constraints during recurrent transitions, preserving spatiotemporal coherence. Evaluated on the CAMUS and EchoNet-Dynamic datasets, OSA consistently outperforms state-of-the-art methods in segmentation accuracy and temporal consistency, while maintaining real-time inference efficiency. This framework offers a principled and efficient solution for dynamic cardiac analysis in echocardiography. The code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36220", "url": null, "sourceid": 44297, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36222, "uid": "8d6bd2dad761e54d64308b604bb23ba1", "name": "MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization", "authors": [{"id": 155463, "fullname": "Chengyue Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155463?format=json", "institution": "Georgia Institute of Technology"}, {"id": 164256, "fullname": "Mellon Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/164256?format=json", "institution": "Georgia Institute of Technology"}, {"id": 184483, "fullname": "Robert Azarcon", "url": "http://cvpr.thecvf.com/api/miniconf/users/184483?format=json", "institution": "Georgia Institute of Technology"}, {"id": 184484, "fullname": "Glen Chou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184484?format=json", "institution": "Georgia Institute of Technology"}, {"id": 88875, "fullname": "Zsolt Kira", "url": "http://cvpr.thecvf.com/api/miniconf/users/88875?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but na\u00efve fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS is parameter-free, data-free, and plug-and-play with existing architectures. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and benchmarks like LIBERO, CALVIN, and SimplerEnv, MAPS boosts both in- and out-of-distribution performance (up to +25\\%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for scalable VLA adaptation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36222", "url": null, "sourceid": 43818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36233, "uid": "8c22d6f99007d77ad122b1de7e5ce6c3", "name": "Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning", "authors": [{"id": 153088, "fullname": "Zhenghao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153088?format=json", "institution": "University of California, Los Angeles"}, {"id": 154133, "fullname": "Wenhao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/154133?format=json", "institution": "NVIDIA Research"}, {"id": 98591, "fullname": "Yurong You", "url": "http://cvpr.thecvf.com/api/miniconf/users/98591?format=json", "institution": "NVIDIA"}, {"id": 138722, "fullname": "Yuxiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/138722?format=json", "institution": "California Institute of Technology"}, {"id": 184524, "fullname": "Wenjie Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184524?format=json", "institution": "NVIDIA"}, {"id": 184525, "fullname": "Thomas Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/184525?format=json", "institution": null}, {"id": 184526, "fullname": "Yulong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184526?format=json", "institution": "NVIDIA"}, {"id": 177858, "fullname": "Apoorva Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/177858?format=json", "institution": "NVIDIA Research"}, {"id": 131889, "fullname": "Danfei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131889?format=json", "institution": "Georgia Institute of Technology"}, {"id": 127258, "fullname": "Boris Ivanovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/127258?format=json", "institution": "NVIDIA"}, {"id": 105890, "fullname": "Boyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/105890?format=json", "institution": "UC Berkeley / NVIDIA"}, {"id": 184527, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184527?format=json", "institution": "NVIDIA"}, {"id": 88162, "fullname": "Marco Pavone", "url": "http://cvpr.thecvf.com/api/miniconf/users/88162?format=json", "institution": "NVIDIA"}], "abstract": "Recent reasoning-augmented Vision-Language-Action (VLA) models have improved the interpretability of end-to-end autonomous driving by generating intermediate reasoning traces. Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. CF-VLA first generates time-segmented meta-actions that summarize driving intent, then performs a counterfactual reasoning pass conditioned on both the meta-actions and the visual. This step simulates potential outcomes, identifies unsafe behaviors, and outputs corrected meta-actions that guide the final trajectory generation. To efficiently obtain such self-reflection capabilities, we propose a rollout\u2013filter\u2013label pipeline that mines high-value scenes from a base (non-counterfactual) VLA's rollouts and labels counterfactual reasoning traces for subsequent counterfactual training rounds. Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6\\%, enhances safety metrics, and exhibits adaptive thinking: it only enables counterfactual reasoning in challenging scenarios. By transforming reasoning traces from one-shot descriptions to causal self-correction signals, CF-VLA takes a step toward self-reflective autonomous driving agents that learn to think before they act.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36233", "url": null, "sourceid": 39888, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36293, "uid": "0a55079ea003674ae2145b02f03ff27d", "name": "Self-Consistency for LLM-based Motion Trajectory Generation and Verification", "authors": [{"id": 183836, "fullname": "Jiaju Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/183836?format=json", "institution": "Stanford University"}, {"id": 163879, "fullname": "R. Kenny Jones", "url": "http://cvpr.thecvf.com/api/miniconf/users/163879?format=json", "institution": "Stanford University"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}, {"id": 73046, "fullname": "Maneesh Agrawala", "url": "http://cvpr.thecvf.com/api/miniconf/users/73046?format=json", "institution": "Stanford University"}], "abstract": "Self-consistency has proven to be an effective technique for improving LLM performance on natural language reasoning tasks in a lightweight, unsupervised manner. In this work, we study how to adapt self-consistency to visual domains; specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. Given a prompt (e.g., ``Move the circle in a spiral path''), we first sample diverse motion trajectories from an LLM, and then identify groups of consistent trajectories via clustering.  Our key insight is to model the family of shapes associated with a prompt as a prototype trajectory paired with a group of geometric transformations (e.g., rigid, similarity, affine).  Two trajectories can then be considered consistent if one can be transformed into the other under the warps allowable by the transformation group. We propose an algorithm that automatically recovers a shape family, using  hierarchical relationships between a set of candidate transformation groups.  Our approach improves the accuracy of LLM-based trajectory generation by 4\u20136\\%. We further extend our method to support verification, observing 11\\% precision gains over VLM baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36293", "url": null, "sourceid": 44525, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36235, "uid": "98df4e0ce8b3e458dddb60c61fe5a3b2", "name": "HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics", "authors": [{"id": 127735, "fullname": "Masatoshi Tateno", "url": "http://cvpr.thecvf.com/api/miniconf/users/127735?format=json", "institution": "Institute of Industrial Science, the University of Tokyo"}, {"id": 184534, "fullname": "Gido Kato", "url": "http://cvpr.thecvf.com/api/miniconf/users/184534?format=json", "institution": "Waseda University; AIST, National Institute of Advanced Industrial Science and Technology"}, {"id": 87986, "fullname": "Hirokatsu Kataoka", "url": "http://cvpr.thecvf.com/api/miniconf/users/87986?format=json", "institution": "National Institute of Advanced Industrial Science and Technology (AIST)"}, {"id": 69170, "fullname": "Yoichi Sato", "url": "http://cvpr.thecvf.com/api/miniconf/users/69170?format=json", "institution": "University of Tokyo"}, {"id": 95886, "fullname": "Takuma Yagi", "url": "http://cvpr.thecvf.com/api/miniconf/users/95886?format=json", "institution": "AIST, National Institute of Advanced Industrial Science and Technology"}], "abstract": "Hand\u2013object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects.However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI.We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs.Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes.HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation.We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding.We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36235", "url": null, "sourceid": 35238, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36237, "uid": "6eaeb92aa84721ed6e66f4c77ccfa308", "name": "Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution", "authors": [{"id": 180882, "fullname": "Tianyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180882?format=json", "institution": "Nankai University"}, {"id": 184536, "fullname": "Zheng-Peng Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184536?format=json", "institution": "Nankai University"}, {"id": 76506, "fullname": "Chun-Le Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76506?format=json", "institution": "Nankai University"}, {"id": 126680, "fullname": "Peng-Tao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126680?format=json", "institution": "vivo Mobile Communication (Hangzhou) Co., Ltd."}, {"id": 87210, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87210?format=json", "institution": "vivo Mobile Communication Co.,Ltd."}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}, {"id": 73507, "fullname": "Chongyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73507?format=json", "institution": "Nankai University"}], "abstract": "Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance. To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, since SD will perform different generative priors at different timesteps, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps. Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD's generative capabilities. To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve controllable trade-offs between fidelity and realism by changing the timestep. Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36237", "url": null, "sourceid": 46566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36238, "uid": "585f71bfcabaf958eea2dff2971b53f3", "name": "SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation", "authors": [{"id": 180841, "fullname": "Meihua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180841?format=json", "institution": "Shenzhen University"}, {"id": 72624, "fullname": "Yang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/72624?format=json", "institution": "Shenzhen University"}, {"id": 107480, "fullname": "Weizhao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/107480?format=json", "institution": null}, {"id": 184537, "fullname": "Hu Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184537?format=json", "institution": "Shenzhen University"}, {"id": 184538, "fullname": "Yisong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184538?format=json", "institution": "Shenzhen University"}], "abstract": "Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36238", "url": null, "sourceid": 46243, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36243, "uid": "60eefdd92947076e89f98d7f4ddec6c9", "name": "VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes", "authors": [{"id": 91828, "fullname": "Paul Gavrikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/91828?format=json", "institution": "Independent Researcher"}, {"id": 135282, "fullname": "Wei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/135282?format=json", "institution": "ELLIS Unit &amp; LIT AI Lab, JKU"}, {"id": 150900, "fullname": "Muhammad Jehanzeb Mirza", "url": "http://cvpr.thecvf.com/api/miniconf/users/150900?format=json", "institution": "MIT"}, {"id": 184561, "fullname": "Soumya Jahagirdar", "url": "http://cvpr.thecvf.com/api/miniconf/users/184561?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 182346, "fullname": "Muhammad Huzaifa", "url": "http://cvpr.thecvf.com/api/miniconf/users/182346?format=json", "institution": "MBZUAI"}, {"id": 89691, "fullname": "Sivan Doveh", "url": "http://cvpr.thecvf.com/api/miniconf/users/89691?format=json", "institution": "IBM, Weizmann Institute of Science"}, {"id": 131403, "fullname": "James Glass", "url": "http://cvpr.thecvf.com/api/miniconf/users/131403?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 69178, "fullname": "Serena Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/69178?format=json", "institution": "Stanford"}, {"id": 150894, "fullname": "Hilde Kuehne", "url": "http://cvpr.thecvf.com/api/miniconf/users/150894?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}], "abstract": "Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question\u2013answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.8 % accuracy on our hardest test split and overall 69.5 % accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36243", "url": null, "sourceid": 42959, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36245, "uid": "77fb534c73679ddb543437d0b3db77c4", "name": "LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors", "authors": [{"id": 184564, "fullname": "Gaoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184564?format=json", "institution": "Zhejiang University"}, {"id": 184565, "fullname": "Xinguo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184565?format=json", "institution": "Zhejiang University"}], "abstract": "The task of multi-view inpainting necessitates 3D consistency in the inpainted images. Most prior methods first employ single-view 2D inpainting and then enforce multi-view consistency in a post-hoc 3D optimization stage, which leads to undesirable artifacts and lengthy optimization times. The existing single-stage method, MVInpainter, uses video priors and is pose-free, making it less suitable for inputs beyond video sequences. In this paper, we propose a framework that trains an inpainting model to condition on the explicit and reliable multi-view correspondences from a 3D foundation model. Central to our framework is a cross-view conditioning architecture, LaRP, carefully designed to utilize both the generative prior of a pretrained diffusion inpainting model and the reprojected cross-view appearance latents. We additionally propose a scalable data pipeline for stable training of LaRP. Extensive experiments demonstrate that LaRP outperforms prior methods in 3D consistency and novel view synthesis quality competitive with the state-of-the-art, while being \u223c50x faster.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36245", "url": null, "sourceid": 41025, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36327, "uid": "11c3bc83e4a0ecfb01176e866d97435f", "name": "Federated Active Learning Under Extreme Non-IID and Global Class Imbalance", "authors": [{"id": 151755, "fullname": "Chen-Chen Zong", "url": "http://cvpr.thecvf.com/api/miniconf/users/151755?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 156156, "fullname": "Sheng-Jun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156156?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "Federated active learning (FAL) seeks to reduce annotation cost under privacy constraints, yet its effectiveness degrades in realistic settings with severe global class imbalance and highly heterogeneous clients. We conduct a systematic study of query-model selection in FAL and uncover a central insight: the model that achieves more class-balanced sampling, especially for minority classes, consistently leads to better final performance. Moreover, global-model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable.Based on these findings, we propose FairFAL, an adaptive class-fair FAL framework. FairFAL (1) infers global imbalance and local\u2013global divergence via lightweight prediction discrepancy, enabling adaptive selection between global and local query models; (2) performs prototype-guided pseudo-labeling using global features to promote class-aware querying; and (3) applies a two-stage uncertainty\u2013diversity balanced sampling strategy with $k$-center refinement. Experiments on five benchmarks show that FairFAL consistently outperforms state-of-the-art approaches under challenging long-tailed and non-IID settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36327", "url": null, "sourceid": 39861, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36247, "uid": "5da9cf9625826af0d8fd2452c33ad396", "name": "Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion", "authors": [{"id": 181024, "fullname": "Laura Dodds", "url": "http://cvpr.thecvf.com/api/miniconf/users/181024?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 184572, "fullname": "Maisy Lam", "url": "http://cvpr.thecvf.com/api/miniconf/users/184572?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 184573, "fullname": "Waleed Akbar", "url": "http://cvpr.thecvf.com/api/miniconf/users/184573?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 184574, "fullname": "Yibo Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184574?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 184575, "fullname": "Fadel Adib", "url": "http://cvpr.thecvf.com/api/miniconf/users/184575?format=json", "institution": "MIT &amp; Cartesian"}], "abstract": "We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former's design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world data. In head-to-head comparisons with state-of-the-art baselines,  Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36247", "url": null, "sourceid": 46757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36252, "uid": "348c32d6b35abb039ee53da3c36b6a63", "name": "ORPO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation", "authors": [{"id": 183055, "fullname": "Sanghyeon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/183055?format=json", "institution": "\ub098"}, {"id": 184583, "fullname": "Minwoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184583?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 184584, "fullname": "Euijin Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184584?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 184585, "fullname": "Kangyeol Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/184585?format=json", "institution": null}, {"id": 183673, "fullname": "Seunghwan Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183673?format=json", "institution": "KAIST"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "We introduce the Orthogonal Panel-Relative Operator (OPRO), a novel parameter-efficient adaptation method for tiled-panel In-Context Generation (ICG) that utilizes the pre-trained Diffusion Transformers (DiTs). OPRO works by composing learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two properties: 1) Isometry, which maintains feature geometry to promote stable fine-tuning, and 2) Same-Panel Invariance, which perfectly preserves the model's powerful pre-trained intra-panel synthesis capabilities. We conduct a controlled analysis demonstrating that OPRO's effectiveness is not limited to RoPE but consistently enhances performance across various positional encodings that satisfy orthogonality. By enabling effective panel-relative learning while simultaneously protecting the backbone's core synthesis power, OPRO consistently improves ICG-based instructional image editing methods, including state-of-the-art methods ICEdit.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36252", "url": null, "sourceid": 44237, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36253, "uid": "0941839a257545ab919cd01ae3e568b5", "name": "MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding", "authors": [{"id": 155661, "fullname": "Basit Alawode", "url": "http://cvpr.thecvf.com/api/miniconf/users/155661?format=json", "institution": "Khalifa University of Science, Technology and Research"}, {"id": 126160, "fullname": "Arif Mahmood", "url": "http://cvpr.thecvf.com/api/miniconf/users/126160?format=json", "institution": "Information Technology University, Lahore"}, {"id": 184586, "fullname": "Muaz Radi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184586?format=json", "institution": "Khalifa University of Science, Technology and Research"}, {"id": 141578, "fullname": "Shahad Albastaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/141578?format=json", "institution": "Khalifa University of Science, Technology and Research"}, {"id": 149156, "fullname": "Asim Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/149156?format=json", "institution": "Khalifa University, Abu Dhabi"}, {"id": 184587, "fullname": "Muhammad Bilal", "url": "http://cvpr.thecvf.com/api/miniconf/users/184587?format=json", "institution": "King Abdul Aziz University"}, {"id": 184588, "fullname": "Moshira Abdalla", "url": "http://cvpr.thecvf.com/api/miniconf/users/184588?format=json", "institution": "Khalifa University of Science, Technology and Research"}, {"id": 88320, "fullname": "Mohammed Bennamoun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88320?format=json", "institution": "University of Western Australia"}, {"id": 70835, "fullname": "Sajid Javed", "url": "http://cvpr.thecvf.com/api/miniconf/users/70835?format=json", "institution": "Khalifa University of Science and Technology"}], "abstract": "Whole Slide Images (WSIs) exhibit hierarchical structure, where diagnostic cues arise from cellular morphology, regional tissue organization, and global context. Existing Computational Pathology (CPath) Multimodal Large Language Models (MLLMs) typically compress an entire WSI into a single embedding, which hinders fine-grained grounding and ignores how pathologists synthesize evidence across different scales.  We introduce \\textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales\u2014cell as word, patch as phrase, region as sentence, and WSI as paragraph\u2014to support interpretable, evidence-grounded reasoning.  MLLM-HWSI decomposes each WSI into multi-scale embeddings with scale-specific V$\\rightarrow$L projectors and jointly enforces (i) a hierarchical contrastive objective and (ii) a cross-scale consistency loss, preserving semantic coherence from cells to the WSI. To make gigapixel processing tractable and clinically meaningful, we compute diagnostically relevant tokens and aggregate segmented cell embeddings into a compact cellular token per-patch using a lightweight \\textit{Cell\u2013Cell Attention Fusion (CCAF)} transformer. The projected multi-scale tokens are fused with text tokens and fed to an instruction-tuned LLM for open-ended reasoning, VQA, report, and caption generation tasks.  Trained in three stages, MLLM-HWSI achieves new SOTA results on 13 WSI-level benchmarks across six CPath tasks. By grounding language in calibrated, multi-scale visual evidence, HMLLM provides accurate, interpretable outputs that mirror expert diagnostic workflows and advance holistic WSI understanding. Code will be released upon the publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36253", "url": null, "sourceid": 44336, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36255, "uid": "4d2fe2e8601f7a8018594d98f28706f2", "name": "Enhancing Out-of-Distribution Detection with Extended Logit Normalization", "authors": [{"id": 180287, "fullname": "Yifan Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/180287?format=json", "institution": "Link\u00f6ping University"}, {"id": 72703, "fullname": "Xixi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72703?format=json", "institution": "Imperial College London"}, {"id": 184591, "fullname": "Jonas Unger", "url": "http://cvpr.thecvf.com/api/miniconf/users/184591?format=json", "institution": "Link\u00f6ping University"}, {"id": 184592, "fullname": "Gabriel Eilertsen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184592?format=json", "institution": "Link\u00f6ping University"}], "abstract": "Out-of-distribution (OOD) detection is essential for the safe deployment of machine learning models. Extensive work has focused on devising various scoring functions for detecting OOD samples, while only a few studies focus on training neural networks using certain model calibration objectives, which often lead to a compromise in predictive accuracy and support only limited choices of scoring functions. In this work, we first identify the feature collapse phenomena in Logit Normalization (LogitNorm), then propose a novel hyperparameter-free formulation that significantly benefits a wide range of post-hoc detection methods. To be specific, we devise a feature distance-awareness loss term in addition to LogitNorm, termed $\\textbf{ELogitNorm}$, which enables improved OOD detection and in-distribution (ID) confidence calibration. Extensive experiments across standard benchmarks demonstrate that our approach outperforms state-of-the-art training-time methods in OOD detection while maintaining strong ID classification accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36255", "url": null, "sourceid": 30847, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36256, "uid": "e2a25d092213dd181de2501aa0d8b987", "name": "Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes", "authors": [{"id": 184593, "fullname": "Hanjie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184593?format=json", "institution": "Peking University"}, {"id": 184594, "fullname": "Yuanxing Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184594?format=json", "institution": "Galbot"}, {"id": 76909, "fullname": "Qiyu Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/76909?format=json", "institution": "Peking University"}, {"id": 85492, "fullname": "Ge Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85492?format=json", "institution": "Peking University Shenzhen Graduate School"}, {"id": 88739, "fullname": "Baoquan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88739?format=json", "institution": "Peking University"}, {"id": 74064, "fullname": "He Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/74064?format=json", "institution": "Galbot"}], "abstract": "We address the challenge of reconstructing long dynamic scenes from multi-view videos in a storage-efficient manner. Recent advances in Gaussian Splatting and its extensions to dynamic scenes have demonstrated impressive visual quality, but remain limited to short duration (<10 s), large storage size (>500 MB), and high GPU VRAM usage.To overcome these limitations, we introduce Layered 4D-Rotor Gaussian Splatting (L4DRotorGS), a novel compressed representation designed for long dynamic scenes. Our approach integrates a layered 4D representation, efficient training, and effective compression into a unified framework. Specifically, 4D Gaussians are first organized into layers based on their temporal extents and then partitioned into discrete temporal buckets. This structure allows for selective access and rendering of only the necessary subsets of 4D Gaussians, substantially reducing GPU memory requirements.To further compress the representation, we apply a series of techniques, Factorized Covariance Quantization, Layered Compression, and Residual Codebook Quantization, achieving a compression ratio of up to 22.3\u00d7 while preserving high visual fidelity.We implement a highly optimized C++/CUDA framework for efficient training, compression, and real-time rendering, achieving over 500 FPS on an RTX 3090 GPU. Extensive experiments demonstrate the superior storage efficiency, visual quality, and rendering speed of L4DRotorGS, consistently outperforming prior methods in both quantitative metrics and perceptual quality on real-world long dynamic scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36256", "url": null, "sourceid": 34898, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36257, "uid": "6d6ab8b0d3dfb637e7ce3d5841be0115", "name": "PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models", "authors": [{"id": 149656, "fullname": "Wonyong Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/149656?format=json", "institution": "KAIST"}, {"id": 107109, "fullname": "Jaeho Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/107109?format=json", "institution": "KAIST"}, {"id": 155915, "fullname": "Jaehyup Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/155915?format=json", "institution": "Kyungpook National University"}, {"id": 86366, "fullname": "Soo Ye Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86366?format=json", "institution": "Adobe Research"}, {"id": 87114, "fullname": "Munchurl Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87114?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures.However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire.Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation.Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation.Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations.Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36257", "url": null, "sourceid": 45480, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36259, "uid": "9700420bcba0097e5526de0467c0f74e", "name": "RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection", "authors": [{"id": 155199, "fullname": "Hyeonjeong Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/155199?format=json", "institution": "University of Illinois at Chicago"}, {"id": 155200, "fullname": "Peixi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155200?format=json", "institution": "Intel Labs"}, {"id": 135777, "fullname": "Xiaoqian Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/135777?format=json", "institution": "University of Illinois at Chicago"}, {"id": 143152, "fullname": "Dian Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/143152?format=json", "institution": "University of Illinois at Chicago"}, {"id": 155198, "fullname": "Pei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155198?format=json", "institution": "Microsoft"}, {"id": 88199, "fullname": "Wei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88199?format=json", "institution": "University of Illinois, Chicago"}], "abstract": "Monocular 3D object detection from a single RGB image remains challenging due to two fundamental challenges: the ill-posed nature of 3D localization, where multiple plausible configurations can correspond to the same 2D observation, and unreliable confidence estimation that fails to reflect true localization accuracy. Existing methods predict deterministic 3D boxes that often collapse to implausible mean estimates and rely on absolute confidence scores that are highly sensitive to localization errors. This paper introduces RARE, a unified framework that addresses both challenges through learning to rank and retrieve. RARE formulates confidence estimation as a ranking problem, learning to order detections by their relative quality rather than regressing absolute values. It provides more robust and stable confidence estimates that are less sensitive to localization uncertainty. Building on this improved confidence estimator, RARE learns to construct a query set for each object that predicts multiple diverse and plausible 3D configurations, and retrieves the top-ranked prediction. It explicitly models the multimodal nature of monocular 3D perception and produces more plausible localizations. Extensive experiments demonstrate the effectiveness of RARE. We will make the code publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36259", "url": null, "sourceid": 38394, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36266, "uid": "5fdb81013e74b3bb0c0e0ce50249c0ca", "name": "Joint Learning of General and Diverse Patterns with Mixture of Memory Experts for Weakly-Supervised Video Anomaly Detection", "authors": [{"id": 181980, "fullname": "Bo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181980?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 107232, "fullname": "Junxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107232?format=json", "institution": null}, {"id": 126527, "fullname": "Zhe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126527?format=json", "institution": "Pengcheng Laboratory"}, {"id": 129700, "fullname": "Feng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129700?format=json", "institution": "Peking University"}, {"id": 184623, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184623?format=json", "institution": "Peking University"}, {"id": 126309, "fullname": "Li Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/126309?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}], "abstract": "Weakly-supervised Video Anomaly Detection (wVAD) aims to detect abnormal events using only binary labels,   making it challenging to capture both the diversity of anomalies and their shared semantic cues.   Existing methods either focus on a generic anomaly pattern, achieving strong generalization but weak discrimination,   or rely on class-level diversity modeling, which ignores shared semantics and suffers from limited generalization.   To overcome these limitations, we propose the Mixture of Memory Experts (MoME),    a unified framework that jointly learns general and diverse patterns.   Each expert in MoME possesses an internal memory for fine-grained specialization    and shares an external memory for general knowledge aggregation.   To enhance semantic diversity and improve generalization beyond coarse class-level supervision,   we introduce an Anomaly Prototype Router that leverages large language models    to construct generalized anomaly prototypes for semantically guided expert routing.   Moreover, the regularization loss for APR ensures balanced routing, the distinctiveness loss    for experts encourages diversity, and reconstruction together with memory tasks enhance pattern discriminability.   Extensive experiments on UCF-Crime and XD-Violence demonstrate that our approach achieves    state-of-the-art performance, validating the effectiveness of jointly modeling generality and diversity    for robust anomaly detection under weak supervision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36266", "url": null, "sourceid": 36078, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36270, "uid": "030ed9e0b2210f2207c3d0b9638898c3", "name": "HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models", "authors": [{"id": 184634, "fullname": "Haiyan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184634?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 177975, "fullname": "Deyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177975?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 184635, "fullname": "dongdong weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184635?format=json", "institution": null}, {"id": 175674, "fullname": "Weitao Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/175674?format=json", "institution": "Beijing Institute of Technology"}, {"id": 184636, "fullname": "Henry Duh", "url": "http://cvpr.thecvf.com/api/miniconf/users/184636?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "The 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires extensive and tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for automatic 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with existing baselines, while supporting fast and intuitive scene editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36270", "url": null, "sourceid": 43902, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36274, "uid": "be628b567b4ebb5b4221ff54edb8d561", "name": "Protect to Adapt: Subspace-Constrained Adaptation with Ranked Negative Prompt Feedback for Few-Shot Action Recognition", "authors": [{"id": 182042, "fullname": "Hantao Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182042?format=json", "institution": "Xiamen University"}, {"id": 184648, "fullname": "Yan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184648?format=json", "institution": "Xiamen University"}, {"id": 184649, "fullname": "Junlong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184649?format=json", "institution": "Xiamen University"}, {"id": 153844, "fullname": "Hanzi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153844?format=json", "institution": "Xiamen University"}], "abstract": "Adapting Vision\u2013Language Models (VLMs) to few-shot action recognition (FSAR) often trades accuracy for stability: task-specific gains can trigger catastrophic forgetting of domain-general knowledge and reduce inter-class margins. In few-shot episodes, each query is contrasted with only one positive class and a few negatives, so the text encoder sees limited prompt diversity and rarely observes hard counter-examples near decision boundaries. We propose Protect-to-Adapt (P2A), a parameter-efficient fine-tuning method with two complementary modules. Orthogonal Subspace Control (OSC) estimates a principal semantic subspace of the pre-trained backbone and constrains low-rank updates to its orthogonal complement, preserving domain-general semantics while allowing task-specific adaptation. Ranked Negative-prompt Curriculum (RNC) uses a large language model to generate verifier-filtered negative prompts with increasing difficulty. These class-specific hard counter-examples enlarge margins and sharpen decision boundaries under few-shot conditions. With only 2\\% of backbone parameters trainable, P2A achieves state-of-the-art performance on five FSAR benchmarks and substantially reduces catastrophic forgetting in a cross-dataset continual-learning setting where the model is adapted sequentially to multiple video datasets without replay.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36274", "url": null, "sourceid": 38577, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36279, "uid": "1306c03a55fa10ba20fa30bc3b098338", "name": "Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network", "authors": [{"id": 174141, "fullname": "Linkang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174141?format=json", "institution": "TongJi University"}, {"id": 157043, "fullname": "Gang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157043?format=json", "institution": "Tongji University"}, {"id": 175320, "fullname": "Yue Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/175320?format=json", "institution": "Tongji University"}, {"id": 184663, "fullname": "Xiangxin Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/184663?format=json", "institution": "Tongji University"}], "abstract": "Drone-based building defect segmentation remains challenging due to complex surface textures and illumination variations. We propose TPSegformer, a topology-preserving segmentation framework that mitigates mis-segmentation in such scenarios. Its decoder incorporates a Hilbert curve\u2013based topology-preserving mechanism to maintain spatial continuity and boundary precision during category layer computation. A lightweight multi-scale fusion module enhances semantic representation, while global context modeling strengthens holistic perception. Experiments on the building defect dataset show that TPSegformer outperforms existing segmentation methods, achieving 80.77\\% mIoU and 90.22\\% Acc. On the Dacl10k dataset, it maintains strong generalization, reaching 44.27\\% mIoU and 60.32\\% Acc across diverse materials and defect types.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36279", "url": null, "sourceid": 43845, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36281, "uid": "f5ab797de5386f2868f35ff96a052cd9", "name": "Debiased Sample Selection for Learning with Noisy Labels", "authors": [{"id": 180826, "fullname": "Weiran Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180826?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 130050, "fullname": "Wei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130050?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184669, "fullname": "Wenfeng xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/184669?format=json", "institution": "Pingan Technology"}], "abstract": "Existing methods for learning with noisy labels (LNL) predominantly rely on the small-loss trick, assuming that low-loss samples are more likely to be correctly labeled. While effective, this strategy suffers from two overlooked confirmation biases: (1) Class-level confirmation bias: samples from easy-to-learn classes tend to have lower losses, leading to over-selection of easy samples while ignore hard ones; (2) Instance-level confirmation bias: mislabeled samples with spuriously low loss are mistakenly treated as clean, forcing the model to memorize wrong labels. Both biases accumulate over training and degrade performance. To mitigate these issues, we propose Marginal Distribution Adjustment (MDA) and Candidate Class Selection (CCS). MDA dynamically reshapes the model\u2019s predicted class distribution toward uniformity, ensuring more fair sample selection across classes. CCS leverages training dynamics to identify likely true labels and removes them from the classification task, preventing memorization of incorrect annotations while converting weakly related labels into useful supervision. Both MDA and CCS are plug-and-play modules. Extensive experiments show that integrating MDA and CCS into either existing sample selectors or advanced LNL pipeline consistently enhances performance on both CIFAR-10/100 with synthetic noise and real-world datasets (CIFAR-N, Clothing1M, WebVision), demonstrating their broad applicability in LNL methods. Our code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36281", "url": null, "sourceid": 46550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36320, "uid": "03114c0c3fae310726281dd79793bcc3", "name": "Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic", "authors": [{"id": 180154, "fullname": "Wanying Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180154?format=json", "institution": "Fudan University"}, {"id": 149525, "fullname": "Jianxiong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/149525?format=json", "institution": "Fudan University"}, {"id": 184766, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184766?format=json", "institution": "Southern University of Science and Technology"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}], "abstract": "Capturing dynamic spatiotemporal neural activity is essential for understanding large-scale brain mechanisms. Functional magnetic resonance imaging (fMRI) provides high-resolution cortical representations that form a strong basis for characterizing fine-grained brain activity patterns. The high acquisition cost of fMRI limits large-scale applications, therefore making high-quality fMRI reconstruction a crucial task. Electroencephalography (EEG) offers millisecond-level temporal cues that complement fMRI. Leveraging this complementarity, we present an EEG-conditioned framework for reconstructing dynamic fMRI as continuous neural sequences with high spatial fidelity and strong temporal coherence at the cortical-vertex level. To address sampling irregularities common in real fMRI acquisitions, we incorporate a null-space intermediate-frame reconstruction, enabling measurement-consistent completion of arbitrary intermediate frames and improving sequence continuity and practical applicability. Experiments on the CineBrain dataset demonstrate superior voxel-wise reconstruction quality and robust temporal consistency across whole-brain and functionally specific regions. The reconstructed fMRI also preserves essential functional information, supporting downstream visual decoding tasks. This work provides a new pathway for estimating high-resolution fMRI dynamics from EEG and advances multimodal neuroimaging toward more dynamic brain activity modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36320", "url": null, "sourceid": 46034, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36283, "uid": "339d3905144e063a03c4f0fe26912927", "name": "SPDMark: Selective Parameter Displacement for Robust Video Watermarking", "authors": [{"id": 107194, "fullname": "Samar Fares", "url": "http://cvpr.thecvf.com/api/miniconf/users/107194?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 76414, "fullname": "Nurbek Tastan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76414?format=json", "institution": "MBZUAI"}, {"id": 153190, "fullname": "Karthik Nandakumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/153190?format=json", "institution": "Michigan State University"}], "abstract": "The advent of high-quality video generation models has amplified the need for robust watermarking schemes that can be used to reliably detect and track the provenance of generated videos. Existing video watermarking methods based on both post-hoc and in-generation approaches fail to simultaneously achieve imperceptibility, robustness, and computational efficiency. This work introduces a novel framework for in-generation video watermarking called \\textbf{\\texttt{SPDMark}} (pronounced `SpeedMark') based on \\textbf{selective parameter displacement} of a video diffusion model. Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model. To make the problem tractable, the displacement is modeled as an additive composition of layer-wise basis shifts, where the final composition is indexed by the watermarking key. For parameter efficiency, this work specifically leverages low-rank adaptation (LoRA) to implement the basis shifts. During the training phase, the basis shifts and the watermark extractor are jointly learned by minimizing a combination of message recovery, perceptual similarity, and temporal consistency losses. To detect and localize temporal modifications in the watermarked videos, we use a cryptographic hashing function to derive frame-specific watermark messages from the given base watermarking key. During watermark extraction, maximum bipartite matching is applied to recover the correct frame order, even from temporally tampered videos. Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of \\textbf{\\texttt{SPDMark}} to generate imperceptible watermarks that can be recovered with high accuracy and also establish its robustness against a variety of common video modifications.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36283", "url": null, "sourceid": 37666, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36284, "uid": "5e3196bdf1782fa1b4485fdd167ce6f3", "name": "DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution", "authors": [{"id": 145730, "fullname": "Shuhao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/145730?format=json", "institution": "Nankai University"}, {"id": 184675, "fullname": "Wenjie Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184675?format=json", "institution": null}, {"id": 133689, "fullname": "Haotian Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/133689?format=json", "institution": "ByteDance"}, {"id": 184676, "fullname": "Hang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184676?format=json", "institution": "ByteDance"}, {"id": 184677, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184677?format=json", "institution": "ByteDance Inc."}, {"id": 76506, "fullname": "Chun-Le Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76506?format=json", "institution": "Nankai University"}, {"id": 73507, "fullname": "Chongyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73507?format=json", "institution": "Nankai University"}], "abstract": "Benefiting from the powerful generative priors of diffusion models, diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance.To achieve efficient Real-ISR, several recent works have designed one-step diffusion-based models.Howerver, unmediatedly feeding LR into a diffusion model creates a distributional gap with the model's original input.A straightforward approach to reduce the distribution gap is to introduce noise to the LR latents. However, directly adding noise inevitably corrupts the content of the LR images.In this study, we propose \\textbf{DNF\u2011SR}, a \\textbf{D}ual\u2011input and \\textbf{N}egative\u2011aware \\textbf{F}eature fine\u2011tuning method for Real-ISR.Specifically, we use a \\textbf{dual-input} strategy that concatenates the original LR image with the noisy LR input and feeds them into a diffusion-based image editing model, ensuring both high-fidelity one-step super-resolution and improved perceptual and content consistency.Additionally, the noise present in the noisy LR input introduces randomness and diversity into the outputs. We exploit this property and propose a post-training optimization method, Negative-aware Feature Fine-Tuning (NF\u00b2T), which guides the model toward producing higher-quality results.NF$^2$T classifies multiple outputs into positive and negative subsets and then defines implicit policy improvement directions in both the image and feature spaces, thereby further enhancing the stability of the optimization.Extensive experiments show that DNF-SR outperforms other methods.Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36284", "url": null, "sourceid": 43755, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36285, "uid": "86ff09548a5c6eff1ec764a28b6c8112", "name": "Any4D: Unified Feed-Forward Metric 4D Reconstruction", "authors": [{"id": 72237, "fullname": "Jay Karhade", "url": "http://cvpr.thecvf.com/api/miniconf/users/72237?format=json", "institution": "Carnegie Mellon University"}, {"id": 129424, "fullname": "Nikhil Keetha", "url": "http://cvpr.thecvf.com/api/miniconf/users/129424?format=json", "institution": "Carnegie Mellon University"}, {"id": 184678, "fullname": "Yuchen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184678?format=json", "institution": null}, {"id": 184679, "fullname": "Tanisha Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/184679?format=json", "institution": "Carnegie Mellon University"}, {"id": 184680, "fullname": "Akash Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184680?format=json", "institution": "Amazon; Carnegie Mellon University"}, {"id": 73029, "fullname": "Sebastian Scherer", "url": "http://cvpr.thecvf.com/api/miniconf/users/73029?format=json", "institution": "Carnegie Mellon University"}, {"id": 75455, "fullname": "Deva Ramanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75455?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework  is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster) - opening avenues for multiple downstream applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36285", "url": null, "sourceid": 34199, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36288, "uid": "23fdc2b96c1afdf6d8ab932f150ef29a", "name": "OSMO: Open-vocabulary Self-eMOtion Tracking", "authors": [{"id": 107400, "fullname": "Mohamed Abdelfattah", "url": "http://cvpr.thecvf.com/api/miniconf/users/107400?format=json", "institution": "EPFL, VITA"}, {"id": 129498, "fullname": "Bugra Tekin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129498?format=json", "institution": "Meta"}, {"id": 164461, "fullname": "Fadime Sener", "url": "http://cvpr.thecvf.com/api/miniconf/users/164461?format=json", "institution": "Meta"}, {"id": 184696, "fullname": "Necati Cihan Camgoz", "url": "http://cvpr.thecvf.com/api/miniconf/users/184696?format=json", "institution": "Meta"}, {"id": 129505, "fullname": "Eric Sauser", "url": "http://cvpr.thecvf.com/api/miniconf/users/129505?format=json", "institution": "Meta"}, {"id": 140328, "fullname": "Shugao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/140328?format=json", "institution": "Meta"}, {"id": 97325, "fullname": "Alex Alahi", "url": "http://cvpr.thecvf.com/api/miniconf/users/97325?format=json", "institution": "EPFL"}, {"id": 151053, "fullname": "Edoardo Remelli", "url": "http://cvpr.thecvf.com/api/miniconf/users/151053?format=json", "institution": "Facebook"}], "abstract": "We introduce the novel task of egocentric self-emotion tracking, which aims to infer an individual's evolving emotions from egocentric multimodal streams such as voice, visual surroundings, semantic subtext, and eye-tracking signals. To establish this research direction, we present: (1) OSMO dataset, a large-scale annotation effort on 110 hours of existing bilingual smart-glasses recordings, establishing the largest egocentric emotion dataset and the first with subject-wise emotion timelines; (2) OSMO benchmark, a suite of five tasks (emotion recognition, sentiment, intensity, localization, and reasoning), that redefine emotion understanding as a continuous, context-aware process rather than discrete classification of trimmed videos; (3) OSIRIS, a large multimodal model that tracks emotions over time by reasoning over the user's personal emotion history, current expressions, and egocentric observations. Extensive evaluations show that OSIRIS achieves a state-of-the-art performance, delivering, for the first time, coherent emotion timelines from egocentric data. Dataset, model, and codes will be fully open-sourced upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36288", "url": null, "sourceid": 39901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36290, "uid": "6106c564e568969d698ddd9c43632195", "name": "Task-Aware Image Signal Processor for Advanced Visual Perception", "authors": [{"id": 180244, "fullname": "CHEN KAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/180244?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 184705, "fullname": "Jin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184705?format=json", "institution": "ByteDance Inc."}, {"id": 99636, "fullname": "Leheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/99636?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 107533, "fullname": "Kexuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/107533?format=json", "institution": "Xidian University"}, {"id": 86553, "fullname": "Shuhang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86553?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional low-bit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational capacity.To address these issues, we propose Task-Aware Image Signal Processor (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36290", "url": null, "sourceid": 46048, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36291, "uid": "2d92499c5520dcade5254e16888eefb8", "name": "DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution", "authors": [{"id": 130504, "fullname": "Zhengyao Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/130504?format=json", "institution": "University of Hong Kong"}, {"id": 88163, "fullname": "Menghan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/88163?format=json", "institution": "Tencent AI Lab"}, {"id": 75722, "fullname": "Xintao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75722?format=json", "institution": "Tencent"}, {"id": 88247, "fullname": "Kwan-Yee K. Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88247?format=json", "institution": "The University of Hong Kong"}], "abstract": "Diffusion-based video super-resolution (VSR) achieves remarkable fidelity but suffers from prohibitive sampling cost. While distribution matching distillation (DMD) accelerates diffusion models to one-step generation, directly applying it to VSR leads to training instability and degraded, insufficient supervision.To address these issues, we propose \\textbf{DUO-VSR}, a three-stage framework centered on a \\textbf{DU}al-Stream Distillation strategy that integrates distribution matching and adversarial supervision for \\textbf{O}ne-step VSR.We first adopt a Progressive Guided Distillation Initialization to stabilize subsequent training through trajectory-preserving distillation.We then introduce a Dual-Stream Distillation Strategy to jointly optimize DMD and Real\u2013Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision using features from both real and fake score models.Finally, a Preference-Guided Refinement aligns the student with perceptual quality preferences.Comprehensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36291", "url": null, "sourceid": 37265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36294, "uid": "663aa1f4a67d9cf37f2d4f7061e06769", "name": "\u0394ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos", "authors": [{"id": 180792, "fullname": "Chia-Hsiang Kao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180792?format=json", "institution": "Cornell University"}, {"id": 88646, "fullname": "Cong Phuoc Huynh", "url": "http://cvpr.thecvf.com/api/miniconf/users/88646?format=json", "institution": "Amazon"}, {"id": 129838, "fullname": "Chien-Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129838?format=json", "institution": "NVIDIA"}, {"id": 184708, "fullname": "Noranart Vesdapunt", "url": "http://cvpr.thecvf.com/api/miniconf/users/184708?format=json", "institution": "Amazon"}, {"id": 90905, "fullname": "Stefan Stojanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/90905?format=json", "institution": "Georgia Institute of Technology"}, {"id": 89201, "fullname": "Bharath Hariharan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89201?format=json", "institution": "Cornell University"}, {"id": 184709, "fullname": "Oleksandr Obiednikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/184709?format=json", "institution": "Amazon"}, {"id": 140562, "fullname": "Ning Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/140562?format=json", "institution": "Amazon"}], "abstract": "Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, which are unable to generalize to complex real-world settings. We introduce \u0394ynamics, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, \u0394ynamics generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, \u0394ynamics achieves a segmentation IoU of $0.30$, a $7\\times$ improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Further, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of $235$ real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36294", "url": null, "sourceid": 31827, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36302, "uid": "e89b40e45012370acb2fc278b3cb64fd", "name": "GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation", "authors": [{"id": 106595, "fullname": "Muhammad Atif Butt", "url": "http://cvpr.thecvf.com/api/miniconf/users/106595?format=json", "institution": "Universitat Autonoma De Barcelona"}, {"id": 153001, "fullname": "Alexandra Gomez-Villa", "url": "http://cvpr.thecvf.com/api/miniconf/users/153001?format=json", "institution": "Universitat Aut\u00f3noma de Barcelona"}, {"id": 147952, "fullname": "Tao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147952?format=json", "institution": "COMPUTER VISION CENTER"}, {"id": 92830, "fullname": "Javier Vazquez-Corral", "url": "http://cvpr.thecvf.com/api/miniconf/users/92830?format=json", "institution": "Computer Vision Center / Autonomous University of Barcelona"}, {"id": 90564, "fullname": "Joost van de Weijer", "url": "http://cvpr.thecvf.com/api/miniconf/users/90564?format=json", "institution": "Computer Vision Center Barcelona"}, {"id": 184736, "fullname": "Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184736?format=json", "institution": "City University of Hong Kong (Dongguan)"}], "abstract": "Recent years have seen impressive advances in text-to-image generation, with image generative or unified models, generating high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess the color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities like interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for T2I color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models\u2019 true capabilities via perceptual and automated assessments. Evaluations of popular T2I models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will allow to guide improvements in precise color generation. The benchmark will be made public upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36302", "url": null, "sourceid": 42003, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36305, "uid": "033c253e16760c3b1de8b9c807e10bf7", "name": "Language-guided Frequency Modulation for Large Vision-Language Models", "authors": [{"id": 180740, "fullname": "Shuyi Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180740?format=json", "institution": "Zhejiang University"}, {"id": 76886, "fullname": "Gongfan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76886?format=json", "institution": "National University of Singapore"}, {"id": 90356, "fullname": "Xinyin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90356?format=json", "institution": "National University of Singapore"}, {"id": 89287, "fullname": "Yen-Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89287?format=json", "institution": "Ritsumeikan University"}, {"id": 89269, "fullname": "Lanfen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89269?format=json", "institution": "Zhejiang University"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual reasoning across diverse tasks. These tasks place different demands on visual representations: some prioritize high-level global context, while others emphasize fine-grained local details. However, most existing methods operate on visual representations primarily in the spatial domain, lacking an explicit mechanism for distinguishing between high-frequency local details and low-frequency global context. This limitation hinders fine-grained control of visual representations and complicates their hierarchical alignment with language. To address this issue, we introduce Language-guided Frequency Modulation (LFM), a plug-and-play approach that adaptively refines visual signals in the frequency domain under linguistic guidance. By selectively enhancing critical regions and details, LFM enables more structured and precise visual processing. Crucially, it adds no extra training parameters, relying solely on a lightweight learnable projector to refine visual tokens before integration into the LLM, thereby ensuring minimal computational overhead. Extensive experiments across diverse vision-language benchmarks highlight LFM\u2019s scalability, effectiveness, and broad applicability to LVLMs. The code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36305", "url": null, "sourceid": 44805, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36307, "uid": "263253aaa403d4555a972b520d9f03eb", "name": "The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models", "authors": [{"id": 155465, "fullname": "Shivang Chopra", "url": "http://cvpr.thecvf.com/api/miniconf/users/155465?format=json", "institution": "Georgia Institute of Technology"}, {"id": 139417, "fullname": "Shaunak Halbe", "url": "http://cvpr.thecvf.com/api/miniconf/users/139417?format=json", "institution": "Georgia Institute of Technology"}, {"id": 155463, "fullname": "Chengyue Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155463?format=json", "institution": "Georgia Institute of Technology"}, {"id": 155464, "fullname": "Brisa Maneechotesuwan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155464?format=json", "institution": "Georgia Institute of Technology"}, {"id": 88875, "fullname": "Zsolt Kira", "url": "http://cvpr.thecvf.com/api/miniconf/users/88875?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Fine-tuning approaches for Vision-Language Models (VLMs) face a critical three-way trade-off between In-Distribution (ID) accuracy, Out-of-Distribution (OOD) generalization, and adversarial robustness. Existing robust fine-tuning strategies resolve at most two axes of this trade-off. Generalization-preserving methods retain ID/OOD performance but leave models vulnerable to adversarial attacks, while adversarial training improves robustness to targeted attacks but degrades ID/OOD accuracy. Our key insight is that the robustness trade-off stems from two geometric failures: sharp, anisotropic minima in parameter space and unstable feature representations that deform under perturbation. To address this, we propose GRACE (Gram-aligned Robustness via Adaptive Curvature Estimation), a unified fine-tuning framework that jointly regularizes the parameter-space curvature and feature-space invariance for VLMs. Grounded in Robust PAC-Bayes theory, GRACE employs adaptive weight perturbations scaled by local curvature to promote flatter minima, combined with a feature alignment loss that maintains representation consistency across clean, adversarial, and OOD inputs. On ImageNet fine-tuning of CLIP models, GRACE simultaneously improves ID accuracy by 10.8%, and adversarial accuracy by 8.9% while maintaining 57.0% OOD accuracy (vs. 57.4% zero-shot baseline). Geometric analysis confirms that GRACE converges to flatter minima without feature distortion across distribution shifts, providing a principled step toward generalized robustness in foundation VLMs.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36307", "url": null, "sourceid": 36301, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36308, "uid": "7b5533abaa9399471b8c5d1cac9d4eb9", "name": "Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?", "authors": [{"id": 172516, "fullname": "Tilemachos Aravanis", "url": "http://cvpr.thecvf.com/api/miniconf/users/172516?format=json", "institution": "Visual Recognition Group, CTU in Prague"}, {"id": 126774, "fullname": "Vladan Stojni\u0107", "url": "http://cvpr.thecvf.com/api/miniconf/users/126774?format=json", "institution": "Czech Technical University in Prague"}, {"id": 183910, "fullname": "Vasileios Psomas", "url": "http://cvpr.thecvf.com/api/miniconf/users/183910?format=json", "institution": "Czech technical university in Prague, Faculty of Eletrical Engineering"}, {"id": 130246, "fullname": "Nikos Komodakis", "url": "http://cvpr.thecvf.com/api/miniconf/users/130246?format=json", "institution": "University of Crete"}, {"id": 73917, "fullname": "Giorgos Tolias", "url": "http://cvpr.thecvf.com/api/miniconf/users/73917?format=json", "institution": "CTU in Prague"}], "abstract": "Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision\u2013language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and adapts seamlessly to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36308", "url": null, "sourceid": 38543, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36311, "uid": "d53b6413363588d99b45a8cdfa5acad3", "name": "OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models", "authors": [{"id": 157978, "fullname": "Keda Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157978?format=json", "institution": "Westlake University"}, {"id": 184746, "fullname": "Kele Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184746?format=json", "institution": "Westlake University"}, {"id": 184747, "fullname": "Bohan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184747?format=json", "institution": "Alibaba Group"}, {"id": 90268, "fullname": "Weiqiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90268?format=json", "institution": "University of Southern California"}, {"id": 131732, "fullname": "Jian liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131732?format=json", "institution": "AntGroup"}, {"id": 87566, "fullname": "Huan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87566?format=json", "institution": "Northeastern University"}], "abstract": "Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42$\\times$ inference speedup and $1.4\\times$ memory reduction over other top-performing counterparts, while maintaining performance with no training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36311", "url": null, "sourceid": 34699, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36313, "uid": "611a80e9562b4c0738021647c910846f", "name": "CARD: Correlation Aware Restoration with Diffusion", "authors": [{"id": 183373, "fullname": "Niki Nezakati", "url": "http://cvpr.thecvf.com/api/miniconf/users/183373?format=json", "institution": "University of California, Riverside"}, {"id": 183616, "fullname": "Arnab Ghosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/183616?format=json", "institution": "University of California,RIverside"}, {"id": 163965, "fullname": "Amit Roy-Chowdhury", "url": "http://cvpr.thecvf.com/api/miniconf/users/163965?format=json", "institution": "University of California, Riverside"}, {"id": 131774, "fullname": "Vishwanath Saragadam", "url": "http://cvpr.thecvf.com/api/miniconf/users/131774?format=json", "institution": "Rice University"}], "abstract": "Denoising diffusion models have achieved state-of-the-art performance in image restoration by modeling the process as sequential denoising steps. However, most approaches assume independent and identically distributed (i.i.d.) Gaussian noise, while real-world sensors often exhibit spatially correlated noise due to readout mechanisms, limiting their practical effectiveness. We introduce Correlation Aware Restoration with Diffusion (CARD), a training-free extension of DDRM that explicitly handles correlated Gaussian noise. CARD first whitens the noisy observation, which converts the noise into an i.i.d. form. Then, the diffusion restoration steps are replaced with noise-whitened updates, which inherits DDRM's closed-form sampling efficiency while now being able to handle correlated noise. To emphasize the importance of addressing correlated noise, we contribute CIN-D, a novel correlated noise dataset captured across diverse illumination conditions to evaluate restoration methods on real rolling-shutter sensor noise. This dataset fills a critical gap in the literature for experimental evaluation with real-world correlated noise. Experiments on standard benchmarks with synthetic correlated noise and on CIN-D demonstrate that CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36313", "url": null, "sourceid": 45558, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36316, "uid": "fe11fd4030d827a13e7e5593851e0040", "name": "AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation", "authors": [{"id": 184757, "fullname": "Sharath Girish", "url": "http://cvpr.thecvf.com/api/miniconf/users/184757?format=json", "institution": null}, {"id": 184758, "fullname": "Viacheslav Ivanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/184758?format=json", "institution": "Snap Inc."}, {"id": 159921, "fullname": "Tsai-Shien Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159921?format=json", "institution": "UC Merced"}, {"id": 184759, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184759?format=json", "institution": "Snap Inc."}, {"id": 184760, "fullname": "Aliaksandr Siarohin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184760?format=json", "institution": "Snap Inc.; Snap Inc."}, {"id": 85389, "fullname": "Sergey Tulyakov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85389?format=json", "institution": "Snap Inc."}], "abstract": "Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects.However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation.We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings.Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead.We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence.Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36316", "url": null, "sourceid": 45726, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36317, "uid": "dbaebce9c842f6aa7482517597c75c8c", "name": "Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning", "authors": [{"id": 184761, "fullname": "Shih-Wen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184761?format=json", "institution": "National Cheng Kung University"}, {"id": 184762, "fullname": "Yen-Chang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184762?format=json", "institution": "National Cheng Kung University"}, {"id": 183028, "fullname": "Wei-Ta Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183028?format=json", "institution": "National Cheng Kung University"}, {"id": 140039, "fullname": "Fu-En Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/140039?format=json", "institution": "NVIDIA"}, {"id": 89818, "fullname": "Yu-Chiang Frank Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89818?format=json", "institution": "NVIDIA"}], "abstract": "Multi-task learning (MTL) aims to equip a single model with the ability to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce Free Sinewich, a parameter-efficient multi-task learning framework that achieves efficient weight reuse through frequency switching. A lightweight Clock Net first determines task-dependent frequency with negligible overhead (Free). These frequencies modulate a Sine-AWB (Sinewich) layer, where low-rank factors and convolutional priors are combined into a single kernel and transformed via an elementwise sinusoidal transformation to produce task-specialized weights. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Our code is publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36317", "url": null, "sourceid": 31738, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36322, "uid": "2fba9547f91cd8d8bf0601fb2cb61dff", "name": "COPYLENS: Towards Copyrighted Characters Infringement Detection via Copyright-Aware Prompt Learning", "authors": [{"id": 184771, "fullname": "Yaoyu Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184771?format=json", "institution": "Northeastern University"}, {"id": 184772, "fullname": "Xiaochun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184772?format=json", "institution": "Northeastern University"}, {"id": 87508, "fullname": "Hong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87508?format=json", "institution": "National Institute of Informatics"}, {"id": 184773, "fullname": "Leixia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184773?format=json", "institution": "Northeastern University"}, {"id": 184774, "fullname": "Jian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184774?format=json", "institution": null}, {"id": 184775, "fullname": "Rui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/184775?format=json", "institution": "Northeastern University"}, {"id": 184776, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184776?format=json", "institution": "Northeastern University"}], "abstract": "Recent advances in text-to-image (T2I) generation can produce highly resembling images of copyrighted characters, often indistinguishable from official depictions, raising serious concerns about intellectual property infringement. Consequently, robust detection of copyright character infringement is urgently needed. Yet, existing methods exhibit limited alignment with human judgments regarding the likelihood of infringement. To bridge this gap, we propose \\textsc{CopyLens}, a novel prompt optimization framework that automatically refines textual prompts for vision-language model-based detectors to better match human infringement judgments.  Our approach establishes a closed-loop refinement process between a large vision-language model (LVLM) and a large language model (LLM): the LVLM assesses generated images for copyright detection, while the LLM iteratively optimizes detection prompts via meta-prompting, guided by feedback signals derived from human annotation consistency. To facilitate the assessment of prompt-human alignment, we introduce \\textsc{CopyChars}, a new large-scale dataset of over 7,000 AI-generated images spanning more than 100 popular copyrighted characters, along with detailed human annotations on potential infringement. Extensive experiments on \\textsc{CopyChars} show that the proposed \\textsc{CopyLens} can improve detection performance by 5\\% to 10\\% compared to recent state-of-the-art methods. This work offers a scalable and automated solution for visual copyright protection and highlights the critical role of prompt engineering.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36322", "url": null, "sourceid": 37732, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36323, "uid": "009aa24babc6c78308c9c3513061331f", "name": "Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning", "authors": [{"id": 163084, "fullname": "ZHENYU ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/163084?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 181535, "fullname": "Guangyao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181535?format=json", "institution": "Cornell University"}, {"id": 131641, "fullname": "Yixiong Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131641?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131642, "fullname": "Yuhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131642?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86547, "fullname": "Ruixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86547?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that \\textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant.Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \\textbf{re-utilize} information in these lost layersat both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. We will release the code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36323", "url": null, "sourceid": 45335, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36324, "uid": "a7308ceb6bfe23ec59af4c75cd8885ce", "name": "Reevaluating the Intra-modal Misalignment Hypothesis in CLIP", "authors": [{"id": 103286, "fullname": "Jonas Herzog", "url": "http://cvpr.thecvf.com/api/miniconf/users/103286?format=json", "institution": "Zhejiang University"}, {"id": 184777, "fullname": "Yue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184777?format=json", "institution": "Zhejiang University"}], "abstract": "Recent research has indicated that the embeddings generated by contrastive language-image training like CLIP may not be ideal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated similarities between images.In this study, we question this intra-modal misalignment hypothesis.We reexamine the theoretical arguments and techniques that seek to demonstrate the misalignment.Our findings reveal that neither the distribution of cosine similarities nor few-shot or retrieval metrics serve as reliable indicators of misalignment.In fact, these metrics yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2), which indicates there is no intra-modal misalignment stemming from contrastive language-image training.We argue the observed phenomena can be explained without assuming a fundamental flaw in the image embedding space.Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing supposed misalignment is unnecessary for achieving strong performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36324", "url": null, "sourceid": 45039, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36331, "uid": "50328aa4473c97fbb908c38276efb703", "name": "MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation", "authors": [{"id": 151828, "fullname": "Taha Koleilat", "url": "http://cvpr.thecvf.com/api/miniconf/users/151828?format=json", "institution": "Concordia University"}, {"id": 156537, "fullname": "Hojat Asgariandehkordi", "url": "http://cvpr.thecvf.com/api/miniconf/users/156537?format=json", "institution": "Concordia University"}, {"id": 175762, "fullname": "Omid Nejatimanzari", "url": "http://cvpr.thecvf.com/api/miniconf/users/175762?format=json", "institution": "Concordia University"}, {"id": 184786, "fullname": "Berardino Barile", "url": "http://cvpr.thecvf.com/api/miniconf/users/184786?format=json", "institution": "Concordia University"}, {"id": 156539, "fullname": "Yiming Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156539?format=json", "institution": "Concordia University"}, {"id": 156538, "fullname": "Hassan Rivaz", "url": "http://cvpr.thecvf.com/api/miniconf/users/156538?format=json", "institution": "Concordia University"}], "abstract": "Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation. Code and text prompts will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36331", "url": null, "sourceid": 42911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36332, "uid": "a3acb1a52dcaf379f44c93dc844acd62", "name": "GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving", "authors": [{"id": 158029, "fullname": "Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158029?format=json", "institution": "Beijing Jiaotong University"}, {"id": 158028, "fullname": "Caiyan Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/158028?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184787, "fullname": "Guanyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184787?format=json", "institution": "QCraft Inc."}, {"id": 151958, "fullname": "Ziying Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/151958?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184788, "fullname": "Junqiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184788?format=json", "institution": "QCraft Inc."}, {"id": 184789, "fullname": "Feiyang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184789?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184790, "fullname": "Peiliang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184790?format=json", "institution": "Yanshan University"}, {"id": 156448, "fullname": "Xiaoshuai Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156448?format=json", "institution": "Beijing Academy of Artificial Intelligence(BAAl)"}, {"id": 158034, "fullname": "Yadan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158034?format=json", "institution": "The University of Queensland"}], "abstract": "Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose GuideFlow, a novel planning framework that leverages Constrained Flow Matching. Concretely, GuideFlow explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, GuideFlow unifies the training of the flow matching with the Energy-Based Model (EBM) to enhance the model's autonomous optimization capability to robustly satisfy physical constraints. Secondly, GuideFlow parameterizes driving aggressiveness as a control signal during generation, enabling precise manipulation of trajectory style. Extensive evaluations on major driving benchmarks (Bench2Drive, NuScenes, NavSim and ADV-NuScenes) validate the effectiveness of GuideFlow. Notably, on the NavSim test hard split (Navhard), GuideFlow achieved SOTA with an EPDMS score of 43.0. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36332", "url": null, "sourceid": 32283, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36334, "uid": "bf86ac1ef70c8d7151cf993de222d66f", "name": "Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection", "authors": [{"id": 181848, "fullname": "Yiyan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181848?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126475, "fullname": "Menghao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126475?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 153469, "fullname": "Haifeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153469?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126479, "fullname": "Pengfei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/126479?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184799, "fullname": "Xianao Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184799?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 184800, "fullname": "Chenye Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184800?format=json", "institution": "Beijing University of Posts and Telecommunications; Queen Mary, University of London"}, {"id": 184801, "fullname": "Hong Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184801?format=json", "institution": "Communication University of China"}, {"id": 184802, "fullname": "Jinghan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184802?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126480, "fullname": "Qi Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126480?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126469, "fullname": "Jingyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126469?format=json", "institution": "Beijing University of Post and Telecommunication, Tsinghua University"}], "abstract": "With the rise of pre-trained vision\u2013language models such as CLIP, performing video anomaly detection (VAD) through cross-modal reasoning has become an emerging trend. However, we observe that CLIP still suffers from weak abnormality awareness: normal and abnormal descriptions are highly entangled in the text embedding space, causing video features to assign nearly indistinguishable similarity scores to both types of prompts. To address this issue, we propose \\textbf{Alert-CLIP}, an abnormality-aware latent\u2013enhanced tuning framework that tailors CLIP for VAD. Alert-CLIP introduces a multi-level alignment strategy: (1) \\textit{video\u2013label alignment}, which reshapes the semantic space to establish a coarse-level foundation for abnormality awareness; (2) \\textit{region\u2013text alignment}, which explicitly associates anomaly-related regions with their detailed descriptions to strengthen fine-grained perception; and (3) \\textit{region\u2013semantic alignment}, which further contrasts anomalous regions against multiple hard negative samples, enhancing abnormality-aware discrimination.Extensive experiments on four benchmarks demonstrate that \\textbf{Alert-CLIP} consistently surpasses vanilla CLIP across supervised, zero-shot, and open-vocabulary settings, providing a solid foundation for future CLIP-based VAD research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36334", "url": null, "sourceid": 43539, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36338, "uid": "8346a1ed91b8f4eb1a4a26f59b2f751f", "name": "An Efficient Token Compression Framework for Visual Object Tracking", "authors": [{"id": 184811, "fullname": "Weijing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184811?format=json", "institution": "Guangxi Normal University"}, {"id": 155925, "fullname": "Qihua Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155925?format=json", "institution": "Guangxi Normal University"}, {"id": 128975, "fullname": "Bineng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128975?format=json", "institution": "Guangxi Normal University"}, {"id": 184385, "fullname": "Haiying Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184385?format=json", "institution": "Guangxi Normal University"}, {"id": 129690, "fullname": "Zhiyi Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129690?format=json", "institution": "Wuzhou university"}, {"id": 129683, "fullname": "Shuxiang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/129683?format=json", "institution": "Guangxi Normal University"}], "abstract": "Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. This fusion is performed through a cascade of collaborative stages, where each stage executes a structured process of template enrichment via search context, unified feature learning, and search feature refinement to ensure precise target localization. Experiments on seven benchmarks demonstrate that our method significantly outperforms current state-of-the-art trackers.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36338", "url": null, "sourceid": 36173, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36342, "uid": "7457ec459cfecce749022e0f48a89bfa", "name": "Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding", "authors": [{"id": 184820, "fullname": "Keliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184820?format=json", "institution": "Fudan University"}, {"id": 184821, "fullname": "Zizhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184821?format=json", "institution": "Fudan University"}, {"id": 91010, "fullname": "Mingcheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91010?format=json", "institution": "Fudan University"}, {"id": 126844, "fullname": "Jingqun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126844?format=json", "institution": "Bytedance"}, {"id": 91003, "fullname": "Dingkang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91003?format=json", "institution": "Fudan University"}, {"id": 91007, "fullname": "Lihua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91007?format=json", "institution": "Fudan University"}], "abstract": "Document understanding is a long-standing practical task. Vision-Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single-page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the model's judgment. While retrieval-augmented generation mitigates this issue by filtering for question-relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi-agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse-to-fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence-dense multimodal context to generate the final prediction. The SLEUTH framework is model-agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long-document benchmarks, achieving SOTA results. Ablation studies verify each module\u2019s effectiveness and confirm the benefits of our hierarchical refinement paradigm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36342", "url": null, "sourceid": 41253, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36345, "uid": "42e63e1b4d6aa921fd566bdef26e5ef4", "name": "End-to-End Hyper-Relational Information Extraction for Engineering Diagrams via Dynamically Tokenized Relation Transformer", "authors": [{"id": 183251, "fullname": "Tianyou Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/183251?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u9662\u81ea\u52a8\u5316\u7814\u7a76\u6240"}, {"id": 184827, "fullname": "Yan-Ming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184827?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 184828, "fullname": "Zixiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184828?format=json", "institution": "ShanghaiTech University"}, {"id": 184829, "fullname": "Jibin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184829?format=json", "institution": "Dalian Institute of Chemical Physics, Chinese Academy of Sciences"}, {"id": 152669, "fullname": "Fei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/152669?format=json", "institution": ", Institute of automation, Chinese academy of science"}, {"id": 86336, "fullname": "Cheng-Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86336?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Engineering diagrams are the core carriers of technical information in industrial contexts, where the pressing demand for their digitization from industrial sectors has driven great advancements in related research domains. However, existing research still suffers from three limitations. Firstly, the detection of symbols, lines, and texts typically involves multiple independent models, resulting in cumbersome workflows. In addition, high-resolution diagrams often impose an excessive computational cost on existing models. Moreover, parsing frameworks solely based on object detection can merely localize component positions, yet fail to capture the topological connection semantics and structured knowledge among components, thus offering limited convenience for industrial applications. To address these issues, we propose an end-to-end information extraction framework based on the Dynamically Tokenized Relation Transformer (DTRT), which can dynamically reduce received image tokens, filter redundant information, and efficiently extract structural knowledge to construct hyper-relational knowledge graphs. We practiced our model on piping and instrumentation diagrams (P&IDs) and electrical diagrams (EDs): the former are widely used in chemical engineering enterprises, while the latter are employed to describe circuit systems. DTRT achieves an R@1000 accuracy of 94.84% on PIDs and R@200 accuracy of 92.52% on EDs with a significantly reduced computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36345", "url": null, "sourceid": 45057, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36346, "uid": "94d5f5ee8c271ca277bd8f34b62fcdf9", "name": "NS-Diff: Fluid Navier\u2013Stokes Guided Video Diffusion via Reinforcement Learning", "authors": [{"id": 182101, "fullname": "Zijun Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182101?format=json", "institution": "Peking University"}, {"id": 87023, "fullname": "Yuxin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87023?format=json", "institution": "Peking University"}], "abstract": "While recent video generation models achieve impressive visual quality, generating physically plausible videos remains challenging, especially for fluid dynamics and rigid-body motions. To address this, we present **NS-Diff**, a physics-guided reinforcement learning framework for video diffusion. First, we design a noise-robust physical dynamics detector that distinguishes rigid and fluid regions by analyzing motion in noisy latent frames. Second, we introduce a Physics-Conditioned Latent Injection module, which encodes velocity fields, deformation gradients, and material masks, and injects them into the DiT denoiser via cross-attention.  Third, we introduce a reinforcement learning optimization module that enforces simplified Navier-Stokes constraints on fluid dynamics and minimum-jerk principles on rigid bodies through policy gradients. Experiments on PhysVideoBench, UCF, and MSR-VTT show that our approach reduces jerk errors by 43\\%, decreases fluid divergence by 33\\%, and improves FVD by 22.7\\%, achieving higher physical plausibility and visual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36346", "url": null, "sourceid": 38453, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36353, "uid": "8e576e753c2932a38d6fb13a6bf5b573", "name": "Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision", "authors": [{"id": 180121, "fullname": "Zitang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180121?format=json", "institution": "Sony Group Corporation"}, {"id": 89595, "fullname": "Masakazu Yoshimura", "url": "http://cvpr.thecvf.com/api/miniconf/users/89595?format=json", "institution": "Sony Group Corporation"}, {"id": 89589, "fullname": "Junji Otsuka", "url": "http://cvpr.thecvf.com/api/miniconf/users/89589?format=json", "institution": "Sony Group Corporation"}, {"id": 89580, "fullname": "Atsushi Irie", "url": "http://cvpr.thecvf.com/api/miniconf/users/89580?format=json", "institution": "Sony Group Corporation"}, {"id": 89606, "fullname": "Takeshi Ohashi", "url": "http://cvpr.thecvf.com/api/miniconf/users/89606?format=json", "institution": "Sony Group Corporation"}], "abstract": "High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model\u2019s evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36353", "url": null, "sourceid": 34563, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36355, "uid": "4d815e0dd3244bf08f3acc60d9498027", "name": "CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration", "authors": [{"id": 70008, "fullname": "Xiefan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/70008?format=json", "institution": "Beihang University"}, {"id": 159457, "fullname": "Xinzhu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/159457?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 71241, "fullname": "Haiyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71241?format=json", "institution": "BUAA"}, {"id": 87605, "fullname": "Di Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87605?format=json", "institution": "Beihang University"}], "abstract": "Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., Stable Diffusion 2.1) and flow-based approaches (e.g., Stable Diffusion 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36355", "url": null, "sourceid": 38360, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36358, "uid": "9b784280afc36c74b271d8af5ec9e534", "name": "From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward", "authors": [{"id": 173570, "fullname": "Ze Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173570?format=json", "institution": "University of Science and Technology of China"}, {"id": 184868, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184868?format=json", "institution": "University of Science and Technology of China"}, {"id": 184869, "fullname": "Xianquan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184869?format=json", "institution": "University of Science and Technology of China"}, {"id": 184870, "fullname": "Shuochen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184870?format=json", "institution": "University of Science and Technology of China"}, {"id": 184871, "fullname": "Jiaxian Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184871?format=json", "institution": "University of Science and Technology of China"}, {"id": 183312, "fullname": "Yupeng Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/183312?format=json", "institution": "University of Science and Technology of China"}, {"id": 88604, "fullname": "Qi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88604?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Handwritten mathematical expression recognition is hindered by a fundamental misalignment between the dual representations of LaTeX formulas: the symbolic text and the rendered visual image. This discrepancy means that textually distinct LaTeX sequences can produce visually identical outputs, while minor textual errors can cause catastrophic rendering failures. As a result, text-level reward mechanisms cannot perfectly assess the quality of model predictions, failing to effectively guide the model towards optimal performance during training. To overcome this limitation, we introduce the Image Matching Score (IMS), a lightweight yet effective reward based on the structural edit distance of column-wise image projections, which robustly quantifies the visual fidelity between rendered formulas. Leveraging IMS, we then propose Image-Matching driven Policy Optimization (IMPO), a training framework built upon Group Relative Policy Optimization (GRPO). This approach facilitates stable policy learning directly from our sequence-level visual reward, notably without the need for a separate value function network. Extensive experiments demonstrate that IMPO yields consistent performance gains across various backbone models on the challenging CROHME, HME100K, and M$^2$E datasets. Our model-agnostic framework establishes new state-of-the-art results, improving the Expression Recognition Rate by an average of 1.1% and up to 1.37% over strong prior methods. The code can be found in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36358", "url": null, "sourceid": 34805, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36362, "uid": "cd4aa78eef2c2a097afdf7efae7bc1d9", "name": "VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation", "authors": [{"id": 91375, "fullname": "Walid Bousselham", "url": "http://cvpr.thecvf.com/api/miniconf/users/91375?format=json", "institution": "Johann Wolfgang Goethe Universit\u00e4t Frankfurt am Main"}, {"id": 150894, "fullname": "Hilde Kuehne", "url": "http://cvpr.thecvf.com/api/miniconf/users/150894?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 69185, "fullname": "Cordelia Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/69185?format=json", "institution": "Inria / Google"}], "abstract": "Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone.We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance.We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin.Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36362", "url": null, "sourceid": 44925, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36361, "uid": "e1f82bcef895e584aa522328ae983f08", "name": "Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding", "authors": [{"id": 161151, "fullname": "Weikai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161151?format=json", "institution": "University of Washington"}, {"id": 90298, "fullname": "Jieyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90298?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 184878, "fullname": "Taoyang jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184878?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 152947, "fullname": "Chenhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152947?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 184879, "fullname": "Ziqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184879?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 85306, "fullname": "Jae Sung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/85306?format=json", "institution": "University of Washington"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}], "abstract": "Visual grouping\u2014operationalized through tasks such as instance segmentation, visual grounding, and object detection\u2014enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce **Synthetic Object Compositions (SOC)**, an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just **100K** of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by **+24\u201336%**\u2014achieving **+10.9 AP** on LVIS and **+8.4 $N_{\\text{Acc}}$** on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including **+6.59 AP** on a **1% COCO** data setup. Furthermore, this controllability enables targeted data generation for **intra-class referring**, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36361", "url": null, "sourceid": 44527, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36365, "uid": "89b2d96d7e05597d4f5cb5a8a0953c9d", "name": "WildPose: A Unified Framework for Robust Pose Estimation in the Wild", "authors": [{"id": 151552, "fullname": "Jianhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151552?format=json", "institution": "Stanford University"}, {"id": 107128, "fullname": "Liyuan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107128?format=json", "institution": "Stanford University"}, {"id": 126325, "fullname": "Zihan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126325?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 106415, "fullname": "Iro Armeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/106415?format=json", "institution": "Stanford University"}], "abstract": "Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume inputs from static environments. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models sometimes perform badly on static-only scenes. We present Wildpose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect the two powerful paradigms in modern 3D vision: the rich perceptual frontend of feed-forward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this by enhancing the differentiable BA pipeline in two ways. First, we introduce a new 3D-aware update operator by integrating a frozen, pre-trained MASt3R feature backbone and training the operator's subsequent layers on a diverse curriculum of static and dynamic data. Second, we propose a high-capacity motion mask detector that leverages rich, multi-level 3D-aware features from the same frozen backbone. Extensive experiments show Wildpose consistently outperforms prior methods across a wide variety of benchmarks, including dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36365", "url": null, "sourceid": 43384, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36377, "uid": "bca050a896b3070dba891b0a4404aacd", "name": "Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion", "authors": [{"id": 184908, "fullname": "Yanglin Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184908?format=json", "institution": "Jiangnan University"}, {"id": 75854, "fullname": "Tianyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75854?format=json", "institution": "Jiangnan University"}, {"id": 151477, "fullname": "Chunyang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151477?format=json", "institution": "Jiangnan University"}, {"id": 157157, "fullname": "Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157157?format=json", "institution": "School of Artificial Intelligence and Computer Science"}, {"id": 129533, "fullname": "Xiaojun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129533?format=json", "institution": "Jiangnan University"}, {"id": 154654, "fullname": "Josef Kittler", "url": "http://cvpr.thecvf.com/api/miniconf/users/154654?format=json", "institution": "University of Surrey"}], "abstract": "Infrared and visible image fusion (IVIF) aims to synthesise complementary information from the two source modalities while preserving natural textures and salient thermal signatures simultaneously. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting the generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36377", "url": null, "sourceid": 41296, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36368, "uid": "a7ad710353b08b3d1e7ee6daca9679cc", "name": "Rank-Guided Pseudo-Bias Learning for Robust Black-Box Adaptation", "authors": [{"id": 184891, "fullname": "Rajeev Dwivedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184891?format=json", "institution": "Indian Institute of Science Education and Research Bhopal"}, {"id": 184892, "fullname": "Anshuman Dangwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/184892?format=json", "institution": null}, {"id": 184893, "fullname": "Vinod Kurmi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184893?format=json", "institution": "IISER Bhopal"}], "abstract": "Pretrained vision encoders are widely used as frozen, black-box feature extractors, yet they often inherit spurious correlations that disproportionately harm underrepresented groups. We introduce \\textbf{PLD-Debias}, a fully black-box debiasing framework that requires neither access to backbone parameters nor demographic annotations. Our method integrates three components: (1) \\emph{Rank-Regularized Amplification}, a lightweight adapter that exaggerates latent spurious directions; (2) \\emph{Unsupervised Pseudo-Bias Induction}, which clusters amplified features to infer high-fidelity proxy bias labels; and (3) \\emph{Bias-Guided Refinement}, combining supervised contrastive alignment with cluster-aware adaptive margins to purify representations and equalize decision boundaries. We theoretically show that these components jointly tighten a worst-group risk bound under spurious correlations. Empirically, PLD-Debias achieves state-of-the-art worst-group accuracy across CelebA, Waterbirds, and CMNIST, improving performance by 3--5 points over prior black-box methods while maintaining average accuracy. Remarkably, our pseudo-bias labels align with ground-truth bias annotations at over 90\\% fidelity, enabling oracle-level robustness without demographic supervision. Our results demonstrate that fairness and utility can be achieved through a plug-and-play classifier adapter for any frozen foundation model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36368", "url": null, "sourceid": 41501, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36370, "uid": "53e9395cbdea2c88641cd83a2e364bb9", "name": "ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos", "authors": [{"id": 181877, "fullname": "Yuantao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181877?format=json", "institution": "The Chinese University of Hong Kong,Shenzhen"}, {"id": 184898, "fullname": "Jiahao Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184898?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 90336, "fullname": "Chongjie Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/90336?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 184899, "fullname": "Chaoran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184899?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 184900, "fullname": "Zhaojie Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184900?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 102232, "fullname": "Chenghong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/102232?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 88683, "fullname": "Xiaoguang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88683?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of  optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36370", "url": null, "sourceid": 43010, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36372, "uid": "039043c90e0bec4f2947ad9778dc35c0", "name": "ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology", "authors": [{"id": 154647, "fullname": "Srikumar Sastry", "url": "http://cvpr.thecvf.com/api/miniconf/users/154647?format=json", "institution": "Washington University in St Louis"}, {"id": 107003, "fullname": "Subash Khanal", "url": "http://cvpr.thecvf.com/api/miniconf/users/107003?format=json", "institution": "Washington University in St Louis"}, {"id": 133667, "fullname": "Aayush Dhakal", "url": "http://cvpr.thecvf.com/api/miniconf/users/133667?format=json", "institution": "Washington University in St. Louis"}, {"id": 184903, "fullname": "Jiayu Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184903?format=json", "institution": "Washington University, Saint Louis"}, {"id": 164794, "fullname": "Daniel Cher", "url": "http://cvpr.thecvf.com/api/miniconf/users/164794?format=json", "institution": "Washington University, Saint Louis"}, {"id": 184904, "fullname": "Phoenix Jarosz", "url": "http://cvpr.thecvf.com/api/miniconf/users/184904?format=json", "institution": "Washington University, Saint Louis"}, {"id": 75557, "fullname": "Nathan Jacobs", "url": "http://cvpr.thecvf.com/api/miniconf/users/75557?format=json", "institution": "Washington University in St. Louis"}], "abstract": "We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36372", "url": null, "sourceid": 44901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36375, "uid": "99cc8493c2392e2b0f5d61bb0220b474", "name": "Guiding Diffusion Models with Semantically Degraded Conditions", "authors": [{"id": 180521, "fullname": "shilong han", "url": "http://cvpr.thecvf.com/api/miniconf/users/180521?format=json", "institution": "National University of Defense Technology"}, {"id": 184906, "fullname": "Yuming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184906?format=json", "institution": "National University of Defense Technology"}, {"id": 84848, "fullname": "Hongxia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84848?format=json", "institution": "National University of Defense Technology"}], "abstract": "Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ($\\varnothing$) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $c_{deg}$ . This reframes guidance from a coarse \"good vs. null\" contrast to a more refined \"good vs. almost good\" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. To synthesize $\\boldsymbol{c}_{\\text{deg}}$ adaptively, our method models the self-attention mechanism as a graph and employs Weighted PageRank to identify and degrade the most semantically salient tokens. Validated on state-of-the-art models like Stable Diffusion 3, CDG markedly improves compositional accuracy and text-image alignment, addressing key failure modes of the baseline. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36375", "url": null, "sourceid": 40104, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36373, "uid": "1900820441f58be72e092e531f54adc1", "name": "DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction", "authors": [{"id": 130633, "fullname": "Yufu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130633?format=json", "institution": "University of Pennsylvania"}, {"id": 131439, "fullname": "Evonne Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131439?format=json", "institution": "University of California, Berkeley"}, {"id": 119973, "fullname": "Soyong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/119973?format=json", "institution": "Carnegie Mellon University"}, {"id": 127197, "fullname": "Rawal Khirodkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/127197?format=json", "institution": "Meta"}, {"id": 96465, "fullname": "Yuan Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/96465?format=json", "institution": "Meta"}, {"id": 133394, "fullname": "Zhaoen Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/133394?format=json", "institution": "Facebook"}, {"id": 89967, "fullname": "Jinhyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/89967?format=json", "institution": "Carnegie Mellon University"}, {"id": 88213, "fullname": "Kris Kitani", "url": "http://cvpr.thecvf.com/api/miniconf/users/88213?format=json", "institution": "Carnegie Mellon University"}, {"id": 85833, "fullname": "Alexander Richard", "url": "http://cvpr.thecvf.com/api/miniconf/users/85833?format=json", "institution": "Reality Labs Research, Meta"}, {"id": 159435, "fullname": "Fabian Prada", "url": "http://cvpr.thecvf.com/api/miniconf/users/159435?format=json", "institution": "Meta"}, {"id": 127050, "fullname": "Michael Zollhoefer", "url": "http://cvpr.thecvf.com/api/miniconf/users/127050?format=json", "institution": "Meta"}], "abstract": "We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly (bypassing parametric models). DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space MPJPE error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space MPJPE error.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36373", "url": null, "sourceid": 44556, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36381, "uid": "1fdfb6f89419a6d2e520339bfea2d0bb", "name": "Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy", "authors": [{"id": 70799, "fullname": "Shuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/70799?format=json", "institution": "Zhejiang University"}, {"id": 155974, "fullname": "Yijin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155974?format=json", "institution": "Avolution AI"}, {"id": 184918, "fullname": "Xi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184918?format=json", "institution": "Zhejiang University"}, {"id": 84995, "fullname": "Guofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84995?format=json", "institution": "Zhejiang University"}], "abstract": "The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imaging physics for self-calibrated, shadow-robust reconstruction. NFH-SEM achieves precise recovery across diverse specimens, revealing 478 nm layered features in two-photon lithography samples, 782 nm surface textures on pollen grains, and 1.559 \u03bcm fracture steps on silicon carbide particles, demonstrating its accuracy and broad applicability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36381", "url": null, "sourceid": 46455, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40257?format=json"], "related_events_ids": [40257]}, {"id": 36380, "uid": "ae845eb437371efc105ff790cfe62bb9", "name": "Deep Feature Deformation Weights", "authors": [{"id": 86489, "fullname": "Richard Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86489?format=json", "institution": "University of Chicago"}, {"id": 72635, "fullname": "Itai Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/72635?format=json", "institution": "University of Chicago &amp; Tel Aviv University"}, {"id": 86488, "fullname": "Rana Hanocka", "url": "http://cvpr.thecvf.com/api/miniconf/users/86488?format=json", "institution": "University of Chicago"}], "abstract": "Handle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by the choice of control handles, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage the data prior to obtain semantic edits, at the cost of fine-grained control and speed. We propose a technique that achieves the best of both worlds by leveraging the semantic prior of data and the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proximity makes for smooth and semantic deformation weights, with no need for additional regularization. Importantly, these weights can be computed in real-time for any surface point, whereas all prior methods require optimization of these weights. Moreover, the semantic prior from deep features enables co-deformation of semantic parts. We introduce an improved feature distillation pipeline, barycentric feature distillation, which leverages the full visual signal from shape renders to make the compute cost robust to mesh resolution. This allows deep feature weights to be computed for even high resolution meshes in under a minute, in contrast to potentially hours for both classical and neural methods. We preserve and extend existing functionality of classical methods through feature space constraints and locality weighting.Our field representation allows for automatic detection of semantic symmetries, which we use to produce symmetry-preserving deformations. We show a proof-of-concept application which can produce deformations for meshes up to 1 million faces in real-time on a consumer-grade machine.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36380", "url": "https://threedle.github.io/dfd/", "sourceid": 44936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40257, "uid": "1fdfb6f89419a6d2e520339bfea2d0bb", "name": "Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy", "authors": [{"id": 70799, "fullname": "Shuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/70799?format=json", "institution": "Zhejiang University"}, {"id": 155974, "fullname": "Yijin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155974?format=json", "institution": "Avolution AI"}, {"id": 184918, "fullname": "Xi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184918?format=json", "institution": "Zhejiang University"}, {"id": 84995, "fullname": "Guofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84995?format=json", "institution": "Zhejiang University"}], "abstract": "The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imaging physics for self-calibrated, shadow-robust reconstruction. NFH-SEM achieves precise recovery across diverse specimens, revealing 478 nm layered features in two-photon lithography samples, 782 nm surface textures on pollen grains, and 1.559 \u03bcm fracture steps on silicon carbide particles, demonstrating its accuracy and broad applicability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40257", "url": null, "sourceid": -46455, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36381?format=json"], "related_events_ids": [36381]}, {"id": 36387, "uid": "3a55a38f5f96deb7a6064d9dac177151", "name": "Open the Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation", "authors": [{"id": 126431, "fullname": "Ke Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126431?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 88671, "fullname": "Ran Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88671?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 156478, "fullname": "Jingyu Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156478?format=json", "institution": "East China Normal University"}, {"id": 86953, "fullname": "Yabiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86953?format=json", "institution": "Zhejiang University"}, {"id": 147923, "fullname": "yating wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147923?format=json", "institution": "shanghai jiaotong university"}, {"id": 86818, "fullname": "Xin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86818?format=json", "institution": "East China Normal University"}, {"id": 86912, "fullname": "Chengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86912?format=json", "institution": "Tencent Youtu Lab; Shanghai Jiao Tong University"}, {"id": 89127, "fullname": "Lizhuang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/89127?format=json", "institution": "Dept. of Computer Sci. &amp; Eng., Shanghai Jiao Tong University"}], "abstract": "Text-to-motion generation is a fundamental task in computer vision, aiming to synthesize 3D human motion sequences from natural language descriptions. However, due to the limited scale and diversity of existing datasets, models trained to directly map raw text to motion often struggle to generalize to out-of-domain textual inputs. We observe that although high-level motion semantics vary widely, many motions share a common set of underlying atomic motions\u2014that is, simple, reusable body-part movements. Building on this insight, we introduce an **Atomic Motion Decomposition and Recomposition** framework for open-vocabulary text-to-motion generation. Our approach consists of two key components: a **Textual Decomposition** module that parses out-of-domain descriptions into atomic motion units, and an **Atomic Recomposition** module that integrates these units to produce the final motion sequence. Our model achieves a competitive performance on the in-domain HumanML3D dataset, and extensive experiments on two out-of-domain datasets (IDEA400 and Mixamo) demonstrate that our method substantially outperforms state-of-the-art approaches in open-vocabulary motion generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36387", "url": null, "sourceid": 39948, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36389, "uid": "74e2945e57b290b98755a5f3f4468cb3", "name": "Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation", "authors": [{"id": 180809, "fullname": "Zihan Huan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180809?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 184930, "fullname": "Xipeng Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184930?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 184931, "fullname": "Hualong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184931?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 184932, "fullname": "Siyang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184932?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 184933, "fullname": "Rushi Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184933?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 184934, "fullname": "Huadeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184934?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 184935, "fullname": "Haoxiang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184935?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 184936, "fullname": "Zhenbing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184936?format=json", "institution": null}], "abstract": "Nuclei instance segmentation in histopathology images is essential for diagnostic accuracy and downstream computational tasks, yet this task relies heavily on expensive pixel level annotations. Although point level annotations substantially reduce the annotation burden for pathologists, many existing methods utilize only a single type of image and overlook the complementary information contained in alternative representations. To address this limitation, we propose DFGNet, a weakly supervised framework that utilizes dual-representation complementary fusion and interleaved guidance learning by jointly modeling RGB images and their corresponding Hematoxylin components. From the complementary fusion perspective, we propose a Reciprocal Cross-scale Dynamic Fusion Module. (RCDF) and an Entropy Confidence Aggregation Unit (ECAU) to integrate multi-scale complementary cues and adaptively combine the outputs of the dual branches. In terms of interleaved guidance, we also propose an Interleaved point-Guided Attention (IGA) that enables bidirectional refinement between the segmentation task and the kernel prediction task. Extensive experiments on three benchmark datasets show that DFGNet achieves state-of-the-art performance across multiple metrics and significantly outperforms existing approaches. DFGNet also demonstrates strong generalization ability across different tissue types and exhibits remarkable robustness to annotation shifts, providing a low-cost and scalable solution for practical clinical applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36389", "url": null, "sourceid": 37686, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36393, "uid": "276441f6460de74a1b4099238d203427", "name": "$A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors", "authors": [{"id": 143406, "fullname": "Zhenyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/143406?format=json", "institution": "Qilu University of Technology"}, {"id": 183221, "fullname": "Tianyi Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183221?format=json", "institution": "Shandong Institute of Mechanical Design and Research"}], "abstract": "Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36393", "url": null, "sourceid": 38014, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36394, "uid": "8db1c5244f04213e7178c188cf975960", "name": "Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models", "authors": [{"id": 182916, "fullname": "Shengchao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182916?format=json", "institution": "Tencent ARC Lab; University of Hong Kong"}, {"id": 155774, "fullname": "Yuxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155774?format=json", "institution": "Tencent ARC Lab"}, {"id": 86827, "fullname": "Yuying Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/86827?format=json", "institution": "University of Hong Kong"}, {"id": 156644, "fullname": "Wei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156644?format=json", "institution": "University of Hong Kong"}, {"id": 129558, "fullname": "Jiehong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129558?format=json", "institution": "South China University of Technology"}, {"id": 84809, "fullname": "Ying Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84809?format=json", "institution": "Tencent"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}], "abstract": "Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (**DSR**), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce **DSR Suite**. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of **DSR-Train** for learning and further human-refined **DSR-Bench** for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (**GSM**) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36394", "url": null, "sourceid": 46024, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36398, "uid": "721a427b22b69757b37c150e2fe24caa", "name": "LiveGesture: Streamable Co-Speech Gesture Generation Model", "authors": [{"id": 183719, "fullname": "Muhammad Usama Saleem", "url": "http://cvpr.thecvf.com/api/miniconf/users/183719?format=json", "institution": "Amazon.com"}, {"id": 180015, "fullname": "Mayur Jagdishbhai Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/180015?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 133207, "fullname": "Ekkasit Pinyoanuntapong", "url": "http://cvpr.thecvf.com/api/miniconf/users/133207?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 184948, "fullname": "Zhongxing Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184948?format=json", "institution": null}, {"id": 158161, "fullname": "Li Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158161?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 135247, "fullname": "Hongfei Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/135247?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 184949, "fullname": "Ahmed Helmy", "url": "http://cvpr.thecvf.com/api/miniconf/users/184949?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 73542, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73542?format=json", "institution": "University of Central Florida"}, {"id": 127740, "fullname": "Pu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127740?format=json", "institution": "University of North Carolina at Charlotte"}], "abstract": "We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods\u2014which are designed for offline generation and either treat body regions independently or entangle all joints within a single model\u2014LiveGesture is built from the ground up for causal, region-coordinated motion generation. \\emph{LiveGesture} consists of two main modules: the Streamable Vector-Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-eXpert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR-Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR-Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body  gestures in real time, matching or surpassing state-of-the-art offline methods under true zero\u2013look-ahead conditions", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36398", "url": null, "sourceid": 45146, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36403, "uid": "04aba162f63281798d13df8b256e3a99", "name": "Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization", "authors": [{"id": 180360, "fullname": "Zhuohan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180360?format=json", "institution": "Fudan University"}, {"id": 107251, "fullname": "Wujian Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107251?format=json", "institution": "Fudan University"}, {"id": 148321, "fullname": "Yitong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/148321?format=json", "institution": "Fudan University"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}], "abstract": "Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BIDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BICOMP, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BIDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36403", "url": null, "sourceid": 30745, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36404, "uid": "ac42634b27b4530eac740d6b72fcb713", "name": "It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal", "authors": [{"id": 145694, "fullname": "lishen qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145694?format=json", "institution": "Nankai University"}, {"id": 127878, "fullname": "Shihao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127878?format=json", "institution": "Nankai University"}, {"id": 132718, "fullname": "Jie Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132718?format=json", "institution": "OPPO"}, {"id": 88749, "fullname": "Hui Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88749?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 76168, "fullname": "Jufeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76168?format=json", "institution": "Nankai University"}], "abstract": "Flicker artifacts, arising from unstable illumination and row-wise exposure inconsistencies, pose a significant challenge in short-exposure photography, severely degrading image quality. Unlike typical artifacts, e.g., noise and low-light, flicker is a structured degradation with specific spatial-temporal patterns, which are not accounted for in current generic restoration frameworks, leading to suboptimal flicker suppression and ghosting artifacts. In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting.Specifically, Flickerformer comprises three key components: a phase-based fusion module (PFM), an autocorrelation feed-forward network (AFFN), and a wavelet-based directional attention module (WDAM). Based on the periodicity, PFM performs inter-frame phase correlation to adaptively aggregate burst features, while AFFN exploits intra-frame structural regularities through autocorrelation, jointly enhancing the network\u2019s ability to perceive spatially recurring patterns. Moreover, motivated by the directionality of flicker artifacts, WDAM leverages high-frequency variations in the wavelet domain to guide the restoration of low-frequency dark regions, yielding precise localization of flicker artifacts. Extensive experiments demonstrate that Flickerformer outperforms state-of-the-art approaches in both quantitative metrics and visual quality. The source code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36404", "url": null, "sourceid": 35633, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36407, "uid": "33f57f79648ec1697ef41b85f0706428", "name": "Foundry: Distilling 3D Foundation Models for the Edge", "authors": [{"id": 182439, "fullname": "Guillaume Letellier", "url": "http://cvpr.thecvf.com/api/miniconf/users/182439?format=json", "institution": "University of Caen Normandy"}, {"id": 131613, "fullname": "Siddharth Srivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/131613?format=json", "institution": "TensorTour Inc"}, {"id": 167001, "fullname": "Frederic Jurie", "url": "http://cvpr.thecvf.com/api/miniconf/users/167001?format=json", "institution": "GREYC Laboratory - ENSICAEN, UNICAEN"}, {"id": 131594, "fullname": "Gaurav Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/131594?format=json", "institution": "TensorTour Inc."}], "abstract": "Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient `specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable.In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher\u2019s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks\u2014classification, part segmentation, and few-shot scenarios\u2014approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resource-constrained hardware.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36407", "url": null, "sourceid": 45259, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36408, "uid": "525a12095ee281d7f23e073acee75040", "name": "Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression", "authors": [{"id": 175960, "fullname": "Mengmeng Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175960?format=json", "institution": "Nanjing University of Science and Technology "}, {"id": 129610, "fullname": "Zeren Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129610?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 107127, "fullname": "Tao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107127?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 87633, "fullname": "Jinshan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87633?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 72996, "fullname": "Fumin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72996?format=json", "institution": "UESTC"}], "abstract": "Learning with noisy labels (LNL) has received growing attention, with most prior work following the paradigm of clean-sample reliance (e.g., sample selection). However, this reliance also imposes intrinsic limitations, as overfitting to even a few noisy samples is inevitable, creating a major bottleneck for further improvement. This limitation motivates us to go beyond mere clean-sample reliance and explore how to actively forget corrupted knowledge already internalized by models while suppressing further noise assimilation. To this end, we propose FINE, a fundamentally novel perspective for LNL that unifies active ForgettIng via machine unlearning (MU) and Noise supprEssion via negative learning (NL) within a cohesive framework. Specifically, we first reveal two key stages of noise fitting: early-stage generalized learning and later-stage noise overfitting. To actively forget early-stage noise accumulation, we introduce an MU-based module that employs a negative cross-entropy loss to erase corrupted knowledge, while an NL-based module leveraging complementary labels suppresses later-stage overfitting and mitigates reliance on noisy supervision. These modules act synergistically as plug-and-play regularizers, seamlessly integrating into existing baselines. Finally, extensive experiments on both synthetic and real-world noisy benchmarks demonstrate that our FINE consistently boosts robustness and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36408", "url": null, "sourceid": 43139, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36411, "uid": "8a010f0312373c02e0d15cdfc56ea416", "name": "ReGenHOI: Unifying Reconstruction and Generation for 3D Human\u2013Object Interaction Understanding", "authors": [{"id": 182210, "fullname": "miao xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182210?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 76305, "fullname": "Xiangyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76305?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 184974, "fullname": "Zidu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184974?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 182070, "fullname": "XUSHENG LIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/182070?format=json", "institution": "City University of Hong Kong"}, {"id": 184975, "fullname": "Bao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184975?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 184329, "fullname": "Jinlin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184329?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 155613, "fullname": "Zelin Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155613?format=json", "institution": "Westlake University, Zhejiang University, National University of Singapore"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}], "abstract": "Understanding 3D human\u2013object interaction (HOI) involves two highly-related abilities: reconstruction, which perceives observed geometry, and generation, which imagines plausible future interactions. However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic\u2013geometric reasoning space. Specifically, a 3D Contact Reasoning mechanism enables direct reasoning in 3D space, jointly modeling geometric structure and semantic relationships, while a Reasoning Trace Refinement module iteratively refines contact predictions by integrating geometric and semantic cues. The framework builds a unified latent representation via explicit reasoning on human\u2013object contact regions. To further enhance realism and physical plausibility when generating the outputs of reconstruction and generation, we modify and adapt the Gravity-Field Based Diffusion Bridge to refine fine-grained contact geometry and ensure smooth, physically consistent human\u2013object engagement. Extensive experiments demonstrate that our unified framework significantly improves both reconstruction accuracy and generative interaction quality, establishing a cohesive and interpretable paradigm for 3D HOI understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36411", "url": null, "sourceid": 34265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36409, "uid": "16ecd261ac5088aee91078bf5225abd9", "name": "InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization", "authors": [{"id": 150464, "fullname": "Daniel Gilo", "url": "http://cvpr.thecvf.com/api/miniconf/users/150464?format=json", "institution": "Technion"}, {"id": 87005, "fullname": "Or Litany", "url": "http://cvpr.thecvf.com/api/miniconf/users/87005?format=json", "institution": "NVIDIA / Technion"}], "abstract": "We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views.Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36409", "url": null, "sourceid": 39996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36419, "uid": "39b24aca0c6550e0aadc339909990afc", "name": "UniSER: A Foundation Model for Unified Soft Effects Removal", "authors": [{"id": 136067, "fullname": "Jingdong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136067?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 153323, "fullname": "Lingzhi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153323?format=json", "institution": "Adobe Systems"}, {"id": 84712, "fullname": "Qing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84712?format=json", "institution": "Adobe Systems"}, {"id": 183074, "fullname": "Mang Tik Chiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183074?format=json", "institution": "Adobe Inc."}, {"id": 88938, "fullname": "Connelly Barnes", "url": "http://cvpr.thecvf.com/api/miniconf/users/88938?format=json", "institution": "Adobe Systems"}, {"id": 129569, "fullname": "Yizhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129569?format=json", "institution": "Northeastern University"}, {"id": 85156, "fullname": "Haoran You", "url": "http://cvpr.thecvf.com/api/miniconf/users/85156?format=json", "institution": "Georgia Institute of Technology"}, {"id": 156847, "fullname": "Xiaoyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156847?format=json", "institution": "Adobe"}, {"id": 88967, "fullname": "Yuqian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88967?format=json", "institution": "University of Illinois, Urbana-Champaign"}, {"id": 85199, "fullname": "Zhe Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85199?format=json", "institution": "Adobe Research"}, {"id": 75717, "fullname": "Eli Shechtman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75717?format=json", "institution": "Adobe Research, US"}, {"id": 88957, "fullname": "Sohrab Amirghodsi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88957?format=json", "institution": "Adobe"}, {"id": 73998, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73998?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 87784, "fullname": "Wenping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87784?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 91059, "fullname": "Xiaohang Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/91059?format=json", "institution": "Tencent"}], "abstract": "Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36419", "url": null, "sourceid": 37911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36416, "uid": "ff342ebc4c38bf532854050e89acf973", "name": "Seeing Conversations: Communication Context Identification in Egocentric Video", "authors": [{"id": 135598, "fullname": "Tobias Dorszewski", "url": "http://cvpr.thecvf.com/api/miniconf/users/135598?format=json", "institution": "Technical University of Denmark"}, {"id": 184987, "fullname": "Jens Hjortkj\u00e6r", "url": "http://cvpr.thecvf.com/api/miniconf/users/184987?format=json", "institution": "Technical University of Denmark"}], "abstract": "In everyday conversations, humans effortlessly recognize communication partners using visual cues such as gaze or head orientation. Replicating this social reasoning in computer vision is challenging, especially in dynamic, multi-person settings. We introduce Communication Context Identification (CCI) in egocentric vision: Given a first-person video sequence, determine which individuals are engaged in communication with the camera wearer. To support CCI, we collected a challenging large-scale dataset comprising 68.9 hours of egocentric video captured across diverse multi-person, multi-conversation scenarios.We propose CoCoNet, a temporal interaction model for CCI that tracks social dynamics via attention across individuals over long time scales. CoCoNet flexibly handles varying group sizes, maintains predictions through occlusions, and performs robustly even with limited temporal input. Leveraging long temporal contexts, it achieves 96\\% balanced accuracy on CCI. Performance varies with group size and spatial scene layout, highlighting the importance of dataset diversity. Our work advances vision-based conversational awareness, enabling applications in assistive hearing that use egocentric video to enhance individuals in the user\u2019s conversation group.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36416", "url": null, "sourceid": 38615, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36425, "uid": "d64c331c34bb99731a626c0002467c65", "name": "2D-LFM: Lifting Foundation Model without 3D supervision", "authors": [{"id": 70318, "fullname": "Mosam Dabhi", "url": "http://cvpr.thecvf.com/api/miniconf/users/70318?format=json", "institution": "Carnegie Mellon University"}, {"id": 185008, "fullname": "Irhas Gill", "url": "http://cvpr.thecvf.com/api/miniconf/users/185008?format=json", "institution": "University of Adelaide"}, {"id": 76711, "fullname": "Laszlo Jeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/76711?format=json", "institution": "Carnegie Mellon University"}, {"id": 86945, "fullname": "Simon Lucey", "url": "http://cvpr.thecvf.com/api/miniconf/users/86945?format=json", "institution": "University of Adelaide"}], "abstract": "Recent vision foundation models give the impression that 3D reconstruction from RGB is largely solved. Yet these systems struggle with object-specific 3D structure: the fine-grained geometry implied by an object\u2019s landmarks or skeleton. In this paper, we show that when a model is given only 2D landmarks, it can recover more accurate 3D structure than state-of-the-art depth-from-RGB foundation models. Classical lifting approaches such as PAUL demonstrate this principle but do not scale beyond single categories, while methods like 3D-LFM scale but require extensive 3D supervision. We present the first lifting foundation model that learns object-specific 3D geometry using only 2D supervision. The key idea is to inject correspondence structure into the model via a positional encoding inspired by classical structure-from-motion. This simple inductive bias enables robust, object-agnostic 3D lifting that rivals or exceeds recent 3D-supervised approaches, revealing that landmark-based lifting remains a powerful and under-exploited paradigm for 3D understanding.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36425", "url": null, "sourceid": 31401, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36423, "uid": "80e37ed836e4af7a5f41eb4ed703bfef", "name": "Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning", "authors": [{"id": 136248, "fullname": "Jinge Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/136248?format=json", "institution": "Purdue University"}, {"id": 126795, "fullname": "Fengqing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126795?format=json", "institution": "Purdue University, Purdue University"}], "abstract": "With the widespread adoption of deep learning in visual tasks, Class-Incremental Learning (CIL) has become an important paradigm for handling dynamically evolving data distributions. However, CIL faces the core challenge of *catastrophic forgetting*, often manifested as a prediction bias toward new classes. Existing methods mainly attribute this bias to intra-task class imbalance and focus on corrections at the classifier head. In this paper, we highlight an overlooked factor\u2014*temporal imbalance*\u2014as a key cause of this bias. Earlier classes receive stronger negative supervision toward the end of training, leading to asymmetric precision and recall. We establish a temporal supervision model, formally define temporal imbalance, and propose the Temporal-Adjusted Loss (TAL), which uses a temporal decay kernel to construct a supervision strength vector and dynamically reweight the negative supervision in cross-entropy loss. Theoretical analysis shows that TAL degenerates to standard cross-entropy under balanced conditions and effectively mitigates prediction bias under imbalance. Extensive experiments demonstrate that TAL significantly reduces forgetting and improves performance on multiple CIL benchmarks, underscoring the importance of temporal modeling for stable long-term learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36423", "url": null, "sourceid": 42706, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36421, "uid": "3eb98498c491436f425cd830b3f447fa", "name": "CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space", "authors": [{"id": 174987, "fullname": "Sohwi Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/174987?format=json", "institution": "KAIST"}, {"id": 184992, "fullname": "Lee Hyoseok", "url": "http://cvpr.thecvf.com/api/miniconf/users/184992?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 184993, "fullname": "Jungjoon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/184993?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 152617, "fullname": "Tae-Hyun Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/152617?format=json", "institution": "KAIST"}], "abstract": "Human perception of visual similarity is inherently adaptive and subjective, depending on the users\u2019 interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves state-of-the art retrieval accuracy and notable computational efficiency compared to previous works.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36421", "url": null, "sourceid": 37005, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36437, "uid": "024f19fc85e7662509e586c3f73273cd", "name": "MatE: Material Extraction from Single-Image via Geometric Prior", "authors": [{"id": 182407, "fullname": "Zeyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182407?format=json", "institution": "University of Science and Technology of China"}, {"id": 86247, "fullname": "Wei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86247?format=json", "institution": "University of Science and Technology of China"}, {"id": 155522, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155522?format=json", "institution": "University of Science and Technology of China"}, {"id": 86250, "fullname": "Yang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86250?format=json", "institution": "University of Science and Technology of China"}], "abstract": "The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world image.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36437", "url": null, "sourceid": 44023, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36433, "uid": "67012a7770c47eba3bc1b811ca28ecc6", "name": "Real-Time Neural Video Compression with Unified Intra and Inter Coding", "authors": [{"id": 173417, "fullname": "Hui Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173417?format=json", "institution": "University of Science and Technology of China"}, {"id": 143057, "fullname": "Yifan Bian", "url": "http://cvpr.thecvf.com/api/miniconf/users/143057?format=json", "institution": "University of Science and Technology of China"}, {"id": 90450, "fullname": "Li Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90450?format=json", "institution": "University of Science and Technology of China"}, {"id": 185035, "fullname": "Jingran Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185035?format=json", "institution": "Tencent Shannon Lab"}, {"id": 185036, "fullname": "Xianguo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185036?format=json", "institution": "Tencent Shannon Lab"}, {"id": 87804, "fullname": "Dong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87804?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 12.1\\% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36433", "url": null, "sourceid": 34753, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36434, "uid": "3985c3f6f10fa559cb7403cd0121d5c1", "name": "Mirror Illusion Art", "authors": [{"id": 182683, "fullname": "Xiaopei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182683?format=json", "institution": "Tsinghua University"}, {"id": 185037, "fullname": "Zeyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185037?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86599, "fullname": "Jun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86599?format=json", "institution": "Tsinghua University"}, {"id": 89698, "fullname": "Xiaolin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89698?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise,  (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36434", "url": null, "sourceid": 32516, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36435, "uid": "9f0f4ce8b671b61e661861ee16f16d93", "name": "UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying", "authors": [{"id": 183908, "fullname": "Bai Chengyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183908?format=json", "institution": "Peking University"}, {"id": 156765, "fullname": "Jintao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156765?format=json", "institution": "Peking University"}, {"id": 185038, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185038?format=json", "institution": "Peking University"}, {"id": 185039, "fullname": "Yilong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185039?format=json", "institution": "Peking University"}, {"id": 185040, "fullname": "Qi She", "url": "http://cvpr.thecvf.com/api/miniconf/users/185040?format=json", "institution": "Bytedance AI Lab"}, {"id": 89661, "fullname": "Ming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89661?format=json", "institution": "Intel Labs China"}, {"id": 91956, "fullname": "Shanghang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91956?format=json", "institution": "Peking University"}], "abstract": "Recent advances in diffusion models and vision-language models (VLMs) have significantly enhanced the controllability of image editing. Methods like FlowEdit enable step-by-step editing along a visible, noise-free trajectory, where each intermediate result is a clear image, eliminating the need for full noise inversion. However, these approaches still operate in pixel space or VAE latent space, where intermediate outputs often suffer from visual artifacts, distortions, or unrealistic details\u2014making reliable semantic evaluation difficult. Furthermore, they remain open-loop systems, applying static edits without feedback to guide or correct the editing process adaptively. We propose UniEdit-I, the first training-free, closed-loop image editing framework that operates entirely within the semantic latent space of a unified VLM by introducing an Understanding\u2013Editing\u2013Verifying (UEV) loop:(1) Understanding: parses the source image and editing instruction into a structured source prompt and a minimal target specification;(2) Editing: applies dynamic semantic offsets, with a configurable feedback weighting mechanism that adaptively modulates editing intensity based on real-time alignment feedback;(3) Verifying: leverages the VLM\u2019s own multimodal reasoning capability to evaluate the intermediate output along multiple semantic dimensions and trigger early stopping or refinement. By transforming the VLM from a post-hoc evaluator into an in-process conductor, UniEdit-I establishes the first semantics-driven, self-correcting closed-loop image editing pipeline. Evaluated on GEdit-Bench, UniEdit-I achieves state-of-the-art performance without any fine-tuning or architectural modifications, and even surpasses several large-scale pre-trained editors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36435", "url": null, "sourceid": 43569, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36439, "uid": "8fd2aa96c03c97513ead09138475efd9", "name": "InterPrior: A Scalable Motion Prior for Physics-Based Human-Object Interactions", "authors": [{"id": 140805, "fullname": "Sirui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/140805?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 73495, "fullname": "Samuel Schulter", "url": "http://cvpr.thecvf.com/api/miniconf/users/73495?format=json", "institution": "NEC Laboratories America"}, {"id": 185050, "fullname": "Morteza Ziyadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185050?format=json", "institution": "Amazon"}, {"id": 185051, "fullname": "Xialin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185051?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 185052, "fullname": "Xiaohan Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185052?format=json", "institution": "Amazon"}, {"id": 73909, "fullname": "Yu-Xiong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73909?format=json", "institution": "School of Computer Science, Carnegie Mellon University"}, {"id": 91538, "fullname": "Liangyan Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/91538?format=json", "institution": "UIUC"}], "abstract": "Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified control policy, i.e., interaction motion prior through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multi-modal and partially specified goal cues. A targeted diversity process, combining data augmentation and physical perturbations, broadens exposure to varied contact and object conditions, producing a motion prior that generalizes beyond the training data. To address the vast configuration space of large-scale human-object interaction, a reinforcement learning finetuning enhances unseen goal competence, enabling recovery from unsuccessful grasp. The resulting policy acts as a reusable motion prior that can absorb new behaviors, including interactions with unseen objects. We also show its effectiveness in user-interactive control and across different embodiments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36439", "url": null, "sourceid": 33873, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36440, "uid": "7c9d8efa5f0fc84385730c20b6a569e3", "name": "Mapping Networks", "authors": [{"id": 177112, "fullname": "Lord Sen", "url": "http://cvpr.thecvf.com/api/miniconf/users/177112?format=json", "institution": "National Institute of Technology, Rourkela"}, {"id": 177110, "fullname": "Shyamapada Mukherjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/177110?format=json", "institution": "National Institute of Technology Rourkela"}], "abstract": "The escalating parameter counts in modern deep learningmodels pose a fundamental challenge to efficient trainingand resolution of overfitting. We address this by introducingthe Mapping Networks which replace the high dimensionalweight space by a compact, trainable latent vector based onthe hypothesis that the trained parameters of large networksreside on smooth, low-dimensional manifolds. Henceforth,the Mapping Theorem enforced by a dedicated MappingLoss, shows the existence of a mapping from this latentspace to the target weight space both theoretically and inpractice. Mapping Networks significantly reduce overfittingand achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc., with99.5%, i.e., around 500\u00d7 reduction in trainable parameters.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36440", "url": null, "sourceid": 30604, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40260?format=json"], "related_events_ids": [40260]}, {"id": 40260, "uid": "7c9d8efa5f0fc84385730c20b6a569e3", "name": "Mapping Networks", "authors": [{"id": 177112, "fullname": "Lord Sen", "url": "http://cvpr.thecvf.com/api/miniconf/users/177112?format=json", "institution": "National Institute of Technology, Rourkela"}, {"id": 177110, "fullname": "Shyamapada Mukherjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/177110?format=json", "institution": "National Institute of Technology Rourkela"}], "abstract": "The escalating parameter counts in modern deep learningmodels pose a fundamental challenge to efficient trainingand resolution of overfitting. We address this by introducingthe Mapping Networks which replace the high dimensionalweight space by a compact, trainable latent vector based onthe hypothesis that the trained parameters of large networksreside on smooth, low-dimensional manifolds. Henceforth,the Mapping Theorem enforced by a dedicated MappingLoss, shows the existence of a mapping from this latentspace to the target weight space both theoretically and inpractice. Mapping Networks significantly reduce overfittingand achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc., with99.5%, i.e., around 500\u00d7 reduction in trainable parameters.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40260", "url": null, "sourceid": -30604, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36440?format=json"], "related_events_ids": [36440]}, {"id": 36442, "uid": "3f20872cfee4278dba7b3c0d18d370f8", "name": "Narrative Weaver:  Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning", "authors": [{"id": 180876, "fullname": "Zhengjian Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180876?format=json", "institution": "Peking University"}, {"id": 180766, "fullname": "Yongzhi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180766?format=json", "institution": "Kuaishou"}, {"id": 76387, "fullname": "Xinyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76387?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185056, "fullname": "Quan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185056?format=json", "institution": "Kuaishou"}, {"id": 185057, "fullname": "Peng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185057?format=json", "institution": "Kuaishou Technology"}, {"id": 128636, "fullname": "Yanye Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128636?format=json", "institution": "Peking University"}], "abstract": "We present Narrative Weaver, a novel framework that addresses a fundamental challenge in generative AI: achieving controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences\u2014a critical limitation for real-world applications such as filmmaking and e-commerce advertising.Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift.To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD)\u2014the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations.Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method\u2019s superiority while opening new possibilities for AI-driven content creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36442", "url": null, "sourceid": 36047, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36444, "uid": "f3945e1de291464abf90ae242e534786", "name": "Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization", "authors": [{"id": 181913, "fullname": "Hanchao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181913?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185061, "fullname": "Fang-Lue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185061?format=json", "institution": "Victoria University of Wellington"}, {"id": 185062, "fullname": "Shining Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185062?format=json", "institution": "Tsinghua University"}, {"id": 89605, "fullname": "Tai-Jiang Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89605?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 89603, "fullname": "Shi-Min Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89603?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme. Code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36444", "url": null, "sourceid": 46259, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36447, "uid": "5001c2f5718ea51b03f9bac94edbe5b8", "name": "DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers", "authors": [{"id": 131306, "fullname": "Mengping Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131306?format=json", "institution": "East China University of Science and Technology"}, {"id": 127467, "fullname": "Stewart Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127467?format=json", "institution": "Alibaba DAMO Academy"}, {"id": 185075, "fullname": "Binglei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185075?format=json", "institution": "Shanghai Innovation Institute; Fudan University"}, {"id": 185076, "fullname": "Xiaomeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185076?format=json", "institution": "Shanghai Academy of Artificial Intelligence for Science"}, {"id": 185077, "fullname": "Hesen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185077?format=json", "institution": "Fudan University; Shanghai Academy of Artificial Intelligence for Science"}, {"id": 99574, "fullname": "Hao li", "url": "http://cvpr.thecvf.com/api/miniconf/users/99574?format=json", "institution": "Fudan University"}], "abstract": "Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity.  DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256 $\\times$ 256 and 512 $\\times$ 512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance. Our code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36447", "url": null, "sourceid": 45067, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36450, "uid": "e75d0b169ffeb90d4b805790ce68a239", "name": "$L^{2}DGS$: Low-Light Dynamic Gaussian Splatting", "authors": [{"id": 157067, "fullname": "Ashish Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/157067?format=json", "institution": "Indian Institute of Technology, Madras"}, {"id": 85247, "fullname": "A. N. Rajagopalan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85247?format=json", "institution": "Indian Institute of Technology Madras"}], "abstract": "Synthesizing novel spatiotemporal views of dynamic scenes is inherently challenging due to both object and camera motion, as well as sparsity of observations. Recent advances in Neural Radiance Fields (NeRFs) and Gaussian Splatting (GS) have enabled 4D dynamic scene reconstruction, but predominantly from well-lit images or videos. Some works address the problem of reconstructing a well-lit scene from low-light input, but these are limited to static scenes. Moreover, prior methods primarily emphasize improving illumination, while overlooking the underlying scene characteristics. Reconstructing well-lit dynamic scenes from inputs captured under low-light conditions is particularly challenging due to shadows, occlusions, and disocclusions caused by object motion, which makes the problem highly ambiguous and ill-posed. We propose $L^{2}DGS$ (Low-Light Dynamic Gaussian Splatting), a self-supervised 4D GS framework for directly reconstructing well-lit dynamic scenes from low-light inputs. The proposed method decomposes each scene into two complementary components: illumination, which varies across both view and time, and reflectance, which remains invariant to these factors. To achieve this, we introduce several key innovations. First, the proposed Occlusion-Disocclusion Network (OCD-Net) models time-varying intensity across frames. Next, we propose Brightness Attenuation Features (BAFs), when complemented by the BAF Enhancement Network (BAFE-Net), enable geometry- and photometry-aware transformation between well-lit and low-light scenes for self-supervision. Together, these components allow $L^{2}DGS$ to maximize signal strength and suppress noise inherent in low-light inputs, leading to enhanced spatial fidelity and temporal consistency under challenging illumination conditions. Our method operates on standard sRGB inputs without requiring camera metadata (e.g., exposure settings), ensuring compatibility with consumer-grade imaging devices. We evaluate $L^{2}DGS$ on both simulated and real-world Low-Light Dynamic Video ($L^{2}DyV$) datasets, demonstrating superior qualitative and quantitative performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36450", "url": null, "sourceid": 40986, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40261, "uid": "33c7c478e181b849b1a65eef4ba8d414", "name": "GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport", "authors": [{"id": 107602, "fullname": "Youngju Na", "url": "http://cvpr.thecvf.com/api/miniconf/users/107602?format=json", "institution": "KAIST"}, {"id": 135036, "fullname": "Jaeseong Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/135036?format=json", "institution": "NAVER LABS"}, {"id": 185089, "fullname": "Soohyun Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185089?format=json", "institution": "Naver Labs"}, {"id": 185090, "fullname": "Hyunsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185090?format=json", "institution": "Naver Labs"}, {"id": 89458, "fullname": "Sung-Eui Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/89458?format=json", "institution": "KAIST"}, {"id": 155592, "fullname": "Suyong Yeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/155592?format=json", "institution": "NAVER LABS"}], "abstract": "While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate that GLINT achieves state-of-the-art performance in 3D reconstruction of complex transparent scenes.Our code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40261", "url": null, "sourceid": -45476, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36452?format=json"], "related_events_ids": [36452]}, {"id": 40264, "uid": "de217bb1ab483987c651cf5a5e868018", "name": "QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition", "authors": [{"id": 184166, "fullname": "Daniel Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184166?format=json", "institution": "University of Minnesota Twin Cities"}, {"id": 127889, "fullname": "Gilad Lerman", "url": "http://cvpr.thecvf.com/api/miniconf/users/127889?format=json", "institution": "University of Minnesota, Minneapolis"}, {"id": 185121, "fullname": "Joe Kileel", "url": "http://cvpr.thecvf.com/api/miniconf/users/185121?format=json", "institution": "University of Texas, Austin"}], "abstract": "In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors.  We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares.  We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40264", "url": null, "sourceid": -39559, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36469?format=json"], "related_events_ids": [36469]}, {"id": 36454, "uid": "8e5c5ec9bcdcb17075d4ba66da4008f5", "name": "Exemplar-Free Continual Learning for State Space Models", "authors": [{"id": 174195, "fullname": "ISAAC NING LEE", "url": "http://cvpr.thecvf.com/api/miniconf/users/174195?format=json", "institution": "Monash University"}, {"id": 185095, "fullname": "Leila Mahmoodi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185095?format=json", "institution": null}, {"id": 128675, "fullname": "Trung Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/128675?format=json", "institution": "Monash University"}, {"id": 87144, "fullname": "Mehrtash Harandi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87144?format=json", "institution": "Monash University"}], "abstract": "State-Space Models (SSMs) excel at capturing long-range dependencies with structured recurrence, making them well-suited for sequence modeling. However, their evolving internal states pose unique challenges in Continual Learning (CL). Without access to the full distribution of previous tasks, updates to the state-space dynamics become unconstrained, leading to catastrophic forgetting. To address this, we propose $\\textbf{Inf-SSM}$, a geometry-aware regularization framework for CL in SSMs. It constrains state evolution via the infinite-dimensional Grassmannian of SSM observability subspaces, without requiring any exemplars from past tasks. Unlike classical CL methods that restrict weight updates, Inf-SSM directly regularizes the infinite-horizon state evolution encoded by the extended observability subspace of the SSM. We show that enforcing this regularization requires solving a matrix equation known as the Sylvester equation, which typically incurs $\\mathcal{O}(n^3)$ complexity. Thus, we develop a $\\mathcal{O}(n^2)$ solution by exploiting the structure and properties of SSMs. This leads to an efficient regularization mechanism that can be seamlessly integrated into existing CL methods. Comprehensive experiments on challenging benchmarks of ImageNet-R, CIFAR-100, and Caltech-256 demonstrate a significant reduction in forgetting while improving accuracy across sequential tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36454", "url": null, "sourceid": 34232, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36455, "uid": "e4df76144e798cb0077f3ce85d0a4e3e", "name": "Grid Distillation: Compositional Image Distillation via Structured Generative Grids", "authors": [{"id": 153192, "fullname": "Biplab Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/153192?format=json", "institution": "International Institute of Information Technology, Bangalore"}, {"id": 185096, "fullname": "Shouvik Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/185096?format=json", "institution": "Samsung"}, {"id": 151837, "fullname": "Viswanath Gopalakrishnan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151837?format=json", "institution": "IIIT Bangalore"}], "abstract": "We present \\textbf{Grid Distillation}, a generative dataset distillation framework that compresses large-scale datasets into a compact set of informative synthetic samples. Our method constructs high-resolution compositional grids via \\textbf{spectral submodular optimization}, which injects \\textit{world knowledge} from CLIP representations to maximize semantic coverage and diversity. These grids are then downsampled into low-resolution distilled images optimized for diversity and representational efficiency. During training, a single-step diffusion reconstruction (based on Stable Diffusion Turbo) restores fine-grained spatial details from diffusion priors, bridging the gap between compact representations and natural image statistics. A grid-aware cropping strategy further enhances discriminability by probabilistically aligning crops with grid boundaries, maintaining compatibility with standard $224{\\times}224$ inference inputs. Experiments on ImageWoof, ImageNette, ImageIDC, and ImageNet-1K demonstrate consistent improvements over existing dataset distillation methods across multiple IPC settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36455", "url": null, "sourceid": 36404, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36456, "uid": "46b58cc60d2381f5329ccaef74edea93", "name": "Towards Training-free Scene Text Editing", "authors": [{"id": 154559, "fullname": "Yubo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154559?format=json", "institution": "INSTITUTE OF INFORMATION ENGINEERING"}, {"id": 154555, "fullname": "Xugong Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154555?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 154556, "fullname": "peng zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154556?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 157531, "fullname": "Hailun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/157531?format=json", "institution": "Institute of Information Engineering, CAS"}, {"id": 154558, "fullname": "Gangyan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154558?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 185097, "fullname": "Kexin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185097?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36456", "url": null, "sourceid": 44013, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36458, "uid": "a79f5158d38f5d570c22deb545a8fabd", "name": "SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation", "authors": [{"id": 181894, "fullname": "Chaitat Utintu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181894?format=json", "institution": "University of Surrey"}, {"id": 76978, "fullname": "Yi-Zhe Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/76978?format=json", "institution": "University of Surrey"}], "abstract": "We introduce SketchDeco, a training-free approach to sketch colourisation that bridges the gap between professional design needs and intuitive, region-based control. Our method empowers artists to use simple masks and colour palettes for precise spatial and chromatic specification, avoiding both the tediousness of manual assignment and the ambiguity of text-based prompts. We reformulate this task as a novel, training-free composition problem. Our core technical contribution is a guided latent-space blending process: we first leverage diffusion inversion to precisely ``paint'' user-defined colours into specified regions, and then use a custom self-attention mechanism to harmoniously blend these local edits with a globally consistent base image. This ensures both local colour fidelity and global harmony without requiring any model fine-tuning. Our system produces high-quality results in 15--20 inference steps on consumer GPUs, making professional-quality, controllable colourisation accessible.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36458", "url": "https://chaitron.github.io/SketchDeco/", "sourceid": 43259, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36463, "uid": "22e003fcd294b552ff9da4aa1b9f85de", "name": "MusicInfuser: Making Video Diffusion Listen and Dance", "authors": [{"id": 152982, "fullname": "Susung Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/152982?format=json", "institution": "University of Washington"}, {"id": 89867, "fullname": "Ira Kemelmacher-Shlizerman", "url": "http://cvpr.thecvf.com/api/miniconf/users/89867?format=json", "institution": "UW + Google"}, {"id": 85699, "fullname": "Brian Curless", "url": "http://cvpr.thecvf.com/api/miniconf/users/85699?format=json", "institution": "University of Washington"}, {"id": 126276, "fullname": "Steve Seitz", "url": "http://cvpr.thecvf.com/api/miniconf/users/126276?format=json", "institution": "University of Washington"}], "abstract": "We introduce MusicInfuser, an approach that aligns pre-trained text-to-video diffusion models to generate high-quality dance videos synchronized with specified music tracks. Rather than training a multimodal audio-video or audio-motion model from scratch, our method demonstrates how existing video diffusion models can be efficiently adapted to align with musical inputs. We propose a novel layer-wise adaptability criterion based on a guidance-inspired constructive influence function to select adaptable layers, significantly reducing training costs while preserving rich prior knowledge, even with limited, specialized datasets. Experiments show that MusicInfuser effectively bridges the gap between music and video, generating novel and diverse dance movements that respond dynamically to music. Furthermore, our framework generalizes well to unseen music tracks, longer video sequences, and unconventional subjects, outperforming baseline models in consistency and synchronization. All of this is achieved without requiring motion data, with training completed on a single GPU within a day.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36463", "url": null, "sourceid": 44254, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36465, "uid": "ce943be8824807573e3d363d01058edb", "name": "INSID3: Training-Free In-Context Segmentation with DINOv3", "authors": [{"id": 129915, "fullname": "Claudia Cuttano", "url": "http://cvpr.thecvf.com/api/miniconf/users/129915?format=json", "institution": "Polytechnic Institute of Turin"}, {"id": 95391, "fullname": "Gabriele Trivigno", "url": "http://cvpr.thecvf.com/api/miniconf/users/95391?format=json", "institution": "Politecnico di Torino"}, {"id": 136627, "fullname": "Christoph Reich", "url": "http://cvpr.thecvf.com/api/miniconf/users/136627?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 128244, "fullname": "Carlo Masone", "url": "http://cvpr.thecvf.com/api/miniconf/users/128244?format=json", "institution": "Politecnico di Torino"}, {"id": 73884, "fullname": "Stefan Roth", "url": "http://cvpr.thecvf.com/api/miniconf/users/73884?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}], "abstract": "In-context segmentation (ICS) aims to segment arbitrary concepts, objects, parts, or personalized instances given a few annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but limits generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concept at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +6.1 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36465", "url": null, "sourceid": 41053, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40263?format=json"], "related_events_ids": [40263]}, {"id": 40263, "uid": "ce943be8824807573e3d363d01058edb", "name": "INSID3: Training-Free In-Context Segmentation with DINOv3", "authors": [{"id": 129915, "fullname": "Claudia Cuttano", "url": "http://cvpr.thecvf.com/api/miniconf/users/129915?format=json", "institution": "Polytechnic Institute of Turin"}, {"id": 95391, "fullname": "Gabriele Trivigno", "url": "http://cvpr.thecvf.com/api/miniconf/users/95391?format=json", "institution": "Politecnico di Torino"}, {"id": 136627, "fullname": "Christoph Reich", "url": "http://cvpr.thecvf.com/api/miniconf/users/136627?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 128244, "fullname": "Carlo Masone", "url": "http://cvpr.thecvf.com/api/miniconf/users/128244?format=json", "institution": "Politecnico di Torino"}, {"id": 73884, "fullname": "Stefan Roth", "url": "http://cvpr.thecvf.com/api/miniconf/users/73884?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}], "abstract": "In-context segmentation (ICS) aims to segment arbitrary concepts, objects, parts, or personalized instances given a few annotated visual examples. Existing work relies on (i) fine-tuning vision foundation models (VFMs), which improves in-domain results but limits generalization, or (ii) combines multiple frozen VFMs, which preserves generalization but yields architectural complexity and fixed segmentation granularities. We revisit ICS from a minimalist perspective and ask: Can a single self-supervised backbone support both semantic matching and segmentation, without any supervision or auxiliary models? We show that scaled-up dense self-supervised features from DINOv3 exhibit strong spatial structure and semantic correspondence. We introduce INSID3, a training-free approach that segments concept at varying granularities only from frozen DINOv3 features, given an in-context example. INSID3 achieves state-of-the-art results across one-shot semantic, part, and personalized segmentation, outperforming previous work by +6.1 % mIoU, while using 3x fewer parameters and without any mask or category-level supervision.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40263", "url": null, "sourceid": -41053, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36465?format=json"], "related_events_ids": [36465]}, {"id": 36466, "uid": "5c126a6cc68cc505013ca40816d40532", "name": "SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction", "authors": [{"id": 182504, "fullname": "Zicheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182504?format=json", "institution": "Fudan University"}, {"id": 185115, "fullname": "Xiangting Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185115?format=json", "institution": "ShanghaiTech University"}, {"id": 185116, "fullname": "Ke Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185116?format=json", "institution": "Fudan University"}, {"id": 185117, "fullname": "Wenchao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185117?format=json", "institution": "Fudan University; Tars Robotics Ltd."}], "abstract": "Recent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and feed-forward models. Extensive experimental results demonstrate that SparseSplat can achieve state-of-the-art rendering quality with only 22\\% of the Gaussians and maintain reasonable rendering quality with only 1.5\\% of the Gaussians.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36466", "url": null, "sourceid": 33485, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36467, "uid": "6d09e77eaed6a2162027587f5026efe7", "name": "Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision", "authors": [{"id": 163690, "fullname": "Amadou S. SANGARE", "url": "http://cvpr.thecvf.com/api/miniconf/users/163690?format=json", "institution": "CEA-List, Paris-Saclay University"}, {"id": 166475, "fullname": "Adrien Maglo", "url": "http://cvpr.thecvf.com/api/miniconf/users/166475?format=json", "institution": "CEA-List"}, {"id": 166477, "fullname": "Mohamed Chaouch", "url": "http://cvpr.thecvf.com/api/miniconf/users/166477?format=json", "institution": "CEA-List"}, {"id": 185118, "fullname": "Bertrand Luvison", "url": "http://cvpr.thecvf.com/api/miniconf/users/185118?format=json", "institution": "CEA"}], "abstract": "Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2x according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36467", "url": null, "sourceid": 40507, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36469, "uid": "de217bb1ab483987c651cf5a5e868018", "name": "QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition", "authors": [{"id": 184166, "fullname": "Daniel Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184166?format=json", "institution": "University of Minnesota Twin Cities"}, {"id": 127889, "fullname": "Gilad Lerman", "url": "http://cvpr.thecvf.com/api/miniconf/users/127889?format=json", "institution": "University of Minnesota, Minneapolis"}, {"id": 185121, "fullname": "Joe Kileel", "url": "http://cvpr.thecvf.com/api/miniconf/users/185121?format=json", "institution": "University of Texas, Austin"}], "abstract": "In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors.  We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares.  We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36469", "url": null, "sourceid": 39559, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40264?format=json"], "related_events_ids": [40264]}, {"id": 36470, "uid": "9b4db5a70a498cb5022c8702b13e7956", "name": "ExpPortrait: Expressive Portrait Generation via Personalized Representation", "authors": [{"id": 183689, "fullname": "Junyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183689?format=json", "institution": "University of Science and Technology of China"}, {"id": 133700, "fullname": "Yudong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/133700?format=json", "institution": "University of Science and Technology of China"}, {"id": 185122, "fullname": "Boyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185122?format=json", "institution": "University of Science and Technology of China"}, {"id": 185123, "fullname": "Shengming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185123?format=json", "institution": "University of Science and Technology of China"}, {"id": 127405, "fullname": "Juyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127405?format=json", "institution": "University of Science and Technology of China"}], "abstract": "While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36470", "url": null, "sourceid": 37908, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36472, "uid": "bde4a681eceb6f2c6d01c533b80a7a6e", "name": "Delta Rectified Flow Sampling for Text-to-Image Editing", "authors": [{"id": 182057, "fullname": "Gaspard Beaudouin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182057?format=json", "institution": "Harvard AI and Robotics Lab"}, {"id": 77015, "fullname": "Minghan LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/77015?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 185126, "fullname": "Jaeyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185126?format=json", "institution": "Harvard University"}, {"id": 185127, "fullname": "Sung-Hoon Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/185127?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 131867, "fullname": "Mengyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131867?format=json", "institution": "Harvard University"}], "abstract": "We propose Delta Rectified Flow Sampling (DRFS), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DRFS is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DRFS reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DRFS generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. We conduct an analysis to guide the design of our shift term, and experimental results on the widely used PIE Benchmark indicate that DRFS achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36472", "url": null, "sourceid": 32404, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36480, "uid": "d0b9ad3d3ca9c79694e2ce99aee06382", "name": "SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting", "authors": [{"id": 73178, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73178?format=json", "institution": "Nanchang Hangkong University"}, {"id": 175016, "fullname": "Jiahao Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175016?format=json", "institution": "Nanchang Hangkong University"}, {"id": 185151, "fullname": "Yubo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185151?format=json", "institution": "Nanchang Hangkong University"}, {"id": 128158, "fullname": "Feng-Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128158?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 185152, "fullname": "Bin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185152?format=json", "institution": null}, {"id": 185061, "fullname": "Fang-Lue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185061?format=json", "institution": "Victoria University of Wellington"}, {"id": 76495, "fullname": "Lin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76495?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "3D Gaussian representations have emerged as a powerful paradigm for digital head modeling, achieving photorealistic quality with real-time rendering. However, intuitive and interactive creation or editing of 3D Gaussian head models remains challenging. Although 2D sketches provide an ideal interaction modality for fast, intuitive conceptual design, they are sparse, depth-ambiguous, and lack high-frequency appearance cues, making it difficult to infer dense, geometrically consistent 3D Gaussian structures from strokes\u2014especially under real-time constraints. To address these challenges, we propose SketchFaceGS, the first sketch-driven framework for real-time generation and editing of photorealistic 3D Gaussian head models from 2D sketches. Our method uses a feed-forward, coarse-to-fine architecture. A Transformer-based UV feature-prediction module first reconstructs a coarse but geometrically consistent UV feature map from the input sketch, and a 3D UV feature enhancement module refines it with high-frequency, photorealistic detail to produce a high-fidelity 3D head. For editing, we introduce a UV Mask Fusion technique combined with a layer-by-layer feature-fusion strategy, enabling precise, real-time, free-viewpoint modifications. Extensive experiments show that SketchFaceGS outperforms existing methods in both generation fidelity and editing flexibility, producing high-quality, editable 3D heads from sketches in a single forward pass.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36480", "url": null, "sourceid": 42281, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36477, "uid": "878c04f03e333f1bc8707c8f0ad6a3e9", "name": "Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation", "authors": [{"id": 180696, "fullname": "Delin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/180696?format=json", "institution": "University of Notre Dame"}, {"id": 185145, "fullname": "Chaoli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185145?format=json", "institution": "University of Notre Dame"}], "abstract": "Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36477", "url": null, "sourceid": 44116, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36481, "uid": "159f02cdca45838bdfe5ecdeadb520be", "name": "AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization", "authors": [{"id": 163537, "fullname": "Jiawei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/163537?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185153, "fullname": "Wanrong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185153?format=json", "institution": "Adobe Research"}, {"id": 185154, "fullname": "Vlad I Morariu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185154?format=json", "institution": "Adobe"}, {"id": 179929, "fullname": "Christopher Tensmeyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/179929?format=json", "institution": "Adobe"}], "abstract": "Document generation has emerged as a crucial task for automating the creation of visually appealing and well-structured content across diverse domains. Existing methods in this field, however, suffer from some limitations in terms of application scope, document representation and dataset coverage, which greatly restricts the capabilities of document generation models. To address these challenges, we propose OmniDoc, a framework that introduces HTML/CSS as a novel document representation given its inherent advantages in hierarchical structure modeling. Leveraging HTML/CSS, OmniDoc establishes a scalable data synthesis pipeline to curate DocHTML, a large-scale document dataset containing 265,206 high-quality samples. Each document in DocHTML includes complete metadata annotations, structured HTML/CSS source code, synthesized visual assets, and rendered screenshots, spanning diverse categories, styles, and complexity levels to ensure comprehensive coverage. OmniDoc then utilizes DocHTML to fine-tune the multimodal large language models, empowering them remarkable document generation capabilities on three practical tasks: intention-to-document, document derendering, and element-to-document. To address the content overflow issues found in the fine-tuned models, we incorporate a height-aware post-training method within OmniDoc based on Group Relative Policy Optimization. By carefully designing the reward function to measure the alignment between predicted and target document heights, OmniDoc effectively alleviates the overflow problem, further enhancing model performance. Qualitative and quantitative results demonstrate the superiority of OmniDoc over baseline models across all three tasks. Extensive ablation studies manifest the effectiveness of the HTML/CSS representation, curated dataset, and height-aware reinforcement optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36481", "url": null, "sourceid": 30822, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36482, "uid": "8e7ad9f076740e3652a62ae9e328c53f", "name": "MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction", "authors": [{"id": 180981, "fullname": "Shuo Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180981?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 185155, "fullname": "Jian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185155?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 173539, "fullname": "Jiadong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173539?format=json", "institution": "Institute of Automation,Chinese Academy of Sciences"}, {"id": 185156, "fullname": "yi chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185156?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 185157, "fullname": "Qizhao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185157?format=json", "institution": "China Meteorological Administration"}, {"id": 185158, "fullname": "Lingdong Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185158?format=json", "institution": "Peking University"}, {"id": 86336, "fullname": "Cheng-Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86336?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89642, "fullname": "Shiming Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89642?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Timely and accurate forecasts of severe weather events are essential for early warning and for constraining downstream analysis and decision-making. Since severe weather events prediction still depends on subjective, time-consuming expert interpretation, end-to-end \u201cAI weather station\u201d systems are emerging but face three major challenges: (1) scarcity of severe weather event samples; (2) imperfect alignment between high-dimensional meteorological data and textual warnings; (3) current multimodal language models cannot effectively process high-dimensional meteorological inputs or capture their complex spatiotemporal dependencies. To address these challenges, we introduce MP-Bench, the first large-scale multimodal dataset for severe weather events prediction, comprising 421,363 pairs of raw multi-year meteorological data and corresponding  text caption,  covering a wide range of severe weather scenarios. On top of this dataset, we develop a Meteorology Multimodal Large Model (MMLM) that directly ingests 4D meteorological inputs. In addition, it is designed to accommodate the unique characteristics of 4D meteorological data flow, incorporating three plug-and-play adaptive fusion modules that enable dynamic feature extraction and integration across temporal sequences, vertical pressure layers, and spatial dimensions. Extensive experiments on MP-Bench show that MMLM achieves strong performance across multiple tasks, demonstrating effective severe weather understanding and representing a key step toward automated, AI-driven severe weather events forecasting systems. Our source code and dataset will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36482", "url": null, "sourceid": 40622, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36484, "uid": "0160134f4214329e47c0dadcb73f646f", "name": "Agile Deliberation: Concept Deliberation for Subjective Visual Classification", "authors": [{"id": 180896, "fullname": "Leijie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180896?format=json", "institution": "University of Washington"}, {"id": 130005, "fullname": "Otilia Stretcu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130005?format=json", "institution": "Google Research"}, {"id": 185165, "fullname": "Wei Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185165?format=json", "institution": "Google"}, {"id": 185166, "fullname": "Thomas Denby", "url": "http://cvpr.thecvf.com/api/miniconf/users/185166?format=json", "institution": "Google"}, {"id": 130003, "fullname": "Krishnamurthy Viswanathan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130003?format=json", "institution": "Google"}, {"id": 129995, "fullname": "Enming Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129995?format=json", "institution": "Google"}, {"id": 105229, "fullname": "Chun-Ta Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105229?format=json", "institution": "Google Research"}, {"id": 185167, "fullname": "Tushar Dogra", "url": "http://cvpr.thecvf.com/api/miniconf/users/185167?format=json", "institution": "Google"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}, {"id": 129992, "fullname": "Ariel Fuxman", "url": "http://cvpr.thecvf.com/api/miniconf/users/129992?format=json", "institution": "Google"}], "abstract": "From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding.Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through \"concept deliberation\", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called Agile Deliberation that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user\u2019s evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5\\% higher $F_1$ scores than automated decomposition baselines and more than 3\\% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36484", "url": null, "sourceid": 41098, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36486, "uid": "9f187d77d73d175445b0a0252b75efe3", "name": "Flowception: Temporally Expansive Flow Matching for Video Generation", "authors": [{"id": 107647, "fullname": "Tariq Berrada", "url": "http://cvpr.thecvf.com/api/miniconf/users/107647?format=json", "institution": "Meta"}, {"id": 185169, "fullname": "John Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185169?format=json", "institution": "Facebook"}, {"id": 86016, "fullname": "Karteek Alahari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86016?format=json", "institution": "Inria"}, {"id": 138110, "fullname": "Jakob Verbeek", "url": "http://cvpr.thecvf.com/api/miniconf/users/138110?format=json", "institution": "Meta"}, {"id": 185170, "fullname": "Ricky T. Q. Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185170?format=json", "institution": "FAIR Labs, Meta AI"}], "abstract": "We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36486", "url": null, "sourceid": 44667, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36494, "uid": "665abc2f6f55c9be9952b327c7f52baa", "name": "Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation", "authors": [{"id": 153353, "fullname": "Tim Engelbracht", "url": "http://cvpr.thecvf.com/api/miniconf/users/153353?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 185191, "fullname": "Ren\u00e9 Zurbr\u00fcgg", "url": "http://cvpr.thecvf.com/api/miniconf/users/185191?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 185192, "fullname": "Matteo Wohlrapp", "url": "http://cvpr.thecvf.com/api/miniconf/users/185192?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 91397, "fullname": "Martin B\u00fcchner", "url": "http://cvpr.thecvf.com/api/miniconf/users/91397?format=json", "institution": "Albert-Ludwigs-Universit\u00e4t Freiburg"}, {"id": 86578, "fullname": "Abhinav Valada", "url": "http://cvpr.thecvf.com/api/miniconf/users/86578?format=json", "institution": "Universit\u00e4t Freiburg"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}, {"id": 139149, "fullname": "Hermann Blum", "url": "http://cvpr.thecvf.com/api/miniconf/users/139149?format=json", "institution": "Uni Bonn                Lamarr Institute"}, {"id": 151077, "fullname": "Zuria Bauer", "url": "http://cvpr.thecvf.com/api/miniconf/users/151077?format=json", "institution": "ETH Zurich"}], "abstract": "We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. The dataset contains 3048 sequences across 381 articulated objects in 38 environments. Each object is operated under four embodiments - (i) human hand, (ii) human hand with a wrist-mounted camera, (iii) handheld UMI gripper, and (iv) a custom Hoi! gripper - where the tool embodiment provide synchronized end-effector forces and tactile sensing. Our dataset offers a holistic view of interaction understanding from video, enabling researchers to evaluate how well methods transfer between human and robotic viewpoints, but also investigate underexplored modalities such as force sensing and prediction.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36494", "url": null, "sourceid": 37159, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36497, "uid": "19101d2e639b7f9d2d85022ce1fece21", "name": "Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction", "authors": [{"id": 172487, "fullname": "Guangpu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172487?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 185201, "fullname": "Steffen Kie\u00df", "url": "http://cvpr.thecvf.com/api/miniconf/users/185201?format=json", "institution": "University of Stuttgart"}, {"id": 185202, "fullname": "Hanxiang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185202?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 185203, "fullname": "Xingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185203?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 185204, "fullname": "Sven Simon", "url": "http://cvpr.thecvf.com/api/miniconf/users/185204?format=json", "institution": "Universit\u00e4t Stuttgart"}], "abstract": "We propose Exact-GS, a novel mathematically rigorous and accurate 3D Gaussian Splatting model designed to perform 3D X-ray computed tomography (CT) reconstruction and novel view synthesis. Recently, 3D Gaussian Splatting achieved considerable progress at 3D representation. Unfortunately, due to the affine approximation of the projective transformation, previous 3DGS-based methods inevitably suffer from artifacts and projection inconsistencies. To address this problem, some ray tracing based methods perform integration along the ray across Gaussians. However, these methods are computationally inefficient on the forward and backward pass. We introduce a novel closed-form splatting solution for this problem with mathematically rigorous derivation. Our model is the first to achieve the same exact rendering quality as ray tracing based methods without any approximation under a splatting-based formulation, enabling fast CUDA-based hardware rasterization.Additionally, we present a precise Gaussian-tile intersection algorithm, enabling faster and efficient rendering.We demonstrate the performance gains by reconstruction and novel view synthesis through different synthetic and real-world datasets. Our method also contributes to the visible light scene representation, where the density accumulation (X-ray attenuation coefficient) in our model can be replaced by the integral of the opacity at alpha blending.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36497", "url": null, "sourceid": 46125, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36496, "uid": "437aa62e304ca50b7793a093360e4186", "name": "Computational Speckle Pattern Interferometry", "authors": [{"id": 174190, "fullname": "Shengxi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174190?format=json", "institution": "Carnegie Mellon University"}, {"id": 182010, "fullname": "Sophia Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182010?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 185200, "fullname": "Dorian Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185200?format=json", "institution": "Apple"}, {"id": 147109, "fullname": "Matthew O\u2019Toole", "url": "http://cvpr.thecvf.com/api/miniconf/users/147109?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Visually imperceptible surface deformations encode rich information---from the mechanical properties of an object to the acoustic vibrations present in the surrounding environment. Existing optical techniques reveal these subtle motions by employing coherent illumination and capturing multiple measurements over time. In this paper, we introduce Computational Speckle Pattern Interferometry (CSPI), a novel single-shot approach that estimates per-pixel displacement and motion by leveraging a phasor-based image formation model and an optical-flow-inspired reconstruction algorithm. Our key insight is that the image formation process can be factorized to jointly recover spatial coefficients and temporal dynamics. Unlike traditional interferometric methods, CSPI requires no precision instrumentation to perform phase stepping. We demonstrate its effectiveness by measuring per-pixel displacements and motions at sub-micrometer scales, visualizing high-frequency vibrations of a tuning fork and a Chladni plate, and recovering sound indirectly from these vibrations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36496", "url": null, "sourceid": 30750, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36500, "uid": "f0a546375c2be84f83f3d1e3405ad9d2", "name": "CountGD++: Generalized Prompting for Open-World Counting", "authors": [{"id": 103047, "fullname": "Niki Amini-Naieni", "url": "http://cvpr.thecvf.com/api/miniconf/users/103047?format=json", "institution": "University of Oxford"}, {"id": 75512, "fullname": "Andrew Zisserman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75512?format=json", "institution": "University of Oxford"}], "abstract": "The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36500", "url": null, "sourceid": 36558, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36549, "uid": "5a68cb47058c98fe6d6e4971aedb0480", "name": "Black-Box Domain Adaptation for Object Detection with Retention-Driven Knowledge Compression", "authors": [{"id": 183878, "fullname": "Yuwu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183878?format=json", "institution": "South China Normal University"}, {"id": 185322, "fullname": "Chunzhi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185322?format=json", "institution": "Tongji University"}], "abstract": "Black-Box Domain Adaptation (BBDA) is a highly practical yet challenging strategy that enables the deployment of pre-trained detectors to new unlabeled target domains without accessing source data or models. Compared to previous domain adaptation studies, BBDA not only provides stronger data privacy protection but also offers greater portability. Despite growing interest, existing BBDA strategies remain difficult to apply directly to object detection, as most prior works focus on classification and segmentation tasks that do not involve bounding box localization and rely on different learning mechanisms. In this paper, inspired by lifelong learning, we propose Retention-Driven Knowledge Compression (RDKC), which applies a brain-inspired continual learning process to BBDA for object detection. Specifically, RDKC consists of two key components: Memory Retention (MR) and Scene Compression (SC). MR is designed specifically for object detection under the BBDA setting, where it performs memorized contrastive learning on partitioned regions to better utilize informative cues from reliable areas while filtering out potential noise from noise prediction labels. SC introduces a contrastive mechanism between near- and far-view regions, which enables the model to better learn from far-view regions under the guidance of near-view cues. Experimental results demonstrate that under the BBDA setting, RDKC outperforms previous SOTA methods across all evaluated benchmarks, achieving superior performance improvements.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36549", "url": null, "sourceid": 46599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36503, "uid": "a8d2884db49e0769ad4a0b8dce2e143c", "name": "TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment", "authors": [{"id": 180099, "fullname": "Qin Chunxia", "url": "http://cvpr.thecvf.com/api/miniconf/users/180099?format=json", "institution": "University of Science and Technology of China"}, {"id": 185224, "fullname": "Chenyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185224?format=json", "institution": "University of Science and Technology of China"}, {"id": 185225, "fullname": "Pengcheng Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/185225?format=json", "institution": "IFLYTEK CO.LTD."}, {"id": 131090, "fullname": "Jun Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/131090?format=json", "institution": "University of Science and Technology of China"}, {"id": 185226, "fullname": "Baocai Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185226?format=json", "institution": "iFLYTEK Research"}, {"id": 153578, "fullname": "Bing Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/153578?format=json", "institution": "iFLYTEK"}, {"id": 90080, "fullname": "Cong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90080?format=json", "institution": "iFLYTEK"}], "abstract": "Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows.End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios.To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment.TDATR adopts a \u201cperceive-then-fuse\u201d strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness.The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data.Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision\u2013language alignment. It enhances the interpretability and accuracy of TR.We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36503", "url": null, "sourceid": 42881, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36505, "uid": "ed3551789fc0376ff8938b6827b16eae", "name": "SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning", "authors": [{"id": 180311, "fullname": "Leo Fillioux", "url": "http://cvpr.thecvf.com/api/miniconf/users/180311?format=json", "institution": "CentraleSupelec"}, {"id": 181113, "fullname": "Omprakash Chakraborty", "url": "http://cvpr.thecvf.com/api/miniconf/users/181113?format=json", "institution": "ETS Montreal"}, {"id": 77361, "fullname": "Ismail Ben Ayed", "url": "http://cvpr.thecvf.com/api/miniconf/users/77361?format=json", "institution": "ETS Montreal"}, {"id": 185231, "fullname": "Paul-Henry Courn\u00e8de", "url": "http://cvpr.thecvf.com/api/miniconf/users/185231?format=json", "institution": "Paris-Saclay University"}, {"id": 185232, "fullname": "Stergios Christodoulidis", "url": "http://cvpr.thecvf.com/api/miniconf/users/185232?format=json", "institution": "CentraleSup\u00e9lec"}, {"id": 129973, "fullname": "Maria Vakalopoulou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129973?format=json", "institution": "CentraleSupelec"}, {"id": 84856, "fullname": "Jose Dolz", "url": "http://cvpr.thecvf.com/api/miniconf/users/84856?format=json", "institution": "\u00c9cole de technologie sup\u00e9rieure"}], "abstract": "With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber\u2011based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality\u2011based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36505", "url": null, "sourceid": 31766, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36510, "uid": "77dcd1d395862e2d2eef39b3afc939cc", "name": "GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views", "authors": [{"id": 159771, "fullname": "tianyu chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159771?format=json", "institution": "La Trobe University"}, {"id": 91374, "fullname": "Wei Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91374?format=json", "institution": "La Trobe University"}, {"id": 156020, "fullname": "Kang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/156020?format=json", "institution": "La Trobe University"}, {"id": 185238, "fullname": "Lu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185238?format=json", "institution": "La Trobe University"}, {"id": 184154, "fullname": "Di Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184154?format=json", "institution": "La Trobe University"}, {"id": 92855, "fullname": "Gaowen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92855?format=json", "institution": "Cisco Systems"}, {"id": 129129, "fullname": "Ramana Kompella", "url": "http://cvpr.thecvf.com/api/miniconf/users/129129?format=json", "institution": "Cisco"}], "abstract": "Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality.  Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36510", "url": null, "sourceid": 33501, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36512, "uid": "852be1ce1d25bb877395346423eb292b", "name": "Reinforcing Structured Chain-of-Thought for Video Understanding", "authors": [{"id": 181983, "fullname": "Peiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181983?format=json", "institution": "Stony Brook University"}, {"id": 139004, "fullname": "Haotian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/139004?format=json", "institution": "Amazon"}, {"id": 184708, "fullname": "Noranart Vesdapunt", "url": "http://cvpr.thecvf.com/api/miniconf/users/184708?format=json", "institution": "Amazon"}, {"id": 185240, "fullname": "Rui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185240?format=json", "institution": "Amazon"}, {"id": 185241, "fullname": "Jingyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185241?format=json", "institution": "Amazon"}, {"id": 92055, "fullname": "Haibin Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/92055?format=json", "institution": "State University of New York, Stony Brook"}, {"id": 184709, "fullname": "Oleksandr Obiednikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/184709?format=json", "institution": "Amazon"}, {"id": 140562, "fullname": "Ning Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/140562?format=json", "institution": "Amazon"}, {"id": 138144, "fullname": "Kah Fu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/138144?format=json", "institution": "Amazon"}], "abstract": "Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods often depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLM\u2019s ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize \u2192Think \u2192Answer. SDRL introduces two self-supervised mechanisms integrated into GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets. Additionally, we construct and will release an 80K VideoQA dataset with temporal annotations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36512", "url": null, "sourceid": 43467, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36513, "uid": "6fb475e33fd1df9c376f2c7256458e9f", "name": "FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection", "authors": [{"id": 183690, "fullname": "Yufei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183690?format=json", "institution": "\u897f\u5b89\u7535\u5b50\u79d1\u6280\u5927\u5b66"}, {"id": 185242, "fullname": "Long Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/185242?format=json", "institution": null}, {"id": 185243, "fullname": "Yuyang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185243?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 185244, "fullname": "Wenchao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185244?format=json", "institution": "Xidian University"}, {"id": 155786, "fullname": "Liang Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155786?format=json", "institution": "Xidian Unversity"}, {"id": 185245, "fullname": "Xiyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185245?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}], "abstract": "Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on obtaining prototypes from limited normal images, they neglect to systematically incorporate statistics of query image to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process during inference: (1) characteristic transfer from query features to the enhanced prototypes, and (2) anomaly suppression by aligning the enhanced prototypes with their normal counterparts. The characteristic transfer is achieved through linear reconstruction of query features from prototypes with an optimizable transport matrix, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable in characteristic transfer. Therefore, we employ optimal transport to measure while minimize the gap between prototypes and their enhanced counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and efficiency of our approach under 1/2/4-shots.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36513", "url": null, "sourceid": 42395, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36516, "uid": "6f69fa86accf2dca7fb4e3e12b3d29b4", "name": "Defending Unauthorized Model Merging via Dual-Stage Weight Protection", "authors": [{"id": 185252, "fullname": "Wei-Jia Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185252?format=json", "institution": null}, {"id": 185253, "fullname": "Min-Yan Tsai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185253?format=json", "institution": null}, {"id": 176504, "fullname": "Cheng-Yi Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/176504?format=json", "institution": "Academia Sinica"}, {"id": 183072, "fullname": "Chia-Mu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183072?format=json", "institution": "National Yang Ming Chiao Tung University"}], "abstract": "Traditional multi-task learning often relies on separately fine-tuned models for each task, leading to high training costs and inefficiency. Recent advances in model merging alleviate this issue by linearly combining parameters from multiple task-specific models to create new multi-task models. Such approaches can match or even surpass fine-tuning performance while greatly reducing computational overhead. However, the increasing openness of model-sharing platforms also introduces intellectual property risks. Malicious users can easily merge publicly available models to build new commercial systems without authorization, undermining the rights of original developers. To address this emerging threat, we propose MergeGuard, a two-stage preprocessing mechanism that protects models against unauthorized merging. MergeGuard subtly adjusts a model\u2019s internal parameter structure to maintain its original task performance while degrading the performance of any merged derivatives. The key challenge lies in defending against unpredictable merging behaviors, as the attacker\u2019s chosen models, strategies, and tasks remain unknown. MergeGuard effectively achieves this balance\u2014ensuring normal functionality before merging but significant performance degradation afterward\u2014to safeguard model ownership in open AI ecosystems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36516", "url": null, "sourceid": 36899, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36517, "uid": "fc77eeae6b54251137390500c50c9172", "name": "SpatialDiff: 3D-Aware Object Movement via Implicit Spatial Modeling", "authors": [{"id": 183495, "fullname": "Zheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183495?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 72293, "fullname": "Zijian He", "url": "http://cvpr.thecvf.com/api/miniconf/users/72293?format=json", "institution": "Sun Yat-sen University"}, {"id": 185254, "fullname": "Huiguo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185254?format=json", "institution": "South China University of Technology"}, {"id": 185255, "fullname": "Weizhi Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185255?format=json", "institution": "University of Hong Kong"}, {"id": 185256, "fullname": "Yejun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185256?format=json", "institution": null}, {"id": 129998, "fullname": "Huan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129998?format=json", "institution": "Kuaishou"}, {"id": 156268, "fullname": "Kun Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156268?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 185257, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185257?format=json", "institution": null}], "abstract": "Recent advances in image editing allow impressive manipulation of objects, existing methods still struggle to handle spatial manipulation in complex scenes, such as objects span different depth layers or are partially occluded.Most image editing methods focus solely on 2D datasets prior information, emphasizing planar features while lacking support for spatial positional structures. Even approaches that incorporate explicit positional information fail to capture true 3D spatial relationships, thus limiting accurate object movement in complex scenes.In this paper, we present $\\textbf{SpatialDiff}$, a method that effectively captures 3D spatial structures, enabling precise and consistent object movements in complex scenes.Our core innovations are twofold: (1) $\\textbf{Implicit 3D Spatial Modeling}$, which introduces 3D prior knowledge and enables the model to internally build a comprehensive understanding of the three-dimensional spatial structure; and (2) $\\textbf{Global Spatial Supervision}$, which constrains the latent spatial features to enable the model to perceive changes in object spatial positions caused by editing operations.Experimental results demonstrate that our method significantly improves the accuracy and fidelity of spatial manipulation in complex scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36517", "url": null, "sourceid": 46745, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36521, "uid": "8d2dc7dfd445269bda25fb479ff8b3d9", "name": "Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering", "authors": [{"id": 181252, "fullname": "Yura Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181252?format=json", "institution": "Imperial college london"}, {"id": 91821, "fullname": "Roy Miles", "url": "http://cvpr.thecvf.com/api/miniconf/users/91821?format=json", "institution": "Imperial College London"}, {"id": 91772, "fullname": "Rolandos Alexandros Potamias", "url": "http://cvpr.thecvf.com/api/miniconf/users/91772?format=json", "institution": "Imperial College London"}, {"id": 89676, "fullname": "Ismail Elezi", "url": "http://cvpr.thecvf.com/api/miniconf/users/89676?format=json", "institution": "Huawei Noah&#x27;s Ark"}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}, {"id": 86877, "fullname": "Stefanos Zafeiriou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86877?format=json", "institution": "Imperial College London"}], "abstract": "Understanding and answering questions based on a user\u2019s pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video.To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks.Built upon it, we further propose Hand Intent Tokens (HINT), which encode tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent.We show that our model outperforms others in different backbones and model sizes.In particular, HINT-14Bachieves 68.1\\% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6\\%.To further facilitate the open research, we will release the code, model, and dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36521", "url": null, "sourceid": 46058, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36524, "uid": "31f97ff19cc0aa5d7001a6b187dd2fa7", "name": "TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification", "authors": [{"id": 130380, "fullname": "Guan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/130380?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 157911, "fullname": "Xiu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157911?format=json", "institution": "Bytedance"}, {"id": 152112, "fullname": "Rui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152112?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 185269, "fullname": "Xuanyu Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185269?format=json", "institution": "ByteDance Inc."}, {"id": 89549, "fullname": "Jing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89549?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 171335, "fullname": "Chia-Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/171335?format=json", "institution": ", Tsinghua University"}, {"id": 185270, "fullname": "Jiahang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185270?format=json", "institution": "ByteDance Inc."}, {"id": 89456, "fullname": "Song-Hai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89456?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 70631, "fullname": "Jianfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70631?format=json", "institution": "NUS"}], "abstract": "The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the \\textit{representation mismatch} between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topological framework. Specifically, we convert arbitrary input meshes into DMC-compliant representations via a remeshing algorithm that preserves sharp edges using an L$\\infty$ distance metric. Our decoder outputs meshes in the same DMC format, ensuring that both predicted and target meshes share identical topological structures.  This establishes explicit correspondences at the vertex and face level, allowing us to derive explicit mesh-level supervision signals for topology, vertex positions, and face orientations with clear gradients. Our sparse VAE architecture employs this unified framework and is trained with Teacher Forcing and progressive resolution training for stable and efficient convergence. Extensive experiments demonstrate that TopoMesh significantly outperforms existing VAEs in reconstruction fidelity, achieving superior preservation of sharp features and geometric details, as shown in Fig. 1.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36524", "url": null, "sourceid": 44115, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36526, "uid": "922d5cb28ca0586bbe2db57477143dbb", "name": "DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models", "authors": [{"id": 155318, "fullname": "Qichao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155318?format=json", "institution": "Zhejiang University"}, {"id": 152076, "fullname": "Yunhong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152076?format=json", "institution": "Zhejiang University"}, {"id": 155319, "fullname": "Hengyuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155319?format=json", "institution": "Zhejiang University"}, {"id": 145019, "fullname": "Junyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145019?format=json", "institution": "Zhejiang University"}, {"id": 155322, "fullname": "Min Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155322?format=json", "institution": "Zhejiang University"}], "abstract": "Dataset distillation enables efficient training by distilling the information of large-scale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate additional fine-tuning stages, and effective guidance mechanisms remain underexplored. To address these limitations, we rethink diffusion based dataset distillation and propose a Dual Matching Guided Diffusion (DMGD) framework, centered on efficient training-free guidance. We propose a pioneering theoretical framework for guidance design, proving that optimizing distributional distance under semantic alignment equivalently tightens the upper bound of dataset distillation objectives. Therefore, we first establish  **Semantic Matching** via conditional likelihood optimization, eliminating the need for auxiliary classifiers. Furthermore, we propose a dynamic guidance mechanism that enhances the diversity of synthetic data while maintaining semantic alignment. Simultaneously, we introduce an optimal transport (OT) based **Distribution Matching** approach to further align with the target distribution structure. To ensure efficiency, we develop two enhanced strategies for diffusion based framework: Distribution Approximate Matching and Greedy Progressive Matching. These strategies enable effective distribution matching guidance with minimal computational overhead. Experimental results on ImageNet-Woof, ImageNet-Nette, and ImageNet-1K demonstrate that our training-free approach achieves significant improvements, outperforming state-of-the-art (SOTA) methods requiring additional fine-tuning by average accuracy gains of $2.1$%, $5.4$%, and $2.4$%, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36526", "url": null, "sourceid": 42789, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36527, "uid": "67f3a8a7347963d3c9d0622417ad0885", "name": "Vision-Speech Models: Teaching Speech Models to Converse about Images", "authors": [{"id": 182846, "fullname": "Amelie Royer", "url": "http://cvpr.thecvf.com/api/miniconf/users/182846?format=json", "institution": "Kyutai"}, {"id": 185274, "fullname": "Moritz B\u00f6hle", "url": "http://cvpr.thecvf.com/api/miniconf/users/185274?format=json", "institution": null}, {"id": 185275, "fullname": "Laurent Mazar\u00e9", "url": "http://cvpr.thecvf.com/api/miniconf/users/185275?format=json", "institution": "Kyutai"}, {"id": 185276, "fullname": "Neil Zeghidour", "url": "http://cvpr.thecvf.com/api/miniconf/users/185276?format=json", "institution": "Kyutai"}, {"id": 185277, "fullname": "Alexandre D\u00e9fossez", "url": "http://cvpr.thecvf.com/api/miniconf/users/185277?format=json", "institution": "Kyutai"}, {"id": 185278, "fullname": "Patrick Perez", "url": "http://cvpr.thecvf.com/api/miniconf/users/185278?format=json", "institution": "Kyutai"}], "abstract": "The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., \"speechless\") and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36527", "url": null, "sourceid": 30734, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36530, "uid": "c2ad3ba220066148c8f08d2caa94bf1f", "name": "CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion", "authors": [{"id": 185291, "fullname": "Yu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185291?format=json", "institution": "George Washington University"}, {"id": 153126, "fullname": "Yujun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/153126?format=json", "institution": "The University of Queensland"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36530", "url": null, "sourceid": 44417, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36531, "uid": "0401ed6796f1f9b637d18a4ba337e1d6", "name": "UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression", "authors": [{"id": 143012, "fullname": "Yuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/143012?format=json", "institution": "Dalian University of Technology"}, {"id": 185292, "fullname": "Youwei Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185292?format=json", "institution": "Nanyang Technological University"}, {"id": 127233, "fullname": "Lihe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127233?format=json", "institution": "Dalian University of Technology"}, {"id": 127225, "fullname": "Hanqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127225?format=json", "institution": "Ohio State University, Columbus"}, {"id": 127271, "fullname": "Jiaming Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/127271?format=json", "institution": "University of Southern California"}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}, {"id": 185293, "fullname": "Xiaoqi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185293?format=json", "institution": "Yale University"}], "abstract": "Existing anomaly detection  methods often treat  the  modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates.To address these limitations, we propose UniMMAD, a unified framework for  multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains.This process is guided by a ``general \u2192 specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning.In the decoding stage, the general features are decompressed into  modality-specific  and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and an MoE-in-MoE structure, reducing MoE parameter usage by approximately 75\\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection  datasets, spanning 3 fields, 12 modalities, and 66 classes. The code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36531", "url": null, "sourceid": 33561, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36535, "uid": "2fefdc64cea4bfeacce657d0465e6eb5", "name": "SpiderCam: Low-Power Snapshot Depth from Differential Defocus", "authors": [{"id": 169102, "fullname": "Marcos Ferreira", "url": "http://cvpr.thecvf.com/api/miniconf/users/169102?format=json", "institution": "Northwestern University"}, {"id": 159842, "fullname": "Tianao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/159842?format=json", "institution": "Northwestern University"}, {"id": 154563, "fullname": "John Mamish", "url": "http://cvpr.thecvf.com/api/miniconf/users/154563?format=json", "institution": "Georgia Institute of Technology"}, {"id": 154566, "fullname": "Josiah Hester", "url": "http://cvpr.thecvf.com/api/miniconf/users/154566?format=json", "institution": "Georgia Institute of Technology"}, {"id": 185298, "fullname": "Yaman Sangar", "url": "http://cvpr.thecvf.com/api/miniconf/users/185298?format=json", "institution": "Georgia Institute of Technology"}, {"id": 126346, "fullname": "Qi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126346?format=json", "institution": "Purdue University"}, {"id": 88885, "fullname": "Emma Alexander", "url": "http://cvpr.thecvf.com/api/miniconf/users/88885?format=json", "institution": "Northwestern University"}], "abstract": "We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 611 mW of power in total. SpiderCam comprises a custom camera which simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36535", "url": null, "sourceid": 43581, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36537, "uid": "39f4bdc74fe2fa9312f7eff5335c00c5", "name": "Ultra-Fast Neural Video Compression", "authors": [{"id": 132590, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132590?format=json", "institution": "Microsoft Research Asia"}, {"id": 88230, "fullname": "Wenxuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/88230?format=json", "institution": "Microsoft Research Asia"}, {"id": 131172, "fullname": "Zhaoyang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/131172?format=json", "institution": "University of Science and Technology of China"}, {"id": 88028, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88028?format=json", "institution": "Microsoft"}, {"id": 185301, "fullname": "Zongyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185301?format=json", "institution": "Microsoft Research"}, {"id": 88233, "fullname": "Xiaoyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88233?format=json", "institution": "Research, Microsoft"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. Both training and testing codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36537", "url": null, "sourceid": 31703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36538, "uid": "5ee8bb7c3c285851db9b969e956afc36", "name": "Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation", "authors": [{"id": 185302, "fullname": "Tzu-Ling Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185302?format=json", "institution": "University of Saskatchewan"}, {"id": 166933, "fullname": "Ian Stavness", "url": "http://cvpr.thecvf.com/api/miniconf/users/166933?format=json", "institution": "University of Saskatchewan"}, {"id": 183708, "fullname": "Mrigank Rochan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183708?format=json", "institution": "University of Saskatchewan, Canada"}], "abstract": "Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36538", "url": null, "sourceid": 31950, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36539, "uid": "009ff90f082aaa9fcd6e14caf65c7cc6", "name": "Cycle-Consistent Tuning for Layered Image Decomposition", "authors": [{"id": 181691, "fullname": "Zheng Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181691?format=json", "institution": "Shenzhen University"}, {"id": 149200, "fullname": "Min Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149200?format=json", "institution": "Shenzhen University"}, {"id": 147257, "fullname": "Zhida Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/147257?format=json", "institution": "Shenzhen University"}, {"id": 88441, "fullname": "Dani Lischinski", "url": "http://cvpr.thecvf.com/api/miniconf/users/88441?format=json", "institution": "The Hebrew University of Jerusalem, Israel"}, {"id": 87616, "fullname": "Daniel Cohen-Or", "url": "http://cvpr.thecvf.com/api/miniconf/users/87616?format=json", "institution": "Google"}, {"id": 85724, "fullname": "Hui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85724?format=json", "institution": "Shenzhen University"}], "abstract": "Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36539", "url": null, "sourceid": 44052, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36540, "uid": "124c6149f09717e388e1f286163b130b", "name": "Free-Grained Hierarchical Visual Recognition", "authors": [{"id": 135118, "fullname": "Seulki Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/135118?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 157804, "fullname": "Zilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157804?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 73935, "fullname": "Stella X. Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73935?format=json", "institution": "University of Michigan - Ann Arbor"}], "abstract": "Hierarchical image recognition predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained labels, an assumption rarely met in practice. Real-world annotations vary in granularity due to image quality, annotator expertise, and task goals; a distant bird may be labeled \"Bird'', while a close-up reveals \"Bank Swallow''. We formalize this realistic setting as free-grain learning, where each image may be labeled at any taxonomy level, while the model must still learn the full hierarchical path. To study this problem, we build diverse benchmarks that provide labels at varying semantic granularity, including a new three-level ImageNet-F and mixed-granularity variants of datasets. We further develop strong baselines that improve learning under mixed supervision through (1) semantic guidance from vision\u2013language models and (2) visual guidance via semi-supervised learning. Together, our benchmarks and methods advance hierarchical recognition under real-world constraints.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36540", "url": null, "sourceid": 32212, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36541, "uid": "9fae7572ff8bae93fc21468e873cb4ac", "name": "FlashCap: Millisecond-Accurate Human Motion  Capture via Flashing LEDs and Event-Based Vision", "authors": [{"id": 185303, "fullname": "Zekai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185303?format=json", "institution": "Xiamen University"}, {"id": 129247, "fullname": "Shuqi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129247?format=json", "institution": "Xiamen University"}, {"id": 160148, "fullname": "Mengyin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/160148?format=json", "institution": "Xiamen University"}, {"id": 154306, "fullname": "Yuhua Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154306?format=json", "institution": "Xiamen University"}, {"id": 129253, "fullname": "Xincheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129253?format=json", "institution": "Xiamen University"}, {"id": 77069, "fullname": "Ming Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/77069?format=json", "institution": "Xiamen University"}, {"id": 185304, "fullname": "Junhao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185304?format=json", "institution": "Xiamen University"}, {"id": 185305, "fullname": "Xiuhong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185305?format=json", "institution": "Xiamen University"}, {"id": 132350, "fullname": "Yuexin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/132350?format=json", "institution": "ShanghaiTech University"}, {"id": 86653, "fullname": "Chenglu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86653?format=json", "institution": "Xiamen University"}, {"id": 73999, "fullname": "Lan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73999?format=json", "institution": "ShanghaiTech University"}, {"id": 86709, "fullname": "Siqi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86709?format=json", "institution": "Xiamen University"}, {"id": 86652, "fullname": "Cheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86652?format=json", "institution": "Xiamen University"}], "abstract": "Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36541", "url": null, "sourceid": 45509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36545, "uid": "8698d181e9fe0f253656bcf4419ce43d", "name": "SceneTok: A Compressed, Diffusable Token Space for 3D Scenes", "authors": [{"id": 149832, "fullname": "Mohammad Asim", "url": "http://cvpr.thecvf.com/api/miniconf/users/149832?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 126927, "fullname": "Christopher Wewer", "url": "http://cvpr.thecvf.com/api/miniconf/users/126927?format=json", "institution": "Max Planck Institute for Informatics, Saarland Informatics Campus"}, {"id": 126719, "fullname": "Jan Lenssen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126719?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}], "abstract": "We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. A diffusion transformer enables scene generation on the compressed token space. We show that the compression is two orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 8 seconds, achieving a much better quality-speed tradeoff than previous paradigms.Our code and trained models will be released upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36545", "url": null, "sourceid": 30710, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36548, "uid": "53e30d90815c9c1a590e70d0120525f8", "name": "OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion", "authors": [{"id": 183514, "fullname": "Dongjian Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183514?format=json", "institution": "Yunnan University"}, {"id": 185318, "fullname": "Weiqing Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/185318?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185319, "fullname": "Qian Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185319?format=json", "institution": "Yunnan University"}, {"id": 185320, "fullname": "Xing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185320?format=json", "institution": "Yunnan University"}, {"id": 185321, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185321?format=json", "institution": null}, {"id": 85108, "fullname": "Shuqiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85108?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets focus on Western cuisines, with limited coverage of Chinese dishes, leading to limitations in accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food scenes with detailed nutritional annotations and multi-view images for each scene. In addition, to enhance models\u2019 capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework to predict nutritional information from a single RGB image. We first predict a depth map from a single RGB image, then refine it using our Scale-Shift Residual Adapter (SSRA), which enforces global scale consistency and preserves local structural details. Second, the Frequency-Aligned Fusion Module (FAFM) hierarchically fuses RGB and adapted depth features, aligning multi-modal representations in the frequency domain across layers. Third, the Mask-based Prediction Head (MPH) emphasizes key ingredient regions via dynamic channel selection, improving prediction accuracy. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, providing a practical solution for daily dietary assessment.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36548", "url": null, "sourceid": 40189, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36551, "uid": "f37ef71e570a8c693cac4cfca7654f8e", "name": "Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound", "authors": [{"id": 90970, "fullname": "Hyeonggon Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90970?format=json", "institution": "KAIST"}, {"id": 126430, "fullname": "Joon Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/126430?format=json", "institution": "KAIST"}, {"id": 127636, "fullname": "David Harwath", "url": "http://cvpr.thecvf.com/api/miniconf/users/127636?format=json", "institution": "University of Texas, Austin"}], "abstract": "Many audio-visual learning methods have focused on aligning audio and visual information, either through semantic or temporal correspondence. However, most of these works have utilized monaural audio, which does not contain information about the spatial location of the sound source. In contrast, humans and other animals utilize binaural hearing to perceive this spatial information. Combining spatial sound and visual perception enables powerful high-level reasoning: for example, a person looking for their phone may hear the ringing sound coming from a backpack sitting on a table, and quickly infer that the missing phone is inside the backpack. In this paper, we investigate the problem of Audio-Visual Spatial Reasoning. We design a spatial audio-visual question answering dataset to cover scenarios where semantic correspondence between audio and visual signals is absent but spatial alignment exists, as well as cases with multiple audio-visual semantic correspondences that require spatial reasoning to disambiguate. We propose a model that learns spatial comprehension across the audio and vision modalities by connecting them with a large language model and experimentally demonstrate that spatial sound perception is an essential part of our task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36551", "url": null, "sourceid": 37485, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36552, "uid": "2a0231d2d461c00f311124486fc4d8f8", "name": "CoRiM: Conflict-driven Risk Minimization for Dynamic Multimodal Fusion", "authors": [{"id": 144571, "fullname": "shihao Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/144571?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 130050, "fullname": "Wei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130050?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Dynamic multimodal fusion methods lack robust theoretical guidance for handling modal conflicts and inconsistent data quality. While recent theory-based works correlate weights with indirect scalar proxies (e.g., loss or confidence), this paradigm struggles to comprehensively capture the risk driven by direct distribution inconsistencies. In this paper, we propose a  **Co**nflict-driven **Ri**sk **M**inimization (**CoRiM**) dynamic fusion paradigm. Specifically, we redefine dynamic fusion as a principled, per-sample, direct risk minimization task. To this end, we first design a novel, differentiable Modality Conflict Risk (MCR) function, $\\mathcal{R}(w)$, which quantifies risk by directly modeling fused uncertainty and inter-modal consistency. Second, we identify that minimizing $\\mathcal{R}(w)$ is fundamentally a non-convex constrained optimization problem over the probabilistic simplex. To efficiently solve this specific challenge, we innovatively introduce the projection free Frank-Wolfe (FW) algorithm, as it is perfectly suited for optimization on the simplex.We prove that our designed $\\mathcal{R}(w)$ possesses L-smoothness, which provides theoretical guarantees for the convergence of the FW algorithm on our non-convex objective. Extensive experiments on multiple benchmark datasets demonstrate that CoRiM outperforms current state-of-the-art methods in high-conflict and noisy environments, validating the robustness of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36552", "url": null, "sourceid": 40333, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36557, "uid": "bf3e0df27b04abb4c107ab4df9955b29", "name": "SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference", "authors": [{"id": 180445, "fullname": "Haojuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180445?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 185332, "fullname": "Tang Ruohan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185332?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185333, "fullname": "Dongzhou Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185333?format=json", "institution": "Shanghai Innovation Institute; Southeast University"}, {"id": 185334, "fullname": "Zongpu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185334?format=json", "institution": "Purdue University"}, {"id": 176966, "fullname": "Jian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/176966?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 77217, "fullname": "Jiaqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77217?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "Video Large Language Models (VideoLLMs) face a fundamental performance bottleneck: the token explosion intrinsic to video inputs. The resulting $O(N^2)$ prefill cost makes conventional Transformer inference prohibitively expensive at scale. Existing attempts fall into a hard accuracy\u2013latency dilemma: naive sparsification risks losing essential temporal\u2013spatial context, whereas naive parallelization introduces substantial communication and memory overhead.To overcome this impasse, we argue that algorithm\u2013system co-design is not optional but necessary, jointly optimizing what to compute (sparsification) and how to compute it (parallelism). We introduce SegMo, a unified framework that instantiates this co-design principle and enables efficient, accurate VideoLLM inference at scale.SegMo is driven by the empirical insight that VideoLLM attention exhibits Local Cohesion. Our system implements this via two integrated components:(1) Content-Aware Sparsification (CAS): A lightweight, hierarchical algorithm that first employs Query Relevance for scene-level assessment, and then uses Temporal Redundancy for intra-scene static redundancy pruning, to generate a precise, non-uniform computation load, ensuring accuracy.(2) Locally-Cohesive Segment Parallelism (LSP): A novel paradigm that leverages attention locality to partition the video at scene boundaries, using a lightweight Global Context Injection mechanism to replace the massive communication and memory overheads of global attention.SegMo was validated across LVBench, LongVideoBench, and Video-MME. Our CAS module improved accuracy by up to 12.00%. When integrated with LSP, the full system (CAS + LSP) achieved a peak prefill acceleration of 3.55x, while still maintaining a significant accuracy gain of up to 8.31%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36557", "url": null, "sourceid": 43420, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36558, "uid": "2d5421eb54f4118e3b933e87baa6cc1c", "name": "DIMOS: Disentangling Instance-level Moving Object Segmentation", "authors": [{"id": 180425, "fullname": "Hongxiang HUANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180425?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 185335, "fullname": "Hongwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/185335?format=json", "institution": "Harbin Institute of Technology"}, {"id": 185336, "fullname": "Xiaopeng LIN", "url": "http://cvpr.thecvf.com/api/miniconf/users/185336?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 130328, "fullname": "Yulong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130328?format=json", "institution": "Central South University"}, {"id": 73868, "fullname": "Zeke Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73868?format=json", "institution": "Baidu Research"}, {"id": 130278, "fullname": "Bojun Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130278?format=json", "institution": "Hong Kong University of Science and Technology (Guanggzhou)"}], "abstract": "Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36558", "url": null, "sourceid": 30900, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36565, "uid": "922e012726a1c14d3321f51798dda751", "name": "Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding", "authors": [{"id": 185365, "fullname": "Shawn Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185365?format=json", "institution": "Brigham Young University"}, {"id": 84672, "fullname": "Brian Price", "url": "http://cvpr.thecvf.com/api/miniconf/users/84672?format=json", "institution": "Adobe Research"}, {"id": 96587, "fullname": "Yifei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/96587?format=json", "institution": "Adobe Systems"}, {"id": 69363, "fullname": "Bryan Morse", "url": "http://cvpr.thecvf.com/api/miniconf/users/69363?format=json", "institution": "Brigham Young University"}], "abstract": "Automatic album organization has been studied extensively over the past decades due to significant progress in digital photography. Recent vision-language models (VLMs) have shown strong performance on multi-image understanding, making them natural candidates for automating album organization workflows. While VLMs' abilities in multi-image understanding have been widely studied, their performance on album organization remains underexplored. To bridge this gap, we introduce AlbumBench, the first comprehensive benchmark for automatic album organization. Specifically, we (1) define album organization tasks as photo selection for album-specific user objectives, photo rating according to how well user intents are fulfilled, and album-specific photo grouping given a user query which requires contextual understanding of the album; (2) establish AlbumBench, a benchmark dataset containing 27051 images across 641 albums with 5 annotations per image; and (3) evaluate mainstream open-source and proprietary VLMs on AlbumBench. We show that AlbumBench presents unique challenges compared to traditional multi-image understanding benchmarks due to its requirement for understanding album context and user intent. Our findings reveal a significant performance gap between open-source and proprietary VLMs on album organization tasks. Despite this gap, even the best-performing proprietary models sometimes struggle with tasks that humans find relatively easy. We hope that AlbumBench can serve as a foundation for unifying album organization research and motivate improvements in VLMs' performance on these tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36565", "url": null, "sourceid": 38140, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36567, "uid": "2cf7926aeec52fbe4f1a6ae2a1770329", "name": "UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL", "authors": [{"id": 185368, "fullname": "Rui Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/185368?format=json", "institution": "Apple; Fudan University"}, {"id": 88896, "fullname": "Mingfei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88896?format=json", "institution": "Apple"}, {"id": 140628, "fullname": "Haiming Gang", "url": "http://cvpr.thecvf.com/api/miniconf/users/140628?format=json", "institution": "Apple"}, {"id": 150987, "fullname": "Jiasen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150987?format=json", "institution": "Apple"}, {"id": 87966, "fullname": "Zhe Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87966?format=json", "institution": "Apple"}, {"id": 84650, "fullname": "Yinfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84650?format=json", "institution": "Apple"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}, {"id": 151014, "fullname": "Afshin Dehghan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151014?format=json", "institution": "Apple"}], "abstract": "We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit benchmarks that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36567", "url": null, "sourceid": 31302, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36573, "uid": "51323239e69063c691d0dcfdb046b1b4", "name": "SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification", "authors": [{"id": 181148, "fullname": "HUANG HUIYUAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/181148?format=json", "institution": "Kookmin University"}, {"id": 183741, "fullname": "SANG MIN YOON", "url": "http://cvpr.thecvf.com/api/miniconf/users/183741?format=json", "institution": "Kookmin University"}], "abstract": "Person re-identification (Re-ID) requires a balance between discriminative capability and computational efficiency for real-world deployment. However, even the Visual State Space Model (SSM), despite its linear complexity, suffers from redundant computation due to dense token processing. We propose SSM-aware Token-Efficient VMamba (TE-VMamba), which integrates adaptive patch pruning and merging modules to reduce redundant tokens while preserving identity-discriminative cues. The layer-adaptive pruning strategy removes low-importance tokens in shallow layers to enhance efficiency, whereas the depth-aware merging strategy consolidates semantically similar tokens in deeper layers to improve representation compactness. Learnable layer-wise thresholds dynamically balance accuracy and computational cost across the network. On the Market-1501 benchmark, TE-VMamba reduces FLOPs by over 60\\%, achieving significant computational savings while maintaining competitive accuracy. These results highlight the potential of structured token reduction in state-space models for efficient and powerful person re-identification.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36573", "url": null, "sourceid": 34833, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36574, "uid": "c230257fc8994f9835dc50c0b267db0e", "name": "AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects", "authors": [{"id": 127472, "fullname": "Danrui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127472?format=json", "institution": "Rutgers University"}, {"id": 69924, "fullname": "Jiahao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69924?format=json", "institution": "The Australian National University"}, {"id": 75545, "fullname": "Bernhard Egger", "url": "http://cvpr.thecvf.com/api/miniconf/users/75545?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 126514, "fullname": "Moitreya Chatterjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/126514?format=json", "institution": "Mitsubishi Electric Research Labs"}, {"id": 88004, "fullname": "Suhas Lohit", "url": "http://cvpr.thecvf.com/api/miniconf/users/88004?format=json", "institution": "Mitsubishi Electric Research Labs"}, {"id": 98153, "fullname": "Tim Marks", "url": "http://cvpr.thecvf.com/api/miniconf/users/98153?format=json", "institution": "Mitsubishi Electric Research Labs (MERL)"}, {"id": 76845, "fullname": "Anoop Cherian", "url": "http://cvpr.thecvf.com/api/miniconf/users/76845?format=json", "institution": "Mitsubishi Electric Research Labs (MERL)"}], "abstract": "Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36574", "url": null, "sourceid": 31178, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36575, "uid": "9008bcab767d69a3d185b4936687940e", "name": "Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark", "authors": [{"id": 144489, "fullname": "Yifei Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/144489?format=json", "institution": "Anhui University; Anhui University"}, {"id": 183391, "fullname": "Chenglong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183391?format=json", "institution": "Anhui University"}, {"id": 185385, "fullname": "YUYANG ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185385?format=json", "institution": "University of Hong Kong"}, {"id": 185386, "fullname": "Guyue Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185386?format=json", "institution": "Anhui University"}, {"id": 126842, "fullname": "Jin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126842?format=json", "institution": "Anhui University"}], "abstract": "Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text\u2013image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging.    To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network (CFANet), which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text\u2013aerial person retrieval. In particular, we design the Fuzzy Token Alignment  module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporates the ground-view agent as a bridge in text\u2013aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method. The code and dataset will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36575", "url": null, "sourceid": 39449, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36576, "uid": "29f30bcecaff4c40c0b778b6eb7cb1b4", "name": "Concept-Guided Fine-tuning: Steering ViTs away from Spurious Correlations to Improve Robustness", "authors": [{"id": 180959, "fullname": "Yehonatan Elisha", "url": "http://cvpr.thecvf.com/api/miniconf/users/180959?format=json", "institution": "Tel Aviv University"}, {"id": 185387, "fullname": "Oren Barkan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185387?format=json", "institution": "Open University of Israel"}, {"id": 185388, "fullname": "Noam Koenigstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/185388?format=json", "institution": "Tel Aviv University"}], "abstract": "Vision Transformers (ViTs) often fail under distribution shifts because they learn spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground/background masks, overlook the fine-grained semantic concepts that truly define an object (e.g., \"long beak\" and \"wings\" for a \"bird\"). To address this, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps (via AttnLRP) to align with spatially-grounded concept masks. These guidance masks are generated automatically and without manual annotation: class-relevant concepts are first proposed using an LLM-driven, label-free method, and then segmented using a Vision-Language Model (GroundingSAM). The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas and preserving classifier confidence via a dedicated loss term. This process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution (OOD) benchmarks, show that our method significantly enhances model robustness across multiple ViT-based models and an additional CNN model. Furthermore, we validate that the resulting relevance maps exhibit improved alignment with semantic object parts, providing a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective guidance for model robustness than conventional segmentation maps, validating our hypothesis.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36576", "url": "https://yonisgit.github.io/concept-ft/", "sourceid": 36674, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36582, "uid": "6bd22a86b1f7a3a11de928d301f86d67", "name": "S$^2$AM3D: Scale-controllable Part Segmentation of 3D Point Clouds", "authors": [{"id": 185409, "fullname": "Han Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/185409?format=json", "institution": "Harbin Institute of Technology; Harbin Institute of Technology"}, {"id": 129853, "fullname": "Tianyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129853?format=json", "institution": "Harbin Institute of Technology &amp; City University of Hong Kong"}, {"id": 185410, "fullname": "Zichen Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185410?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89739, "fullname": "Xiaohe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89739?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision.Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views.To address these challenges, we propose S$^2$AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training.Extensive experiments demonstrate that S$^2$AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36582", "url": null, "sourceid": 39120, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40267?format=json"], "related_events_ids": [40267]}, {"id": 40267, "uid": "6bd22a86b1f7a3a11de928d301f86d67", "name": "S$^2$AM3D: Scale-controllable Part Segmentation of 3D Point Clouds", "authors": [{"id": 185409, "fullname": "Han Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/185409?format=json", "institution": "Harbin Institute of Technology; Harbin Institute of Technology"}, {"id": 129853, "fullname": "Tianyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129853?format=json", "institution": "Harbin Institute of Technology &amp; City University of Hong Kong"}, {"id": 185410, "fullname": "Zichen Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185410?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89739, "fullname": "Xiaohe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89739?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision.Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views.To address these challenges, we propose S$^2$AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training.Extensive experiments demonstrate that S$^2$AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40267", "url": null, "sourceid": -39120, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36582?format=json"], "related_events_ids": [36582]}, {"id": 36586, "uid": "ba288e94194f1abd5c2cb5f9313905a5", "name": "Tracking by Predicting 3-D Gaussians Over Time", "authors": [{"id": 168526, "fullname": "Tanish Baranwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/168526?format=json", "institution": "UC Berkeley"}, {"id": 140649, "fullname": "Himanshu Singh Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/140649?format=json", "institution": "University of California, Berkeley"}, {"id": 87446, "fullname": "Jathushan Rajasegaran", "url": "http://cvpr.thecvf.com/api/miniconf/users/87446?format=json", "institution": "University of California Berkeley"}, {"id": 75437, "fullname": "Jitendra Malik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75437?format=json", "institution": "University of California at Berkeley"}], "abstract": "We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning  that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pre-training a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36586", "url": null, "sourceid": 44783, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36587, "uid": "4e997cbe9d93884f8ce686aa0094bf58", "name": "Match-and-Fuse: Consistent Generation from Unstructured Image Sets", "authors": [{"id": 181436, "fullname": "Kate Feingold", "url": "http://cvpr.thecvf.com/api/miniconf/users/181436?format=json", "institution": "Weizmann Institute of Science"}, {"id": 152957, "fullname": "Omri Kaduri", "url": "http://cvpr.thecvf.com/api/miniconf/users/152957?format=json", "institution": "Weizmann Institute of Science"}, {"id": 86757, "fullname": "Tali Dekel", "url": "http://cvpr.thecvf.com/api/miniconf/users/86757?format=json", "institution": "Weizmann Institute of Science"}], "abstract": "We present Match-and-Fuse - a zero-short, training-free method for consistent controlled generation of unstructured image sets -- collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while achieving global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. This also allows us to leverage an emergent prior in text\u2011to\u2011image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36587", "url": null, "sourceid": 45961, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36588, "uid": "83dc5dd35ff0d85a1efb042a7d1e6892", "name": "EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation", "authors": [{"id": 180440, "fullname": "Tianwei Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180440?format=json", "institution": "University of Hong Kong; ByteDance Inc."}, {"id": 127568, "fullname": "Jun Hao Liew", "url": "http://cvpr.thecvf.com/api/miniconf/users/127568?format=json", "institution": "ByteDance"}, {"id": 155910, "fullname": "Zilong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155910?format=json", "institution": "Bytedance"}, {"id": 87202, "fullname": "Zhijie Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87202?format=json", "institution": "Zhejiang University"}, {"id": 86968, "fullname": "Jiashi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86968?format=json", "institution": "ByteDance"}, {"id": 86697, "fullname": "Xihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86697?format=json", "institution": "The University of Hong Kong"}], "abstract": "Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce **EVATok**, a framework to produce **E**fficient **V**ideo **A**daptive  **Tok**enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4\\% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36588", "url": null, "sourceid": 46646, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36590, "uid": "dde2cbd066f964bf63dcc323945aecae", "name": "PhotoFramer: Multi-modal Image Composition Instruction", "authors": [{"id": 152361, "fullname": "Zhiyuan You", "url": "http://cvpr.thecvf.com/api/miniconf/users/152361?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86862, "fullname": "Ke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86862?format=json", "institution": "Electrical Engineering and Computer Sciences, University of California, Berkeley"}, {"id": 86848, "fullname": "He Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86848?format=json", "institution": "Adobe Systems"}, {"id": 152360, "fullname": "Xin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/152360?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 179609, "fullname": "Jinjin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179609?format=json", "institution": "INSAIT, Sofia University"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 75721, "fullname": "Chao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/75721?format=json", "institution": "SIAT"}, {"id": 129904, "fullname": "Zhoutong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129904?format=json", "institution": "Adobe Systems"}], "abstract": "Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36590", "url": null, "sourceid": 32052, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36596, "uid": "0d898488000a8e9d85ceefedf10725ca", "name": "Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers", "authors": [{"id": 126863, "fullname": "Yifan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/126863?format=json", "institution": "Nanyang Technological University"}, {"id": 156465, "fullname": "Zeqi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156465?format=json", "institution": "Nanyang Technological University"}, {"id": 185432, "fullname": "Tianyi Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185432?format=json", "institution": "Nanyang Technological University"}, {"id": 90417, "fullname": "Shuai Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90417?format=json", "institution": "Nanyang Technological University"}, {"id": 98319, "fullname": "Xingang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/98319?format=json", "institution": "Nanyang Technological University"}], "abstract": "Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-$K$ sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing $K$ required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure.In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-$K$ selection, progressively adopting sparse Top-$K$ selection with the indices found at the previous level,and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks.We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by $ 28.27 \\times$ and DiT training by $ 6.09 \\times$ on $256 \\times 256$ pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36596", "url": null, "sourceid": 44819, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36601, "uid": "9092509bfb07bff91803b6a2db1dc642", "name": "The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations", "authors": [{"id": 165338, "fullname": "Kushal Vyas", "url": "http://cvpr.thecvf.com/api/miniconf/users/165338?format=json", "institution": "Rice University"}, {"id": 154850, "fullname": "Alper Kayabasi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154850?format=json", "institution": "University of California, Riverside"}, {"id": 185441, "fullname": "Daniel Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185441?format=json", "institution": "Rice University"}, {"id": 131774, "fullname": "Vishwanath Saragadam", "url": "http://cvpr.thecvf.com/api/miniconf/users/131774?format=json", "institution": "Rice University"}, {"id": 85552, "fullname": "Ashok Veeraraghavan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85552?format=json", "institution": "William Marsh Rice University"}, {"id": 131368, "fullname": "Guha Balakrishnan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131368?format=json", "institution": "Rice University"}], "abstract": "The approximation and convergence properties of implicit neural representations (INRs) are known to be highly sensitive to parameter initialization strategies. Several data-driven INR parameter initialization methods demonstrate significant improvement over standard random sampling, but the reason for their successes -- whether they encode classical statistical signal priors or something more sophisticated -- is not well understood.  In this study, we explore this topic with a series of experimental analyses leveraging noise pretraining. In particular, we pretrain INRs on noise signals of different classes (e.g., Gaussian, Dead Leaves, Spectral), and measure their abilities at both fitting unseen signals and encoding priors for an inverse imaging task (denoising). Our analyses on image and video data reveal the highly surprising finding that simply pretraining on unstructured noise (Uniform, Gaussian) results in a dramatic improvement in signal fitting capacity compared to all other baselines. However, unstructured noise also yields poor deep image priors for denoising. In contrast, noise with the classic $1/|f^\\alpha|$ spectral structure of natural images yields an excellent balance of both signal fitting and inverse imaging capabilities on par with the best data-driven initialization methods. This finding can enable more efficient training of INRs in applications without sufficient prior domain-specific data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36601", "url": null, "sourceid": 46122, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36602, "uid": "f6f256d9b3c807754c99925e78035b3f", "name": "MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation", "authors": [{"id": 180045, "fullname": "Changli Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180045?format=json", "institution": "Xiamen University"}, {"id": 185442, "fullname": "Haodong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185442?format=json", "institution": "Xiamen University"}, {"id": 131760, "fullname": "Jiayi Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/131760?format=json", "institution": "Xiamen University"}, {"id": 185443, "fullname": "Yutian Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185443?format=json", "institution": "Tianjin University of Science and Technology"}, {"id": 185444, "fullname": "Chunsai Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/185444?format=json", "institution": "ByteDance Inc."}, {"id": 185445, "fullname": "Jihua Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185445?format=json", "institution": "ByteDance Inc.; Fudan University"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}, {"id": 88415, "fullname": "Liujuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88415?format=json", "institution": "Xiamen University"}], "abstract": "Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36602", "url": null, "sourceid": 31258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36603, "uid": "d8405e1681f4d9e9e21d26be210caa7f", "name": "NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning", "authors": [{"id": 184191, "fullname": "Ishaan Singh Rawal", "url": "http://cvpr.thecvf.com/api/miniconf/users/184191?format=json", "institution": "Applied Intuition /        Texas A&amp;M University"}, {"id": 185446, "fullname": "Shubh Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/185446?format=json", "institution": "Applied Intuition, Inc."}, {"id": 96453, "fullname": "Yihan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/96453?format=json", "institution": "Applied Intuition"}, {"id": 88181, "fullname": "Wei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88181?format=json", "institution": "University of California Berkeley"}], "abstract": "Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. Current VLAs face two challenges: (1) they require extensive datasets annotated with reasoning traces, and (2) these traces greatly increase token counts, inflating training and inference costs. We propose NoRD (No Reasoning for Driving), a data- and inference-efficient VLA that addresses both. Compared to existing VLAs, NoRD achieves competitive performance while being fine-tuned on atleast <60% of the data and no reasoning annotations,  resulting in 3x fewer tokens. Our approach applies Reinforcement Learning (RL) to fine-tune a Supervised fine-tuning (SFT) policy trained on a small, reasoning-free dataset. However, we observe that the standard RL algorithm, Group Relative Policy Optimization (GRPO), fails to yield significant improvements over this data-efficient SFT policy. We find that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NoRD overcomes this limitation by incorporating Dr.GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NoRD achieves competitive performance on Waymo and NAVSIM without large datasets, reasoning or additional inputs, enabling scalable, data-efficient training, and fast inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36603", "url": null, "sourceid": 42132, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36606, "uid": "c2dd0b60035ad00b08f81244a20b4860", "name": "High-Fidelity Mobile Avatars with Pruned Local Blendshapes", "authors": [{"id": 151803, "fullname": "Youyi Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151803?format=json", "institution": "Zhejiang University"}, {"id": 107376, "fullname": "He Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107376?format=json", "institution": "University College London, University of London"}, {"id": 131917, "fullname": "Tianjia Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131917?format=json", "institution": "Zhejiang University"}, {"id": 85731, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85731?format=json", "institution": "Zhejiang University"}], "abstract": "We propose a method to reconstruct high-fidelity human avatars from multi\u2011view video that can run on mobile devices. Many works can model high\u2011quality Gaussian-based full-body avatars from multi\u2011view video. However, these methods require heavy computation to obtain pose\u2011dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose\u2011dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propose to remove blendshapes for Gaussians whose attributes change little, yielding a minimal blendshape representation. Our method is an end-to-end training method without a pretrained model. To make it running on multiple devices, we implement our method using WebGPU. Experiments show that our method can render high\u2011quality human avatars with better details, and can reach 120 FPS at 2K resolution on mobile devices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36606", "url": null, "sourceid": 39546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36615, "uid": "e8791c81f0dbb5e99c8abe851ec1900b", "name": "Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer", "authors": [{"id": 152559, "fullname": "Mohsen Ghafoorian", "url": "http://cvpr.thecvf.com/api/miniconf/users/152559?format=json", "institution": "Qualcomm"}, {"id": 185473, "fullname": "Denis Korzhenkov", "url": "http://cvpr.thecvf.com/api/miniconf/users/185473?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 106397, "fullname": "Amirhossein Habibian", "url": "http://cvpr.thecvf.com/api/miniconf/users/106397?format=json", "institution": "Qualcomm AI Research"}], "abstract": "Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism\u2014mixing softmax and linear tokens\u2014with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36615", "url": null, "sourceid": 46107, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36617, "uid": "33a5d9297143c62a93170d9dc2cf5a52", "name": "Vinedresser3D: Towards Agentic Text-guided 3D Editing", "authors": [{"id": 180179, "fullname": "Yankuan Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180179?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 155531, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155531?format=json", "institution": "University of Illinois, Urbana Champaign"}, {"id": 77107, "fullname": "Zixuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77107?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 95904, "fullname": "James Rehg", "url": "http://cvpr.thecvf.com/api/miniconf/users/95904?format=json", "institution": "University of Illinois at Urbana-Champaign"}], "abstract": "Text-guided 3D editing aims to modify existing 3D assets using natural-language instructions. Current methods struggle to jointly understand complex prompts, automatically localize edits in 3D, and preserve unedited content. We introduce Vinedresser3D, an agentic framework for high-quality text-guided 3D editing that operates directly in the latent space of a native 3D generative model. Given a 3D asset and an editing prompt, Vinedresser3D uses a multimodal large language model to infer rich descriptions of the original asset, identify the parts to edit and the edit type (addition, modification, deletion), and generate decomposed structural and appearance-level text guidance. The agent then selects an informative view and applies an image editing model to obtain visual guidance. Finally, an inversion-based rectified-flow inpainting pipeline with an interleaved sampling module performs editing in the 3D latent space, enforcing prompt alignment while maintaining 3D coherence and unedited regions. Experiments on diverse 3D edits demonstrate that Vinedresser3D outperforms prior baselines in both automatic metrics and human preference studies, while enabling precise, coherent, and mask-free 3D editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36617", "url": null, "sourceid": 38736, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36620, "uid": "e67eecf0ca844bd510ca62420eb53769", "name": "AURA: Multi-modal Shared Autonomy for Urban Navigation", "authors": [{"id": 185485, "fullname": "Yukai Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185485?format=json", "institution": "Zhejiang University"}, {"id": 185486, "fullname": "Honglin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185486?format=json", "institution": "UCLA Computer Science Department, University of California, Los Angeles"}, {"id": 185487, "fullname": "Selina Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/185487?format=json", "institution": "University of California, Los Angeles"}, {"id": 77359, "fullname": "Wayne Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77359?format=json", "institution": "University of California, Los Angeles"}, {"id": 89955, "fullname": "Bolei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89955?format=json", "institution": "University of California, Los Angeles"}], "abstract": "Long-horizon navigation in complex urban environments still relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision-Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate in the same action space, resulting in high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi-modal framework that decomposes urban navigation into high-level human instruction and low-level AI control. AURA incorporates a Spatial-Aware Instruction Encoder to align human instructions with visual and spatial context. To facilitate training, we construct UrbanWalks, a large-scale dataset composed of teleoperation and vision-language description data. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation and continuous learning.Moreover, under similar takeover conditions, our hierarchical shared autonomy framework reduces human operation Frequency by over 75%. Code and data will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36620", "url": null, "sourceid": 42437, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36625, "uid": "24fd82394ab5c4074297d17f7847b17a", "name": "BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models", "authors": [{"id": 91273, "fullname": "Ryan Po", "url": "http://cvpr.thecvf.com/api/miniconf/users/91273?format=json", "institution": "Stanford University"}, {"id": 86845, "fullname": "Eric Ryan Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86845?format=json", "institution": "Stanford University"}, {"id": 133214, "fullname": "Changan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/133214?format=json", "institution": "Stanford University"}, {"id": 85845, "fullname": "Gordon Wetzstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85845?format=json", "institution": "Stanford University"}], "abstract": "Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model\u2019s own rollouts, teaching it to recover from its mistakes. Unlike prior approaches that rely on few-step distillation and distribution-matching losses --which can hurt quality and diversity -- BAgger trains with standard score or flow matching objectives, avoiding large teachers and long-chain backpropagation through time. We instantiate BAgger on causal diffusion transformers and evaluate on text-to-video, video extension, and multi-prompt generation, observing more stable long-horizon motion and better visual consistency with reduced drift.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36625", "url": null, "sourceid": 31029, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36627, "uid": "953e2ad25261d78a67bc9b4da4016994", "name": "Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features", "authors": [{"id": 128282, "fullname": "Zheng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128282?format=json", "institution": "Queen Mary, University of London"}, {"id": 185503, "fullname": "Debin Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185503?format=json", "institution": "Queen Mary University of London"}, {"id": 153657, "fullname": "Yunqi Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153657?format=json", "institution": "Huawei London Research Center"}, {"id": 128891, "fullname": "Zhensong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128891?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 84723, "fullname": "Songcen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84723?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 75872, "fullname": "Ioannis Patras", "url": "http://cvpr.thecvf.com/api/miniconf/users/75872?format=json", "institution": "Queen Mary University of London"}, {"id": 86145, "fullname": "Jifei Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/86145?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36627", "url": null, "sourceid": 40249, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36632, "uid": "91a8f8847ee097f0be9c70c3255afad8", "name": "BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation", "authors": [{"id": 143882, "fullname": "Yasong Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/143882?format=json", "institution": "The Australian National university; Data 61, CSIRO"}, {"id": 106229, "fullname": "Zeeshan Hayder", "url": "http://cvpr.thecvf.com/api/miniconf/users/106229?format=json", "institution": "Google"}, {"id": 127311, "fullname": "David Ahmedt-Aristizabal", "url": "http://cvpr.thecvf.com/api/miniconf/users/127311?format=json", "institution": "CSIRO"}, {"id": 92749, "fullname": "Hongdong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/92749?format=json", "institution": "Australian National University"}], "abstract": "Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures.To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ''image $\\to$ noise'' and ''noise $\\to$ image'' directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones.Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability. Our code and models will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36632", "url": null, "sourceid": 46451, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36634, "uid": "c84cb4b5c23ac35411414b9300dd323c", "name": "Diff4Splat: Repurposing Video Diffusion Models  for  Dynamic Scene Generation", "authors": [{"id": 181092, "fullname": "Panwang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181092?format=json", "institution": "ByteDance"}, {"id": 147620, "fullname": "Chenguo Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/147620?format=json", "institution": "Peking University"}, {"id": 152583, "fullname": "Chenxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152583?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185518, "fullname": "Jingjing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185518?format=json", "institution": null}, {"id": 185519, "fullname": "Yuchen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185519?format=json", "institution": "Peking University"}, {"id": 185520, "fullname": "Haopeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185520?format=json", "institution": "ByteDance Inc."}, {"id": 144220, "fullname": "yunlong lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/144220?format=json", "institution": "Xiamen university"}, {"id": 156997, "fullname": "Kairun Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156997?format=json", "institution": "Xiamen University"}, {"id": 89467, "fullname": "Yixuan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89467?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 89566, "fullname": "Yadong Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89566?format=json", "institution": "Peking University"}], "abstract": "We introduce Diff4Splat, a feed-forward framework for dynamic scene generation from a single image. Our method synergizes the powerful generative priors of video diffusion models with geometric and motion constraints learned from a large-scale 4D dataset. Given a single image, a camera trajectory, and an optional text prompt, our model directly predicts a dynamic scene represented by a deformable 3D Gaussian field. This approach captures appearance, geometry, and motion in a single pass, eliminating the need for test-time optimization or post-hoc processing. At the core of our framework is a video latent transformer that enhances existing video diffusion models, enabling them to jointly model spatio-temporal dependencies and predict 3D Gaussian Primitives over time. Supervised by objectives targeting appearance fidelity, geometric accuracy, and motion consistency, Diff4Splat generates high-fidelity dynamic scenes within 30 seconds. We demonstrate the effectiveness of Diff4Splat across video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36634", "url": null, "sourceid": 37198, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36635, "uid": "56eb2fff9adeac35191fd615fff3efee", "name": "PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning", "authors": [{"id": 159324, "fullname": "Jianqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159324?format=json", "institution": "KAUST"}, {"id": 185521, "fullname": "Biao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185521?format=json", "institution": "KAUST"}, {"id": 150519, "fullname": "Xiangjun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150519?format=json", "institution": "KAUST"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}], "abstract": "6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36635", "url": null, "sourceid": 32263, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40268?format=json"], "related_events_ids": [40268]}, {"id": 36633, "uid": "b45e759c405f0c30ca3b26dbae3f0424", "name": "Learning by Analogy: A Causal Framework for Compositional Generalization", "authors": [{"id": 76186, "fullname": "Lingjing Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76186?format=json", "institution": "Computer Science Department, School of Computer Science"}, {"id": 85808, "fullname": "Shaoan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85808?format=json", "institution": "Carnegie Mellon University"}, {"id": 87482, "fullname": "Yang Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87482?format=json", "institution": "Amazon"}, {"id": 185516, "fullname": "Yetian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185516?format=json", "institution": "Amazon"}, {"id": 135758, "fullname": "Yanhui Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/135758?format=json", "institution": "McMaster University"}, {"id": 185517, "fullname": "Simone Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185517?format=json", "institution": "Amazon"}, {"id": 87481, "fullname": "Yan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87481?format=json", "institution": "Amazon"}, {"id": 76475, "fullname": "Guangyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76475?format=json", "institution": "MBZUAI, CMU"}, {"id": 85796, "fullname": "Kun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85796?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Compositional generalization -- the ability to understand and generate novel combinations of learned concepts -- enables models to extend their capabilities beyond limited experiences.    While effective, the data structures and principles that enable this crucial capability remain poorly understood.    We propose that compositional generalization fundamentally requires decomposing high-level concepts into basic, low-level concepts that can be recombined across similar contexts, similar to how humans draw analogies between concepts. For example, someone who has never seen a peacock eating rice can envision this scene by relating it to their previous observations of a chicken eating rice.In this work, we formalize these intuitive processes using principles of causal modularity and minimal changes.     We introduce a hierarchical data-generating process that naturally encodes different levels of concepts and their interaction mechanisms.    Theoretically, we demonstrate that this approach enables compositional generalization supporting complex relations between composed concepts, advancing beyond prior work that assumes simpler interactions like additive effects.    Critically, we also prove that this latent hierarchical structure is provably recoverable (identifiable) from observable data like text-image pairs, a necessary step for learning such a generative process. To validate our theory, we apply insights from our theoretical framework and achieve significant improvements on benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36633", "url": null, "sourceid": 40877, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40268, "uid": "56eb2fff9adeac35191fd615fff3efee", "name": "PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning", "authors": [{"id": 159324, "fullname": "Jianqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159324?format=json", "institution": "KAUST"}, {"id": 185521, "fullname": "Biao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185521?format=json", "institution": "KAUST"}, {"id": 150519, "fullname": "Xiangjun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150519?format=json", "institution": "KAUST"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}], "abstract": "6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40268", "url": null, "sourceid": -32263, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36635?format=json"], "related_events_ids": [36635]}, {"id": 36637, "uid": "df3d0fecd6147a1622fa34857b8c934d", "name": "MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction", "authors": [{"id": 130992, "fullname": "Han Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130992?format=json", "institution": "Zhejiang University"}, {"id": 72089, "fullname": "Jiakai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/72089?format=json", "institution": "Zhejiang University"}, {"id": 152054, "fullname": "Yexing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152054?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 129888, "fullname": "Lei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129888?format=json", "institution": "Zhejiang University"}, {"id": 129895, "fullname": "Wei Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/129895?format=json", "institution": "Zhejiang University"}, {"id": 185524, "fullname": "Huaizhong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185524?format=json", "institution": "Zhejiang University"}], "abstract": "3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to multi-view dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally and duplicate their deformation networks for each new temporal segment, enabling specialized modeling to capture intricate motion details. Concurrently, low-dynamic 3DGs are treated as static to reduce computational costs. However, this temporal partitioning strategy for high-dynamic 3DGs can introduce visual discontinuities across frames at the partition boundaries. To address this, we introduce a cross-frame consistency loss, which not only ensures visual continuity but also further enhances rendering quality. Extensive experiments demonstrate that MAPo achieves superior rendering quality compared to baselines while maintaining comparable computational costs, particularly in regions with complex or rapid motions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36637", "url": null, "sourceid": 46083, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36638, "uid": "939aba147bd1077514a5d2022505783f", "name": "Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization", "authors": [{"id": 180568, "fullname": "Xingyue Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180568?format=json", "institution": "Peking University"}, {"id": 185525, "fullname": "Shuai Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185525?format=json", "institution": "Peking University"}, {"id": 185526, "fullname": "Xiangyu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/185526?format=json", "institution": "Peking University"}, {"id": 185527, "fullname": "Jianhua Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185527?format=json", "institution": "Peking University"}, {"id": 147001, "fullname": "Yuxuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/147001?format=json", "institution": "Peking University"}, {"id": 88986, "fullname": "Liangcai Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88986?format=json", "institution": "Peking University"}], "abstract": "Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex real-world images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light\u2013shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36638", "url": null, "sourceid": 34489, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36640, "uid": "a7b96247649539e6c32d09543bf68a46", "name": "Detecting Compressed AI-Generated Images via Phase Spectrum Robustness", "authors": [{"id": 102328, "fullname": "Kai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/102328?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 75641, "fullname": "Wenqi Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/75641?format=json", "institution": "Sun Yat-Sen University"}, {"id": 185532, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185532?format=json", "institution": "SUN YAT-SEN UNIVERSITY; SUN YAT-SEN UNIVERSITY"}, {"id": 126239, "fullname": "Xiaochun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126239?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "This paper aims to present a robust AI-generated image detection framework designed to address performance degradation caused by image compression in online social networks. The key challenges are twofold: 1) compression destroys fragile artifacts that are crucial to existing methods, and 2) it introduces new compression artifacts that interfere with detection. Existing methods typically enhance the compression robustness by collecting original-compression pairs and compression labels. However, the collection and annotation process is highly resource-intensive. To address these issues, we propose a Compression-Robust Phase-Harmonized Transformer, motivated by the observation that phase spectrum remains stable under compression. The framework consists of a phase-harmonized cross-modal interaction module that leverages phase spectrum information for feature fusion, enhancing compression robustness, and a multi-domain modulation adapter that further refines fused features while enabling parameter-efficient fine-tuning. In particular, the framework operates without requiring compression-original data pairs and compression labels. When limited compression labels are available, we introduce a difficulty-aware consistency loss to maximize their utility by prioritizing hard compressed samples during training, further boosting robustness. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches, exhibiting superior robustness against image compression.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36640", "url": null, "sourceid": 37106, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36647, "uid": "eb2f0fc72f358a3663182913cbcdb32f", "name": "UniPercept: A Unified Diffusion Model for Generalizable Visual Perception", "authors": [{"id": 180428, "fullname": "Zuyan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180428?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6280\u672f\u7814\u7a76\u6240"}, {"id": 182763, "fullname": "Zhenliang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182763?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6280\u672f\u7814\u7a76\u6240"}, {"id": 127370, "fullname": "Meina Kan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127370?format=json", "institution": "Institute of Computing Technoloy, Chinese Academy of Sciences"}, {"id": 86312, "fullname": "Shiguang Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86312?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 76990, "fullname": "Xilin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76990?format=json", "institution": "Institute of Computing Technology"}], "abstract": "Diffusion models have shown impressive performance in generative tasks, demonstrating their ability to capture detailed structural and semantic information. Recently, these capabilities have been extended to visual understanding, with studies employing diffusion models as the backbone for various perception tasks. However, existing diffusion-based perception models are generally restricted to a single task or a fixed set of predefined tasks, lacking an efficient mechanism to generalize to novel tasks. To overcome this limitation, we propose a unified DiT-based perception framework called UniPercept, which introduces a novel foundation\u2013adapter paradigm for general visual perception. In this framework, a shared diffusion-based foundation model is trained to capture common and generalizable visual knowledge across diverse perception tasks, with task-specific adapters integrated for each individual task. Leveraging its superior generalization ability, the foundation model can be efficiently adapted to novel domains through lightweight adapters, requiring as few as **1,000** training samples and less than **1%** of trainable parameters. Furthermore, UniPercept demonstrates strong performance across various perception tasks, outperforming state-of-the-art generalist models in most cases and achieving accuracy comparable to specialist models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36647", "url": null, "sourceid": 42326, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36648, "uid": "a2fa09f7e331c73ec8ba3b64b846d7dd", "name": "Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning", "authors": [{"id": 182119, "fullname": "Haonan Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/182119?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 88113, "fullname": "Shichao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88113?format=json", "institution": "Megvii Technology Inc."}, {"id": 130319, "fullname": "Xin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/130319?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185558, "fullname": "Zenghui Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185558?format=json", "institution": "Alibaba Group"}, {"id": 156655, "fullname": "Jin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156655?format=json", "institution": "The University of Hong Kong"}, {"id": 157234, "fullname": "Jinsong Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/157234?format=json", "institution": "Alibaba Group"}, {"id": 185559, "fullname": "Xiaoyong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185559?format=json", "institution": "Alibaba Group; Microsoft"}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 156949, "fullname": "Kaifu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156949?format=json", "institution": "Alibaba Group"}], "abstract": "Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20\\% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36648", "url": null, "sourceid": 41268, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36649, "uid": "20004d9171f2c39562975a104b3b9d7d", "name": "SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving", "authors": [{"id": 182330, "fullname": "Felix Embacher", "url": "http://cvpr.thecvf.com/api/miniconf/users/182330?format=json", "institution": "Mercedes-Benz AG"}, {"id": 92422, "fullname": "Jonas Uhrig", "url": "http://cvpr.thecvf.com/api/miniconf/users/92422?format=json", "institution": "Mercedes Benz AG"}, {"id": 129125, "fullname": "Marius Cordts", "url": "http://cvpr.thecvf.com/api/miniconf/users/129125?format=json", "institution": "Mercedes-Benz"}, {"id": 75703, "fullname": "Markus Enzweiler", "url": "http://cvpr.thecvf.com/api/miniconf/users/75703?format=json", "institution": "Esslingen University of Applied Sciences"}], "abstract": "Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples.We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 386k bounding boxes covering 64 rare categories. It specifically targets the \u201cneedle-in-a-haystack\u201d problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models.Comprehensive zero-shot evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. Models that directly align spatial visual features with language achieve the best relative results, yet none demonstrate satisfactory retrieval capability in absolute terms. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale benchmark for retrieval-driven data curation and long-tail perception research in AD.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36649", "url": null, "sourceid": 33323, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36654, "uid": "3ee61d1c8258496260a72d01e76480ba", "name": "HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning", "authors": [{"id": 144065, "fullname": "Hongji Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144065?format=json", "institution": "University of Macau"}, {"id": 185569, "fullname": "Yucheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185569?format=json", "institution": "University of Macau"}, {"id": 89411, "fullname": "Wencheng Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/89411?format=json", "institution": "University of Macau"}, {"id": 89897, "fullname": "Runzhou Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89897?format=json", "institution": "Beijing Institute of Technology"}, {"id": 185570, "fullname": "Zhongying Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185570?format=json", "institution": "ZEEKR"}, {"id": 185571, "fullname": "Jianfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185571?format=json", "institution": "Shanghai ZEEKR Blue New Energy Technology Co., Ltd."}, {"id": 126757, "fullname": "Jianbing Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126757?format=json", "institution": "University of Macau"}], "abstract": "Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36654", "url": null, "sourceid": 36426, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36661, "uid": "ff19aaaec6a0ea4ed365576f4902cefa", "name": "Efficient unrolled networks for large-scale 3D inverse problems", "authors": [{"id": 182338, "fullname": "Romain Vo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182338?format=json", "institution": "CNRS, ENS de Lyon, France"}, {"id": 128042, "fullname": "Juli\u00e1n Tachella", "url": "http://cvpr.thecvf.com/api/miniconf/users/128042?format=json", "institution": "CNRS"}], "abstract": "Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36661", "url": null, "sourceid": 33698, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40269?format=json"], "related_events_ids": [40269]}, {"id": 36663, "uid": "132e0a7afc6812c80743111c6237f12e", "name": "MoBind: Motion Binding for Fine-Grained IMU\u2013Video Pose Alignment", "authors": [{"id": 185584, "fullname": "Duc Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185584?format=json", "institution": "University of Adelaide"}, {"id": 98366, "fullname": "Tat-Jun Chin", "url": "http://cvpr.thecvf.com/api/miniconf/users/98366?format=json", "institution": "The University of Adelaide"}, {"id": 136690, "fullname": "Minh Nguyen Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/136690?format=json", "institution": "Qualcomm Inc, QualComm; University of Adelaide"}], "abstract": "We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36663", "url": null, "sourceid": 36931, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36674, "uid": "8d661ad44b835f29a6494c184c21a463", "name": "OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar", "authors": [{"id": 86729, "fullname": "Jianqiang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/86729?format=json", "institution": "Alibaba Group"}, {"id": 185619, "fullname": "Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185619?format=json", "institution": "Alibaba Group"}, {"id": 185620, "fullname": "Steven Hoi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185620?format=json", "institution": "Alibaba Group; Singapore Management University"}], "abstract": "We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decomposition scheme in which the head and shoulders are predicted separately and then integrated through cross-region combination. Extensive experiments demonstrate that OMG-Avatar outperforms state-of-the-art methods in reconstruction quality, reenactment performance, and computational efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36674", "url": null, "sourceid": 40456, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40269, "uid": "ff19aaaec6a0ea4ed365576f4902cefa", "name": "Efficient unrolled networks for large-scale 3D inverse problems", "authors": [{"id": 182338, "fullname": "Romain Vo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182338?format=json", "institution": "CNRS, ENS de Lyon, France"}, {"id": 128042, "fullname": "Juli\u00e1n Tachella", "url": "http://cvpr.thecvf.com/api/miniconf/users/128042?format=json", "institution": "CNRS"}], "abstract": "Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40269", "url": null, "sourceid": -33698, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36661?format=json"], "related_events_ids": [36661]}, {"id": 36662, "uid": "83fe595a07a8da38ace7d7f82b749d4d", "name": "How Much 3D Do Video Foundation Models Encode?", "authors": [{"id": 77107, "fullname": "Zixuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77107?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 155531, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155531?format=json", "institution": "University of Illinois, Urbana Champaign"}, {"id": 99138, "fullname": "Zhaoyang Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/99138?format=json", "institution": "Impossible, Inc."}, {"id": 95904, "fullname": "James Rehg", "url": "http://cvpr.thecvf.com/api/miniconf/users/95904?format=json", "institution": "University of Illinois at Urbana-Champaign"}], "abstract": "Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36662", "url": null, "sourceid": 43355, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36665, "uid": "75096e37873d85045640e9f3fcfc182a", "name": "Transition Matching Distillation for Fast Video Generation", "authors": [{"id": 162987, "fullname": "Weili Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/162987?format=json", "institution": "NVIDIA"}, {"id": 154411, "fullname": "Julius Berner", "url": "http://cvpr.thecvf.com/api/miniconf/users/154411?format=json", "institution": "NVIDIA"}, {"id": 153407, "fullname": "Nanye Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/153407?format=json", "institution": "New York University"}, {"id": 152133, "fullname": "Chao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152133?format=json", "institution": "NVIDIA"}, {"id": 90241, "fullname": "Saining Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90241?format=json", "institution": "Facebook"}, {"id": 87737, "fullname": "Arash Vahdat", "url": "http://cvpr.thecvf.com/api/miniconf/users/87737?format=json", "institution": "NVIDIA Research"}], "abstract": "Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video flow model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36665", "url": null, "sourceid": 44733, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36669, "uid": "14fab9101cde27b123722210d1a1836f", "name": "Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics", "authors": [{"id": 184024, "fullname": "Zhengzhong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184024?format=json", "institution": "Sichuan University"}, {"id": 183000, "fullname": "Liangjin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183000?format=json", "institution": "Department of Computer Science, Sichuan University"}, {"id": 185608, "fullname": "Pei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185608?format=json", "institution": "Sichuan University"}, {"id": 185609, "fullname": "Shiquan min", "url": "http://cvpr.thecvf.com/api/miniconf/users/185609?format=json", "institution": "Sichuan University"}, {"id": 185610, "fullname": "Jiangping Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185610?format=json", "institution": "Sichuan University"}], "abstract": "Spatial transcriptomics provides both spatial coordinates and gene expression profiles, enabling the study of tissue organization and cellular heterogeneity. Despite recent progress, current spatial clustering methods still face two major limitations. First, representations learned from spatial and expression views often differ due to view-specific noise and incomplete structural information. Without enforcing sample-level cross-view consistency, embeddings from the two views may not correspond to the same biological identity, reducing discriminative capability. Second, existing approaches lack effective semantic-level supervision. Although node embeddings capture local neighborhood patterns, they do not explicitly reflect high-level semantic structures. Prototype-based modeling can provide such semantic abstraction, yet current methods seldom align prototypes with node representations, leading to weak semantic consistency. To overcome these issues, we propose a Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics (MHAL). At the sample level, MHAL introduce positive sample alignment to enforce consistency between spatial and expression embeddings. At the semantic level, MHAL design prototype level contrastive learning, where prototypes act as semantic anchors and guide the formation of coherent cluster structures. Together, these two alignment mechanisms progressively ensure both local consistency and global semantic discrimination. Extensive experimental results demonstrate that the proposed hierarchical contrastive multi-view clustering method achieves competitive performance in spatial domain identification compared to other state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36669", "url": null, "sourceid": 41180, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36672, "uid": "4c08bbaa43016917c824f6bdbb3af9d8", "name": "Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation", "authors": [{"id": 128995, "fullname": "Su Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128995?format=json", "institution": "Purdue University"}, {"id": 99408, "fullname": "Cheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/99408?format=json", "institution": "Bosch Research"}, {"id": 99045, "fullname": "Himangi Mittal", "url": "http://cvpr.thecvf.com/api/miniconf/users/99045?format=json", "institution": "Carnegie Mellon University"}, {"id": 77002, "fullname": "Gaurav Mittal", "url": "http://cvpr.thecvf.com/api/miniconf/users/77002?format=json", "institution": "Microsoft"}, {"id": 153120, "fullname": "Rohith Kukkala", "url": "http://cvpr.thecvf.com/api/miniconf/users/153120?format=json", "institution": "Microsoft"}, {"id": 89241, "fullname": "Yingjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89241?format=json", "institution": "Purdue University"}, {"id": 172531, "fullname": "Mei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/172531?format=json", "institution": "Dolby"}], "abstract": "Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance.We present \\emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling.\\emph{Track4DGen} surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate \\emph{Sketchfab28}, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36672", "url": null, "sourceid": 39711, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36675, "uid": "af8e600dc7b9d186560061e3e73cbe57", "name": "Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation", "authors": [{"id": 176762, "fullname": "Yunkai Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176762?format=json", "institution": "Sun Yat-sen University"}, {"id": 92566, "fullname": "Yudong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/92566?format=json", "institution": "Tencent; Tsinghua University"}, {"id": 175965, "fullname": "Kunquan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175965?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 131056, "fullname": "Jinxiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131056?format=json", "institution": "Tsinghua University"}, {"id": 176821, "fullname": "Xinying Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/176821?format=json", "institution": "Beijing Institute of Technology"}, {"id": 131048, "fullname": "Haohuan Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131048?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 180060, "fullname": "Runmin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180060?format=json", "institution": "Sun Yat-sen University,"}], "abstract": "With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text\u2013image\u2013mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36675", "url": null, "sourceid": 35813, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36678, "uid": "a003d3c58568435ab3c440f97f1e953e", "name": "TIGER: A Unified Framework for Time, Images and Geo-location Retrieval", "authors": [{"id": 136031, "fullname": "David Shatwell", "url": "http://cvpr.thecvf.com/api/miniconf/users/136031?format=json", "institution": "University of Central Florida"}, {"id": 151034, "fullname": "Swetha Sirnam", "url": "http://cvpr.thecvf.com/api/miniconf/users/151034?format=json", "institution": "University of Central Florida"}, {"id": 73977, "fullname": "Mubarak Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/73977?format=json", "institution": "Amazon"}], "abstract": "Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, geolocation, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time. We formalize this problem as Geo-Time Aware Image Retrieval and curate a diverse benchmark of 4.5M paired image\u2013location\u2013time triplets for training and 85k high-quality triplets for evaluation. We then propose TIGeR, a multi-modal-transformer\u2013based model that maps image, geolocation, and time into a unified geo-temporal embedding space. TIGeR supports flexible input configurations (single-modality and multi-modality queries) and uses the same representation to perform (i) geo-localization, (ii) time-of-capture prediction, and (iii) geo-time\u2013aware retrieval. By better preserving underlying location identity under large appearance changes, TIGeR enables retrieval based on where and when a scene is, rather than purely on visual similarity. Extensive experiments show that TIGeR consistently outperforms strong baselines and state-of-the-art methods by up to 16% on time-of-year, 8% time-of-day prediction, and 14% in geo-time aware retrieval recall, highlighting the benefits of unified geo-temporal modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36678", "url": null, "sourceid": 37541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36682, "uid": "f29d0a82cf0a101d8992195bd1514477", "name": "Photo-Guided Tooth Segmentation on 3D Oral Scan Model", "authors": [{"id": 185636, "fullname": "Shaojie Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185636?format=json", "institution": "Shandong University"}, {"id": 158569, "fullname": "Guangshun Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/158569?format=json", "institution": "Shandong University"}, {"id": 185637, "fullname": "Jiangxin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185637?format=json", "institution": "Shandong University"}, {"id": 158572, "fullname": "Yuanfeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/158572?format=json", "institution": "Shandong University"}], "abstract": "Accurate 3D tooth segmentation is fundamental for digital dentistry, orthodontic analysis, and clinical simulation. Intraoral scan (IOS) models often suffer from incomplete or unreliable texture information, making it difficult to delineate fine boundaries between teeth and gingiva, while 2D intraoral images provide rich semantic and chromatic information that can complement 3D geometry. Thus, we propose a novel Photo-guided 3D Model Tooth Segmentation framework, PMTSeg, that enhances 3D tooth segmentation by integrating texture cues from intraoral photos. Our framework introduces three key components: a Camera Alignment Module (CAM) for accurate image-model registration, a Feature Filtering Gate (FFG) for adaptive multi-view feature selection, and a Consistent Feature Learning (CFL) mechanism for learning texture-geometry correspondence. Our method supports arbitrary numbers and views of intraoral photos. Experiments show significant improvements in distinguishing adjacent teeth and tooth\u2013gingiva boundaries, demonstrating that intraoral photographs serve as an efficient, semantically rich supplement to 3D scans for precise dental segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36682", "url": null, "sourceid": 45278, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36685, "uid": "09c678ee8120d2953194b54f37d8a8a5", "name": "VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale", "authors": [{"id": 152873, "fullname": "Sven Elflein", "url": "http://cvpr.thecvf.com/api/miniconf/users/152873?format=json", "institution": "University of Toronto"}, {"id": 185641, "fullname": "Ruilong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185641?format=json", "institution": null}, {"id": 185642, "fullname": "S\u00e9rgio Agostinho", "url": "http://cvpr.thecvf.com/api/miniconf/users/185642?format=json", "institution": "NVIDIA"}, {"id": 86828, "fullname": "\u017dan Goj\u010di\u010d", "url": "http://cvpr.thecvf.com/api/miniconf/users/86828?format=json", "institution": "NVIDIA"}, {"id": 128661, "fullname": "Laura Leal-Taixe", "url": "http://cvpr.thecvf.com/api/miniconf/users/128661?format=json", "institution": "NVIDIA"}, {"id": 135558, "fullname": "Qunjie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/135558?format=json", "institution": "NVIDIA"}, {"id": 90782, "fullname": "Aljo\u0161a O\u0161ep", "url": "http://cvpr.thecvf.com/api/miniconf/users/90782?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training.VGG-T$^3$ ($\\mathbf{V}$isual $\\mathbf{G}$eometry $\\mathbf{G}$rounded $\\mathbf{T}$est $\\mathbf{T}$ime $\\mathbf{T}$raining) scales linearly w.r.t. the number of input views, similar to online models, and achieves a $11.6\\times$ speed-up over baselines that rely on softmax attention for reconstructing a $1k$ image collection in just $54$ seconds. Because our method retains global scene aggregation capability, our resulting point map reconstruction error is comparable to VGGT.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36685", "url": null, "sourceid": 38626, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36688, "uid": "0e74b02d920c9bcf962fbbec0c2a3423", "name": "LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction", "authors": [{"id": 180407, "fullname": "Tianye Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/180407?format=json", "institution": "Northeastern University"}, {"id": 155435, "fullname": "Yiming Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/155435?format=json", "institution": "Northeastern University"}, {"id": 185647, "fullname": "Yiqing Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185647?format=json", "institution": "Brown University"}, {"id": 126514, "fullname": "Moitreya Chatterjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/126514?format=json", "institution": "Mitsubishi Electric Research Labs"}, {"id": 97201, "fullname": "Pedro Miraldo", "url": "http://cvpr.thecvf.com/api/miniconf/users/97201?format=json", "institution": "Mitsubishi Electric Research Laboratories"}, {"id": 127396, "fullname": "Huaizu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127396?format=json", "institution": "Northeastern University"}], "abstract": "Recent feed-forward reconstruction models like VGGT and $\\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system byaligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($Sim(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps.Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos.Our code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36688", "url": null, "sourceid": 36486, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36689, "uid": "be3c7fe586ab8b69887e6b0f52bfe9b9", "name": "DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers", "authors": [{"id": 151039, "fullname": "Dahye Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/151039?format=json", "institution": "Boston University"}, {"id": 106968, "fullname": "Deepti Ghadiyaram", "url": "http://cvpr.thecvf.com/api/miniconf/users/106968?format=json", "institution": "Boston University"}, {"id": 169376, "fullname": "Raghudeep Gadde", "url": "http://cvpr.thecvf.com/api/miniconf/users/169376?format=json", "institution": "Amazon"}], "abstract": "Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity.We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\\times$ and $3.2\\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36689", "url": null, "sourceid": 41806, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36698, "uid": "0e9c14002f9268601c57837930261014", "name": "MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts", "authors": [{"id": 179861, "fullname": "Jingnan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/179861?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 104460, "fullname": "Zhe Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104460?format=json", "institution": "Alibaba"}, {"id": 185670, "fullname": "Xianze Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185670?format=json", "institution": "Alibaba Group"}, {"id": 76724, "fullname": "Xingyu Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/76724?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 129691, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129691?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185671, "fullname": "Shengqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185671?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 88818, "fullname": "Yuhao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88818?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 104646, "fullname": "Jiangjing Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104646?format=json", "institution": "Alibaba Group"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 91466, "fullname": "Yichao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/91466?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks.In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations.However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability.Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction.MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks.Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36698", "url": "https://g-1nonly.github.io/MoRE_Website/", "sourceid": 36736, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36692, "uid": "16a359f1f323bca3b0448e6c8ca5bec5", "name": "Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models", "authors": [{"id": 76205, "fullname": "Yuedong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76205?format=json", "institution": "University of Texas at Austin"}, {"id": 182940, "fullname": "Xiwen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/182940?format=json", "institution": "University of Texas at Austin"}, {"id": 95377, "fullname": "Mustafa Munir", "url": "http://cvpr.thecvf.com/api/miniconf/users/95377?format=json", "institution": "The University of Texas at Austin"}, {"id": 85871, "fullname": "Radu Marculescu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85871?format=json", "institution": "University of Texas, Austin"}], "abstract": "Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications.However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking).We observe empirically that the CoT process follows a Bernoulli process, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of \"fuel\" available to support the reasoning process.Based on this insight, we propose **Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time**. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking.Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37$\\times$ reduction in the memory allocation frequency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36692", "url": null, "sourceid": 31739, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36699, "uid": "59ddf60f6efb7a194b24a392215fb222", "name": "Geometry-Aligned and Anomaly-Aware Reconstruction for 3D Anomaly Detection", "authors": [{"id": 104386, "fullname": "linchun wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104386?format=json", "institution": "Wuhan University"}, {"id": 88855, "fullname": "Qin Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88855?format=json", "institution": "Wuhan University"}, {"id": 185672, "fullname": "Yuanhao Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/185672?format=json", "institution": null}, {"id": 86262, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86262?format=json", "institution": "Wuhan University"}], "abstract": "Point cloud anomaly detection is crucial in automated manufacturing, with reconstruction-based diffusion methods emerging as a mainstream solution. However, these approaches still face two major challenges: (1) geometry violation, where random noise perturbations deviate from local surface normals, causing structural distortion; and (2) undistinguished reference regions, where uniformly applied coarse anomaly embeddings during denoising blur normal details and impede accurate anomaly recovery. To address these issues, we propose AARD, a geometry-aligned and anomaly-aware diffusion reconstruction framework. We argue that high-fidelity anomaly detection requires a principled reformulation of the diffusion process: noise should align with geometry to preserve structures, and reconstruction can be better guided by anomaly-aware references to discriminatively recover normal details while correcting defects. AARD progressively aligns noise directions with vertex normals while maintaining vertex-graph consistency, and employs an adaptive transformer that assigns normal references to anomalous regions and input references to normal areas. Experiments on Anomaly-ShapeNet and Real3D-AD show that AARD consistently outperforms state-of-the-art approaches, achieving superior geometric fidelity and robust anomaly localization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36699", "url": null, "sourceid": 39786, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36696, "uid": "2f68c2e0cd8964d59dbbcfa48196b499", "name": "Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching", "authors": [{"id": 185665, "fullname": "Jintu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185665?format=json", "institution": "XPeng Motors"}, {"id": 185666, "fullname": "Qizhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185666?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 185667, "fullname": "Huangxin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185667?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences; Shenzhen University"}, {"id": 185668, "fullname": "zhuojie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185668?format=json", "institution": "\u5e7f\u5dde\u6c47\u5929\u98de\u884c\u6c7d\u8f66\u5236\u9020\u6709\u9650\u516c\u53f8"}], "abstract": "While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28$\\times$ speedup, 76.6\\% memory peak reduction and 80.9\\% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320$\\times$640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36696", "url": null, "sourceid": 43050, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36703, "uid": "294590f8ea96849dda853bf178cab105", "name": "Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection", "authors": [{"id": 97522, "fullname": "Shan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/97522?format=json", "institution": "ANU;CSIRO"}, {"id": 73955, "fullname": "Maying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73955?format=json", "institution": "NVIDIA"}, {"id": 150901, "fullname": "Nadine Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150901?format=json", "institution": "NVIDIA"}, {"id": 74135, "fullname": "Chuong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/74135?format=json", "institution": "CSIRO"}, {"id": 92749, "fullname": "Hongdong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/92749?format=json", "institution": "Australian National University"}, {"id": 73959, "fullname": "Jose M. Alvarez", "url": "http://cvpr.thecvf.com/api/miniconf/users/73959?format=json", "institution": "NVIDIA"}], "abstract": "Multimodal large language models achieve strong performance across diverse tasks but remain prone to hallucinations, where outputs are not grounded in visual inputs. This issue can be attributed to two main biases: text\u2013visual bias, the overreliance on prompts and prior outputs, and co-occurrence bias, spurious correlations between frequently paired objects. We propose Gradient-based Influence-Aware Constrained Decoding (GACD), an inference-based method, that addresses both biases without auxiliary models, and is readily applicable to existing models without finetuning. The core of our approach is bias estimation, which uses first-order Taylor gradients to understand the contribution of individual tokens\u2014visual features and text tokens\u2014to the current output. Based on this analysis, GACD mitigates hallucinations through two components: (1) suppressing spurious visual features correlated with the output objects, and (2) rebalancing cross-modal contributions by strengthening visual features relative to text. Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36703", "url": null, "sourceid": 35874, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36705, "uid": "ea6800b170ae74a500d50a77c1cd2b0c", "name": "STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs", "authors": [{"id": 185687, "fullname": "Zongzhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185687?format=json", "institution": "Renmin University of China"}, {"id": 88244, "fullname": "Zongyang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88244?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 185688, "fullname": "Mingze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185688?format=json", "institution": "Renmin University of China"}, {"id": 185689, "fullname": "Songyou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185689?format=json", "institution": "Renmin University of China"}, {"id": 185690, "fullname": "Yu Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185690?format=json", "institution": "Alibaba Group"}, {"id": 185691, "fullname": "Tingyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185691?format=json", "institution": "Alibaba Group"}, {"id": 76750, "fullname": "Ziqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76750?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 85411, "fullname": "Deli Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85411?format=json", "institution": "Alibaba Group"}, {"id": 88061, "fullname": "Wenbing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88061?format=json", "institution": "Renmin University of China"}], "abstract": "Multimodal Large Language Models (MLLMs) remain far from human-level performance in multi-view spatial reasoning, where models must establish object correspondences across view and infer coherent scene semantics. We analyze this limitation through the Transformation-Driven Visual Reasoning (TVR) task and find that Supervised Fine-Tuning (SFT) fails to capture cross-view consistency, whereas reinforcement learning (RL) fails to reliably identify key referential objects. To bridge this gap, we introduce multi-View Spatial TrAnsformation Reasoning (STAR-R1), a two-stage framework that combines process-supervised SFT with a referential-aware RL paradigm. STAR-R1 first learns structured spatial reasoning trajectories from high-quality CoTs and then uses fine-grained rewards on referential selection and answer correctness to encourage effective exploration and robust scene interpretation. Despite using only a small amount of high-quality training data, STAR-R1 surpasses state-of-the-art models with far more training data on the multi-view spatial understanding benchmarks TVR, MMSI-Bench, MindCube-Bench, and SPAR-Bench. Our study reveals the overlooked potential of RL in multi-view spatial understanding and points a way toward potentially achieving more human-like spatial reasoning in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36705", "url": null, "sourceid": 35496, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36713, "uid": "def4bf8b332ec684f60d57af1c8a8f43", "name": "Back to Basics: Let Denoising Generative Models Denoise", "authors": [{"id": 89938, "fullname": "Tianhong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89938?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 150920, "fullname": "Kaiming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/150920?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Today's denoising diffusion models do not \"denoise\" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than \"**Just image Transformers**\", or **JiT**, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36713", "url": null, "sourceid": 39246, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36714, "uid": "01657d0923fcfe6b240ac1d182f998ea", "name": "Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video", "authors": [{"id": 126492, "fullname": "Zeren Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126492?format=json", "institution": "University of Oxford"}, {"id": 180810, "fullname": "Chuanxia Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180810?format=json", "institution": "University of Oxford"}, {"id": 85825, "fullname": "Iro Laina", "url": "http://cvpr.thecvf.com/api/miniconf/users/85825?format=json", "institution": "University of Oxford"}, {"id": 73729, "fullname": "Diane Larlus", "url": "http://cvpr.thecvf.com/api/miniconf/users/73729?format=json", "institution": "NAVER LABS Europe"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}], "abstract": "We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object\u2019s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object\u2019s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36714", "url": null, "sourceid": 40685, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36715, "uid": "23174474f31785ce939641039a212de4", "name": "AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization", "authors": [{"id": 156904, "fullname": "Mohammad Omama", "url": "http://cvpr.thecvf.com/api/miniconf/users/156904?format=json", "institution": "University of Texas at Austin"}, {"id": 163745, "fullname": "Gabriele Berton", "url": "http://cvpr.thecvf.com/api/miniconf/users/163745?format=json", "institution": "Amazon"}, {"id": 165234, "fullname": "Eric Foxlin", "url": "http://cvpr.thecvf.com/api/miniconf/users/165234?format=json", "institution": null}, {"id": 185712, "fullname": "Yelin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185712?format=json", "institution": "Amazon"}], "abstract": "Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers.We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching.Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36715", "url": null, "sourceid": 33244, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36723, "uid": "e782e89abdb62740b34d162d21510f6c", "name": "Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection", "authors": [{"id": 77402, "fullname": "Xu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77402?format=json", "institution": "University of Sydney"}, {"id": 185728, "fullname": "Zhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185728?format=json", "institution": "La Trobe University"}, {"id": 91732, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91732?format=json", "institution": "The University of Sydney"}, {"id": 73994, "fullname": "Dacheng Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73994?format=json", "institution": "Nanyang Technological University"}], "abstract": "Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision\u2013language understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36723", "url": null, "sourceid": 42210, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36725, "uid": "9f86fcd5de31459d5f88ce83ea43d509", "name": "Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation", "authors": [{"id": 129602, "fullname": "Xinhao Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129602?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 181698, "fullname": "Gensheng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181698?format=json", "institution": "Sungkyunkwan University"}, {"id": 129610, "fullname": "Zeren Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129610?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 72996, "fullname": "Fumin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72996?format=json", "institution": "UESTC"}, {"id": 86315, "fullname": "Wenguan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86315?format=json", "institution": "Zhejiang University"}], "abstract": "In this paper, we propose \\textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model.  Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36725", "url": null, "sourceid": 37760, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36726, "uid": "9493def6e3654e282ae77a93d6f784de", "name": "Few-Shot Hybrid Incremental Learning\uff1aContinually Learning under Data Scarcity and Task Uncertainty", "authors": [{"id": 182041, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182041?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185729, "fullname": "Yuzhu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185729?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185730, "fullname": "Kan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185730?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185731, "fullname": "Shu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185731?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185732, "fullname": "Diqi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185732?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 129116, "fullname": "Dingwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129116?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 86299, "fullname": "Junwei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86299?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "The increasing complexity of real-world deployment requires intelligent agents to effectively adapt to non-stationary data streams with stochastic increments under data scarcity. We formally define this challenge as the Few-Shot Hybrid Incremental Learning (FSHIL) paradigm, which reveals a critical stability-plasticity dilemma. Existing strategies struggle to address this dilemma: representation freezing in few-shot incremental learning can mitigate overfitting under data scarcity but leads to insufficient representation plasticity, while architecture expansion in hybrid incremental learning provides plasticity for adaptation but results in overfitting under few-shot conditions. To address this, we propose the Conditional Meta-Expanding Mixture-of-Experts (CME-MoE), which balances feature-level stability-plasticity trade-off through conditional expert reuse and meta-expansion mechanism. Furthermore, recognizing the multi-domain manifestation in the latent space, we introduce the Self-Expanding Prototype Classifier (SEPC), which on-demand expands classification to model complex domain-shifted decision boundaries. The proposed method outperforms existing state-of-the-art methods in three few-shot incremental learning settings across five mainstream datasets, effectively addressing data scarcity and task uncertainty, and providing a robust solution for real-world continual learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36726", "url": null, "sourceid": 43633, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36729, "uid": "224400c2e26393536a7a306a636f5287", "name": "MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment", "authors": [{"id": 181551, "fullname": "Tao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181551?format=json", "institution": "Zhejiang University"}, {"id": 185739, "fullname": "Yibo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185739?format=json", "institution": "Zhejiang University"}, {"id": 126916, "fullname": "Yehao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126916?format=json", "institution": "Zhejiang University"}, {"id": 154148, "fullname": "Zhizhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154148?format=json", "institution": "Zhejiang University"}, {"id": 157906, "fullname": "Zeyi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157906?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 185740, "fullname": "Zequn Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185740?format=json", "institution": "Zhejiang University"}, {"id": 86317, "fullname": "Xi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86317?format=json", "institution": "Zhejiang University"}], "abstract": "Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. Existing In-Context-Learning based methods are limited by their highly coupled training paradigm. These methods attempt to achieve both high subject fidelity and multi-dimensional human preference alignment within a single training stage, relying on a single, indirect reconstruction loss, which is difficult to simultaneously satisfy both these goals. To address this, we propose MultiCrafter, a framework that decouples this task into two distinct training stages. First, in a pre-training stage, we introduce an explicit positional supervision mechanism that effectively resolves attention bleeding and drastically enhances subject fidelity. Second, in a post-training stage, we propose Identity-Preserving Preference Optimization, a novel online reinforcement learning framework. We feature a scoring mechanism to accurately assess multi-subject fidelity based on the Hungarian matching algorithm, which allows the model to optimize for aesthetics and prompt alignment while ensuring subject fidelity achieved in the first stage. Experiments validate that our decoupling framework significantly improves subject fidelity while aligning with human preferences better.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36729", "url": null, "sourceid": 45404, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36732, "uid": "513fadbd92b2c516bce1831c8a118694", "name": "GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings", "authors": [{"id": 185745, "fullname": "Angel Daruna", "url": "http://cvpr.thecvf.com/api/miniconf/users/185745?format=json", "institution": "SRI International"}, {"id": 183805, "fullname": "Nicholas Meegan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183805?format=json", "institution": "SRI International"}, {"id": 95658, "fullname": "Han-Pang Chiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/95658?format=json", "institution": "SRI International"}, {"id": 73125, "fullname": "Supun Samarasekera", "url": "http://cvpr.thecvf.com/api/miniconf/users/73125?format=json", "institution": "SRI International"}, {"id": 73976, "fullname": "Rakesh \u201cTeddy\u201d Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/73976?format=json", "institution": "SRI International"}], "abstract": "Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Unlike prior work, our geographic representation explicitly models the world as a hierarchy of learned geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods. Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36732", "url": null, "sourceid": 33415, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36734, "uid": "de3236d15ff4ded7ef475b2f04b0bba3", "name": "Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling", "authors": [{"id": 96405, "fullname": "sanghyeok chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/96405?format=json", "institution": "seoul national university"}, {"id": 75885, "fullname": "Pyunghwan Ahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/75885?format=json", "institution": "LG AI Research"}, {"id": 185748, "fullname": "Gwangmo Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/185748?format=json", "institution": "Seoul National University; LG AI Research"}, {"id": 185749, "fullname": "Seung Hwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185749?format=json", "institution": "LG AI Research"}, {"id": 135184, "fullname": "Honglak Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/135184?format=json", "institution": "LG AI Research"}, {"id": 75881, "fullname": "Bohyung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/75881?format=json", "institution": "Seoul National University"}], "abstract": "Sparse upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from a pretrained dense checkpoint instead of training from scratch.However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization.We propose Cluster-aware Upcycling, a strategy that embeds semantic structure into MoE initialization.The method clusters the dense model\u2019s input activations to identify latent subspaces, initializes each expert using a data-aware truncated SVD of the dense weights within its cluster, and initializes the router with the corresponding cluster centroids.This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data structure.In addition, we introduce an Expert-Ensemble Self-Distillation loss that regularizes training by guiding uncertain routing with stable predictions from an ensemble teacher.Applied to CLIP ViT-B/16 and ViT-B/32 models, Cluster-aware Upcycling achieves consistent improvements over standard upcycling across zero-shot and few-shot benchmarks, and produces more diverse and disentangled expert representations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36734", "url": null, "sourceid": 40704, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36737, "uid": "454ff46780a58d90242d4a987af54633", "name": "Agentic Learner with Grow-and-Refine Multimodal Semantic Memory", "authors": [{"id": 185761, "fullname": "Weihao Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185761?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 181336, "fullname": "Shan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181336?format=json", "institution": "Australian Institute for Machine Learning (AIML)"}, {"id": 128187, "fullname": "Yanpeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128187?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 144628, "fullname": "Jingjing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144628?format=json", "institution": "Baidu"}, {"id": 185762, "fullname": "Qunyi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/185762?format=json", "institution": "Baidu"}, {"id": 87785, "fullname": "Xiao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87785?format=json", "institution": "Baidu"}, {"id": 185763, "fullname": "KunbinChen KunbinChen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185763?format=json", "institution": "Baidu"}, {"id": 185764, "fullname": "Wei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185764?format=json", "institution": "Baidu"}, {"id": 157000, "fullname": "Xiaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157000?format=json", "institution": "BAIDU Inc,"}, {"id": 135934, "fullname": "Na Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135934?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}, {"id": 181619, "fullname": "Zechao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181619?format=json", "institution": "Nanjing University of Science and Techonolgy"}], "abstract": "MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo\u2014solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a **single-modality** trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both **multimodal and integrated**, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce **ViLoMem**, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge\u2014preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across nine multimodal benchmarks, **ViLoMem** consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction\u2013hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36737", "url": null, "sourceid": 32358, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36738, "uid": "7216fbeed52c416a070d900a7c17f89f", "name": "Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory", "authors": [{"id": 181971, "fullname": "Ce Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181971?format=json", "institution": "Carnegie Mellon University"}, {"id": 181981, "fullname": "Jinxi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181981?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 176272, "fullname": "Junyi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/176272?format=json", "institution": "Carnegie Mellon University"}, {"id": 127777, "fullname": "Katia Sycara", "url": "http://cvpr.thecvf.com/api/miniconf/users/127777?format=json", "institution": "Carnegie Mellon University"}, {"id": 127791, "fullname": "Yaqi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/127791?format=json", "institution": "CMU"}], "abstract": "Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image\u2013text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code will be made publicly accessible upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36738", "url": null, "sourceid": 39429, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36739, "uid": "46f6215956d7c46255d00263c715c9d4", "name": "TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction", "authors": [{"id": 180730, "fullname": "Yihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180730?format=json", "institution": "Beihang University"}, {"id": 185765, "fullname": "Chengxin Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185765?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 144690, "fullname": "Zichen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144690?format=json", "institution": "Beihang University"}, {"id": 76464, "fullname": "Hongyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76464?format=json", "institution": "Beihang University"}, {"id": 87605, "fullname": "Di Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87605?format=json", "institution": "Beihang University"}], "abstract": "We present **TokenSplat**, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images.At its core, TokenSplat introduces a **Token-aligned Gaussian Prediction** module that aligns semantically corresponding information across views directly in the feature space.Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians.To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an **Asymmetric Dual-Flow Decoder (ADF-Decoder)** that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stable pose estimation without iterative refinement.Extensive experiments demonstrate that TokenSplat achieves higher reconstruction fidelity and novel-view synthesis quality in pose-free settings, and significantly improves pose estimation accuracy compared to prior pose-free methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36739", "url": null, "sourceid": 32448, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36740, "uid": "3e3fa756bedd8599c326b868a83216a1", "name": "EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images", "authors": [{"id": 154838, "fullname": "Minh-Quan Viet Bui", "url": "http://cvpr.thecvf.com/api/miniconf/users/154838?format=json", "institution": "KAIST"}, {"id": 154837, "fullname": "Jongmin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/154837?format=json", "institution": "KAIST"}, {"id": 154839, "fullname": "Juan Luis Gonzalez Bello", "url": "http://cvpr.thecvf.com/api/miniconf/users/154839?format=json", "institution": "Flawless AI"}, {"id": 107109, "fullname": "Jaeho Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/107109?format=json", "institution": "KAIST"}, {"id": 131019, "fullname": "Jihyong Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/131019?format=json", "institution": "Chung-Ang University"}, {"id": 87114, "fullname": "Munchurl Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87114?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks. Code and project page will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36740", "url": null, "sourceid": 38847, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36741, "uid": "aefa72019c21020d838fad6cdbf155e8", "name": "Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization", "authors": [{"id": 73911, "fullname": "Torsten Sattler", "url": "http://cvpr.thecvf.com/api/miniconf/users/73911?format=json", "institution": "Czech Technical University in Prague"}, {"id": 86329, "fullname": "Zuzana Kukelova", "url": "http://cvpr.thecvf.com/api/miniconf/users/86329?format=json", "institution": "Czech Technical University in Prague"}], "abstract": "Visual localization, i.e., the problem of estimating the camera pose from which an image was taken, is an important part of applications such as augmented reality and autonomous robots. Many of these applications require a compact memory footprint. Thus, a considerable amount of work has been spent on designing memory-efficient scene representations for visual localization. In this paper, we focus on compressing the 3D structure of the scene by selecting a subset of points from a Structure-from-Motion (SfM) point cloud. In contrast to prior work, which aims to solve (complex) optimization problems, we propose a simple strategy that is almost trivial to implement. Our compression strategy is based on the idea of selecting triplets of points such that the camera pose of each database image (used to build the SfM point cloud) can be accurately estimated from these triplets. Despite its simplicity, our strategy performs similarly to or better than current state-of-the-art structure compression approaches. Combined with standard product quantization approaches to compress feature descriptors, our approach compares favorably with recent learning-based approaches for compact visual localization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36741", "url": null, "sourceid": 36000, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36742, "uid": "f085547b5f91b3ab87bad8b4e0d7674a", "name": "Diffusion Mental Averages", "authors": [{"id": 183254, "fullname": "Phonphrm Thawatdamrongkit", "url": "http://cvpr.thecvf.com/api/miniconf/users/183254?format=json", "institution": "Vidyasirimedhi Institute of Science and Technology"}, {"id": 183252, "fullname": "Sukit Seripanitkarn", "url": "http://cvpr.thecvf.com/api/miniconf/users/183252?format=json", "institution": "Vidyasirimedhi Institute of Science and Technology"}, {"id": 89322, "fullname": "Supasorn Suwajanakorn", "url": "http://cvpr.thecvf.com/api/miniconf/users/89322?format=json", "institution": "Vidyasirimedhi Institute of Science and Technology"}], "abstract": "Can a diffusion model produce its own \u201cmental average\u201d of a concept\u2014one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt.  These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model\u2019s semantic space, as discovered by recent studies.  Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36742", "url": null, "sourceid": 44889, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36743, "uid": "8353b46aaebf9ec0e115fe2077e5c48d", "name": "Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment", "authors": [{"id": 148100, "fullname": "Zhixin Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/148100?format=json", "institution": "University of Science and Technology of China"}, {"id": 185766, "fullname": "Bohao Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185766?format=json", "institution": "University of Science and Technology of China"}, {"id": 73571, "fullname": "Jiacheng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/73571?format=json", "institution": "University of Science and Technology of China"}, {"id": 185767, "fullname": "Xiaotian Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185767?format=json", "institution": "University of Science and Technology of China"}, {"id": 104337, "fullname": "Xinjun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/104337?format=json", "institution": "University of Science and Technology of China"}, {"id": 185768, "fullname": "Yujia Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185768?format=json", "institution": "University of Science and Technology of China"}, {"id": 156266, "fullname": "Baoqun Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/156266?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Both detection-then-match and detection-free methods have been extensively studied for image-to-point cloud registration, yet they still face significant challenges. The detection-then-match approach emphasizes high-quality correspondences but is limited by the availability of repeatable keypoints, making it susceptible to errors from incorrect matches. In contrast, detection-free methods aim for dense correspondences using a coarse-to-fine strategy to mitigate matching errors. However, non-overlapping regions and low-quality matches still introduce inaccuracies, and the differences between image texture and point cloud structure cause inconsistent region representations, increasing the likelihood of incorrect matches.To address these challenges, we propose two innovative modules: the High-Value Zone Reinforced Selection Module (HZRS) and the Zone Representation Consistency Alignment Module (ZRCA). HZRS employs reinforcement learning to resolve the non-differentiable issue of selecting high-value matching regions, while ZRCA improves region alignment through three stages: understand, coordinate, and accelerate.Extensive experiments and ablation studies on RGB-D Scenes v2 and 7-Scenes demonstrate the superiority of our network, establishing it as the state-of-the-art for image-to-point cloud registration.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36743", "url": null, "sourceid": 42364, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36747, "uid": "ba6be4a2b398df63f57ae0f7e5f330d8", "name": "Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching", "authors": [{"id": 174538, "fullname": "Guangxun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174538?format=json", "institution": "New York University"}, {"id": 177089, "fullname": "Mason Haberle", "url": "http://cvpr.thecvf.com/api/miniconf/users/177089?format=json", "institution": "New York University"}, {"id": 185777, "fullname": "Davi Geiger", "url": "http://cvpr.thecvf.com/api/miniconf/users/185777?format=json", "institution": "New York University"}], "abstract": "The Mean Flow Matching algorithm is the state-of-the-art for one-step generative models.  Building on this idea, we propose the Stable Mean Flow algorithm and introduce a Lyapunov-inspired stability regularizer that enforces local non-expansivity of the single-step transport map. This design guarantees uniqueness of characteristics and bounds trajectory drift. We conduct experiments that show improved output quality and convergence speed over Mean Flow. Moreover, we establish explicit upper bounds on error growth for both one-step and multi-step generation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36747", "url": null, "sourceid": 46588, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36750, "uid": "09a66d2b77b4ea8c659311f7ea5add01", "name": "PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction", "authors": [{"id": 182238, "fullname": "Isaac Deutsch", "url": "http://cvpr.thecvf.com/api/miniconf/users/182238?format=json", "institution": "NVIDIA"}, {"id": 155601, "fullname": "Nicolas Mo\u00ebnne-Loccoz", "url": "http://cvpr.thecvf.com/api/miniconf/users/155601?format=json", "institution": "NVIDIA"}, {"id": 185784, "fullname": "Gavriel State", "url": "http://cvpr.thecvf.com/api/miniconf/users/185784?format=json", "institution": "NVIDIA"}, {"id": 86828, "fullname": "\u017dan Goj\u010di\u010d", "url": "http://cvpr.thecvf.com/api/miniconf/users/86828?format=json", "institution": "NVIDIA"}], "abstract": "Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36750", "url": null, "sourceid": 40005, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40274?format=json"], "related_events_ids": [40274]}, {"id": 40274, "uid": "09a66d2b77b4ea8c659311f7ea5add01", "name": "PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction", "authors": [{"id": 182238, "fullname": "Isaac Deutsch", "url": "http://cvpr.thecvf.com/api/miniconf/users/182238?format=json", "institution": "NVIDIA"}, {"id": 155601, "fullname": "Nicolas Mo\u00ebnne-Loccoz", "url": "http://cvpr.thecvf.com/api/miniconf/users/155601?format=json", "institution": "NVIDIA"}, {"id": 185784, "fullname": "Gavriel State", "url": "http://cvpr.thecvf.com/api/miniconf/users/185784?format=json", "institution": "NVIDIA"}, {"id": 86828, "fullname": "\u017dan Goj\u010di\u010d", "url": "http://cvpr.thecvf.com/api/miniconf/users/86828?format=json", "institution": "NVIDIA"}], "abstract": "Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40274", "url": null, "sourceid": -40005, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36750?format=json"], "related_events_ids": [36750]}, {"id": 36751, "uid": "60bc551ca678c042256508c5a0f46689", "name": "LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds", "authors": [{"id": 182397, "fullname": "Jaehun Bang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182397?format=json", "institution": "UNIST"}, {"id": 185785, "fullname": "Jinhyeok Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185785?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 185786, "fullname": "Minji Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185786?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 185787, "fullname": "Seungheon Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185787?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 77309, "fullname": "Kyungdon Joo", "url": "http://cvpr.thecvf.com/api/miniconf/users/77309?format=json", "institution": "Ulsan National Institute of Science and Technology"}], "abstract": "Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain impractically slow, memory-intensive, and overly complex due to iterative optimization and dense feature assignments for every Gaussian. To address these limitations, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantics only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. To further streamline inference and ensure semantic consistency, we cluster Gaussians in a single step by linking geometrically and semantically related masks in 3D. In evaluation, we assess our method on diverse benchmarks, including DL3DV-OVS with large and complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50\u00d7 faster speed and 64\u00d7 lower memory, offering a scalable foundation for real-time language-driven 3D understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36751", "url": null, "sourceid": 34316, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36752, "uid": "1f7f61828a697b28059f813c0f512154", "name": "Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition", "authors": [{"id": 152186, "fullname": "Hui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152186?format=json", "institution": "City University of Hong Kong"}, {"id": 185788, "fullname": "Kecheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185788?format=json", "institution": "City University of Hong Kong"}, {"id": 185789, "fullname": "Jialiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185789?format=json", "institution": "Harbin Institute of Technology"}, {"id": 87539, "fullname": "Xianming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87539?format=json", "institution": "Harbin Institute of Technology"}, {"id": 185790, "fullname": "Wenya Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185790?format=json", "institution": "Nanyang Technological University"}, {"id": 152190, "fullname": "Haoliang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152190?format=json", "institution": "City University of Hong Kong"}], "abstract": "Vision-Language Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness  in zero-shot image classification.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36752", "url": null, "sourceid": 46293, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36753, "uid": "eb7cd84a106d9862149ce49358dccf98", "name": "Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance", "authors": [{"id": 179956, "fullname": "Minxing Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/179956?format=json", "institution": "Nankai University"}, {"id": 185791, "fullname": "Linlong Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185791?format=json", "institution": "Vivo"}, {"id": 185792, "fullname": "Qiushi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185792?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185793, "fullname": "Ge Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185793?format=json", "institution": "Nankai University"}, {"id": 176350, "fullname": "Yiyan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/176350?format=json", "institution": "Vivo"}, {"id": 185794, "fullname": "Yuhang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185794?format=json", "institution": "Shandong University"}, {"id": 126675, "fullname": "Jinwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126675?format=json", "institution": "vivo Mobile Communication Co., Ltd."}, {"id": 91119, "fullname": "Yaxing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91119?format=json", "institution": "Nankai University"}, {"id": 88767, "fullname": "Qingnan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88767?format=json", "institution": "Tencent AI Lab"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}], "abstract": "Current image super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce **TIGER** (**T**ext\u2013**I**mage **G**uided sup**E**r-**R**esolution), a novel two-stage framework that breaks this trade-off through a *\"text-first, image-later\"* paradigm. TIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and uses them to guide full-image super-resolution. This ensures high fidelity and readability. To support comprehensive training and evaluation, we present the UZ-ST (UltraZoom-ST) dataset, the first Chinese scene text dataset with extreme zoom. Extensive experiments show TIGER achieves state-of-the-art performance, enhancing readability and image quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36753", "url": null, "sourceid": 39233, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36754, "uid": "a730bf57002ababb7d6c15f3846c19e3", "name": "Perceptual Neural Video Compression with Color Separation and Rank Chain", "authors": [{"id": 180864, "fullname": "xiongzhuang liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180864?format=json", "institution": "University Of Science And Technology Of China"}, {"id": 156502, "fullname": "Chuanbo Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156502?format=json", "institution": "University of Science and Technology of China"}, {"id": 156503, "fullname": "Zhuoyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156503?format=json", "institution": "University of Science and Technology of China"}, {"id": 90450, "fullname": "Li Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90450?format=json", "institution": "University of Science and Technology of China"}, {"id": 87804, "fullname": "Dong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87804?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Neural video compression (NVC) has achieved significant progress in recent years. The state-of-the-art (SOTA) NVC schemes, exemplified by the Deep Conditional Video Coding (DCVC) series, have focused on pursuing higher fidelity (e.g., PSNR), but lack sufficient exploitation of deep networks' advantages for better perceptual quality. We fill in this gap with two new techniques. First, we propose a color-separation-based framework, termed PNVC-C, which decouples luminance and chrominance processing to better align with human visual perception. This framework enables explicit and adaptive allocation of computation and bitrate budgets between luminance and chrominance components.Second, within this framework, we introduce the perceptual optimization scheme Rc-GAN, which leverages a bitrate-based rank chain loss to link variable-rate coding with perceptual quality ranking, enforcing consistent quality ordering and improving perceptual fidelity.Built upon these designs, we establish the PNVC-C framework with two variants: PNVC-C-Base, optimized for objective fidelity, and PNVC-CR, a perceptual variant that applies the Rc-GAN. Experimental results demonstrate that PNVC-C-Base achieves SOTA objective performance in YUV PSNR, while PNVC-CR attains SOTA perceptual quality on LPIPS, DISTS, KID, and FID metrics.Code and models will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36754", "url": null, "sourceid": 32852, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36757, "uid": "a2c7d084cee77af51e427133bf7a200c", "name": "LRHDR: Learning Representation-enhanced HDR Video Reconstruction", "authors": [{"id": 181189, "fullname": "Chenzhuo Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181189?format=json", "institution": "Nanjing university"}, {"id": 185807, "fullname": "Xin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185807?format=json", "institution": null}, {"id": 185808, "fullname": "Bingchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185808?format=json", "institution": "nanjing university"}, {"id": 104111, "fullname": "Yu Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/104111?format=json", "institution": "Nanjing University"}, {"id": 128951, "fullname": "Tao Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/128951?format=json", "institution": "Nanjing University"}, {"id": 128956, "fullname": "Xuemei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128956?format=json", "institution": "Nanjing University"}], "abstract": "Reconstructing High Dynamic Range (HDR) video from alternately exposed Low Dynamic Range (LDR) frames is challenged by large motion, exposure-induced photometric inconsistency, and information loss in saturated or under-exposed regions. Prior HDR video pipelines typically follow an alignment\u2013reconstruction paradigm, which is limited by the precision of alignment and the performance of the fusion module. We propose a new reconstruction framework called Learning Representation-enhanced HDR Video Reconstruction (LRHDR), which built around two novel components: an Amalgamated Cross-exposure Consistent Representation (ACCR) network and an Adaptive Pixel-wise Sparse Weighted Fusion (APSWF).The ACCR includes an Exposure-aware Interleaved Context (EIC) encoder and a Representation Mapper (RM).The EIC couples a large-field path with a high-fidelity sub-pixel path and an exposure gate to produce exposure-aware features. The RM avoids geometric warping by mapping features from different exposures into a unified representation via per-pixel, per-channel linear modulation and decoded into calibrated linear HDR domain. The APSWF treats fusion as pixel-wise candidate selection, producing sparse weighted masks to form a normalized fusion in the linear HDR domain, thereby suppressing artifacts.Extensive experiments on standard benchmarks demonstrate that our LRHDR outperforms previous methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36757", "url": null, "sourceid": 37491, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36758, "uid": "dc5cd6ec5b2df75ead63c98d5963e732", "name": "MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision", "authors": [{"id": 126969, "fullname": "bedrettin cetinkaya", "url": "http://cvpr.thecvf.com/api/miniconf/users/126969?format=json", "institution": "Middle East Technical University"}, {"id": 107138, "fullname": "Sinan Kalkan", "url": "http://cvpr.thecvf.com/api/miniconf/users/107138?format=json", "institution": "Middle East Technical University"}, {"id": 134258, "fullname": "Emre Akbas", "url": "http://cvpr.thecvf.com/api/miniconf/users/134258?format=json", "institution": "METU"}], "abstract": "Generating crisp, i.e. one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results.To address this limitation, we propose \\MethodLPP, a lightweight, only $\\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \\MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \\MethodLPP substantially improves the performance of existing edge detection models. In particular, \\MethodLPP increases the Average Crispness (AC) metric by up to 2\u20134\u00d7 compared to baseline models. Under the crispness-emphasized evaluation (CEval), \\MethodLPP further boosts baseline performance by up to 20\u201335\\% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36758", "url": null, "sourceid": 41076, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36761, "uid": "9f057b46713528b1c695484fc412e1cf", "name": "UniDAC: Universal Metric Depth Estimation for Any Camera", "authors": [{"id": 148360, "fullname": "Girish Ganesan Ganesan", "url": "http://cvpr.thecvf.com/api/miniconf/users/148360?format=json", "institution": "Michigan State University"}, {"id": 96697, "fullname": "Yuliang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96697?format=json", "institution": "Bosch US Research"}, {"id": 84539, "fullname": "Liu Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/84539?format=json", "institution": "Bosch Research"}, {"id": 73926, "fullname": "Xiaoming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73926?format=json", "institution": "Michigan State University"}], "abstract": "Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$\\phi$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state-of-the-art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36761", "url": null, "sourceid": 37906, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36762, "uid": "ca904c1dfe5c1e4415ce964959278c45", "name": "Diffusion Probe: Generated Image Result Prediction Using CNN Probes", "authors": [{"id": 181137, "fullname": "Bukun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181137?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 156578, "fullname": "Benlei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/156578?format=json", "institution": "Alibaba Group"}, {"id": 185818, "fullname": "Zhizeng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/185818?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 185819, "fullname": "Xuemei Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185819?format=json", "institution": null}, {"id": 101073, "fullname": "Tuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/101073?format=json", "institution": "Southeast University"}, {"id": 90252, "fullname": "Hui Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/90252?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 91003, "fullname": "Dingkang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91003?format=json", "institution": "Fudan University"}, {"id": 157341, "fullname": "Longtao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157341?format=json", "institution": "Alibaba Group"}, {"id": 129579, "fullname": "Haiwen Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129579?format=json", "institution": "Alibaba Group"}, {"id": 126844, "fullname": "Jingqun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126844?format=json", "institution": "Bytedance"}], "abstract": "Text-to-image (T2I) diffusion models currently lack an efficient mechanism for early quality assessment, forcing costly random trial-and-error in scenarios requiring multiple generations (e.g., iterating on prompts, agent-based image generation, flow-grpo). To address this, we first reveal a strong correlation between the attention distribution in the early diffusion process and the final image quality. Building upon this insight, we introduce **Diffusion Probe**, a pioneering framework that leverages the model\u2019s internal cross-attention maps as a predictive signal. We propose a lightweight predictor, trained to establish a direct mapping from statistical properties of these nascent cross-attention distributions\u2014extracted from the initial denoising steps\u2014to the final image\u2019s comprehensive quality. This allows our probe to accurately forecast various aspects of image quality, regardless of the specific ground-truth quality metric, long before full synthesis is complete.We empirically validate the reliability and generalizability of Diffusion Probe through its consistently strong predictive accuracy across a wide spectrum of conditions. On diverse T2I models (e.g., SDXL, FLUX, Qwen-Image), throughout broad early-denoising windows, across various resolutions, and with different quality metrics, it achieves **high correlation (PCC > 0.7)** and **classification performance (AUC-ROC > 0.9)**. This intrinsic reliability is further demonstrated in practice by successfully optimizing T2I workflows that benefit from early, quality-guided decisions, such as **Prompt Optimization**, **Seed Selection**, and **Accelerated RL Training**. In these applications, the probe's early signal enables more targeted sampling strategies, preempting costly computations on low-potential paths. This yields a dual benefit: a significant reduction in computational overhead and a simultaneous improvement in final outcome quality, establishing Diffusion Probe as a model-agnostic and broadly applicable tool poised to revolutionize T2I efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36762", "url": null, "sourceid": 41907, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36763, "uid": "33acaba956e30e1494c5b84d48694e0e", "name": "Seeing without Pixels: Perception from Camera Trajectories", "authors": [{"id": 76373, "fullname": "Zihui Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/76373?format=json", "institution": "University of Texas, Austin"}, {"id": 69188, "fullname": "Kristen Grauman", "url": "http://cvpr.thecvf.com/api/miniconf/users/69188?format=json", "institution": "University of Texas at Austin"}, {"id": 73556, "fullname": "Dima Damen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73556?format=json", "institution": "University of Bristol and Google DeepMind"}, {"id": 75512, "fullname": "Andrew Zisserman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75512?format=json", "institution": "University of Oxford"}, {"id": 89189, "fullname": "Tengda Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/89189?format=json", "institution": "University of Oxford, University of Oxford"}], "abstract": "Can one perceive a video's content without seeing its pixels, just from the camera trajectory\u2014the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, \"how you move\" can indeed reveal \"what you are doing\" (egocentric) or \"observing\" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36763", "url": null, "sourceid": 34896, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36765, "uid": "ae0873455e7c9c44de2b0eb8aff4e258", "name": "CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation", "authors": [{"id": 181147, "fullname": "Chenyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181147?format=json", "institution": "Peking University"}, {"id": 143931, "fullname": "Hongze CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/143931?format=json", "institution": "HKUST"}, {"id": 185826, "fullname": "Jingzhi Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185826?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 76217, "fullname": "Lingting Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76217?format=json", "institution": "The University of Hong Kong"}, {"id": 185827, "fullname": "Runze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185827?format=json", "institution": "Tencent"}, {"id": 89410, "fullname": "Weikai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89410?format=json", "institution": "Tencent America"}, {"id": 153580, "fullname": "Zeyu HU", "url": "http://cvpr.thecvf.com/api/miniconf/users/153580?format=json", "institution": "Tencent Lightspeed Studios, Singapore"}, {"id": 76921, "fullname": "Yingda Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/76921?format=json", "institution": "Peking University"}, {"id": 185828, "fullname": "Keyang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185828?format=json", "institution": "Tencent LightSpeed Studio"}, {"id": 153582, "fullname": "Xin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153582?format=json", "institution": "LightSpeed Studios"}], "abstract": "Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling.To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure.It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity.Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization.Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36765", "url": null, "sourceid": 31246, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36766, "uid": "1df134a4343e125344de1e920a01ff80", "name": "Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting", "authors": [{"id": 182527, "fullname": "Hyeonseo Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182527?format=json", "institution": "Yonsei University"}, {"id": 138853, "fullname": "Hyuk Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/138853?format=json", "institution": "Yonsei University"}, {"id": 138997, "fullname": "Kibok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/138997?format=json", "institution": "Yonsei University"}], "abstract": "We investigate the recently introduced domain-class incremental learning scenarios for vision-language models (VLMs), where both domain and class distributions change across tasks. Recent works address this challenge using parameter-efficient methods such as prefix-tuning or adapters, which facilitate model adaptation to downstream tasks by incorporating task-specific information into input tokens through additive vectors. However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. DPW consists of 1) a gating module that adjusts the weights of each prefix based on the importance of the corresponding input token, and 2) a weighting mechanism that derives adapter output weights as a residual of prefix-tuning weights, ensuring adapters are utilized only when necessary. Experimental results demonstrate that our method achieves state-of-the-art performance in domain-class incremental learning scenarios for VLMs. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36766", "url": null, "sourceid": 35388, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36767, "uid": "456fc8595a04b9c7743188df7df2a22f", "name": "Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling", "authors": [{"id": 159621, "fullname": "Xing Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/159621?format=json", "institution": "South China University of Technology"}, {"id": 153043, "fullname": "Yu Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153043?format=json", "institution": "South China University of Technology"}, {"id": 147665, "fullname": "Ronghua Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/147665?format=json", "institution": "South China University of Technology"}, {"id": 185829, "fullname": "Peixian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185829?format=json", "institution": "Alibaba Group"}, {"id": 185830, "fullname": "peilin tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185830?format=json", "institution": null}], "abstract": "Chain-of-Thought (CoT) is a key technique for enhancing the reasoning capabilities of Vision Language Models (VLMs). Existing methods often employ Reinforcement Learning (RL) with external constraints to align the model's reasoning process with human cognitive patterns. However, we argue that the model's intrinsic reasoning paths may differ from human cognition, and that forcing such alignment can constrain the model's potential and even degrade its performance. To address this, we propose leveraging the model's intrinsic self-evaluation to guide its optimization. We hypothesize that a model's self-generated confidence scores are effective indicators of its reasoning quality. Based on this evaluation metric, we design two novel reward functions: (1) Sequential Confidence Rigorous Evaluation (SCRE) for challenging problems that demand strict logical reasoning, and (2) Intra-group Score Re-ranking (IGSR) for general-purpose, open-ended scenarios. We name our method Video-RAISE (Reasoning Alignment through Intrinsic Self-Evaluation). Comprehensive experiments on six video understanding benchmarks demonstrate that Video-RAISE achieves state-of-the-art (SOTA) performance, significantly outperforming previous methods and even proprietary models, e,g. GPT-4o and Gemini 1.5 pro. For instance, on the VideoMMMU benchmark, our Video-RAISE achieves a new SOTA accuracy of 52.8\\%, outperforming the previous best model by a significant 3.0\\%. In addition, our method achieves a reasoning path consistency of 90\\%, which is double that of the Qwen2.5-VL-Instruct and even surpasses the performance of supervised fine-tuning. Code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36767", "url": null, "sourceid": 31514, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36768, "uid": "7b6ad2297d3beb569ddf3ee1ce22ffa8", "name": "Parameter-Efficient Semantic Augmentation for  Enhancing Open-Vocabulary Object Detection", "authors": [{"id": 172217, "fullname": "Weihao Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/172217?format=json", "institution": "Beijing Jiaotong university"}, {"id": 156264, "fullname": "Runqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156264?format=json", "institution": "Beijing Jiaotong University"}, {"id": 155550, "fullname": "Xiaoyue Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155550?format=json", "institution": "WeChat AI, Tencent Inc"}, {"id": 155554, "fullname": "Jinchao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155554?format=json", "institution": "WeChat AI"}, {"id": 185831, "fullname": "Ang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185831?format=json", "institution": "Beijing Jiaotong University"}, {"id": 159471, "fullname": "Liping Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/159471?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts.This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label.To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels.Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36768", "url": null, "sourceid": 43567, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36773, "uid": "af536dee281164c88c729bd08be02043", "name": "DiffuView: Multi-View Diffusion Pretraining for 3D Aware Robotic Manipulation", "authors": [{"id": 184264, "fullname": "Kaizhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184264?format=json", "institution": "Fudan university"}, {"id": 185843, "fullname": "Tian Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185843?format=json", "institution": "Fudan University"}, {"id": 185844, "fullname": "Tianyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185844?format=json", "institution": "Fudan University"}, {"id": 185845, "fullname": "Chenen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185845?format=json", "institution": "Fudan University"}, {"id": 185846, "fullname": "Zijun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185846?format=json", "institution": "Fudan University"}, {"id": 185847, "fullname": "Qingda Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185847?format=json", "institution": "Fudan University"}, {"id": 185117, "fullname": "Wenchao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185117?format=json", "institution": "Fudan University; Tars Robotics Ltd."}], "abstract": "Robotic manipulation from visual observations remains challenging due to the lack of 3D consistent representations that can generalize across diverse viewpoints and sensor configurations. Existing approaches often rely on masked autoencoders or neural scene representations, which fail to capture cross view correspondences. Crucially, while multi-view diffusion models have recently shown tremendous success in 3D aware generative synthesis, their powerful representations offer a promising direction for achieving viewpoint robust visuomotor control. In this paper, we introduce DiffuView, a novel framework that learns unified 3D aware representations through multi-view diffusion pretraining and deploys them for imitation learning. Specifically, DiffuView models the conditional generation of target views given source observations within a diffusion framework, enabling the network to implicitly recover scene geometry and enforce view consistency. The pretrained diffusion network is then utilized as a powerful visual backbone for an action policy, allowing robust control under varying viewpoints and visual conditions. We evaluate DiffuView on two challenging benchmarks, MetaWorld and Libero. Extensive experiments in both simulation and realworld scenarios demonstrate that DiffuView achieves superior generalization, improving success rates under viewpoint shifts by nearly 20\\% compared with existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36773", "url": null, "sourceid": 35661, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36774, "uid": "a71f9f47d27e10623154025319152a82", "name": "Learning Eigenstructures of Unstructured Data Manifolds", "authors": [{"id": 158674, "fullname": "Roy Velich", "url": "http://cvpr.thecvf.com/api/miniconf/users/158674?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 185848, "fullname": "Arkadi Piven", "url": "http://cvpr.thecvf.com/api/miniconf/users/185848?format=json", "institution": "Technion - Israel Institute of Technology, Technion"}, {"id": 158675, "fullname": "David Bensaid", "url": "http://cvpr.thecvf.com/api/miniconf/users/158675?format=json", "institution": "Computer Science Department, Technion-Israel Institute of Technology"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 182484, "fullname": "Thomas Dag\u00e8s", "url": "http://cvpr.thecvf.com/api/miniconf/users/182484?format=json", "institution": "TUM"}, {"id": 149347, "fullname": "Ron Kimmel", "url": "http://cvpr.thecvf.com/api/miniconf/users/149347?format=json", "institution": "Technion"}], "abstract": "We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers.Grounded in optimal-approximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing.  Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric's sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension.On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36774", "url": null, "sourceid": 31607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40276?format=json"], "related_events_ids": [40276]}, {"id": 40276, "uid": "a71f9f47d27e10623154025319152a82", "name": "Learning Eigenstructures of Unstructured Data Manifolds", "authors": [{"id": 158674, "fullname": "Roy Velich", "url": "http://cvpr.thecvf.com/api/miniconf/users/158674?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 185848, "fullname": "Arkadi Piven", "url": "http://cvpr.thecvf.com/api/miniconf/users/185848?format=json", "institution": "Technion - Israel Institute of Technology, Technion"}, {"id": 158675, "fullname": "David Bensaid", "url": "http://cvpr.thecvf.com/api/miniconf/users/158675?format=json", "institution": "Computer Science Department, Technion-Israel Institute of Technology"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 182484, "fullname": "Thomas Dag\u00e8s", "url": "http://cvpr.thecvf.com/api/miniconf/users/182484?format=json", "institution": "TUM"}, {"id": 149347, "fullname": "Ron Kimmel", "url": "http://cvpr.thecvf.com/api/miniconf/users/149347?format=json", "institution": "Technion"}], "abstract": "We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers.Grounded in optimal-approximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing.  Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric's sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension.On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40276", "url": null, "sourceid": -31607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36774?format=json"], "related_events_ids": [36774]}, {"id": 36782, "uid": "06b059534e43b2938117a83912c62f3d", "name": "EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution", "authors": [{"id": 158835, "fullname": "Tao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158835?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 182457, "fullname": "Shengtao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182457?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 185861, "fullname": "Rong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185861?format=json", "institution": null}, {"id": 129692, "fullname": "Zunjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129692?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 133089, "fullname": "Bolun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133089?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 185862, "fullname": "Yaoqi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185862?format=json", "institution": "Lishui University"}, {"id": 73966, "fullname": "Ying Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73966?format=json", "institution": "Beijing Institute of Technology"}, {"id": 89584, "fullname": "Chenggang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89584?format=json", "institution": "Hangzhou Dianzi University, Tsinghua University"}], "abstract": "Hardware constraints make it challenging to simultaneously acquire hyperspectral images (HSIs) with both high spatial and high spectral resolutions. A promising solution is to fuse low-resolution HSI (LR-HSI) with high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Recently, diffusion models have introduced possibilities for HSI super-resolution, but suffer from low-efficiency sampling, detail-limited generation, and insufficient denoising. To address these issues, we propose an Edge-aware Multimodal Residual Diffusion Model (EMR-Diff). Specifically, multimodal residual mechanism is introduced to facilitate efficient information transfer among HR-MSI, LR-HSI, and HR-HSI, significantly improving the fusion efficiency. Edge-aware noise strategy is designed by exploiting the edge information of HR-MSI, which guides the model to prioritize high-frequency detail reconstruction by applying stronger noise perturbations to edge regions. In addition, we propose a Bilateral Attention Fusion UNet and design a multi-scale supervision mechanism to enable progressive reconstruction and collaborative optimization of spectral and spatial features. Extensive experiments demonstrate that our method achieves superior performance over existing approaches in both quantitative metrics and visual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36782", "url": null, "sourceid": 40261, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36788, "uid": "2bd19fb4009ffc10edd430e6191f8421", "name": "Geometrically-Constrained Agent for Spatial Reasoning", "authors": [{"id": 87024, "fullname": "Zeren Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87024?format=json", "institution": "Beihang University"}, {"id": 185873, "fullname": "Xiaoya Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185873?format=json", "institution": "Shanghai Jiaotong University; Shanghai Artificial Intelligence Laboratory"}, {"id": 185874, "fullname": "Zhijie Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185874?format=json", "institution": "Shanghai Artificial Intelligence Laboratory; Beihang University"}, {"id": 185875, "fullname": "Pengrui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185875?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 144667, "fullname": "Lehan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/144667?format=json", "institution": "Beihang University"}, {"id": 185876, "fullname": "Yijin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185876?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 127187, "fullname": "Jing Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127187?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 128740, "fullname": "Bohan Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128740?format=json", "institution": "Monash University"}, {"id": 86998, "fullname": "Lu Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86998?format=json", "institution": "Beihang University"}], "abstract": "Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for complex spatial tasks. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36788", "url": null, "sourceid": 33938, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36790, "uid": "5c9f93b2fdb3614e052123bba6301e64", "name": "From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching", "authors": [{"id": 159250, "fullname": "Feifan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/159250?format=json", "institution": "Zhejiang University"}, {"id": 185880, "fullname": "Hongyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185880?format=json", "institution": null}], "abstract": "Shape matching is a fundamental task in computer graphics and vision, with deep functional map methods emerging as a preferred solution. However, existing approaches primarily focus on learning informative feature representations by constraining both pointwise and functional maps, while overlooking the optimization of a crucial component: the spectral basis, which plays a key role in the (deep) functional maps pipeline. This oversight leads to suboptimal matching performance. Furthermore, these approaches mostly rely on conventional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational overhead. To address those, we introduce Advanced Functional Maps, which generalizes standard functional maps from fixed basis functions to learnable basis functions, supported by rigorous theoretical guarantees. In this framework, the spectral basis is optimized by learning a set of inhibition functions. Building on this foundation, we propose the first unsupervised spectral basis learning method for efficient and robust non-rigid 3D shape matching, simultaneously optimizing feature extraction and basis functions in an end-to-end manner. A novel heat diffusion module and a new unsupervised loss function are introduced for basis learning, along with a simple yet efficient architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary losses. Extensive experiments demonstrate that our method significantly outperforms current state-of-the-art feature-learning-based functional map approaches, especially in challenging non-isometric and topological noise matching scenarios, all while maintaining high computational efficiency. Finally, we demonstrate that optimizing basis functions is equivalent to spectral convolution, with inhibition functions acting as filters. This insight enables enhanced spectral basis representations by designing novel inhibition functions inspired by spectral graph/manifold convolutional networks, opening new avenues for future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36790", "url": null, "sourceid": 36759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36795, "uid": "84449a62ec1b494ca3c37418cf304b12", "name": "Bridging Privacy and Provenance: Traceable Virtual Identity Generation", "authors": [{"id": 185887, "fullname": "Xianhan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185887?format=json", "institution": "Fudan University"}, {"id": 185888, "fullname": "Xiaoxiao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185888?format=json", "institution": "Fudan University"}, {"id": 128849, "fullname": "Sheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128849?format=json", "institution": "Fudan University"}, {"id": 128859, "fullname": "Zhenxing Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/128859?format=json", "institution": "Fudan University"}, {"id": 128852, "fullname": "Xinpeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128852?format=json", "institution": "Fudan University"}], "abstract": "Recent advances in generative models have enabled the creation of high-fidelity human faces, yet constructing reliable virtual identities that preserve user privacy while supporting consistent and verifiable identity assignment remains challenging. In this paper, we propose a diffusion-based framework for generating traceable virtual identities with stable identity semantics, pose and expression preservation. Our framework couples a virtual identity sampler that generates diverse but consistent identity embeddings with a 3D geometry and expression conditioning module that preserves the pose and non-identity characteristics of the input face. In addition, we incorporate a lightweight latent watermarking mechanism that embeds an imperceptible identity signature during generation, enabling a user to verify ownership of the resulting virtual identity through a secure token without revealing their real facial appearance. Quantitative evaluations demonstrate that our method achieves high identity consistency across repeated sampling, strong pose and expression fidelity, and improved anonymity compared with prior work.  These results validate the effectiveness of integrating virtual identity sampling, geometric conditioning, and latent watermarking into a single generative framework, and highlight the practical potential of our solution for constructing privacy-aware virtual identities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36795", "url": null, "sourceid": 37388, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36799, "uid": "1474f000f1dbb955cbd136f915b8c318", "name": "Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision", "authors": [{"id": 159429, "fullname": "Aadarsh Sahoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/159429?format=json", "institution": "California Institute of Technology"}, {"id": 86982, "fullname": "Georgia Gkioxari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86982?format=json", "institution": "California Institute of Technology"}], "abstract": "Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (\\eg, \u201cleft-most apple\u201d) and overlooks functional and physical reasoning (\\eg, \u201cwhere can I safely store the knife?\u201d). We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConvSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt\u2013mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConvSeg-Net trained on our data engine achieves significant gains on ConvSeg and maintains strong performance on existing language-guided segmentation benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36799", "url": null, "sourceid": 34679, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36804, "uid": "c8c9131c2d65e2fa10ac1685cd399f32", "name": "Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity", "authors": [{"id": 157986, "fullname": "Zhengyao Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157986?format=json", "institution": "Harbin Institute of Technology"}, {"id": 180062, "fullname": "Zexi Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/180062?format=json", "institution": "Tencent"}, {"id": 185905, "fullname": "Yijia Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185905?format=json", "institution": "Fudan University"}, {"id": 181779, "fullname": "Pengcheng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181779?format=json", "institution": "\u817e\u8baf\u79d1\u6280\uff08\u5317\u4eac\uff09\u6709\u9650\u516c\u53f8"}, {"id": 155554, "fullname": "Jinchao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155554?format=json", "institution": "WeChat AI"}, {"id": 130758, "fullname": "Guangming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130758?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 88598, "fullname": "Jun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88598?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 130781, "fullname": "Wenjie Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130781?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often $\\textit{too vivid to be real}$ even when prompted for realistic-style images.To address this issue, we present $\\textbf{Color Fidelity Dataset (CFD)}$ and $\\textbf{Color Fidelity Metric (CFM)}$ for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free $\\textbf{Color Fidelity Refinement (CFR)}$ that adaptively modulates spatial\u2013temporal guidance scale in generation, thereby enhancing color authenticity.Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. All datasets and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36804", "url": null, "sourceid": 35870, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36806, "uid": "5c6ccef01fc04bcec1297ec5a752c9ad", "name": "FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients", "authors": [{"id": 180350, "fullname": "Tian Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180350?format=json", "institution": "the Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 152621, "fullname": "Zhiqin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152621?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185918, "fullname": "Yonggang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185918?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 185919, "fullname": "Xuefeng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185919?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 185920, "fullname": "Hao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185920?format=json", "institution": "Beihang University"}, {"id": 185921, "fullname": "Yuwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185921?format=json", "institution": "Institute of Computing Technology Chinese Academy of Sciences"}, {"id": 84821, "fullname": "Bo Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/84821?format=json", "institution": "HKBU"}], "abstract": "Federated learning (FL) suffers from performance degradation due to the inevitable presence of noisy annotations in distributed scenarios. Existing approaches have advanced in distinguishing noisy samples from the dataset for label correction by leveraging loss values. However, noisy samples recognition relying on scalar loss lacks reliability for FL under heterogeneous scenarios. In this paper, we rethink this paradigm from a representation perspective and propose FedRG(**Fed**erated under **R**epresentation **G**emometry), which follows **''the principle of ``representation geometry priority''** to recognize noisy labels. Firstly, FedRG creates label-agnostic spherical representations by using self-supervision. It then iteratively fits a spherical von Mises-Fisher (vMF) mixture model to this geometry using previously identified clean samples to capture semantic clusters. This geometric evidence is integrated with a semantic-label soft mapping mechanism to derive a distribution divergence between the label-free and annotated label-conditioned feature space, which robustly identifies noisy samples and updates the vMF mixture model with the newly separated clean dataset. Lastly, we employ an additional personalized noise absorption matrix on noisy labels to achieve robust optimization. Extensive experimental results demonstrate that FedRG significantly outperforms state-of-the-art methods for FL  with data heterogeneity under diverse noisy client scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36806", "url": null, "sourceid": 42633, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36807, "uid": "a4c4a62d6155855a5d075003ff696238", "name": "PHAC: Promptable Human Amodal Completion", "authors": [{"id": 183726, "fullname": "Seung Young Noh", "url": "http://cvpr.thecvf.com/api/miniconf/users/183726?format=json", "institution": "Kwangwoon University"}, {"id": 140970, "fullname": "Ju Yong Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/140970?format=json", "institution": "Kwangwoon University"}], "abstract": "Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model to obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on standard HAC and PGPIS benchmarks show that our approach produces more physically plausible, higher-quality completions with significantly improved prompt alignment compared to existing amodal completion and pose-guided synthesis methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36807", "url": null, "sourceid": 37475, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36816, "uid": "f8c00b149cb2143e3a8dd8c86d8c258b", "name": "A Mixed Diet Makes DINO an Omnivorous Vision Encoder", "authors": [{"id": 185941, "fullname": "Rishabh Kabra", "url": "http://cvpr.thecvf.com/api/miniconf/users/185941?format=json", "institution": "DeepMind; University College London, University of London"}, {"id": 85376, "fullname": "Maks Ovsjanikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85376?format=json", "institution": "Ecole Polytechnique, France"}, {"id": 128540, "fullname": "Drew Hudson", "url": "http://cvpr.thecvf.com/api/miniconf/users/128540?format=json", "institution": "Google DeepMind"}, {"id": 158236, "fullname": "Ye Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/158236?format=json", "institution": "Google"}, {"id": 128673, "fullname": "Skanda Koppula", "url": "http://cvpr.thecvf.com/api/miniconf/users/128673?format=json", "institution": "Google Deepmind"}, {"id": 87736, "fullname": "Andr\u00e9 Araujo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87736?format=json", "institution": "Google Research"}, {"id": 75596, "fullname": "Joao Carreira", "url": "http://cvpr.thecvf.com/api/miniconf/users/75596?format=json", "institution": "DeepMind"}, {"id": 85661, "fullname": "Niloy J. Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/85661?format=json", "institution": "University College London"}], "abstract": "Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes ``omnivorous'' by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36816", "url": null, "sourceid": 34238, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36814, "uid": "017c86c4ff3d405e6521becab0fb3732", "name": "Retrieve-to-Restore: Efficient All-in-One Image Restoration with a Retrieval-Based Degradation Bank", "authors": [{"id": 104982, "fullname": "Chenxu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104982?format=json", "institution": "nanjing university"}, {"id": 158831, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158831?format=json", "institution": "Nanjing University"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}], "abstract": "All-in-one image restoration aims to recover clean images from heterogeneous degradations with a single model, but joint training on multiple degradations with a shared backbone often induces cross-task interference and unstable optimization, making it hard to maintain strong performance across all tasks. To address this, we propose Retrieve-to-Restore (R2R), a lightweight framework that decouples degradation adaptation from backbone computation through a retrieval-based degradation bank. Specifically, R2R externalizes degradation knowledge as unified, degradation-specific priors stored in a compact Degradation Bank. A Degradation Amalgamator aggregates GT-guided intra-class features into task-level clean priors during training, while Degradation Matching retrieves the most relevant prior at inference to modulate backbone features for restoration. This retrieval-guided design explicitly separates degradation cues from shared reconstruction capacity, enabling stable multi-degradation training and straightforward scaling to additional degradation types. Extensive comparisons on benchmarks with one, three, and five degradations show that R2R achieves PSNR on par with state-of-the-art all-in-one methods while using about 91\\% fewer MACs. Our code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36814", "url": null, "sourceid": 40522, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36819, "uid": "cd7353ce107dc4dd111bb68c0f211638", "name": "TopoCL: Topological Contrastive Learning for Medical Imaging", "authors": [{"id": 101768, "fullname": "Guangyu Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/101768?format=json", "institution": "University of Notre Dame"}, {"id": 181605, "fullname": "Pengfei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181605?format=json", "institution": "University of Texas Rio Grande Valley"}, {"id": 185949, "fullname": "Peixian Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185949?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 185950, "fullname": "John P. Lalor", "url": "http://cvpr.thecvf.com/api/miniconf/users/185950?format=json", "institution": "University of Notre Dame"}, {"id": 174927, "fullname": "Erin Chambers", "url": "http://cvpr.thecvf.com/api/miniconf/users/174927?format=json", "institution": "University of Notre Dame"}, {"id": 157914, "fullname": "Danny Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157914?format=json", "institution": "University of Notre Dame, USA"}], "abstract": "Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36819", "url": null, "sourceid": 41393, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36820, "uid": "22551a288cde5b41c4816769be4037ea", "name": "Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation", "authors": [{"id": 176637, "fullname": "Nikolay Kormushev", "url": "http://cvpr.thecvf.com/api/miniconf/users/176637?format=json", "institution": "University of Ljubljana, ETH"}, {"id": 185951, "fullname": "Josip \u0160ari\u0107", "url": "http://cvpr.thecvf.com/api/miniconf/users/185951?format=json", "institution": "UniZg-FER, University of Zagreb"}, {"id": 129786, "fullname": "Matej Kristan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129786?format=json", "institution": "University of Ljubljana"}], "abstract": "Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision\u2013language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP\u2019s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5\\% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1\\% and +3\\% PQ, respectively). The code will be available here.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36820", "url": "https://nickormushev.github.io/OVRCOAT/", "sourceid": 39270, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36823, "uid": "c18171305e73faabfa61d61c41a420ae", "name": "Memory-Efficient Fine-Tuning Diffusion Transformer via Dynamic Patch Sampling and Block Skipping", "authors": [{"id": 135749, "fullname": "Sunghyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/135749?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 128119, "fullname": "Jeongho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/128119?format=json", "institution": "KAIST"}, {"id": 185962, "fullname": "Hyoungwoo Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/185962?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 88442, "fullname": "Debasmit Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/88442?format=json", "institution": "Qualcomm Inc."}, {"id": 90623, "fullname": "Sungrack Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/90623?format=json", "institution": "Qualcomm AI Research"}, {"id": 85738, "fullname": "Munawar Hayat", "url": "http://cvpr.thecvf.com/api/miniconf/users/85738?format=json", "institution": "Monash University"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 85634, "fullname": "Fatih Porikli", "url": "http://cvpr.thecvf.com/api/miniconf/users/85634?format=json", "institution": "QualComm"}, {"id": 90619, "fullname": "Seokeon Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/90619?format=json", "institution": "Qualcomm AI Research"}], "abstract": "Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation.   However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints.  To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features.  Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution.   This approach reduces forward \\& backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps.  The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory.  To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking.  Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36823", "url": null, "sourceid": 40183, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36825, "uid": "6c27f03d7e05fafe06f225bcbeb42d3a", "name": "HUMAPS-4D : A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations", "authors": [{"id": 183341, "fullname": "Matthieu Dabrowski", "url": "http://cvpr.thecvf.com/api/miniconf/users/183341?format=json", "institution": "IMT Nord Europe"}, {"id": 185969, "fullname": "Ouala JEMAA", "url": "http://cvpr.thecvf.com/api/miniconf/users/185969?format=json", "institution": "IMT NORD EUROPE RESEARCH CENTRES"}, {"id": 183272, "fullname": "Benjamin Allaert", "url": "http://cvpr.thecvf.com/api/miniconf/users/183272?format=json", "institution": "IMT Nord Europe"}], "abstract": "Current advancements in human motion understanding are strongly reliant on video data. Nevertheless, privacy regulations and operational constraints increasingly restrict the use of visual data in real-world scenarios. Inferring posture through wearable sensors, such as instrumented insoles measuring plantar activation, presents itself as a promising alternative. However, the absence of large-scale multimodal datasets hinders the rigorous benchmarking of these methodologies. We introduce HUMAPS-4D, a novel multimodal dataset designed for human motion analysis, effectively bridging computer vision and biomechanics. This dataset integrates synchronized motion capture, multi-view video, IMUs, plantar pressure signals, sEMG activation patterns, and high-level semantic annotations. The data was collected from 32 subjects performing 30 actions over a total duration of 14 hours. Participants demonstrate substantial anthropometric variability (age, body proportions, and morphology), which supports robust generalization across diverse body types. Distinct from existing resources, this collection offers a unique pairing of low-level physiological signals and high-level human motor descriptors. This capability enables the development of generative and inference models conditioned by both physical and semantic constraints, while simultaneously reducing the reliance on personally identifiable visual data. We establish benchmark tasks specifically targeting posture reconstruction from plantar pressure, semantic motion segmentation, physics-informed motricity analysis, and multimodal fusion under privacy-preserving conditions. The dataset, along with its associated annotation tools and visualization utilities, is scheduled for online release soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36825", "url": null, "sourceid": 35365, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36826, "uid": "6fe0fa22eca4dea0f85b883b75c76a34", "name": "DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations", "authors": [{"id": 174996, "fullname": "Yuxiang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/174996?format=json", "institution": "University of Science and Technology of China"}, {"id": 133565, "fullname": "Zhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/133565?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 178374, "fullname": "Yanwen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178374?format=json", "institution": "Nanjing University"}, {"id": 153839, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153839?format=json", "institution": "Nanjing University"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}, {"id": 85084, "fullname": "Ligang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85084?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Portrait animation from a single source image and a driving video is a long-standing problem.Recent approaches tend to adopt diffusion-based image/video generation models for realistic and expressive animation.However, none of these diffusion models realizes high-fidelity disentangled control between the head pose and facial expression, hindering applications like expression-only or pose-only editing and animation.To address this, we propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals.Specifically, we represent the pose as an explicit global transformation and the expression as an implicit latent code.First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals.Then we propose to inject the pose transformation into the diffusion model through a dual-branch conditioning mechanism, and the expression latent through cross attention.Finally, we design a progressive hybrid classifier-free guidance for more faithful identity consistency.Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36826", "url": null, "sourceid": 39542, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36830, "uid": "b91f491a5ad27382b54abe58f8dd31a3", "name": "Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment", "authors": [{"id": 181875, "fullname": "Roy Amoyal", "url": "http://cvpr.thecvf.com/api/miniconf/users/181875?format=json", "institution": "Ben Gurion University of the Negev"}, {"id": 185976, "fullname": "Oren Freifeld", "url": "http://cvpr.thecvf.com/api/miniconf/users/185976?format=json", "institution": "Ben-Gurion University"}, {"id": 154453, "fullname": "Chaim Baskin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154453?format=json", "institution": "Ben Gurion University of the Negev"}], "abstract": "We present Gaussian Splatting Alignment (GSA), a novel method for aligning two  independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation; translation; scale), even when they are of different objects in the same category (e.g, different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g, the same car) and often must be given true scale as input, while we estimate it successfully. Our approach leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns models while keeping the 3DGS models fixed. First, we perform an iterative, feature-guided coarse registration that is robust to extremely poor initialization (e.g, 180\u00b0 misalignment or a 10\u00d7 scale gap), followed by a fine registration step enforcing multi-view feature consistency, inspired by inverse radiance-field formulations.The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36830", "url": null, "sourceid": 32512, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36832, "uid": "c0732202ac6bc89333d0aba5ab2399a1", "name": "WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration", "authors": [{"id": 146701, "fullname": "Gong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/146701?format=json", "institution": "tianjin university"}, {"id": 148789, "fullname": "Chaokun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/148789?format=json", "institution": "College of Intelligence and Computing, Tianjin University"}, {"id": 177142, "fullname": "Xinyan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/177142?format=json", "institution": "Tianjin Univerity"}], "abstract": "Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce WhisperNet, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module then aligns related messages before fusion to ensure structural consistency.Extensive experiments show that WhisperNet achieves state-of-the-art performance, improving AP@0.7 on OPV2V by 2.4\\% with only 0.5\\% of the communication cost. As a plug-and-play component, it boosts strong baselines with merely 5\\% of full bandwidth while maintaining robustness under localization noise. These results demonstrate that globally-coordinated allocation across what and where to share is the key to achieving efficient collaborative perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36832", "url": null, "sourceid": 37567, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36834, "uid": "cce119502b39f2cd9f69219bb6dbd241", "name": "Towards Intrinsic-Aware Monocular 3D Object Detection", "authors": [{"id": 181516, "fullname": "Zhihao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181516?format=json", "institution": "Michigan State University"}, {"id": 89429, "fullname": "Abhinav Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/89429?format=json", "institution": "Michigan State University"}, {"id": 73926, "fullname": "Xiaoming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73926?format=json", "institution": "Michigan State University"}], "abstract": "Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image.Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsic changes reshape how 3D scenes are projected onto the image plane.We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation.The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry.To capture this effect, MonoIA employs large language models and vision\u2013language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters.These embeddings are hierarchically integrated into the detection network via an Intrinsic Adaptation Module, allowing the model to modulate its feature representations according to camera-specific configurations and maintain consistent 3D detection across intrinsics.This shifts intrinsic modeling from numeric conditioning to semantic representation, enabling robust and unified perception across cameras.Extensive experiments show that MonoIA achieves new state-of-the-art (SoTA) results on standard benchmarks including KITTI, Waymo, and nuScenes (e.g., +1.0% on the KITTI leaderboard), and further improves performance under multi-dataset training (e.g., +4.46% on KITTI Val).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36834", "url": null, "sourceid": 45254, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36835, "uid": "146777c25fa9755117e2a841be9f1c92", "name": "E$^2$-SCI: Elastic Edge\u2013Cloud Speculative Decoding via Credit Inertia", "authors": [{"id": 148880, "fullname": "Senyao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/148880?format=json", "institution": "School of Computer Science and Technology, Huazhong University of Science and Technology"}, {"id": 85300, "fullname": "Haozhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85300?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185981, "fullname": "Zhaobai Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185981?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 185982, "fullname": "Zhanbo Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185982?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 185983, "fullname": "Hao Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185983?format=json", "institution": null}, {"id": 86547, "fullname": "Ruixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86547?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "In edge\u2013cloud environments, the efficiency of speculative decoding is heavily constrained by uplink transmission and cloud-side verification. In this work, we identify a phenomenon we term credit inertia, where the acceptance rates of adjacent token windows exhibit strong temporal consistency. Tokens following recently well-performing windows are likely to pass verification, whereas tokens following poorly performing windows are likely to fail. Motivated by this observation, we propose E$^2$-SCI, an elastic edge\u2013cloud speculative decoding framework that dynamically adjusts draft token verification thresholds based on recent historical performance. This adaptive mechanism allows the system to be more permissive for windows with strong historical performance and stricter for windows with weak performance, effectively leveraging temporal consistency to reduce overall latency. We further introduce Progressive Lookahead Concurrency (PLC), which pipelines draft generation and verification asynchronously to hide latency. Experiments across multiple benchmarks show that E$^2$-SCI achieves over $9.4$ tokens/s on DeepSeek-R1-Distill-Qwen (1.5B/32B), delivering an 88.5\\% speed improvement over the FSD baseline while maintaining accuracy. Notably, E$^2$-SCI integrates seamlessly with existing frameworks (e.g., EAGLE-3), demonstrating broad applicability and superior efficiency\u2013quality trade-offs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36835", "url": null, "sourceid": 44536, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36837, "uid": "0bb0846327772451045bd30dd347821b", "name": "MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis", "authors": [{"id": 132655, "fullname": "Xiangyu Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/132655?format=json", "institution": "Northeastern University"}, {"id": 185985, "fullname": "He Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185985?format=json", "institution": "Department of Computer Science, University of Oxford"}, {"id": 183066, "fullname": "Bishoy Galoaa", "url": "http://cvpr.thecvf.com/api/miniconf/users/183066?format=json", "institution": "Northeastern University"}, {"id": 185986, "fullname": "Utsav Nandi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185986?format=json", "institution": "Northeastern University"}, {"id": 179552, "fullname": "Shayda Moezzi", "url": "http://cvpr.thecvf.com/api/miniconf/users/179552?format=json", "institution": "Northeastern University"}, {"id": 185987, "fullname": "Yuhang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185987?format=json", "institution": "Research, Microsoft"}, {"id": 179540, "fullname": "Sarah Ostadabbas", "url": "http://cvpr.thecvf.com/api/miniconf/users/179540?format=json", "institution": "Northeastern University"}], "abstract": "While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36837", "url": null, "sourceid": 32587, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36838, "uid": "c4539e880b908ad22a401caf9046d152", "name": "FlowFM: Advancing Dark Optical Flow Estimation with Flow Matching", "authors": [{"id": 180703, "fullname": "Fengyuan Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180703?format=json", "institution": "Xi\u2019an University of Technology"}, {"id": 180779, "fullname": "Haiyan Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180779?format=json", "institution": "Xi&#x27;an University of Technology"}, {"id": 185988, "fullname": "Yuanlin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185988?format=json", "institution": "Xi&#x27;an University of Technology"}, {"id": 185989, "fullname": "Zhaolin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185989?format=json", "institution": "Xi&#x27;an University of Technology"}, {"id": 185990, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185990?format=json", "institution": "Xi&#x27;an University of Technology"}, {"id": 185991, "fullname": "MU YUERONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185991?format=json", "institution": "Xi&#x27;an University of Technology"}], "abstract": "Dark optical flow estimation (DOFE) faces critical challenges: discriminative models are less robust to noise and struggle with weakened motion patterns, while diffusion models suffer from discontinuous flow fields and low efficiency. Flow matching (FM), though efficient, remains underexplored for conditional generation in DOFE. In this paper, we propose FlowFM, the first flow matching model tailored to DOFE tasks. Instead of conventional vector field regression, FlowFM suggests estimating the global transformation path constrained by the ground truth optical flow. It generates noisy flow by mixing Gaussian noise with ground truth, then performs a one-step denoising process conditioned on the initial flow field, cost volume, and contextual features for optimal accuracy and efficiency. FlowFM incorporates an implicit Fourier denoising decoder (IFDD) for reliable motion understanding. By leveraging Fourier transform, IFDD uses amplitude to characterize motion intensity and phase to encode target spatial relationships within flow fields, then directly enhances amplitude to restore dark-caused motion information loss. Experiments show that FlowFM significantly outperforms state-of-the-art methods on the FCDN and VBOF benchmarks, setting a new performance record for DOFE.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36838", "url": null, "sourceid": 43275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36840, "uid": "ef6454c46da56ae242cfbcd023d6d61f", "name": "X-WIN: Building Chest Radiograph World Model via Predictive Sensing", "authors": [{"id": 169648, "fullname": "Zefan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169648?format=json", "institution": "Rensselaer Polytechnic Institute"}, {"id": 186000, "fullname": "Ge Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186000?format=json", "institution": "Rensselaer Polytechnic Institute"}, {"id": 186001, "fullname": "James Hendler", "url": "http://cvpr.thecvf.com/api/miniconf/users/186001?format=json", "institution": "Rensselaer Polytechnic Institute"}, {"id": 186002, "fullname": "Mannudeep Kalra", "url": "http://cvpr.thecvf.com/api/miniconf/users/186002?format=json", "institution": "Harvard University"}, {"id": 133302, "fullname": "Pingkun Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/133302?format=json", "institution": "Rensselaer Polytechnic Institute"}], "abstract": "Chest X-ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X-WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity-guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. X-WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36840", "url": null, "sourceid": 30998, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36841, "uid": "13df48a47c642c3bd1548d2a8226035d", "name": "Unified Number-Free Text-to-Motion Generation Via Flow Matching", "authors": [{"id": 144098, "fullname": "Guanhe Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144098?format=json", "institution": "King&#x27;s College London"}, {"id": 164949, "fullname": "Oya Celiktutan", "url": "http://cvpr.thecvf.com/api/miniconf/users/164949?format=json", "institution": "King&#x27;s College London"}], "abstract": "Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF\u2019s effectiveness as a generalist model for multi-person motion generation from text. We will release the code.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36841", "url": null, "sourceid": 42024, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36842, "uid": "2836af9c512d8f61a77e1cc36eb7c0c1", "name": "Towards Hierarchical 3D Spatial Understanding in Vision-Language Models", "authors": [{"id": 174607, "fullname": "Huizhi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174607?format=json", "institution": "Tsinghua University"}, {"id": 186003, "fullname": "Yichao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186003?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 89100, "fullname": "Yu Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89100?format=json", "institution": "Xiaobing.ai"}, {"id": 153304, "fullname": "Sicheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153304?format=json", "institution": "Microsoft"}, {"id": 186004, "fullname": "ZhiYuan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186004?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 85748, "fullname": "Tong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85748?format=json", "institution": "EPFL / University of Chinese Academy of Sciences"}, {"id": 157498, "fullname": "Yaobo Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157498?format=json", "institution": "Microsoft"}, {"id": 129072, "fullname": "Jiaolong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129072?format=json", "institution": "Microsoft Research"}], "abstract": "Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex stages, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that generates over 1 billion 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised finetuning. We also develop an RGB-D VLM that incorporates metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoning benchmarks, surpassing specialized spatial models and large proprietary systems such as Gemini-2.5-pro and GPT-5. Moreover, our analysis reveals clear dependencies among hierarchical task levels, offering new insights into how multi-level task design facilitates the emergence of 3D spatial intelligence in future VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36842", "url": null, "sourceid": 39202, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36844, "uid": "7052dac9f266e7843faf319350765a98", "name": "DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation", "authors": [{"id": 181946, "fullname": "Fangtai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181946?format=json", "institution": "Zhejiang University"}, {"id": 157822, "fullname": "Mushui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157822?format=json", "institution": "Zhejiang University"}, {"id": 186010, "fullname": "Weijie He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186010?format=json", "institution": "Zhejiang University"}, {"id": 154185, "fullname": "Zhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154185?format=json", "institution": "Zhejiang University"}, {"id": 89332, "fullname": "Yunlong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89332?format=json", "institution": "Zhejiang University"}], "abstract": "The unified autoregressive (AR) model excels at multimodal understanding and generation. However, its full potential in the domain of customized image generation has yet to be fully realized.Existing customization approaches for unified AR models face a fundamental dilemma: adaptation-based methods suffer from overfitting and scalability bottlenecks, while concept-injection paradigms are constrained by a shallow injection strategy that leads to poor visual fidelity and impaired re-contextualization.To address this, we propose DCoAR, a novel deep concept injection framework that maintains a completely frozen pre-trained model. DCoAR deeply integrates new concepts through a Layer-wise Multimodal Context Learning (LMCL) strategy, which is stabilized by a multi-faceted regularization scheme: a Dual Prior Preservation (DPP) loss to mitigate semantic drift and a Context-Aware Self-Regularization (CASR) loss to enhance re-contextualization. The framework also enables training-free subject customization in user-provided styles.Experiments demonstrate that DCoAR significantly outperforms previous injection-based methods and achieves performance competitive with adaptation-based approaches while requiring substantially fewer trainable parameters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36844", "url": null, "sourceid": 35460, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36845, "uid": "4eabe528a09da1820d18efcd9160dbd4", "name": "PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model", "authors": [{"id": 186011, "fullname": "Zhengdong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186011?format=json", "institution": "University of Technology Sydney"}, {"id": 158012, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158012?format=json", "institution": "University of Technology Sydney"}, {"id": 126882, "fullname": "Fengyun Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126882?format=json", "institution": "WeChat, Tencent Inc."}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}], "abstract": "This paper explores parallel thinking for Multi-modal Large Language Models (MLLMs), aiming to improve Chain-of-Thought (CoT) through multiple diverse reasoning paths. We guide the model to list multiple visual key points and develop an independent reasoning path for each. Therefore, we term this method PointThinker, which is characterized by starting each thinking path with a point. PointThinker offers two key advantages. (1) It amplifies the benefits of parallel thinking. While parallel thinking naturally benefits from multiple reasoning paths, explicitly listing key points further amplifies these benefits by eliminating redundancy and promoting path diversity, enabling the model to explore problems from more varied perspectives. (2) It uses a novel dense (point-wise) reward for reinforcement learning. We observe that during parallel thinking, some points are helpful while others are invalid, yet popular methods assign them the same rewards. Therefore, we propose allocating differentiated rewards to different points within the same chain-of-thought. This is implemented via a self-verification mechanism called Group Points Policy Optimization (GPPO), which combines rollout-level and point-level validation for reward assignment. On challenging benchmarks such as HallusionBench, PointThinker achieves 58.7% accuracy, improving reasoning quality and answer accuracy. Experimental results demonstrate that parallel thinking with point improves performance, and GPPO further contributes non-trivial gains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36845", "url": null, "sourceid": 33262, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36849, "uid": "daad2c59031ec67bcf6d5520e5da7af3", "name": "SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking", "authors": [{"id": 106293, "fullname": "Muhammad Saif Ullah Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/106293?format=json", "institution": "German Research Center for Artificial Intelligence (DFKI)"}, {"id": 89910, "fullname": "Didier Stricker", "url": "http://cvpr.thecvf.com/api/miniconf/users/89910?format=json", "institution": "Universit\u00e4t Kaiserslautern"}], "abstract": "Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine\u2019s complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a  biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full\u2011body motions in indoor multi\u2011camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking.Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36849", "url": null, "sourceid": 44426, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36855, "uid": "e1db696dd96f55544d1a322402e33f52", "name": "Linear Image Generation by Synthesizing Exposure Brackets", "authors": [{"id": 72028, "fullname": "Yuekun Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/72028?format=json", "institution": "Nanyang Technological University"}, {"id": 129904, "fullname": "Zhoutong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129904?format=json", "institution": "Adobe Systems"}, {"id": 75939, "fullname": "Shangchen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75939?format=json", "institution": "Nanyang Technological University"}, {"id": 88458, "fullname": "Nanxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88458?format=json", "institution": "Adobe Research"}], "abstract": "The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing  (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. Generating linear images, however, is quite challenging. Pre-trained VAEs in latent diffusion models struggle to reconstruct linear images due to their higher dynamic range and bit depth, where extreme highlights and shadows cannot be simultaneously preserved. To this end, we represent a linear image as a sequence of exposure brackets\u2014linear sub-images, each capturing a specific portion of the overall dynamic range. Based on this representation, we propose a new DiT-based flow-matching architecture to generate exposure brackets, which can be post-processed to produce a high-quality linear image. We further demonstrate that our approach enables downstream applications such as linear image editing and conditional linear image generation through ControlNet guidance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36855", "url": null, "sourceid": 39651, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36856, "uid": "88759fc0e73600b1a941d8defae4d6b8", "name": "Cinematic Audio Source Separation Using Visual Cues", "authors": [{"id": 186030, "fullname": "Kang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186030?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 182013, "fullname": "Suyeon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182013?format=json", "institution": "KAIST"}, {"id": 87269, "fullname": "Arda Senocak", "url": "http://cvpr.thecvf.com/api/miniconf/users/87269?format=json", "institution": "KAIST"}, {"id": 126430, "fullname": "Joon Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/126430?format=json", "institution": "KAIST"}], "abstract": "Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36856", "url": null, "sourceid": 44460, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36858, "uid": "248210163abcc765cc4b37cf971ef792", "name": "Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow", "authors": [{"id": 155455, "fullname": "Yu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155455?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186035, "fullname": "Su Lutong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186035?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186036, "fullname": "Ruixiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186036?format=json", "institution": "Beijing Institute of Technology"}, {"id": 177355, "fullname": "Tianji Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177355?format=json", "institution": "Beijing Institute of Technology"}, {"id": 140737, "fullname": "Jiadong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/140737?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155458, "fullname": "Yufeng Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155458?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155459, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155459?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "High-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF\u2019s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-based 3DGS lacks controllable sampling and thus cannot support progressive pose refinement. To address these challenges, we redesign the optimization strategy of Gaussian primitives and introduce an image-energy-guided constraint that encourages progressive alignment of camera poses. Experiments on both synthetic and real-world datasets show that Energy-GS can effectively optimize the scene reconstruction and resolve camera pose misalignment at the same time. Benefiting from reliance on only RGB images, we believe this work provides promising insights for visual localization and dense mapping applications such as SLAM.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36858", "url": null, "sourceid": 36113, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40277?format=json"], "related_events_ids": [40277]}, {"id": 40277, "uid": "248210163abcc765cc4b37cf971ef792", "name": "Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow", "authors": [{"id": 155455, "fullname": "Yu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155455?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186035, "fullname": "Su Lutong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186035?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186036, "fullname": "Ruixiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186036?format=json", "institution": "Beijing Institute of Technology"}, {"id": 177355, "fullname": "Tianji Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177355?format=json", "institution": "Beijing Institute of Technology"}, {"id": 140737, "fullname": "Jiadong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/140737?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155458, "fullname": "Yufeng Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155458?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155459, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155459?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "High-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF\u2019s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-based 3DGS lacks controllable sampling and thus cannot support progressive pose refinement. To address these challenges, we redesign the optimization strategy of Gaussian primitives and introduce an image-energy-guided constraint that encourages progressive alignment of camera poses. Experiments on both synthetic and real-world datasets show that Energy-GS can effectively optimize the scene reconstruction and resolve camera pose misalignment at the same time. Benefiting from reliance on only RGB images, we believe this work provides promising insights for visual localization and dense mapping applications such as SLAM.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40277", "url": null, "sourceid": -36113, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36858?format=json"], "related_events_ids": [36858]}, {"id": 36860, "uid": "d94e6cd8cf7e612bd8fd4096156eab2f", "name": "Geometric-Photometric Event-based 3D Gaussian Ray Tracing", "authors": [{"id": 134001, "fullname": "Kai Kohyama", "url": "http://cvpr.thecvf.com/api/miniconf/users/134001?format=json", "institution": "Keio University"}, {"id": 90274, "fullname": "Yoshimitsu Aoki", "url": "http://cvpr.thecvf.com/api/miniconf/users/90274?format=json", "institution": "Keio University"}, {"id": 73480, "fullname": "Guillermo Gallego", "url": "http://cvpr.thecvf.com/api/miniconf/users/73480?format=json", "institution": "TU Berlin"}, {"id": 71425, "fullname": "Shintaro Shiba", "url": "http://cvpr.thecvf.com/api/miniconf/users/71425?format=json", "institution": "The University of Tokyo / Keio University"}], "abstract": "Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves the state-of-the-art performance on the real-world datasets and competitive performance on the synthetic datasets. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event accumulation size, and achieves sharp reconstruction on scene edges. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. We will release the code upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36860", "url": "https://e3ai.github.io/gpert/", "sourceid": 34731, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36861, "uid": "df05cafdf19af074c204ab7d6f544119", "name": "CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval", "authors": [{"id": 186041, "fullname": "Xuanzuo Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186041?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 186042, "fullname": "Min Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186042?format=json", "institution": null}, {"id": 186043, "fullname": "Daizong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186043?format=json", "institution": "Wuhan University"}, {"id": 186044, "fullname": "Zhiwen Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186044?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 86375, "fullname": "Xun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86375?format=json", "institution": "University of Science and Technology of China"}, {"id": 177056, "fullname": "Changting Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/177056?format=json", "institution": "Gentel"}, {"id": 184744, "fullname": "Xun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184744?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 180225, "fullname": "Jianfeng Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180225?format=json", "institution": "Zhejiang Gongshang University"}], "abstract": "Interactive Text-to-Image Retrieval (I-TIR) aims to refine image retrieval results through natural language dialogues, which allows users to progressively supplement or correct their search intention across multiple rounds, enabling a more precise and user-aligned visual search experience.However, existing methods perform cross-modal retrieval within a fixed multimodal feature space, mapping all dialogue text and images onto the same static embedding manifold.Such static formulation easily causes semantic vagueness, making it difficult to capture subtle embedding shifts in the user's updated intention for fine-grained retrieval.To address this limitation, we propose Context-Aware Latent Space Transformation (CAST), a lightweight framework that dynamically transforms the common latent space of both textual and visual representations according to the specific evolving user's search intention, enabling fine-grained and adaptive semantic alignment.The core of CAST is the Context-Aware Space Regulator (CASR), a crucial space transformation module composed of two key components: (1) the Context-Aware Low-Rank Projector (CLP), which learns to predict the projection direction of embedding space based on the intent's context;and (2) the Context-Guided Modulator (CGM), which adaptively determines appropriate projection strength. CASR is highly lightweight, adding negligible parameters and computational overhead, and can be seamlessly integrated into diverse I-TIR frameworks.Extensive experiments demonstrate the effectiveness of our proposed framework, indicating that it can serve as a general, plug-and-play solution for efficient and scalable interactive text-to-image retrieval. Our source code is provided in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36861", "url": null, "sourceid": 45120, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36863, "uid": "2cd6afecab6cf4f86b40d7b9ded667e0", "name": "A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation", "authors": [{"id": 132821, "fullname": "Wentao Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132821?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 105964, "fullname": "Guofeng Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/105964?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 88510, "fullname": "Yang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88510?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 130609, "fullname": "Yongshun Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/130609?format=json", "institution": "Shandong University"}, {"id": 132889, "fullname": "Xiaoshui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132889?format=json", "institution": "Shanghai AI Lab"}, {"id": 130298, "fullname": "Liang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130298?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Text-to-LiDAR generation can customize 3D data with rich structures and diverse scenes for downstream tasks. However, the scarcity of Text-LiDAR pairs often causes insufficient training priors, generating overly smooth 3D scenes. Moreover, low-quality text descriptions may degrade generation quality and controllability. In this paper, we propose a Text-to-LiDAR Diffusion Model for scene generation, named T2LDM, with a Self-Conditioned Representation Guidance (SCRG). Specifically, SCRG, by aligning to the real representations, provides the soft supervision with reconstruction details for the Denoising Network (DN) in training, while decoupled in inference. In this way, T2LDM can perceive rich geometric structures from data distribution, generating detailed objects in scenes. Meanwhile, we construct a content-composable Text-LiDAR benchmark, T2nuScenes, along with a controllability metric. Based on this, we analyze the effects of different text prompts for LiDAR generation quality and controllability, providing practical prompt paradigms and insights. Furthermore, a directional position prior is designed to mitigate street distortion,  further improving scene fidelity. Additionally, by learning a conditional encoder via frozen DN, T2LDM can support multiple conditional tasks, including Sparse-to-Dense, Dense-to-Sparse, and Semantic-to-LiDAR generation. Extensive experiments in unconditional and conditional generation demonstrate that T2LDM outperforms existing methods, achieving state-of-the-art scene generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36863", "url": null, "sourceid": 35197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36868, "uid": "1d10712905e2faf91de6700424d443f6", "name": "Accelerating Streaming Video Understanding via Hierarchical Token Compression", "authors": [{"id": 186061, "fullname": "Yiyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186061?format=json", "institution": "Shanghai University of Science and Technology"}, {"id": 186062, "fullname": "Xuyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186062?format=json", "institution": "Sichuan University"}, {"id": 186063, "fullname": "Xiyan Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186063?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186064, "fullname": "Xinying Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186064?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186065, "fullname": "Boxue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186065?format=json", "institution": "Shanghai Jiao Tong University; Shanghai Jiao Tong University"}, {"id": 186066, "fullname": "Chenfei Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186066?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 184671, "fullname": "Tailai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184671?format=json", "institution": "China Agricultural University"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \\textbf{S}treaming \\textbf{T}oken \\textbf{C}ompression (\\textbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \\textbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \\textbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \\textbf{99\\%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \\textbf{24.5\\%} and \\textbf{45.3\\%}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36868", "url": null, "sourceid": 34486, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36897, "uid": "456c9caa8ade0846892f33f04e9b466a", "name": "MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label", "authors": [{"id": 186142, "fullname": "Junyoung Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/186142?format=json", "institution": "Kyung Hee University; Kyung Hee University"}, {"id": 186143, "fullname": "Seokwon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186143?format=json", "institution": "Kyung Hee University"}, {"id": 129674, "fullname": "Jung Uk Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/129674?format=json", "institution": "Kyung Hee University"}], "abstract": "Monocular 3D Object Detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely-annotated setting is common in real-world scenarios where annotating every object is impractical.To address this, we propose a novel framework for sparsely-annotated monocular 3D object detection with two key modules.First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. PBF maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates.Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision.Extensive results demonstrate the effectiveness of the proposed method. The source code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36897", "url": null, "sourceid": 37801, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36869, "uid": "6840b145d767fd6d53ef6a0595784bdd", "name": "Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding", "authors": [{"id": 181178, "fullname": "Junhan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181178?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 181286, "fullname": "Zilu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181286?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 70331, "fullname": "Yujun Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/70331?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131116, "fullname": "Dongliang Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131116?format=json", "institution": "Tsinghua University"}, {"id": 181184, "fullname": "Yitao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181184?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 90236, "fullname": "Zhanyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90236?format=json", "institution": "Beijing University of Post and Telecommunication"}], "abstract": "Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions.We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning.Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval\u2013grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36869", "url": null, "sourceid": 45947, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36871, "uid": "196a347f6fbe48ecb3c53a261119729e", "name": "DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning", "authors": [{"id": 180242, "fullname": "JIAMU SUN", "url": "http://cvpr.thecvf.com/api/miniconf/users/180242?format=json", "institution": "Tencent"}, {"id": 131427, "fullname": "Zhiyuan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131427?format=json", "institution": "Peking University"}, {"id": 91026, "fullname": "Ke-Yue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91026?format=json", "institution": "Tencent"}, {"id": 90624, "fullname": "Taiping Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90624?format=json", "institution": "Tencent Youtu Lab"}, {"id": 87241, "fullname": "Shouhong Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/87241?format=json", "institution": "Tencent Youtu Lab"}], "abstract": "Developing generalizable deepfake detectors has become increasingly important with the rapid advancement of generative models. Adapting visual foundation models (VFMs), e.g., CLIP, through parameter-efficient finetuning (PEFT), with only a small subset of parameters updated, has been proven highly effective for generalizable detection. However, the success of \u201cfewer-parameters\u201d training raises an important question: although only a few parameters are tuned, have existing PEFT-based detectors truly exploited the most informative ones while eliminating redundant parameters for better generalization? In this work, we move beyond standard PEFT by proposing a joint optimization strategy that operates at both the layer and token levels. Since latent features across layers capture different semantic abstractions and tokens within the same layer convey varied forgery cues, we propose integrating both layer-level and token-level routing to maximize representational synergy. Specifically, at the layer level, we introduce \"Early Layer Pruning\", an adaptive truncation mechanism that enables the model to adaptively learn distinct forward depths for different types of instances. At the token level, \"Token Selection\" is guided by the Spearman rank loss to filter tokens irrelevant to forgery learning, enabling the model to focus on the most discriminative cues. Furthermore, a unified MoE architecture is applied that encourages diversity and thus reduces the potential model's overfitting to specific forgery types. Extensive benchmarking results demonstrate the effectiveness of our designs and show the superior performance of our method over existing state-of-the-arts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36871", "url": null, "sourceid": 34892, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36872, "uid": "c08f0c801bd312c3d2358c1ed0bc1bea", "name": "PackUV: Packed Gaussian UV Maps for 4D Volumetric Video", "authors": [{"id": 154361, "fullname": "Aashish Rai", "url": "http://cvpr.thecvf.com/api/miniconf/users/154361?format=json", "institution": "Brown University"}, {"id": 127302, "fullname": "Angela Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/127302?format=json", "institution": "Brown University"}, {"id": 164587, "fullname": "Anushka Agarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/164587?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 181515, "fullname": "Xiaoyan Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181515?format=json", "institution": "Brown University"}, {"id": 91650, "fullname": "Zekun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91650?format=json", "institution": "Tencent AI Lab"}, {"id": 155693, "fullname": "Tao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155693?format=json", "institution": "Brown University"}, {"id": 167176, "fullname": "Aayush Prakash", "url": "http://cvpr.thecvf.com/api/miniconf/users/167176?format=json", "institution": "Facebook"}, {"id": 76158, "fullname": "Srinath Sridhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/76158?format=json", "institution": "Brown University"}], "abstract": "Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., HEVC, FFV1) without quality loss, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view 4D dataset to date, featuring more than 50 synchronized 360 cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36872", "url": null, "sourceid": 33855, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36878, "uid": "85c19375f0c12c6793bf66b4e2666dc4", "name": "SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM", "authors": [{"id": 179942, "fullname": "Pengchong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179942?format=json", "institution": "Wayne State University"}, {"id": 88625, "fullname": "Zhizhong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88625?format=json", "institution": "Wayne State University"}], "abstract": "3D Gaussian Splatting (3DGS) has made huge progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified for improving scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian function, and then use these points to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36878", "url": null, "sourceid": 31173, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36880, "uid": "1ca24fd40f486a3bc48e371b2f98a41e", "name": "Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction", "authors": [{"id": 175623, "fullname": "Jiahao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/175623?format=json", "institution": "Westlake University"}, {"id": 182531, "fullname": "Chenxi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/182531?format=json", "institution": "Westlake University"}, {"id": 88826, "fullname": "Wei Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88826?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose a novel training-free, layer-adaptive framework. The core of our approach is the observation that different layers within the model exhibit varying sensitivities to these two O.O.D. issues. We first introduce a systematic probing procedure to quantify each layer's sensitivity. Based on the results, we apply a tailored, layer-wise strategy. For layers sensitive to relative positions, we propose a novel multi-granularity video-based relative position re-encoding (VRPR) scheme. For layers sensitive to context length, we utilize a tiered sparse attention (TSA) mechanism combined with an attention sink. Extensive experiments show that our method achieves state-of-the-art performance in long video generation. Importantly, our framework can be seamlessly integrated into various leading video diffusion models without any additional training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36880", "url": null, "sourceid": 43914, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36882, "uid": "332d62c00e4e7218915d19a6f084d0ae", "name": "Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens", "authors": [{"id": 135900, "fullname": "Xinxuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/135900?format=json", "institution": "University of California, Irvine"}, {"id": 84831, "fullname": "Charless Fowlkes", "url": "http://cvpr.thecvf.com/api/miniconf/users/84831?format=json", "institution": "University of California, Irvine"}, {"id": 96762, "fullname": "Alex Berg", "url": "http://cvpr.thecvf.com/api/miniconf/users/96762?format=json", "institution": "Donald Bren School of Information and Computer Sciences, University of California, Irvine"}], "abstract": "Current text-to-image models struggle to express precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune a unified multimodal model for viewpoint-conditioned text-to-image generation on a curated dataset that combines high-volume rendered images for geometric supervision with low-volume photorealistic augmentations for appearance diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy across all camera parameters while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36882", "url": null, "sourceid": 43645, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36883, "uid": "9cf7f7608bc4393fef791f39b1b179b8", "name": "MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator", "authors": [{"id": 106685, "fullname": "Peiqing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106685?format=json", "institution": "S-Lab, Nanyang Technological University"}, {"id": 75939, "fullname": "Shangchen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75939?format=json", "institution": "Nanyang Technological University"}, {"id": 186100, "fullname": "Kai Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186100?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 153089, "fullname": "Qingyi Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153089?format=json", "institution": "Sensetime"}], "abstract": "Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Quality Evaluator (QE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The QE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36883", "url": null, "sourceid": 37578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36885, "uid": "bc3e6127ca2263a0a9375a2efab8dae4", "name": "Unified Multimodal Models as Auto-Encoders", "authors": [{"id": 131427, "fullname": "Zhiyuan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131427?format=json", "institution": "Peking University"}, {"id": 186102, "fullname": "Kaiqing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186102?format=json", "institution": "Shenzhen University"}, {"id": 186103, "fullname": "Zongjian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186103?format=json", "institution": "Peking University"}, {"id": 129844, "fullname": "Junyan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/129844?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 150500, "fullname": "Hui Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/150500?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 89673, "fullname": "Haochen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89673?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 89104, "fullname": "Zhendong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89104?format=json", "institution": "University of Science and Technology of China"}, {"id": 153157, "fullname": "Bin Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/153157?format=json", "institution": "Peking University"}, {"id": 156080, "fullname": "Li Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156080?format=json", "institution": "Peking University"}, {"id": 186104, "fullname": "Xinyan Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186104?format=json", "institution": "Baidu"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}, {"id": 84919, "fullname": "Haifeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84919?format=json", "institution": "Baidu"}, {"id": 186105, "fullname": "Li Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186105?format=json", "institution": "Peking University"}], "abstract": "Image-to-text (I2T) understanding and text-to-image (T2I) generation are two fundamental, important yet traditionally isolated multimodal tasks.  Despite their intrinsic connection, existing approaches typically optimize them independently, missing the opportunity for mutual enhancement. In this paper, we argue that both tasks can be connected under a shared Auto-Encoder perspective, where text serves as the intermediate latent representation, bridging the two directions \u2014 encoding images into textual semantics (I2T) and decoding text back into images (T2I). Our key insight is that *if the encoder truly \"understands\" the image, it should capture all essential structure, and if the decoder truly \"understands\" the text, it should recover that structure faithfully.* Building upon this principle, we propose Unified-GRPO, a post-training method based on reinforcement learning that jointly optimizes both modules through reconstructive rewards, maximizing the semantic consistency between the input and the generated images. Under this reconstruction objective, the encoder is encouraged to extract as much accurate and comprehensive semantic information from the input image to maximize reconstruction quality, while the decoder is simultaneously optimized to generate conditioned on the encoder's prior, enabling a self-evolving improvement. Empirically, we find that using text as the intermediate representation and training under a reconstructive RL paradigm effectively benefits both I2T and T2I.The I2T module gains stronger fine-grained visual perception, such as small-object recognition, grounding, etc, while its dense embeddings and language priors, in turn, provide richer semantic signals that improve T2I fidelity and complex instruction following. These results demonstrate that the reconstructive RL establishes a mutually reinforcing cross-modal synergy within the auto-encoding framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36885", "url": null, "sourceid": 34533, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36889, "uid": "9442a913d4ec91707fa1fac69ec73e0b", "name": "CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection", "authors": [{"id": 147281, "fullname": "Youngjun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/147281?format=json", "institution": "Yonsei University"}, {"id": 72183, "fullname": "Hyeongyu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/72183?format=json", "institution": "Yonsei University"}, {"id": 91230, "fullname": "Dosik Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91230?format=json", "institution": "Yonsei University"}], "abstract": "Test-time adaptation (TTA) enables real-time adaptation to domain shifts without offline retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Very recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This motivates a fundamental question: can we adaptively balance both strategies based on measured feature-level domain shift?We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36889", "url": null, "sourceid": 40176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36894, "uid": "efe2d4536fbb724f90ef5135b2899251", "name": "DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models", "authors": [{"id": 179927, "fullname": "Yangfu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/179927?format=json", "institution": "East China Normal University"}, {"id": 186135, "fullname": "Hongjian Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186135?format=json", "institution": "East China Normal University"}, {"id": 186136, "fullname": "Jiawei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186136?format=json", "institution": "East China Normal University"}, {"id": 184104, "fullname": "YUNING GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/184104?format=json", "institution": "National University of Singapore; Sichuan University; Shanghai Artificial Intelligence Laboratory"}, {"id": 186137, "fullname": "Qi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186137?format=json", "institution": "East China Normal University"}, {"id": 186138, "fullname": "Yue Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186138?format=json", "institution": "East China Normal University"}], "abstract": "Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision\u2013Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6\\% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost. The code will be open-source soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36894", "url": "https://github.com/YChenL/DeepScan", "sourceid": 32813, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36898, "uid": "b147bd07b59bf6b845e2e4238bc73dbb", "name": "MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering", "authors": [{"id": 180542, "fullname": "Xuze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180542?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85300, "fullname": "Haozhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85300?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186144, "fullname": "Zhenyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186144?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 186145, "fullname": "Zhongxu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186145?format=json", "institution": "Alibaba Group; Huazhong University of Science and Technology"}, {"id": 186146, "fullname": "Zhang Jinghua", "url": "http://cvpr.thecvf.com/api/miniconf/users/186146?format=json", "institution": null}, {"id": 86547, "fullname": "Ruixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86547?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Large Vision Language Models (LVLMs) with retrieval-augmented generation (RAG) are emerging as a main paradigm for processing vision-language medical tasks due to their promising achievements. However, existing approaches exhibit two significant limitations in the two key stages\u2014retrieval and generation. First, during the retrieval stage, most methods typically rely on a single similarity signal to estimate document relevance, ignoring the rich information available in multimodal data, which may fail to accurately retrieve matching content. Second, in the generation stage, retrieved documents are integrated directly and uniformly into the input for LVLMs, without taking into account their varying relevance to the question, which may result in the dilution of crucial information and exacerbate the negative impact of irrelevant content. To address these limitations, we propose MR-RAG, a dual-stage RAG enhancement framework by considering multimodal relevance in both retrieval and generation phases. Specifically, we first introduce a Multimodal Cooperative Retrieval (MCR) module that leverages both intra-modal and cross-modal signals to jointly retrieve semantically aligned documents. Then, we design an Importance-Aware Information Flow Augmentation (IFA) mechanism that augments attention paths based on the fused multimodal relevance, enabling more precise control over the information flow during answer generation. By coherently bridging retrieval and generation via multimodal signals, our method significantly enhances factual accuracy and robustness. Experiments on three medical datasets demonstrate that our method outperforms state-of-the-art baselines, achieving up to 6.4% accuracy improvement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36898", "url": null, "sourceid": 42516, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36902, "uid": "d82cc168f5c589abedc86f5bea6ac5e0", "name": "HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation", "authors": [{"id": 181083, "fullname": "Jiawen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181083?format=json", "institution": "Chongqing Academy of Science and Technology"}, {"id": 186159, "fullname": "Fei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186159?format=json", "institution": "East China Normal University"}, {"id": 156245, "fullname": "Dandan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156245?format=json", "institution": "East China Normal University"}, {"id": 90438, "fullname": "Aimin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90438?format=json", "institution": "East China Normal University, Tsinghua University"}], "abstract": "Unsupervised domain adaptation (UDA) for pose estimation promises transfer from synthetic to real domains but often suffers instability under domain shift. Prior work attributes this deterioration to gradient interference between source supervision and target consistency. This conflict is distinct in pose estimation, where sparse and heterogeneous supervision signals cause gradients to be highly sensitive to small localization errors and lead to unstable updates. To address these challenges, we propose HamiPose, a Hamiltonian optimization framework that transports decoupled and confidence-calibrated gradients within a unified geometry to mitigate instability. HamiPose first refines gradient interaction through keypointwise geometry decomposition, orthogonally projecting target gradients to preserve nonconflicting component. Channelwise gated alignment then calibrates the parallel component with confidence and alignment, producing decoupled, confidence-calibrated gradients. These gradients are advanced by a Hamiltonian optimizer with a symplectic integrator, providing controlled momentum that stabilizes updates. Extensive experiments demonstrate that HamiPose achieves state-of-the-art performance in UDA pose estimation while maintains strong performance under domain generalization settings.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36902", "url": null, "sourceid": 32933, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36904, "uid": "d28bab21a85b3f51bb54f9f78c9b01ed", "name": "Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring", "authors": [{"id": 101001, "fullname": "Qizhi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/101001?format=json", "institution": "Tsinghua University"}, {"id": 89500, "fullname": "Kun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89500?format=json", "institution": "Kuaishou Technology"}, {"id": 101948, "fullname": "Yunpeng Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/101948?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 130119, "fullname": "Jiachao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/130119?format=json", "institution": "Beijing Kuaishou "}, {"id": 186164, "fullname": "Mingda Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186164?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 89506, "fullname": "Ming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/89506?format=json", "institution": "Kuaishou Tech"}, {"id": 127516, "fullname": "Chao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127516?format=json", "institution": "kuaishou"}, {"id": 186165, "fullname": "Jihong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186165?format=json", "institution": "Tsinghua University"}], "abstract": "Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions (e.g., noise), restricting its applicability. Benefiting from the human-friendly linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems (e.g., GPT-4), limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's (HVS) reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs. The code and dataset will be publicly available for future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36904", "url": null, "sourceid": 36272, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36907, "uid": "99a56ec8f7a32788295776041bac69b8", "name": "The Missing Point in Vision Transformers for Universal Image Segmentation", "authors": [{"id": 186179, "fullname": "Sajjad Shahabodini", "url": "http://cvpr.thecvf.com/api/miniconf/users/186179?format=json", "institution": "Concordia University"}, {"id": 186180, "fullname": "Mobina Mansoori", "url": "http://cvpr.thecvf.com/api/miniconf/users/186180?format=json", "institution": "Concordia University"}, {"id": 186181, "fullname": "Farnoush Bayatmakou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186181?format=json", "institution": "Concordia University"}, {"id": 186182, "fullname": "Jamshid Abouei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186182?format=json", "institution": "Yazd University"}, {"id": 157156, "fullname": "Konstantinos N. Plataniotis", "url": "http://cvpr.thecvf.com/api/miniconf/users/157156?format=json", "institution": "University of Toronto"}, {"id": 186183, "fullname": "Arash Mohammadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186183?format=json", "institution": "Concordia University"}], "abstract": "Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36907", "url": null, "sourceid": 41693, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36911, "uid": "ef44470a9aec6c5a2086afebd72c4192", "name": "Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration", "authors": [{"id": 186196, "fullname": "Amirhossein Kazerouni", "url": "http://cvpr.thecvf.com/api/miniconf/users/186196?format=json", "institution": "Samsung; University of Toronto"}, {"id": 186197, "fullname": "Maitreya Suin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186197?format=json", "institution": "Samsung"}, {"id": 156296, "fullname": "Tristan T Aumentado-Armstrong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156296?format=json", "institution": "Samsung"}, {"id": 186198, "fullname": "Sina Honari", "url": "http://cvpr.thecvf.com/api/miniconf/users/186198?format=json", "institution": "Samsung"}, {"id": 91301, "fullname": "Amanpreet Walia", "url": "http://cvpr.thecvf.com/api/miniconf/users/91301?format=json", "institution": "Samsung Research America"}, {"id": 91653, "fullname": "Iqbal Mohomed", "url": "http://cvpr.thecvf.com/api/miniconf/users/91653?format=json", "institution": "Samsung AI Centre Toronto"}, {"id": 69250, "fullname": "Kosta Derpanis", "url": "http://cvpr.thecvf.com/api/miniconf/users/69250?format=json", "institution": "York University"}, {"id": 73714, "fullname": "Babak TAATI", "url": "http://cvpr.thecvf.com/api/miniconf/users/73714?format=json", "institution": "Kite Research Institute, Toronto Rehab, University Health Network; University of Toronto"}, {"id": 76844, "fullname": "Alex Levinshtein", "url": "http://cvpr.thecvf.com/api/miniconf/users/76844?format=json", "institution": "Samsung"}], "abstract": "Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose **Face2Scene**, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored\u2013degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36911", "url": null, "sourceid": 36950, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36913, "uid": "4c28db6cc0ca101caa574083cf238466", "name": "Towards Robust Multimodal Large Language Models Against Jailbreak Attacks", "authors": [{"id": 182380, "fullname": "ZIYI YIN", "url": "http://cvpr.thecvf.com/api/miniconf/users/182380?format=json", "institution": "Penn State University"}, {"id": 186205, "fullname": "Yuanpu Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186205?format=json", "institution": "Pennsylvania State University"}, {"id": 186206, "fullname": "Han Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186206?format=json", "institution": "Dalian University of Technology"}, {"id": 186207, "fullname": "Ting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186207?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 139157, "fullname": "Jinghui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/139157?format=json", "institution": "Penn State University"}, {"id": 186208, "fullname": "Fenglong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186208?format=json", "institution": "Pennsylvania State University"}], "abstract": "While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SAFEMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SAFEMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token embeddings under a contrastive objective. SAFEMLLM then updates model parameters to neutralize the perturbation effects while preserving model utility on benign inputs. We evaluate SAFEMLLM across multiple MLLMs and six jailbreak methods spanning multiple modalities. Experimental results show that SAFEMLLM effectively defends against diverse attacks, maintaining robust performance and utilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36913", "url": null, "sourceid": 37713, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36918, "uid": "b2ec3778d61b7fbba65a900aebc41c1a", "name": "Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering", "authors": [{"id": 177294, "fullname": "Meihong Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/177294?format=json", "institution": "Westlake University"}, {"id": 158689, "fullname": "Yefeng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158689?format=json", "institution": "Westlake University"}], "abstract": "Medical Visual Question Answering models often face potential train-test distribution shifts that hinder generalization across unseen imaging and linguistic patterns. To address this challenge, we propose a dual-level confidence based framework (DuCoR) that achieves implicit self-refinement through iterative pseudo-supervised optimization. Instead of relying on fixed pseudo answers, the model progressively refines its predictions by estimating their reliability from two complementary perspectives. A loss-level confidence captures the reliability of supervision by modeling clean and noisy loss distributions, while a feature-level confidence measures the semantic coherence between sample representations and their pseudo-answer conditioned prototypes. Since these two confidences originate from distinct information sources, including the supervision signal and the input semantics, they provide mutually corrective cues. They are adaptively fused to derive per-sample reliability weights that guide pseudo-supervised optimization toward better alignment with the target distribution. Extensive experiments on multiple Med-VQA benchmarks show that our method achieves superior performance and exhibits improved cross-domain generalization over fully supervised baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36918", "url": null, "sourceid": 41896, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36920, "uid": "275451c8e61e2f192cd7c00e36f80a42", "name": "Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction", "authors": [{"id": 145095, "fullname": "Kiseok Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/145095?format=json", "institution": "KAIST"}, {"id": 183739, "fullname": "Hyeongjun Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/183739?format=json", "institution": "KAIST"}, {"id": 90841, "fullname": "Inchul Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/90841?format=json", "institution": "KAIST"}, {"id": 76517, "fullname": "Min H. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/76517?format=json", "institution": "KAIST"}], "abstract": "X-ray computed tomography (CT) reconstructs volumetric representations of objects from projection images obtained by transmitting X-rays through a target. Recent splat-based tomography, which represents a volume as a continuous distribution of 3D Gaussians, has demonstrated both high reconstruction quality and fast convergence in cone-beam sparse-view CT. However, when deployed in real CT systems with limited and non-uniform view distributions, we observe distinctive streak and strip artifacts that are far more pronounced than in conventional reconstruction methods. Through detailed analysis, we show that these artifacts primarily originate from pose inaccuracies in the acquisition geometry rather than from view sparsity itself. We revisit pose sensitivity in the splatting formulation and derive a stable gradient-based framework that jointly refines geometric parameters during reconstruction. Our study not only identifies how pose perturbations propagate through the differentiable projection operator but also reveals why splat-based CT is particularly vulnerable to geometric misalignment. The resulting formulation remains lightweight and easily integrable into existing pipelines while substantially improving reconstruction fidelity under real-world sparse-view conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36920", "url": null, "sourceid": 31117, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36921, "uid": "d1d8b9bbeefe2d37fc12e8aa4abf7e86", "name": "SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching", "authors": [{"id": 129134, "fullname": "Shihua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129134?format=json", "institution": "Wuhan University"}, {"id": 186224, "fullname": "Tianhao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186224?format=json", "institution": null}, {"id": 129146, "fullname": "Zizhuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129146?format=json", "institution": "Wuhan University"}, {"id": 102504, "fullname": "Qing Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/102504?format=json", "institution": "Harbin Institute of Technology"}, {"id": 86222, "fullname": "Jiayi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86222?format=json", "institution": "Wuhan University"}], "abstract": "Image-to-point cloud matching (2D-3D matching) establishes accurate correspondences between image keypoints and 3D points for 6-DoF camera pose estimation. Existing methods either suffer from poor generalization due to scene-specific coordinate regression requiring per-scene retraining, or incur high storage and maintenance costs from descriptor-based matching that relies on large descriptor sets. Consequently, descriptor-free approaches have gained attention by avoiding heavy storage while improving generalizability; however, most rely only on low-level geometric cues, which limits performance. Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. Specifically, we design a compact semantic extraction scheme encoding each 3D point as a low-dimensional semantic probability distribution, offering effective guidance with minimal storage. A bidirectionally-aligned fusion block merges geometric features with semantic context for more unified and consistent representations. Additionally, semantic priors guide the 2D-3D information exchange within the interaction framework from a high-level semantic perspective. Extensive indoor and outdoor experiments validate that SAG-GNN achieves state-of-the-art results in descriptor-free 2D-3D matching and visual localization, with low storage and strong generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36921", "url": null, "sourceid": 33057, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36923, "uid": "062eb81d4674705d10c8ecb848358cb3", "name": "Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training", "authors": [{"id": 147325, "fullname": "Seongmin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/147325?format=json", "institution": "Inha University"}, {"id": 156437, "fullname": "Byung Cheol Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/156437?format=json", "institution": "Inha University"}], "abstract": "The Vision Transformer (ViT) has surpassed Convolutional Neural Networks (CNNs) in performance, becoming the de facto architecture in modern computer vision. However, despite its superior representational capacity, research on the adversarial robustness of ViTs remains limited, with most studies still biased toward CNN-based models. This work aims to address this architectural bias and conduct an in-depth analysis of the interaction between ViTs and adversarial training (AT).We first show that ViTs can identify semantic components of objects through their class attention maps, indicating that adversarially trained ViTs inherently encode strong semantic priors. Next, using the proposed Gradient Path Masking (GPM) analysis, we examine the internal information flow of ViTs and verify that the residual path serves as a major bottleneck that provides advantageous information to adversaries. Furthermore, our inter-patch relation analysis reveals that adversarially trained ViTs tend to rely more on global than local relationships in early layers\u2014a novel observation suggesting a %fundamentalpotential incompatibility between ViTs and hybrid architectures that inject CNN-style inductive biases.Building upon these findings, we design a simple yet effective two-stage AT scheme to mitigate this structural incompatibility, achieving simultaneous improvements in robustness and generalization across various ViT variants and training methods. The proposed method is compatible with a wide range of AT frameworks and models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36923", "url": null, "sourceid": 44180, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36929, "uid": "bc2a6d0560ca625e58ccb534b54a3435", "name": "Direction-aware 3D Large Multimodal Models", "authors": [{"id": 180935, "fullname": "QUAN LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180935?format=json", "institution": "Nanyang Technological University"}, {"id": 90462, "fullname": "Weihao Xuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90462?format=json", "institution": "Waseda University"}, {"id": 186241, "fullname": "Junjue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186241?format=json", "institution": "The University of Tokyo"}, {"id": 152106, "fullname": "Naoto Yokoya", "url": "http://cvpr.thecvf.com/api/miniconf/users/152106?format=json", "institution": "The University of Tokyo"}, {"id": 87230, "fullname": "Ling Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87230?format=json", "institution": "Inception Institute of Artificial Intelligence"}, {"id": 87301, "fullname": "Shijian Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87301?format=json", "institution": "Nanyang Technological University"}], "abstract": "3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMM with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object\u2013frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36929", "url": null, "sourceid": 31148, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36925, "uid": "dcdfdb7d3c7c0ae743b6a79eb5dd70c0", "name": "X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection", "authors": [{"id": 180517, "fullname": "Youngseo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/180517?format=json", "institution": "KAIST"}, {"id": 127075, "fullname": "Kwan Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/127075?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 155842, "fullname": "Seokhyeon Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155842?format=json", "institution": "KAIST, Visual Media Lab"}, {"id": 102258, "fullname": "Sihun Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/102258?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 186233, "fullname": "Colette Koo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186233?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 127106, "fullname": "Junyong Noh", "url": "http://cvpr.thecvf.com/api/miniconf/users/127106?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a generator-side view and observe that internal cross-attention mechanisms in these models encode fine-grained speech\u2013motion alignment, offering useful correspondence cues for forgery detection. Building on this insight, we propose $\\textbf{X-AVDT}$, a robust and generalizable deepfake detector that probes generator-internal audio-visual signals accessed via DDIM inversion to expose these cues. X-AVDT extracts two complementary signals: (i) a video composite capturing inversion-induced discrepancies, and (ii) audio\u2013visual cross-attention feature reflecting modality alignment enforced during generation. To enable faithful, cross-generator evaluation, we further introduce $\\textbf{MMDF}$, a new multi-modal deepfake dataset spanning diverse manipulation types and rapidly evolving synthesis paradigms, including GANs, diffusion, and flow-matching. Extensive experiments demonstrate that X-AVDT achieves leading performance on MMDF and generalizes strongly to external benchmarks and unseen generators, outperforming existing methods with accuracy improved by  $\\textbf{+13.1}$%. Our findings highlight the importance of leveraging internal audio\u2013visual consistency cues for robustness to future generators in deepfake detection. The code and dataset will soon be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36925", "url": null, "sourceid": 30719, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36931, "uid": "41560f2d678088aba524f1e8b04dc4a3", "name": "RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs", "authors": [{"id": 186245, "fullname": "Logan Lawrence", "url": "http://cvpr.thecvf.com/api/miniconf/users/186245?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 131800, "fullname": "Oindrila Saha", "url": "http://cvpr.thecvf.com/api/miniconf/users/131800?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 127812, "fullname": "Rangel Daroya", "url": "http://cvpr.thecvf.com/api/miniconf/users/127812?format=json", "institution": "University of Massachusetts Amherst"}, {"id": 133987, "fullname": "Mustafa Chasmai", "url": "http://cvpr.thecvf.com/api/miniconf/users/133987?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 186246, "fullname": "Wuao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186246?format=json", "institution": "University of Massachusetts Amherst"}, {"id": 186247, "fullname": "Max Hamilton", "url": "http://cvpr.thecvf.com/api/miniconf/users/186247?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 127799, "fullname": "Aaron Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/127799?format=json", "institution": "University of Massachusetts Amherst"}, {"id": 186248, "fullname": "Seoyun Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186248?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 147185, "fullname": "Fabien Delattre", "url": "http://cvpr.thecvf.com/api/miniconf/users/147185?format=json", "institution": "University of Massachusetts, Amherst"}, {"id": 75679, "fullname": "Subhransu Maji", "url": "http://cvpr.thecvf.com/api/miniconf/users/75679?format=json", "institution": "University of Massachusetts, Amherst"}, {"id": 131793, "fullname": "Grant Horn", "url": "http://cvpr.thecvf.com/api/miniconf/users/131793?format=json", "institution": "University of Massachusetts at Amherst"}], "abstract": "Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (vocalization, range, season), or obscured due to occlusion, camera angle, or low resolution. Yet today\u2019s multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rational (e.g., \u201crequires vocalization,\u201d \u201cout of range,\u201d \u201cview obstructed\u201d). For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models ($\\leq 17\\%$ accuracy including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36931", "url": null, "sourceid": 35292, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36934, "uid": "23a28e6d57e4c5d8eb0bff70ae01ed09", "name": "FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement", "authors": [{"id": 145924, "fullname": "Ming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145924?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 186261, "fullname": "Yongsheng Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186261?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 167648, "fullname": "Mingyu Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/167648?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 148813, "fullname": "Jianfu Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/148813?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 186262, "fullname": "Peng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186262?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 186263, "fullname": "Yao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186263?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 186264, "fullname": "Cong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186264?format=json", "institution": "Wuhan University"}, {"id": 154895, "fullname": "Bingliang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154895?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 154896, "fullname": "Quan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154896?format=json", "institution": "Xi&#x27;an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences"}], "abstract": "Occlusion occurs when one object partially or fully blocks another in a scene, making it difficult for an occlusion machine vision system to detect or track objects accurately. In zero-shot anomaly detection (ZSAD), the system needs to detect unseen defects without relying on labeled anomalous samples, which is critical for applications such as industrial inspection and medical imaging. However, normal features in images often occlude anomalous features, leading to coarse localization and limited discriminability.  To address this challenge, we propose **FB-CLIP**, which enhances foreground features while suppressing irrelevant background interference to improve anomaly detection performance. Unlike existing CLIP-based methods that typically rely on a single textual feature, FB-CLIP introduces **Multi-Strategy Text Feature Fusion (MSTFF)**, combining End-of-Text, global pooling, and attention-weighted features to generate rich, task-aware text embeddings. Furthermore, FB-CLIP employs **Multi-View Foreground-Background Enhancement (MVFBE)**, **Background Suppression (BS)**, and **Semantic Consistency Regularization (SCR)** to achieve foreground reinforcement, background interference mitigation, and reliable visual-text alignment, respectively.  Experiments on multiple public industrial and medical datasets show that FB-CLIP effectively captures fine-grained anomalies and outperforms existing zero-shot methods. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36934", "url": null, "sourceid": 39954, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36936, "uid": "a14ad032813ed4d8c564f5fdb3ba8eb8", "name": "FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing", "authors": [{"id": 130830, "fullname": "Hanxi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130830?format=json", "institution": "Peking University"}, {"id": 70754, "fullname": "Yifang Men", "url": "http://cvpr.thecvf.com/api/miniconf/users/70754?format=json", "institution": "Alibaba Group"}, {"id": 88921, "fullname": "Zhouhui Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88921?format=json", "institution": "Peking University"}], "abstract": "Single-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortions and texture blurriness, due to the lack of 3D awareness within the generated frames. To address this, we introduce a novel 4D remeshing module that explicitly disentangles the learning of the globally shared canonical mesh and transient variations by tracking per-vertex deformations under different viewpoints. The topological consistency of the deformed meshes inherently enables the optimization of a unified UV representation that effectively integrates appearance attributes across frames. Both qualitative and quantitative experimental results demonstrate the superiority of our method over prior works in terms of appearance realism, geometric fineness, and generalization diversity.  We also showcase the applicability of our reconstructed avatars for downstream applications including animation and 3D editing.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36936", "url": null, "sourceid": 35010, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36938, "uid": "c6996dd5e0f7a6af62c2a503ab1961cc", "name": "SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation", "authors": [{"id": 144201, "fullname": "Phuc Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/144201?format=json", "institution": "University of Science, VNU-HCM"}, {"id": 186271, "fullname": "Uy Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/186271?format=json", "institution": "Qualcomm Inc"}, {"id": 88080, "fullname": "Binh-Son Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/88080?format=json", "institution": "Trinity College Dublin"}, {"id": 76220, "fullname": "Phong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76220?format=json", "institution": "VinAI"}], "abstract": "Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision-language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed via an efficient inverse mapping process, incorporating remeshing and dynamic stitching algorithms, thereby eliminating the need for physical re-simulation. Extensive experiments on the Multimodal GarmentCodeData benchmark demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36938", "url": null, "sourceid": 36011, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36942, "uid": "47db93bccb52a63d5da2e2f7460173d4", "name": "Elastic Weight Consolidation Done Right for Continual Learning", "authors": [{"id": 144486, "fullname": "Xuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144486?format=json", "institution": "Sun Yat-sen University"}, {"id": 69926, "fullname": "Xiaobin Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69926?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients.However, it has consistently shown suboptimal performance.In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective.For the first time, we find that EWC\u2019s reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios.Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection.Consequently, both EWC and its variant exhibit fundamental misalignments in estimating the importance of weights, leading to inferior performance.To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies the importance estimation of EWC.Specifically, reversing the logit values during the calculation of the FIM can effectively prevent both the gradient vanishing and the redundant protection.Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36942", "url": null, "sourceid": 38589, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36944, "uid": "df3f5bc6f8359d51858bb8ffeb11fe98", "name": "OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness", "authors": [{"id": 131871, "fullname": "Phuc Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131871?format=json", "institution": "VinAI Research"}, {"id": 186276, "fullname": "Anh N Nhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186276?format=json", "institution": "University of Maryland"}, {"id": 150899, "fullname": "Ming Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/150899?format=json", "institution": "University of Maryland, College Park"}], "abstract": "We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world\u2013scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam.Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage.To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks \u2013 KITTI, nuScenes, and Argoverse 2 \u2013 achieving more than 20% performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%\u201392% lower errors across all metrics.These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36944", "url": null, "sourceid": 38161, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36945, "uid": "e6aa1b56ad0ea50842827adfee2b6646", "name": "Exploring Conditions for Diffusion models in Robotic Control", "authors": [{"id": 179969, "fullname": "Heeseong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/179969?format=json", "institution": "KAIST"}, {"id": 159012, "fullname": "Byeongho Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/159012?format=json", "institution": "NAVER AI Lab"}, {"id": 90646, "fullname": "Dongyoon Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/90646?format=json", "institution": "NAVER Corp, CLOVA AI."}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 159013, "fullname": "Taekyung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/159013?format=json", "institution": "NAVER AI Lab"}], "abstract": "While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions\u2014a successful strategy in other vision domains\u2014yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36945", "url": null, "sourceid": 36066, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36950, "uid": "c316b1caa7c585393187d1711e789178", "name": "SkillSight: Efficient First-Person Skill Assessment with Gaze", "authors": [{"id": 182391, "fullname": "Chi Hsuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182391?format=json", "institution": "The University of Texas at Austin"}, {"id": 76703, "fullname": "Kumar Ashutosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/76703?format=json", "institution": "UT Austin &amp; FAIR, Meta"}, {"id": 69188, "fullname": "Kristen Grauman", "url": "http://cvpr.thecvf.com/api/miniconf/users/69188?format=json", "institution": "University of Texas at Austin"}], "abstract": "Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they directtheir attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73\u00d7 less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36950", "url": null, "sourceid": 44325, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36953, "uid": "c872b22bc2ece6de7deafc6b5982b960", "name": "GeoRK2: Geometry-Guided Runge\u2013Kutta Integration for Diffusion Transformer Acceleration", "authors": [{"id": 186300, "fullname": "Chaoqun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186300?format=json", "institution": "Fudan University"}, {"id": 186301, "fullname": "Zongjing Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186301?format=json", "institution": "Fudan University"}, {"id": 186302, "fullname": "Powei Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186302?format=json", "institution": "Bilibili"}, {"id": 182321, "fullname": "Jinpeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182321?format=json", "institution": "Bilibili Inc."}, {"id": 186303, "fullname": "JianXiang Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186303?format=json", "institution": null}, {"id": 186304, "fullname": "Yukang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186304?format=json", "institution": null}, {"id": 186305, "fullname": "Chenyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186305?format=json", "institution": "Bilibili"}], "abstract": "Diffusion transformer models deliver state-of-the-art image synthesis quality but suffer from prohibitively slow iterative sampling. Fewer sampling steps accelerate inference but inevitably distort intermediate features and degrade visual fidelity, while offering little relief in computational cost. To address these limitations, we present GeoRK2, a training-free framework that bridges numerical analysis and information geometry. GeoRK2 couples second-order Runge\u2013Kutta (RK2) integration with a curvature-aware geometric flow derived from the model's noise predictions, establishing provably stable feature evolution dynamics under manifold-aware integration. By leveraging an empirical feature covariance\u2013induced metric estimated from gradient covariances to capture intrinsic feature geometry and applying parallel transport along the manifold connection, GeoRK2 constrains error propagation under large-step integration, ensuring both numerical stability and structural fidelity. As a fully plug-and-play method, GeoRK2 requires no retraining and is compatible with mainstream pretrained diffusion transformers. Comprehensive experiments on image generation and super-resolution tasks across representative diffusion backbones (e.g., DiT-XL, HunyuanVideo, and FLUX.1-dev) demonstrate that GeoRK2 achieves 4\u20135\u00d7 faster inference than baseline frameworks (FORA, TaylorSeer) with only marginal perceptual differences (\u2206FID \u2248 0.81), confirming its effectiveness and generality. All implementation details and code are provided in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36953", "url": null, "sourceid": 31115, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36954, "uid": "2be35d4cea950664a8b8f314b5a177c3", "name": "Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities", "authors": [{"id": 174201, "fullname": "Peibo Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/174201?format=json", "institution": "Shandong University"}, {"id": 186306, "fullname": "Xiaotian Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186306?format=json", "institution": "The University of Tokyo"}, {"id": 174925, "fullname": "Jinshuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174925?format=json", "institution": "Shandong University"}, {"id": 175185, "fullname": "zihao wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175185?format=json", "institution": "Shandong University"}, {"id": 186307, "fullname": "Jinhua liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186307?format=json", "institution": null}, {"id": 186308, "fullname": "Shujun Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186308?format=json", "institution": "Shandong University"}, {"id": 175189, "fullname": "Fangxun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/175189?format=json", "institution": "Shandong University"}, {"id": 155562, "fullname": "Si Yong Yeo", "url": "http://cvpr.thecvf.com/api/miniconf/users/155562?format=json", "institution": "Nanyang Technological University"}], "abstract": "Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine- grained structure capture, cross-modal complementarity modeling, and effective exploitation of available modalities. The key idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods for brain tumor segmentation with missing modalities. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36954", "url": null, "sourceid": 44759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36955, "uid": "c5d1419c9b70973e907db79a55cb339e", "name": "Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging", "authors": [{"id": 158832, "fullname": "Hesong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158832?format=json", "institution": "Beijing Institute of Technology"}, {"id": 158833, "fullname": "Ziqi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158833?format=json", "institution": "Beijing Institute of Technology"}, {"id": 158834, "fullname": "Ruiwen Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158834?format=json", "institution": "Beijing Institute of Technology"}, {"id": 73966, "fullname": "Ying Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73966?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36955", "url": null, "sourceid": 40911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36956, "uid": "9f78e8aa1530b26c85f555017d89e745", "name": "SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting", "authors": [{"id": 180028, "fullname": "Pranav Asthana", "url": "http://cvpr.thecvf.com/api/miniconf/users/180028?format=json", "institution": "University of Maryland"}, {"id": 153558, "fullname": "Alex Hanson", "url": "http://cvpr.thecvf.com/api/miniconf/users/153558?format=json", "institution": "University of Maryland College Park"}, {"id": 153559, "fullname": "Allen Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153559?format=json", "institution": "Department of Computer Science             University of Maryland, College Park"}, {"id": 85114, "fullname": "Tom Goldstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85114?format=json", "institution": "University of Maryland, College Park"}, {"id": 86740, "fullname": "Matthias Zwicker", "url": "http://cvpr.thecvf.com/api/miniconf/users/86740?format=json", "institution": "University of Maryland, College Park"}, {"id": 186309, "fullname": "Amitabh Varshney", "url": "http://cvpr.thecvf.com/api/miniconf/users/186309?format=json", "institution": "University of Maryland, College Park"}], "abstract": "3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks \\& Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36956", "url": null, "sourceid": 34745, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36957, "uid": "0566a7d2b6403531a6b808ca3fff0f12", "name": "Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach", "authors": [{"id": 154783, "fullname": "Aishwarya Agarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/154783?format=json", "institution": "International Institute of Information Technology Hyderabad, &amp; Adobe Research"}, {"id": 73469, "fullname": "Srikrishna Karanam", "url": "http://cvpr.thecvf.com/api/miniconf/users/73469?format=json", "institution": "Adobe Research"}, {"id": 152901, "fullname": "Vineet Gandhi", "url": "http://cvpr.thecvf.com/api/miniconf/users/152901?format=json", "institution": "International Institute of Information Technology Hyderabad"}], "abstract": "Contrastive vision\u2013language models (VLMs) such as CLIP achieve strong zero-shot recognition yet remain vulnerable to spurious correlations, particularly background over-reliance. We introduce Cluster-based Concept Importance (CCI), a novel interpretability method that uses CLIP\u2019s own patch embeddings to group spatial patches into semantically coherent clusters, masking them, and evaluating relative changes in model predictions. CCI sets a new state of the art on faithfulness benchmarks, surpassing prior methods by large margins; for example, it yields more than a twofold improvement on the deletion-AUC metric for MS COCO retrieval. We further propose that CCI when combined with GroundedSAM, automatically categorizes predictions as foreground or background-driven, providing a crucial diagnostic ability. Existing benchmarks such as CounterAnimals, however, rely solely on accuracy and implicitly attribute all performance degradation to background correlations. Our analysis shows this assumption to be incomplete, since many errors arise from viewpoint variation, scale shifts, and fine-grained object confusions. To disentangle these effects, we introduce COVAR, a benchmark that systematically varies object foregrounds and backgrounds. Leveraging CCI with COVAR, we conduct a comprehensive evaluation of eighteen CLIP variants, providing both methodological advances and empirical evidence that chart a path toward more robust vision\u2013language models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36957", "url": null, "sourceid": 36009, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36959, "uid": "6d4c6e964bc0078884cb7b22838b827b", "name": "Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization", "authors": [{"id": 157209, "fullname": "Tahira Kazimi", "url": "http://cvpr.thecvf.com/api/miniconf/users/157209?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 162607, "fullname": "Connor Dunlop", "url": "http://cvpr.thecvf.com/api/miniconf/users/162607?format=json", "institution": "Virginia Tech"}, {"id": 133116, "fullname": "Pinar Yanardag", "url": "http://cvpr.thecvf.com/api/miniconf/users/133116?format=json", "institution": "Virginia Tech"}], "abstract": "While recent text-to-video (T2V) diffusion models  have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce  DPP-GRPO, a novel   framework for diverse video generation generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations.  Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and  encourages diverse generations across visual appearance,  camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36959", "url": "https://diverse-video.github.io/", "sourceid": 31678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36960, "uid": "0a16224c296f72af9037875a027f94a6", "name": "TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation", "authors": [{"id": 154622, "fullname": "Yan Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154622?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 186317, "fullname": "Bin Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/186317?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 166548, "fullname": "Zhitong Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/166548?format=json", "institution": "Technical University of Munich"}, {"id": 152189, "fullname": "Xiao Xiang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152189?format=json", "institution": "Technical University Munich"}, {"id": 138068, "fullname": "Beg\u00fcm Demir", "url": "http://cvpr.thecvf.com/api/miniconf/users/138068?format=json", "institution": "Technische Universit\u00e4t Berlin"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 75430, "fullname": "Paolo Rota", "url": "http://cvpr.thecvf.com/api/miniconf/users/75430?format=json", "institution": "University of Trento"}], "abstract": "Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million  samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36960", "url": null, "sourceid": 31921, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36961, "uid": "8f118ab2a5c0d99362fb67e29856acab", "name": "SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation", "authors": [{"id": 133941, "fullname": "Guiyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133941?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 90144, "fullname": "Yabo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90144?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186318, "fullname": "Xunzhi Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186318?format=json", "institution": "Nanjing University"}, {"id": 186319, "fullname": "Junchao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186319?format=json", "institution": "The Chinese University of Hong Kong (Shenzhen); Didi International Business Group; Tianjin University"}, {"id": 181393, "fullname": "Zhongyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181393?format=json", "institution": "Beihang University"}, {"id": 133263, "fullname": "Li Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133263?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired camera poses and object-level 3D trajectories across diverse indoor and outdoor scenes, addressing a key data gap in unified motion control. Extensive experiments and user studies show that SymphoMotion significantly outperforms existing methods in visual fidelity, camera controllability, and object-motion accuracy, establishing a new benchmark for unified motion control in video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36961", "url": null, "sourceid": 41332, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36968, "uid": "a6a3d3ea3652a05113b94fc5ced94215", "name": "Yume1.5:  A Text-Controlled Interactive World Generation Model", "authors": [{"id": 159511, "fullname": "Xiaofeng Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/159511?format=json", "institution": "Fudan University"}, {"id": 89906, "fullname": "Zhen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89906?format=json", "institution": "Beijing Institute of Technology"}, {"id": 184555, "fullname": "Chuanhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184555?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 152732, "fullname": "Xiaojie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152732?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 100911, "fullname": "Kaining Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/100911?format=json", "institution": "Zhejiang University of Technology"}, {"id": 184560, "fullname": "Kaipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184560?format=json", "institution": "Shanda AI Research"}], "abstract": "Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities.To address these challenges, we propose Yume1.5, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. Yume1.5 achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation method combining unified context compression and linear attention; (2) a context compression-based bidirectional attention distillation approach with an enhanced text embedding scheme for real-time streaming video generation. Yume1.5 achieves an average generation speed of 12 fps at 540p resolution using only a single A100 GPU; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.  The model weights and full codebase will be made public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36968", "url": null, "sourceid": 36740, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36964, "uid": "c95ab906679d4d28d81a420aa76d02df", "name": "PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery", "authors": [{"id": 174762, "fullname": "Yijing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/174762?format=json", "institution": "ShanghaiTech University"}, {"id": 186326, "fullname": "Mengjun Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186326?format=json", "institution": "ShanghaiTech University; China University of Mining Technology - Beijing"}, {"id": 186327, "fullname": "Luo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186327?format=json", "institution": "ShanghaiTech University"}, {"id": 186328, "fullname": "Tianyang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186328?format=json", "institution": "ShanghaiTech University"}, {"id": 157517, "fullname": "Haizhao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157517?format=json", "institution": "ShanghaiTech University"}, {"id": 186329, "fullname": "Yingliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186329?format=json", "institution": "DGene Digital Technology Co., Ltd."}, {"id": 75945, "fullname": "Jingyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75945?format=json", "institution": "Shanghai Tech University"}, {"id": 133585, "fullname": "Yujiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133585?format=json", "institution": "ShanghaiTech University"}], "abstract": "Panoramic imagery offers a full $360^\\circ$ field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting.We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36964", "url": null, "sourceid": 40239, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36965, "uid": "4699df1b3d138637154b348ac946c963", "name": "EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing", "authors": [{"id": 182034, "fullname": "YANG FU", "url": "http://cvpr.thecvf.com/api/miniconf/users/182034?format=json", "institution": "Fudan University"}, {"id": 186330, "fullname": "Yike Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186330?format=json", "institution": "Fudan University"}, {"id": 161184, "fullname": "Ziyun Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/161184?format=json", "institution": "Shanghai University"}, {"id": 76198, "fullname": "Henghui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/76198?format=json", "institution": "Fudan University"}], "abstract": "Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce **VOR** (**V**ideo **O**bject **R**emoval), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60k high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose ***EffectErase***, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion\u2013removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36965", "url": null, "sourceid": 32609, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36967, "uid": "a0471c86437c16f083bb739ef8b5d1e2", "name": "From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution", "authors": [{"id": 181164, "fullname": "Shikang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181164?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 181170, "fullname": "Guantao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181170?format=json", "institution": "\u4e0a\u6d77\u4ea4\u901a\u5927\u5b66"}, {"id": 186335, "fullname": "Landis He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186335?format=json", "institution": "Tsinghua University"}, {"id": 186336, "fullname": "Jiacheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186336?format=json", "institution": "Shanghai Jiaotong University; Tencent; Shandong University"}, {"id": 186337, "fullname": "Yuqi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186337?format=json", "institution": "Jilin University"}, {"id": 179949, "fullname": "Chang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/179949?format=json", "institution": "Shanghai Jiao Tong University &amp; Tencent Hunyuan"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "Diffusion Transformers achieve impressive generative quality but remain computationally expensive due to iterative sampling.  Recently, dynamic resolution sampling has emerged as a promising acceleration technique by reducing the resolution of early sampling steps. However, existing methods rely on heuristic re-noising at every resolution transition, injecting noise that breaks cross-stage consistency and forces the model to relearn global structure. In addition, these methods indiscriminately upsample the entire latent space at once without checking which regions have actually converged, causing accumulated errors, and visible artifacts. Therefore, we propose \\textbf{Fresco}, a dynamic resolution framework that unifies re-noise and global structure across stages with progressive upsampling, preserving both the efficiency of low-resolution drafting and the fidelity of high-resolution refinement, with all stages aligned toward the same final target. Fresco achieves near-lossless acceleration across diverse domains and models, including 10$\\times$ speedup on FLUX, and 5$\\times$ on HunyuanVideo, while remaining orthogonal to distillation, quantization and feature caching, reaching 22$\\times$ speedup when combined with distilled models. Our code is in supplementary material and will be released on Github.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36967", "url": null, "sourceid": 41205, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36974, "uid": "8cc5ef53fc584eb0a0597b052507fe6d", "name": "SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation", "authors": [{"id": 183957, "fullname": "Naomi Kombol", "url": "http://cvpr.thecvf.com/api/miniconf/users/183957?format=json", "institution": "University of Zagreb Faculty of Electrical Engineering and Computing"}, {"id": 186359, "fullname": "Ivan Martinovi\u0107", "url": "http://cvpr.thecvf.com/api/miniconf/users/186359?format=json", "institution": "Faculty of Electrical Engineering and Computing"}, {"id": 85110, "fullname": "Sini\u0161a \u0160egvi\u0107", "url": "http://cvpr.thecvf.com/api/miniconf/users/85110?format=json", "institution": "UniZg-FER"}, {"id": 73917, "fullname": "Giorgos Tolias", "url": "http://cvpr.thecvf.com/api/miniconf/users/73917?format=json", "institution": "CTU in Prague"}], "abstract": "Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. We will release code and models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36974", "url": null, "sourceid": 37526, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36979, "uid": "684c02747b08547490f9fca80f2e5cf1", "name": "UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits", "authors": [{"id": 180023, "fullname": "Keming Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180023?format=json", "institution": "Zhejiang University"}, {"id": 180924, "fullname": "Zhipeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180924?format=json", "institution": "Shenzhen Tencent Computer Systems Co., Ltd."}, {"id": 154726, "fullname": "Canmiao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154726?format=json", "institution": "Wechat AI"}, {"id": 107568, "fullname": "Qingyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107568?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186370, "fullname": "Jiani Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186370?format=json", "institution": "Zhejiang Lab; xinjiang university"}, {"id": 128369, "fullname": "Zheqi Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/128369?format=json", "institution": "Zhejiang University"}, {"id": 86641, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86641?format=json", "institution": "WeChat, Tencent"}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 84818, "fullname": "Zhou Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84818?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 84791, "fullname": "Shengyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84791?format=json", "institution": "Zhejiang University"}], "abstract": "With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \\textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \\textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \\textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \\textit{Non-edit Consistency} and \\textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research. The dataset, benchmark, and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36979", "url": null, "sourceid": 33480, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36982, "uid": "8f012eb115c6b37cd310b1643497d6d6", "name": "Not All Birds Look The Same: Identity-Preserving Generation For Birds", "authors": [{"id": 127799, "fullname": "Aaron Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/127799?format=json", "institution": "University of Massachusetts Amherst"}, {"id": 131800, "fullname": "Oindrila Saha", "url": "http://cvpr.thecvf.com/api/miniconf/users/131800?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 75679, "fullname": "Subhransu Maji", "url": "http://cvpr.thecvf.com/api/miniconf/users/75679?format=json", "institution": "University of Massachusetts, Amherst"}], "abstract": "Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users.Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning.While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data\u2014especially videos or multi-view observations of the same subject\u2014making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail.Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds.We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex---used as a proxy for identity---substantially improves performance on both seen and unseen species.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36982", "url": null, "sourceid": 42064, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36984, "uid": "3a5a64a567a75090add692e2a4377e8d", "name": "Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers", "authors": [{"id": 182333, "fullname": "Cris Claessens", "url": "http://cvpr.thecvf.com/api/miniconf/users/182333?format=json", "institution": "Eindhoven University of Technology"}, {"id": 183350, "fullname": "Christiaan Viviers", "url": "http://cvpr.thecvf.com/api/miniconf/users/183350?format=json", "institution": "Eindhoven University of Technology"}, {"id": 93410, "fullname": "Giacomo D&amp;#x27;Amicantonio", "url": "http://cvpr.thecvf.com/api/miniconf/users/93410?format=json", "institution": "Eindhoven University of Technology"}, {"id": 156805, "fullname": "Egor Bondarev", "url": "http://cvpr.thecvf.com/api/miniconf/users/156805?format=json", "institution": "Eindhoven University of Technology"}, {"id": 186379, "fullname": "Fons van der Sommen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186379?format=json", "institution": "Eindhoven University of Technology"}], "abstract": "We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision\u2013language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision\u2013language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36984", "url": null, "sourceid": 30643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36985, "uid": "a999280d94b62f32e136b75017a67a85", "name": "ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning", "authors": [{"id": 71397, "fullname": "Duowen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71397?format=json", "institution": "East China Normal University"}, {"id": 90671, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90671?format=json", "institution": "East China Normal University"}], "abstract": "Federated Semi-Supervised Learning (FSSL) aims to collaboratively train a global model by leveraging unlabeled data and limited labeled data across clients in a privacy-preserving manner. In FSSL, data heterogeneity is a challenging issue, which exists both across clients (external heterogeneity) and within clients (internal heterogeneity). Most FSSL methods typically design fixed or dynamic parameter aggregation strategies to collect client knowledge on the server (for external) and / or filter out low-confidence unlabeled samples directly by an empirical threshold to reduce mistakes in local client (for internal). But, the former is hard to precisely fit the ideal global category distribution due to external heterogeneity, and the latter results in fewer training participation of available samples in FL. To address these issues, we propose a proxy-guided framework called ProxyFL that focuses on simultaneously mitigating external and internal heterogeneity via a unified proxy. I.e., we consider the learnable weights of classifier as proxy to simulate the category distribution both locally and globally. For external, we explicitly optimize global proxy to better fit the category distribution across clients; for internal, we re-include the discarded samples together with other samples into training based upon a positive-negative proxy pool rather than compromise on wrong pseudo-labels. Insight experiments \\& theoretical analysis show that ProxyFL significantly boost the FSSL performance and convergence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36985", "url": null, "sourceid": 33102, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36987, "uid": "5328c8721bf63d50f836beb8da7329b1", "name": "Anatomical Domain Shifts: Test-time Heterogeneous Adaptation for 3D Human Pose Prediction", "authors": [{"id": 126685, "fullname": "Qiongjie Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/126685?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 126500, "fullname": "Pan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/126500?format=json", "institution": "Sea Group"}, {"id": 89124, "fullname": "Jingjing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89124?format=json", "institution": "Fudan University"}, {"id": 135934, "fullname": "Na Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135934?format=json", "institution": "Singapore University of Technology and Design"}], "abstract": "The research frontier in human pose prediction (HPP) is advancing toward continual test-time adaptation (TTA), where models must self-adapt to dynamic test distributions. To date, the homeostatic continual TTA remains the sole viable solution, which isolates the model parameters and update domain-sensitive ones. Despite mitigating full-body domain gaps, human anatomical heterogeneity (domain shifts often localize to specific regions) is ignored. This anatomical-agnostic approach forces uniform parameter adaptation across kinematically distinct segments, causing: over-adaptation of stable regions and under-adaptation of shift-prone articulations. To address it, we introduce TT-HA, a novel Test-Time Heterogeneous Adaptation that implicitly estimates domain changes for anatomical segments, and adapt the corresponding parameters. Building on human anatomy, TT-HA partitions parameters into five anatomical subsets using fisher information matrix-based parameters uncertainty analysis. During testing, TT-HA uses the instance normalization statistics and Earth Mover's Distance (EMD) to quantify segment-wise domain changes, dynamically determining which segment-specific parameters to adapt and to what extent. When substantial domain shifts are detected, TT-HA restores only affected segments to source-trained values, ensuring robust adaptation without full parameter resetting; minor shifts trigger the fine-tuning of corresponding parameters while preserving remaining ones. Experiments show TT-HA's superior full-body accuracy with greater limb error decrease than prior methods, proving its anatomically-targeted efficacy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36987", "url": null, "sourceid": 44345, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36991, "uid": "31577311e2012c8a369e46365f56891a", "name": "CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation", "authors": [{"id": 183827, "fullname": "Xia Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/183827?format=json", "institution": "University of Washington"}, {"id": 186398, "fullname": "Ruiqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186398?format=json", "institution": "University of Washington"}, {"id": 135668, "fullname": "Benlin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/135668?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 133071, "fullname": "Jingwei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/133071?format=json", "institution": "University of Washington; University of Washington"}, {"id": 186399, "fullname": "Zonglin Di", "url": "http://cvpr.thecvf.com/api/miniconf/users/186399?format=json", "institution": "University of California, Santa Cruz"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}, {"id": 186400, "fullname": "Jon Froehlich", "url": "http://cvpr.thecvf.com/api/miniconf/users/186400?format=json", "institution": "University of Washington"}], "abstract": "Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent\u2019s mobility constraints. For example, a sweeping robot cannot traverse stairs while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent\u2019s specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We close by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36991", "url": null, "sourceid": 36339, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36994, "uid": "114d5395ef320585cb88d71948aa5e40", "name": "Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation", "authors": [{"id": 186414, "fullname": "Divyanshu Daiya", "url": "http://cvpr.thecvf.com/api/miniconf/users/186414?format=json", "institution": "Purdue University"}, {"id": 128106, "fullname": "Aniket Bera", "url": "http://cvpr.thecvf.com/api/miniconf/users/128106?format=json", "institution": "Purdue University"}], "abstract": "We present \\emph{Sketch2Colab}, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student\u2019s transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human\u2013object\u2013human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36994", "url": null, "sourceid": 31970, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36995, "uid": "5d5ef971a832156872b8ae6732280d0a", "name": "Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs", "authors": [{"id": 153304, "fullname": "Sicheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153304?format=json", "institution": "Microsoft"}, {"id": 89100, "fullname": "Yu Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89100?format=json", "institution": "Xiaobing.ai"}, {"id": 129209, "fullname": "Shoukang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129209?format=json", "institution": "Sony AI"}, {"id": 186415, "fullname": "Yichuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186415?format=json", "institution": "Microsoft"}, {"id": 157497, "fullname": "Yizhong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157497?format=json", "institution": "Microsoft"}, {"id": 186416, "fullname": "Zhan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186416?format=json", "institution": null}, {"id": 129072, "fullname": "Jiaolong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129072?format=json", "institution": "Microsoft Research"}, {"id": 85139, "fullname": "Baining Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85139?format=json", "institution": "Microsoft Research"}], "abstract": "Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an auto-regressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models.  Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36995", "url": null, "sourceid": 31261, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36997, "uid": "2708ccf204c47df56b5469327e900581", "name": "Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models", "authors": [{"id": 157721, "fullname": "Jianrong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157721?format=json", "institution": "University of Technology Sydney"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}], "abstract": "Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts.In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts.Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings.These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition. Our implementation will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36997", "url": null, "sourceid": 41159, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37005, "uid": "f9145ca6d153a22f2c344f15ce8034f6", "name": "DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection", "authors": [{"id": 147136, "fullname": "Haochen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/147136?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 77537, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77537?format=json", "institution": "Institute of Computing Technology, CAS"}, {"id": 154434, "fullname": "Hantao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154434?format=json", "institution": "University of Science and Technology of China"}, {"id": 157524, "fullname": "Xin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157524?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 157525, "fullname": "Yifan Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157525?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 157527, "fullname": "Shaohui Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157527?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 157528, "fullname": "Yongwei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157528?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 157530, "fullname": "Ling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157530?format=json", "institution": "Institute of Software, CAS"}], "abstract": "Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain.Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations.However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features.Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs)  architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features.Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM).IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment.OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment.Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37005", "url": null, "sourceid": 46023, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37007, "uid": "9ae0e1661f21a790e485c6d9b7299180", "name": "Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models", "authors": [{"id": 147333, "fullname": "Hoigi Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/147333?format=json", "institution": "Seoul National University"}, {"id": 87682, "fullname": "Byung Hyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/87682?format=json", "institution": "Seoul National University"}, {"id": 183773, "fullname": "Jaehyun Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/183773?format=json", "institution": "Seoul National University"}, {"id": 147068, "fullname": "Sungjin Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/147068?format=json", "institution": "Seoul National University"}, {"id": 87674, "fullname": "Se Young Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87674?format=json", "institution": "Seoul National University"}], "abstract": "Large-scale text-to-image (T2I) diffusion models deliver remarkable visual fidelity but pose  safety risks due to their capacity to reproduce undesirable content, such as copyrighted ones. Concept erasure has emerged as a mitigation strategy, yet existing approaches struggle to balance scalability, precision, and robustness, which restricts their applicability to erasing only a few hundred concepts. To address these limitations, we present Erasing Thousands of Concepts (ETC), a scalable  framework capable of erasing thousands of concepts while preserving generation quality. Our method first models low-rank concept distributions via a Student\u2019s t-distribution Mixture Model (tMM). It enables pin-point erasure of target concepts via affine optimal transport while preserving others by anchoring the boundaries of target concept distributions without pre-defined anchor concepts. We then train a Mixture-of-Experts (MoE)\u2013based module, termed MoEraser, which removes target embeddings while preserving the anchor embeddings. By injecting noise into the text embedding projector and fine-tuning MoEraser for recovery, our framework achieves robustness to white-box attack such as module removal. Extensive experiments on over 2,000 concepts across heterogeneous domains and diffusion models demonstrate state-of-the-art scalability and precision in large-scale concept erasure.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37007", "url": null, "sourceid": 41311, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37009, "uid": "bab0f742373b1c7c3c62aeda2e7dd8bf", "name": "PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection", "authors": [{"id": 177783, "fullname": "Xijun Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177783?format=json", "institution": "Tianjin University"}, {"id": 153901, "fullname": "Hongying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153901?format=json", "institution": "Tianjin University"}, {"id": 106136, "fullname": "Fanhua Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106136?format=json", "institution": "Tianjin University"}, {"id": 180928, "fullname": "Yanming hui", "url": "http://cvpr.thecvf.com/api/miniconf/users/180928?format=json", "institution": "Tianjin University"}, {"id": 153902, "fullname": "Liang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153902?format=json", "institution": "Tianjin University"}], "abstract": "Medical image anomaly detection faces unique challenges due to subtle, heterogeneous anomalies embedded in complex anatomical structures. Through systematic Grad-CAM analysis, we reveal that discriminative activation maps fail on medical data, unlike their success on industrial datasets, motivating the need for manifold-level modeling. We propose \\textbf{PDD} (Manifold-Prior Diverse Distillation), a framework that unifies dual-teacher priors into a shared high-dimensional manifold and distills this knowledge into dual students with complementary behaviors. Specifically, frozen VMamba-Tiny and wide-ResNet50 encoders provide global contextual and local structural priors, respectively. Their features are unified through a \\textbf{Manifold Matching and Unification (MMU)} module, while an \\textbf{Intra-Backbone Attention (InA)} module enriches intermediate representations. The unified manifold is distilled into two students: one performs layer-wise distillation via \\textbf{InA} for local consistency, while the other receives skip-projected representations through a \\textbf{Manifold Prior Affine (MPA)} module to capture cross-layer dependencies. A diversity loss prevents representation collapse while maintaining detection sensitivity. Extensive experiments on multiple medical datasets demonstrate that \\textbf{PDD} significantly outperforms existing state-of-the-art methods, achieving improvements of up to 11.8\\%, 5.1\\%, and 2.9\\% in AUROC on HeadCT, BrainMRI, and ZhangLab datasets, respectively, and 3.4\\% in F1 max on the Uni-Medical dataset, establishing new state-of-the-art performance in medical image anomaly detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37009", "url": null, "sourceid": 38337, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37010, "uid": "603c32f40317b3179730bb4d8a032f84", "name": "Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs", "authors": [{"id": 148455, "fullname": "jing yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/148455?format=json", "institution": "tongji university"}, {"id": 186465, "fullname": "Sen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186465?format=json", "institution": "Baidu"}, {"id": 149871, "fullname": "Boqiang Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/149871?format=json", "institution": "Baidu"}, {"id": 186466, "fullname": "Ming Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186466?format=json", "institution": "Southeast University"}, {"id": 90626, "fullname": "Wei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90626?format=json", "institution": "Baidu"}, {"id": 87785, "fullname": "Xiao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87785?format=json", "institution": "Baidu"}, {"id": 185763, "fullname": "KunbinChen KunbinChen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185763?format=json", "institution": "Baidu"}, {"id": 185764, "fullname": "Wei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185764?format=json", "institution": "Baidu"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}, {"id": 127684, "fullname": "Hanli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127684?format=json", "institution": "Tongji University"}], "abstract": "Recently, multimodal large language models (MLLMs) have achieved remarkable success in general multimodal tasks. Increasing attention has been given to leveraging MLLMs for fine-grained visual understanding, such as region-level captioning and pixel-level grounding.However, most existing approaches are task-specific, and some recent unified approaches attempt to handle both types simultaneously; they still fall short of deeply exploring the underlying associations across tasks. To bridge this gap, we propose a multimodal large language model designed to jointly support $\\textbf{Fine-grained}$ visual understanding through $\\textbf{Consistency Learning}$ (FCLM).  The central idea of this work is that pixel-level captioning and grounding are mutually beneficial and complementary tasks, each enhancing the other in achieving a fine-grained understanding of visual content.Specifically, FCLM analyzes the representation features -- visual prompt and segmentation tokens -- required for the two types of visual tasks, and achieves advanced reasoning and perception through a novel-designed consistency learning loss and a two-stage training framework. Moreover, we design a Hybrid Region Extractor to enhance the quality of visual prompt embeddings, thereby obtaining more semantically discriminative representations for detailed caption generation. Additionally, to verify the MLLM\u2019s ability to localize accurate targets from detailed textual descriptions, we introduce a novel task called Detailed Localized Referring Expression Segmentation (DL-RES).We conduct extensive experiments on seven visual understanding tasks, demonstrating the strong performance and generalization ability of FCLM.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37010", "url": null, "sourceid": 38201, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37013, "uid": "475238fb0e20e291f52044dfd7c0330e", "name": "TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting", "authors": [{"id": 154139, "fullname": "Hyeseong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/154139?format=json", "institution": "Yonsei University"}, {"id": 156512, "fullname": "Geonhui Son", "url": "http://cvpr.thecvf.com/api/miniconf/users/156512?format=json", "institution": "Yonsei University"}, {"id": 154141, "fullname": "Deukhee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/154141?format=json", "institution": "Korea Institute of Science and Technology"}, {"id": 91230, "fullname": "Dosik Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91230?format=json", "institution": "Yonsei University"}], "abstract": "Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37013", "url": null, "sourceid": 39184, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37014, "uid": "2e5cbd954b383fc43f19deb0e2d1e3da", "name": "ZINA: Multimodal Fine-grained Hallucination Detection and Editing", "authors": [{"id": 102956, "fullname": "Yuiga Wada", "url": "http://cvpr.thecvf.com/api/miniconf/users/102956?format=json", "institution": "Keio University"}, {"id": 186490, "fullname": "Kazuki Matsuda", "url": "http://cvpr.thecvf.com/api/miniconf/users/186490?format=json", "institution": "Keio University"}, {"id": 92433, "fullname": "Komei Sugiura", "url": "http://cvpr.thecvf.com/api/miniconf/users/92433?format=json", "institution": "Keio University"}, {"id": 86722, "fullname": "Graham Neubig", "url": "http://cvpr.thecvf.com/api/miniconf/users/86722?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37014", "url": null, "sourceid": 44807, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37016, "uid": "6087889ef641d40c5804fa52689c3398", "name": "Lifting Unlabeled Internet-scale Data for 3D Scene Understanding", "authors": [{"id": 75772, "fullname": "Yixin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/75772?format=json", "institution": "BIGAI"}, {"id": 186492, "fullname": "Yaowei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186492?format=json", "institution": null}, {"id": 155010, "fullname": "Huangyue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155010?format=json", "institution": "Beijing Institute for General Artificial Intelligence"}, {"id": 186493, "fullname": "Junchao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186493?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 155532, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155532?format=json", "institution": "Beijing Institute for General Artificial Intelligence"}, {"id": 158467, "fullname": "Jiangyong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158467?format=json", "institution": "Peking University"}, {"id": 186494, "fullname": "Hongyu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186494?format=json", "institution": "Beijing Institute of Technology"}, {"id": 156707, "fullname": "Junfeng Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/156707?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 179635, "fullname": "Shaofei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179635?format=json", "institution": "Department of Computer Science, Swiss Federal Institute of Technology"}, {"id": 88079, "fullname": "Baoxiong Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/88079?format=json", "institution": "BIGAI"}, {"id": 150957, "fullname": "Song-Chun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150957?format=json", "institution": "Beijing Institute for General Artificial Intelligence"}, {"id": 75767, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75767?format=json", "institution": "Beijing Institute of General Artificial Intelligence"}], "abstract": "Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We systematically identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-level reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37016", "url": null, "sourceid": 40115, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37017, "uid": "73e948cae9bef7907288257056ad33a0", "name": "EasyV2V: A High-quality Instruction-based Video Editing Framework", "authors": [{"id": 75847, "fullname": "Jinjie Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/75847?format=json", "institution": "KAUST"}, {"id": 106929, "fullname": "Chaoyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106929?format=json", "institution": "Snap Inc"}, {"id": 160337, "fullname": "Guocheng Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/160337?format=json", "institution": "Snap Inc."}, {"id": 88383, "fullname": "Willi Menapace", "url": "http://cvpr.thecvf.com/api/miniconf/users/88383?format=json", "institution": "University of Trento"}, {"id": 85389, "fullname": "Sergey Tulyakov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85389?format=json", "institution": "Snap Inc."}, {"id": 75441, "fullname": "Bernard Ghanem", "url": "http://cvpr.thecvf.com/api/miniconf/users/75441?format=json", "institution": "KAUST"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}, {"id": 186495, "fullname": "Ashkan Mirzaei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186495?format=json", "institution": "NVIDIA; Snap Inc."}], "abstract": "While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization.We study the design space of data, architecture, and control, and introduce EasyV2V, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold.On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model.For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images.Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Code and data will be released upon approval.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37017", "url": null, "sourceid": 43122, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37018, "uid": "504ddbd9312d445a43a48d737b906d58", "name": "AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding", "authors": [{"id": 73217, "fullname": "Sanjoy Chowdhury", "url": "http://cvpr.thecvf.com/api/miniconf/users/73217?format=json", "institution": "University of Maryland, College Park"}, {"id": 186496, "fullname": "Karren Dai Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186496?format=json", "institution": "Apple"}, {"id": 186497, "fullname": "Xudong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186497?format=json", "institution": "Apple"}, {"id": 97323, "fullname": "Fartash Faghri", "url": "http://cvpr.thecvf.com/api/miniconf/users/97323?format=json", "institution": "Apple"}, {"id": 132386, "fullname": "Pavan Kumar Anasosalu Vasu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132386?format=json", "institution": "Apple"}, {"id": 90013, "fullname": "Oncel Tuzel", "url": "http://cvpr.thecvf.com/api/miniconf/users/90013?format=json", "institution": "Apple"}, {"id": 85839, "fullname": "Dinesh Manocha", "url": "http://cvpr.thecvf.com/api/miniconf/users/85839?format=json", "institution": "University of Maryland, College Park"}, {"id": 84603, "fullname": "Chun-Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84603?format=json", "institution": "Google"}, {"id": 94509, "fullname": "Raviteja Vemulapalli", "url": "http://cvpr.thecvf.com/api/miniconf/users/94509?format=json", "institution": "Apple"}], "abstract": "Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce $AMusE$, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose $RAFT$, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using $RAFT$, we achieve up to $39.52\\%$ relative improvement in accuracy on our benchmark. Together, $AMusE$ and $RAFT$ provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities. To facilitate further research we will publicly release our code and benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37018", "url": null, "sourceid": 46119, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37019, "uid": "d2a9c67619d9d37284ab7322ea242ee9", "name": "$\\textit{4DSurf}$: High-Fidelity Dynamic Scene Surface Reconstruction", "authors": [{"id": 144556, "fullname": "Renjie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144556?format=json", "institution": "Australian National University"}, {"id": 92749, "fullname": "Hongdong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/92749?format=json", "institution": "Australian National University"}, {"id": 73959, "fullname": "Jose M. Alvarez", "url": "http://cvpr.thecvf.com/api/miniconf/users/73959?format=json", "institution": "NVIDIA"}, {"id": 87488, "fullname": "Miaomiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87488?format=json", "institution": "Australian National University"}], "abstract": "This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``4DSurf'', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evolving surface. To handle large deformations, we introduce an Overlapping Segment Partitioning strategy that divides the sequence into overlapping segments with small deformations and incrementally passes geometric information across segments through the shared overlapping timestep. Experiments on two challenging dynamic scene datasets, Hi4D and CMU Panoptic, demonstrate that our method outperforms state-of-the-art surface reconstruction methods by 49\\% and 19\\% in Chamfer distance, respectively, and achieves superior temporal consistency under sparse-view settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37019", "url": null, "sourceid": 34870, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37026, "uid": "b5403b2d202b8fe1db69b68b2c0c5e2b", "name": "G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval", "authors": [{"id": 182108, "fullname": "jiyoung lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182108?format=json", "institution": "Sungkyunkwan University"}, {"id": 182111, "fullname": "Heejae Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182111?format=json", "institution": "Sungkyunkwan University"}, {"id": 129882, "fullname": "Jee-Hyong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/129882?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free zero-shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37026", "url": "https://github.com/maya0395/gmixer", "sourceid": 30978, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37025, "uid": "ec1e28295c9fe4e013bbc4c45e2ed0ea", "name": "C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion", "authors": [{"id": 127794, "fullname": "Yuval Haitman", "url": "http://cvpr.thecvf.com/api/miniconf/users/127794?format=json", "institution": "Ben-Gurion University of the Negev"}, {"id": 186518, "fullname": "Amit Efraim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186518?format=json", "institution": "Sami Shamoon College of Engineering"}, {"id": 186519, "fullname": "Joseph M. Francos", "url": "http://cvpr.thecvf.com/api/miniconf/users/186519?format=json", "institution": "Ben Gurion University of the Negev"}], "abstract": "We introduce C-GenReg, a training-free framework for 3D point cloud registration that leverages the complementary strengths of world-scale generative priors and registration-oriented Vision Foundation Models (VFMs). Current learning-based 3D  point cloud registration methods struggle to generalize across sensing modalities, sampling differences, and environments. Hence, C-GenReg augments the geometric point cloud registration branch by transferring the matching problem into an auxiliary image domain, where VFMs excel, using a World Foundation Model to synthesize multi-view-consistent RGB representations from the input geometry. This generative transfer,  preserves spatial coherence across source and target views without any fine-tuning. From these generated views, a VFM pretrained for finding dense correspondences extracts matches. The resulting pixel correspondences are lifted back to 3D via the original depth maps. To further enhance robustness, we introduce a \u201cMatch-then-Fuse\u201d probabilistic cold-fusion scheme that combines two independent correspondence posteriors, that of the generated-RGB branch with that of the raw geometric branch. This principled fusion preserves each modality inductive bias and provides calibrated confidence without any additional learning. C-GenReg is zero-shot and plug-and-play: all modules are pretrained and operate without fine-tuning. Extensive experiments on indoor (3DMatch, ScanNet) and outdoor (Waymo) benchmarks demonstrate strong zero-shot performance and superior cross-domain generalization. For the first time, we demonstrate a generative registration framework that operates successfully on real outdoor LiDAR data, where no imagery data is available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37025", "url": null, "sourceid": 44627, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37028, "uid": "5f795ee135dd0eab5014a56ff3e47df7", "name": "WarpTracker: Tracking by Warping instead of Correlation", "authors": [{"id": 98911, "fullname": "Zihang Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/98911?format=json", "institution": "University of Oxford"}, {"id": 186521, "fullname": "Eldar Insafutdinov", "url": "http://cvpr.thecvf.com/api/miniconf/users/186521?format=json", "institution": "Amazon"}, {"id": 186522, "fullname": "Edgar Sucar", "url": "http://cvpr.thecvf.com/api/miniconf/users/186522?format=json", "institution": "University of Oxford"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}], "abstract": "Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting its scalability and efficiency. In this paper, we propose WarpTracker, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design established long-range correspondences without computing feature correlation. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37028", "url": null, "sourceid": 31356, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37039, "uid": "1acc40b7524b3a02423b562ffbd13ff6", "name": "Physically Inspired Gaussian Splatting for HDR Novel View Synthesis", "authors": [{"id": 155911, "fullname": "Huimin Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155911?format=json", "institution": "University of Science and Technology of China"}, {"id": 87602, "fullname": "Yue Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87602?format=json", "institution": "Northeastern University"}, {"id": 169884, "fullname": "hailing wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169884?format=json", "institution": "Northeastern University"}, {"id": 86434, "fullname": "Yun Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86434?format=json", "institution": "Northeastern University"}], "abstract": "High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end,  we introduce PhysHDR-GS,  a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination.   PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision for HDR content, while an illumination-guided gradient scaling strategy mitigates exposure-biased gradient starvation and reduces under-densified representations.  Experimental results across realistic and synthetic datasets demonstrate our superiority in reconstructing HDR details (e.g., a PSNR gain of 2.04 dB over HDR-GS), while maintaining real-time rendering speed (up to 76 FPS).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37039", "url": null, "sourceid": 46380, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37040, "uid": "bb36fd8f142e7fea7041c49f61f75f37", "name": "Progressive mask distillation for self-supervised video representation", "authors": [{"id": 86473, "fullname": "Kewei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86473?format=json", "institution": "Hefei University of Technology"}, {"id": 181316, "fullname": "Chong Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181316?format=json", "institution": "Hefei University of Technology"}, {"id": 86509, "fullname": "Zhao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/86509?format=json", "institution": "Hefei University of Technology"}, {"id": 128503, "fullname": "Dan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128503?format=json", "institution": "Hefei University of Technology"}], "abstract": "Masked visual modeling is a self-supervised learning task that does not use visual annotations. It aims to learn discriminative representations via a mask-reconstruction task. A single mask ratio in reconstruction may fail to capture complex semantics, which motivates dynamic masking strategies. In this work, we propose Progressive Mask Distillation (PMD), which utilizes dynamic mask ratios to facilitate progressive semantic learning from easy to hard. PMD integrates three key components: a progressive student distiller, a difficulty-aware region enhancer, and a cross-layer feature aligner. First, to capture dynamic visual semantics, we design a progressive student distiller that trains multiple student models with progressively increasing mask ratios. The early-phase student (with a low mask ratio) learns easy, low-level semantics from more visible tokens. This learned knowledge then guides the next-phase student (with a higher mask ratio) to capture hard, high-level semantics from fewer visible tokens. This progressive distillation mechanism enhances detail reconstruction at a high mask ratio. Second, to alleviate insufficient learning of semantic regions, we design a difficulty-aware region enhancer. We first smooth the region reconstruction loss to reduce large fluctuations across training epochs. The smoothed loss is then used to learn region-level weights, prioritizing accurate learning of regions with large reconstruction losses. Third, to further bridge the semantic gap across network layers, we design cross-layer feature alignment. This module aligns features across shallow, middle, and deep encoder layers, ensuring that shallow-layer features incorporate semantic information from deeper layers. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Something-Something V2, Kinetics-400, UCF-101, and HMDB-51 datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37040", "url": null, "sourceid": 43763, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37034, "uid": "a1c20553d015bee2ac59f48832572f7d", "name": "Semantic Context Matters: Improving Conditioning for Autoregressive Models", "authors": [{"id": 143311, "fullname": "Dongyang Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/143311?format=json", "institution": "Southern University of Science and Technology"}, {"id": 186535, "fullname": "Ryan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186535?format=json", "institution": "University of North Carolina at Chapel Hill; Alibaba Group"}, {"id": 103451, "fullname": "Jianhao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/103451?format=json", "institution": "Tianjin University"}, {"id": 186536, "fullname": "Rui Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186536?format=json", "institution": null}, {"id": 186537, "fullname": "Yancheng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186537?format=json", "institution": "Alibaba Group"}, {"id": 185770, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185770?format=json", "institution": "Alibaba Group"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}], "abstract": "Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal models compared to diffusion methods.However, extending AR models to controllable image editing remains challenging due to weak and inefficient conditioning strategies, which often lead to suboptimal semantic alignment and visual quality.To address this limitation, we present SCAR, a Semantic-Context-driven method for AutoregRessive models.SCAR introduces Compressed Semantic Prefilling and Semantic Alignment Guidance that jointly enhance contextual understanding and generation coherence. Unlike prior methods that rely on sparse visual tokens or decoding stage injection, SCAR enables strong semantic guidance from the input stage, while remaining model-agnostic and applicable to both next-token and next-scale AR paradigms.Extensive experiments on instruction editing and controllable generation demonstrate that our method significantly improves visual fidelity and semantic alignment, outperforming existing AR-based methods while maintaining controllability.All the code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37034", "url": null, "sourceid": 39716, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37035, "uid": "016a5507d1496235246e6a91b1dc3587", "name": "LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis", "authors": [{"id": 131236, "fullname": "Stanislaw Szymanowicz", "url": "http://cvpr.thecvf.com/api/miniconf/users/131236?format=json", "institution": "University of Oxford, University of Oxford"}, {"id": 128085, "fullname": "Minghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128085?format=json", "institution": "University of Oxford"}, {"id": 160682, "fullname": "Jianyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160682?format=json", "institution": "Oxford VGG"}, {"id": 129663, "fullname": "Christian Rupprecht", "url": "http://cvpr.thecvf.com/api/miniconf/users/129663?format=json", "institution": "University of Oxford"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}], "abstract": "Novel View Synthesis has often relied on explicit 3D representations, which inject a strong 3D bias in the process; however, recent work has shown that network-based rendering can work better despite lacking 3D inductive biases. In this paper, we show that much better quality can be obtained by leveraging a strong 3D bias without a 3D representation. To do so, we introduce LagerNVS, an encoder-decoder network that uses 3D-aware features as a latent scene encoding. The encoder is initialized from a 3D reconstruction network, paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis results (including 31.1 PSNR on Re10k), with and without known cameras, renders in real-time, generalizes to in-the-wild data without known cameras, and can be paired with a diffusion decoder for generative completions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37035", "url": null, "sourceid": 35080, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37036, "uid": "939b9fed93c76ce9339b8aa1b2d5c57c", "name": "AceTone: Bridging Words and Colors for Conditional Image Grading", "authors": [{"id": 180338, "fullname": "Tianren Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/180338?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 181705, "fullname": "Mingxiang Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181705?format=json", "institution": "ByteDance Limited"}, {"id": 89646, "fullname": "Xijin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89646?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 87065, "fullname": "Qixiang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/87065?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose **AceTone**, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE-based tokenizer which compresses a $3\\times32^3$ LUT vector to 64 discrete tokens with $\\Delta \\text{E}<2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to **50%** over existing methods. Human evaluations confirm that AceTone\u2019s results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading. The models and datasets will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37036", "url": null, "sourceid": 33020, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37038, "uid": "9a2bdfc68ce772a6ea10a7e6fb71632d", "name": "Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder", "authors": [{"id": 181225, "fullname": "Tianyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181225?format=json", "institution": "University of Science and Technology of China"}, {"id": 87804, "fullname": "Dong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87804?format=json", "institution": "University of Science and Technology of China"}, {"id": 72164, "fullname": "Chang-Wen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72164?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Ultra-low bitrate image compression (below 0.05 bits per pixel) is increasingly critical for bandwidth-constrained and computation-limited encoding scenarios such as edge devices. Existing frameworks typically rely on large pretrained encoders (e.g., VAEs or tokenizer-based models) and perform transform coding within their generative latent space. While these approaches achieve impressive perceptual fidelity, their reliance on heavy encoder networks makes them unsuitable for deployment on weak sender devices. In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. Specifically, AEIC employs moderate or even shallow encoder networks, while leveraging an one-step diffusion decoder to maintain high-fidelity and high-realism reconstructions under extreme bitrates. To further enhance the efficiency of shallow encoders, we design a dual-side feature distillation scheme that transfers knowledge from AEIC with moderate encoders to its shallow encoder variants. Experiments demonstrate that AEIC not only outperforms existing methods on rate-distortion-perception performance at ultra-low bitrates, but also delivers exceptional encoding efficiency for 35.8 FPS on 1080P input images, while maintaining competitive decoding speed compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37038", "url": null, "sourceid": 44142, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37041, "uid": "0a0750e9f4fb9e9bdc4c9ef6b75a532a", "name": "MotionCrafter: Repurposing Video Generators for Dense Geometry and Motion Reconstruction", "authors": [{"id": 181160, "fullname": "Ruijie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181160?format=json", "institution": "Nanyang Technological University"}, {"id": 186540, "fullname": "Jiahao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186540?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 104852, "fullname": "Wenbo Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104852?format=json", "institution": "Tencent ARC Lab"}, {"id": 88683, "fullname": "Xiaoguang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88683?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 126993, "fullname": "Jianfei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126993?format=json", "institution": "Monash University"}, {"id": 84809, "fullname": "Ying Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84809?format=json", "institution": "Tencent"}, {"id": 180810, "fullname": "Chuanxia Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180810?format=json", "institution": "University of Oxford"}], "abstract": "We introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. To represent them effectively in latent space, we propose a 4D VAE that encodes point maps and scene flows as a unified latent compatible with pretrained video generators. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents\u2014despite their fundamentally different distributions\u2014we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in joint 4D geometry reconstruction and dense scene flow estimation, delivering 38.64\\% and 25.0\\% improvements in geometry and motion reconstruction, respectively, all without any post-optimization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37041", "url": null, "sourceid": 45347, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37043, "uid": "03f632e8f60a478dfc4f8c8c82f5e8bb", "name": "Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation", "authors": [{"id": 157688, "fullname": "Ruoyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157688?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 186546, "fullname": "Xiaoqing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186546?format=json", "institution": "Hong Kong Baptist University"}, {"id": 186547, "fullname": "Kangwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186547?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 126234, "fullname": "Siyuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126234?format=json", "institution": "National University of Singapore"}, {"id": 157689, "fullname": "Shiming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157689?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186548, "fullname": "Qunli Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186548?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186549, "fullname": "Laiyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186549?format=json", "institution": null}, {"id": 157692, "fullname": "Hua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157692?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 126239, "fullname": "Xiaochun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126239?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37043", "url": null, "sourceid": 40629, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37044, "uid": "d89775f1bee30df5043cf5673a197ce0", "name": "Controllable Federated Prompt Learning at Test Time", "authors": [{"id": 178166, "fullname": "Rui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178166?format=json", "institution": "National University of Defense Technology"}, {"id": 186550, "fullname": "Liang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186550?format=json", "institution": "National University of Defense Technology"}, {"id": 186551, "fullname": "Yanming Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186551?format=json", "institution": "National University of Defense Technology"}, {"id": 186552, "fullname": "Yirun Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186552?format=json", "institution": "National University of Defense Technology"}, {"id": 186553, "fullname": "Tianyuan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186553?format=json", "institution": "National University of Defense Technology"}, {"id": 77203, "fullname": "Zhihe Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77203?format=json", "institution": "Hamad Bin Khalifa University"}], "abstract": "Federated Prompt Learning (FPL) has recently attracted increasing attention for its ability to leverage large-scale vision-language models such as CLIP within federated learning frameworks. While existing studies have advanced FPL through personalization strategies to enhance client-specific performance, personalized models often suffer severe degradation when deployed across unseen domains due to distribution shifts.In this paper, we take the first step toward exploring Test-Time FPL (TTFPL), aiming to bridge the cross-domain performance gap with minimal effort, requiring only unlabeled target-domain data. We propose COTE, a tri-prompt controllable TTFPL framework that dynamically balances three complementary prompts: the global prompt from standard FPL, the local prompt from personalized FPL, and the frozen CLIP prompt.Specifically, we introduce a novel confidence-guided Model-Data Alignment (MoDA) metric in COTE that quantifies alignment at both macro and micro levels, capturing the consistency between model predictions and data distributions. By integrating MoDA with model confidence, COTE adaptively adjusts the contribution of each prompt at test time, enabling robust generalization across heterogeneous clients and unseen domains without requiring labeled data.Extensive experiments on multiple benchmark datasets demonstrate that our method consistently improves target-domain performance, setting a new direction for adaptive FPL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37044", "url": null, "sourceid": 32364, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37045, "uid": "8e8d89872be72005d16f13bc59c81296", "name": "Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift", "authors": [{"id": 181686, "fullname": "Till Beemelmanns", "url": "http://cvpr.thecvf.com/api/miniconf/users/181686?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 154034, "fullname": "Alexey Nekrasov", "url": "http://cvpr.thecvf.com/api/miniconf/users/154034?format=json", "institution": "RWTH Aachen University"}, {"id": 175309, "fullname": "Stefan Vilceanu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175309?format=json", "institution": "RWTH Aachen University"}, {"id": 186554, "fullname": "Jonas Steinhaus", "url": "http://cvpr.thecvf.com/api/miniconf/users/186554?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 186555, "fullname": "Timo Woopen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186555?format=json", "institution": "Institute for Automotive Engineering (ika) - RWTH Aachen University"}, {"id": 75750, "fullname": "Bastian Leibe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75750?format=json", "institution": "RWTH Aachen University"}, {"id": 186556, "fullname": "Lutz Eckstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/186556?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}], "abstract": "Reliable uncertainty estimation for 3D object detection is critical for deploying safe autonomous systems, yet modern detectors remain poorly calibrated, especially under distribution shifts.Although post-hoc calibration methods address this issue and provide improved calibration for in-distribution tests, they fail to adapt in distribution-shifted scenarios.In this work, we address this issue and introduce a density-aware calibration method that couples post-hoc calibrators with the feature density of latent object queries from DETR-style 3D object detectors.These queries form a compact, location and class-aware feature, ideal for density estimation, allowing our approach to adjust model confidences in distribution-shift scenarios.By fitting a density estimator on these query features, our approach jointly recalibrates both classification and bounding box regression uncertainties.On both a multi-view camera and LiDAR-based detector, our approach consistently outperforms standard post-hoc methods in both in-distribution and distribution-shifted scenarios.Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37045", "url": null, "sourceid": 45525, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37046, "uid": "79ce0b052347c6b2e676fbffa784c873", "name": "See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding", "authors": [{"id": 72927, "fullname": "Bo-Yuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/72927?format=json", "institution": "Nankai University"}, {"id": 131112, "fullname": "Bo-Wen Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/131112?format=json", "institution": "Nankai University"}, {"id": 150559, "fullname": "Yuan-Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/150559?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 154073, "fullname": "Xihan Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/154073?format=json", "institution": "Alibaba Group"}, {"id": 90664, "fullname": "Qibin Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90664?format=json", "institution": "Nankai University"}], "abstract": "We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference.Our cross-attention analysis of pretrained multimodal large language models (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks.  Experimental results demonstrate that SWIM substantially improves text\u2013visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37046", "url": null, "sourceid": 37941, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37056, "uid": "50cd562760a5819b07a9da2239046ef0", "name": "Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions", "authors": [{"id": 156677, "fullname": "Shibin Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/156677?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 91038, "fullname": "Hang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91038?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 86840, "fullname": "Bingbing Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/86840?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Despite remarkable progress in text-to-image diffusion models, controlling the semantic and spatial relationships between interacting instances remains a fundamental challenge. Current methods that inject spatial constraints often fail to model the intrinsic functional dependencies between entities, leading to implausible interactions. In this paper, we introduce Semantic Derivative Flow (SDF), a novel graph-guided framework that structures the diffusion process within a directed acyclic interaction graph. Our core innovation is a theoretically-motivated derivative attention mechanism, which explicitly enforces the semantic representation of a predicate to be derived from its subject, and the object from the predicate, formalizing a differentiable semantic graph. This principled approach compels the generative process to adhere to the logical chain of interaction. We further integrate a global context node and a real-time regional refinement module to ground the graph in the visual domain holistically. Extensive experiments demonstrate that our model, an instantiation of SDF, establishes a new state-of-the-art in fidelity and controllability on the HICODet benchmark. We complement our empirical results with a theoretical analysis, framing our method as structured message passing on interaction graphs, which provides a rigorous justification for its efficacy and generalization benefits.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37056", "url": null, "sourceid": 46479, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37059, "uid": "3e1a626a9e0c39eb548886c253bf9644", "name": "HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading", "authors": [{"id": 147813, "fullname": "Hao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/147813?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 182915, "fullname": "Jiabin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182915?format=json", "institution": "Alibaba"}, {"id": 186581, "fullname": "Yajing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186581?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 186582, "fullname": "Ruixuan Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186582?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 183393, "fullname": "Jing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183393?format=json", "institution": "Alibaba Group"}], "abstract": "Color grading is central to High Dynamic Range (HDR) video production, shaping the perceptual tone, contrast, and luminance of content across diverse displays. However, evaluating HDR color grading quality is particularly difficult due to its semantic, content-dependent nature and the lack of large-scale annotated data. While pre-trained Vision\u2013Language Models (VLMs) offer strong semantic priors and generalization ability, their exposure is limited to Standard Dynamic Range (SDR) data, making them poorly equipped to handle HDR photometry and perceptual nuances. We propose HDR-VLM, the first method to adapt a VLM to the HDR domain for perceptual quality assessment. Specifically, HDR-VLM employs a two-stage design: it first bridges the domain gap using a unified HLG-based encoding and progressive adaptation; then it aligns model assessments with noisy, multi-scale human preferences via reinforcement learning with curriculum-inspired rewards. Experiments on a real-world, production-sourced HDR dataset show that HDR-VLM not only outperforms existing quality assessment methods but also produces interpretable attribution rationales. These rationales offer actionable guidance for content creators, enhancing the reliability and transparency of automated HDR quality evaluation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37059", "url": null, "sourceid": 36873, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37062, "uid": "8fe69eac5027c59a1f7e4fba73cee0db", "name": "Decouple Your Discovery and Memory in Continual Generalized Category Discovery", "authors": [{"id": 186590, "fullname": "Jiawei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186590?format=json", "institution": "National University of Defense Technology"}, {"id": 154699, "fullname": "Zijian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154699?format=json", "institution": "National University of Defense Technology"}, {"id": 154700, "fullname": "Xingxing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154700?format=json", "institution": "Tsinghua University"}, {"id": 186591, "fullname": "Xuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186591?format=json", "institution": "Hunan University"}, {"id": 154705, "fullname": "Huaimin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154705?format=json", "institution": "National University of Defense Technology"}, {"id": 153556, "fullname": "Kele Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153556?format=json", "institution": "National University of Defense Technology"}], "abstract": "Continual Generalized Category Discovery (C-GCD) seeks to incrementally discover new categories from unlabeled data and memorize old categories\u2019 knowledge, fostering model adaptability in real-world scenarios. Especially, the unlabeled data is from both old and new classes, requiring the model to recognize previously learned classes while discovering. In response, recent efforts focus on devising specific frameworks and various anti-forgetting strategies, striving for a typical stability-plasticity trade-off. Unlike previous studies, in this work, we first revisit these methods and identify that most of these methods over-protect old classes, hampering the accurate discovery of novel ones. To address this challenge, we introduce the Decouple Your Discovery and Memory (DYDM), a dual-branch architecture that decouples the discovery of new classes and the memorization of old classes. The discovery branch is focused on accurately recognizing new classes, while the memory branch consolidates all identified categories in a recursive manner and functions as the inference branch. Importantly, benefiting from the strong knowledge retention ability of the memory branch, the discovery branch can facilitate the recognition of novel classes from the unlabeled data, achieving a win-win outcome between plasticity and stability. Extensive experiments on various datasets and settings demonstrate the superiority of our approach, achieving leads of up to 9.87%, 7.30%, 3.18%, and 8.25%. Furthermore, our framework can integrate with existing approaches, consistently enhancing their performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37062", "url": null, "sourceid": 31408, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37066, "uid": "127449db06658be5e1bc1cd51bde8b78", "name": "AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models", "authors": [{"id": 133782, "fullname": "Zhen Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133782?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy"}, {"id": 96134, "fullname": "Xian Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/96134?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 103229, "fullname": "Xiaoyi Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/103229?format=json", "institution": "CASIA"}, {"id": 186595, "fullname": "Dingrong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186595?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 156345, "fullname": "ShiChen Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156345?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 156347, "fullname": "Zhengtao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156347?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89943, "fullname": "Xingang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89943?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Large multimodal models (LMMs) exhibit strong task-generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation still faces fundamental limitations: anomaly semantics are scarce and unstructured, and the weak alignment between textual prompts and visual features makes accurate anomaly localization difficult.To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchors\u2014[SEG], [NOR], and [ANO]\u2014and introduces a unified anchor-guided segmentation paradigm. Specifically, [SEG] functions as an absolute semantic anchor that injects pixel-level structural priors into LMMs, while [NOR] and [ANO] serve as relative semantic anchors that encode the contrastive semantics between normality and abnormality across categories. To further enhance alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that bridges the gap between the LMM semantic space and high-resolution visual features, and design an Anchor-Guided Mask Decoder (AGMD) that performs anchor-consistent querying for precise anomaly localization.In addition, we construct Anomaly-Instruct20K, a large-scale instruction dataset that provides structured anomaly knowledge\u2014including appearance, shape, and spatial attributes\u2014to help LMMs effectively learn and integrate the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37066", "url": null, "sourceid": 31176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37067, "uid": "860e39bc33e7c4ca4c26ec67979cc290", "name": "Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting", "authors": [{"id": 91898, "fullname": "Xinhang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91898?format=json", "institution": "HKUST"}, {"id": 97201, "fullname": "Pedro Miraldo", "url": "http://cvpr.thecvf.com/api/miniconf/users/97201?format=json", "institution": "Mitsubishi Electric Research Laboratories"}, {"id": 88004, "fullname": "Suhas Lohit", "url": "http://cvpr.thecvf.com/api/miniconf/users/88004?format=json", "institution": "Mitsubishi Electric Research Labs"}, {"id": 127396, "fullname": "Huaizu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127396?format=json", "institution": "Northeastern University"}, {"id": 186596, "fullname": "Naoko Sawada", "url": "http://cvpr.thecvf.com/api/miniconf/users/186596?format=json", "institution": "Mitsubishi Electric Corporation"}, {"id": 87547, "fullname": "Yu-Wing Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87547?format=json", "institution": "Dartmouth College"}, {"id": 88735, "fullname": "Chi-Keung Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88735?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 126514, "fullname": "Moitreya Chatterjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/126514?format=json", "institution": "Mitsubishi Electric Research Labs"}], "abstract": "Understanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent \\emph{spacetime representation} that models the environment\u2019s evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evidence to refine the latent spacetime representation. When queried for any time instant, whether before, at, or beyond the timestamp of the last update.  A readout procedure predicts temporally conditioned point maps and camera parameters describing the scene geometry at the queried time. Unlike prior approaches for online dynamic scene reconstruction that estimate each frame\u2019s point map solely at the timestamp of the last observed frame, Point4Cast achieves coherent reconstruction across any queried time. Empirical evaluations show that \\emph{Point4Cast} achieves state-of-the-art performance on streaming dynamic scene reconstruction and forecasting benchmarks, across multiple challenging datasets, while providing scene flow estimation and forecasting for free. The code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37067", "url": null, "sourceid": 42776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37068, "uid": "f1ee743dc0a992a08cf2e192c586168c", "name": "Masking Teacher and Reinforcing Student for Distilling Vision-Language Models", "authors": [{"id": 184229, "fullname": "Byung-Kwan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184229?format=json", "institution": "NVIDIA"}, {"id": 89818, "fullname": "Yu-Chiang Frank Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89818?format=json", "institution": "NVIDIA"}, {"id": 86491, "fullname": "Ryo Hachiuma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86491?format=json", "institution": "Konica Minolta, Inc."}], "abstract": "Large-scale vision\u2013language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teacher. However, distilling knowledge from large teacher to small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking teacher and reinforcing student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks and non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher from mask to gradually increase the teacher capacity during training. This strategy allows the student to learn richer representations of teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of their responses' transferability from teacher to student. Unlike online think\u2013answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling the students to achieve strong performance without requiring the think\u2013answer process. Extensive experiments across diverse VLM benchmarks demonstrate that Masters outperforms existing compact VLMs and partially surpasses large ones, while being far more efficient. Moreover, gradually increasing the teacher sizes during distillation (e.g., from 14B to 38B) yields smoother convergence and stronger generalization than one-shot distillation (e.g., 38B), revealing a scalable path toward efficient and deployable VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37068", "url": null, "sourceid": 34230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37070, "uid": "9304e840061f69214ea586ac8c0c7c65", "name": "AToken: A Unified Tokenizer for Vision", "authors": [{"id": 150987, "fullname": "Jiasen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150987?format=json", "institution": "Apple"}, {"id": 162151, "fullname": "Liangchen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/162151?format=json", "institution": "Apple"}, {"id": 165732, "fullname": "Mingze Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165732?format=json", "institution": "Adobe Firefly"}, {"id": 136043, "fullname": "Byeongjoo Ahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/136043?format=json", "institution": "Apple"}, {"id": 132966, "fullname": "Yanjun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132966?format=json", "institution": "Apple"}, {"id": 186598, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186598?format=json", "institution": "Apple"}, {"id": 151014, "fullname": "Afshin Dehghan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151014?format=json", "institution": "Apple"}, {"id": 84650, "fullname": "Yinfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84650?format=json", "institution": "Apple"}], "abstract": "We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37070", "url": null, "sourceid": 43578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40282?format=json"], "related_events_ids": [40282]}, {"id": 40282, "uid": "9304e840061f69214ea586ac8c0c7c65", "name": "AToken: A Unified Tokenizer for Vision", "authors": [{"id": 150987, "fullname": "Jiasen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150987?format=json", "institution": "Apple"}, {"id": 162151, "fullname": "Liangchen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/162151?format=json", "institution": "Apple"}, {"id": 165732, "fullname": "Mingze Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165732?format=json", "institution": "Adobe Firefly"}, {"id": 136043, "fullname": "Byeongjoo Ahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/136043?format=json", "institution": "Apple"}, {"id": 132966, "fullname": "Yanjun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132966?format=json", "institution": "Apple"}, {"id": 186598, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186598?format=json", "institution": "Apple"}, {"id": 151014, "fullname": "Afshin Dehghan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151014?format=json", "institution": "Apple"}, {"id": 84650, "fullname": "Yinfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84650?format=json", "institution": "Apple"}], "abstract": "We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 40.2% MSRVTT retrieval for videos, and 28.28 PSNR with 90.9% classification accuracy for 3D.. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40282", "url": null, "sourceid": -43578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37070?format=json"], "related_events_ids": [37070]}, {"id": 37071, "uid": "1c59a3c21537275e625bc9fb5d3338d7", "name": "IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting", "authors": [{"id": 103351, "fullname": "Wei Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/103351?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 154038, "fullname": "Haifeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154038?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 143243, "fullname": "SHIYIN JIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/143243?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 138953, "fullname": "Jinhua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/138953?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186599, "fullname": "Xinchun Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/186599?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 86553, "fullname": "Shuhang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86553?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates multi-level epipolar attention maps in a multiplicative manner. Next, we construct an iterative depth estimation process by stacking multiple DPBUs, progressively identifying potential depth candidates with high likelihood. As IDESplat iteratively updates the depth candidates and boosts probability estimation, the depth map is refined, resulting in accurate Gaussian means. Finally, for the other Gaussian parameters, we design a Gaussian Focused Module (GFM) to determine the most relevant Gaussian tokens for feature interaction. We conduct experiments on RealEstate10K and ACID. IDESplat achieves outstanding reconstruction quality and state-of-the-art performance with real-time efficiency. On RE10K, it outperforms DepthSplat by 0.33 dB in PSNR, using only 10.7% of the parameters and 70% of the memory. Additionally, IDESplat improves PSNR by 2.95 dB over DepthSplat on the DTU dataset in cross-dataset experiments, demonstrating its strong generalization ability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37071", "url": null, "sourceid": 39522, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37073, "uid": "8e7aec754122752b363723f8ea82b77d", "name": "EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment", "authors": [{"id": 172882, "fullname": "Ruoxi Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/172882?format=json", "institution": "Alibaba Group"}, {"id": 186600, "fullname": "Hao-Xuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186600?format=json", "institution": "Nanjing University"}, {"id": 186601, "fullname": "Teng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186601?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186602, "fullname": "Hongyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186602?format=json", "institution": "Nanyang Technological University"}], "abstract": "Large Vision-Language Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores.To address this, we propose \\textbf{EcoAlign}, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37073", "url": null, "sourceid": 44108, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37082, "uid": "a7c90d1a4af8f934c002ec3883e18a40", "name": "Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift", "authors": [{"id": 181275, "fullname": "Heewon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/181275?format=json", "institution": "Sungkyunkwan University"}, {"id": 186621, "fullname": "Mugon Joe", "url": "http://cvpr.thecvf.com/api/miniconf/users/186621?format=json", "institution": "Soongsil University"}, {"id": 186622, "fullname": "Miru Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186622?format=json", "institution": "Soongsil University"}, {"id": 186623, "fullname": "Kyungjin Im", "url": "http://cvpr.thecvf.com/api/miniconf/users/186623?format=json", "institution": "Soongsil University"}, {"id": 186624, "fullname": "Minhae Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/186624?format=json", "institution": "Sung Kyun Kwan University; Soongsil University"}], "abstract": "Federated learning (FL) in post-deployment settings must adapt to non-stationary data streams across heterogeneous clients without access to ground-truth labels. A major challenge is learning rate selection under client-specific, time-varying distribution shifts, where fixed learning rates often lead to underfitting or divergence. We propose Fed-ADE (Federated Adaptation with Distribution Shift Estimation), an unsupervised federated adaptation framework that leverages lightweight estimators of distribution dynamics. Specifically, Fed-ADE employs uncertainty dynamics estimation to capture changes in predictive uncertainty and representation dynamics estimation to detect covariate-level feature drift, combining them into a per-client, per-timestep adaptive learning rate. We provide theoretical analyses showing that our dynamics estimation approximates the underlying distribution shift and yields dynamic regret and convergence guarantees. Experiments on image and text benchmarks under diverse distribution shifts (label, covariate, and concept) demonstrate consistent improvements over strong baselines. These results highlight that distribution shift-aware adaptation enables effective and robust federated post-adaptation under real-world non-stationarity.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37082", "url": null, "sourceid": 45635, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37078, "uid": "417ac9043b0b1805c3e5e5e9605fe400", "name": "UniGame: Turning a Unified Multimodal Model Into Its Own Adversary", "authors": [{"id": 181037, "fullname": "Zhaolong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/181037?format=json", "institution": "College of William and Mary"}, {"id": 186616, "fullname": "Wang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186616?format=json", "institution": "Tsinghua University"}, {"id": 134015, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/134015?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 74221, "fullname": "Yixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74221?format=json", "institution": "University of Wisconsin Madison"}, {"id": 152510, "fullname": "Jindong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152510?format=json", "institution": "William &amp; Mary"}], "abstract": "Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces <1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37078", "url": null, "sourceid": 43123, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37079, "uid": "57362c74575e8bb32e618ce073f4d8e6", "name": "Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings", "authors": [{"id": 140563, "fullname": "Yunxiang Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/140563?format=json", "institution": "University of Delaware"}, {"id": 91048, "fullname": "Mengmeng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/91048?format=json", "institution": "University of Delaware"}, {"id": 153046, "fullname": "Ziyu Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153046?format=json", "institution": "George Mason University"}, {"id": 76931, "fullname": "Xi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/76931?format=json", "institution": "University of Virginia, Charlottesville"}], "abstract": "Reliable generalization metrics are fundamental to both the development and evaluation of machine learning models. Especially in high-stakes applications where labeled target data are scarce, evaluation of models' generalization performance under distribution shift is a pressing need. We focus on two practical scenarios: (1) Before deployment, how to select the best model for unlabeled target data? (2) After deployment, how to monitor model performance under distribution shift? The central need in both cases is a reliable, label-free proxy metric. Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model outputs while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using a model\u2019s inner working, i.e. circuits, as a predictive metric of generalization performance. Leveraging circuit discovery, we extract the causal interactions between internal representations as a circuit, from which we derive two metrics tailored to the two practical scenarios. (1) Before deployment, we introduce Dependency Depth Bias, which measures different models' generalization capability on target data. (2) After deployment, we propose Circuit Shift Score, which predicts a model's generalization under different distribution shifts. Across diverse tasks, both metrics demonstrate significantly improved correlation with generalization performance, outperforming existing proxies by an average of 11.0% and 45.3%, respectively.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37079", "url": null, "sourceid": 38345, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37080, "uid": "48e081d443097598b231fe0f53e95fdd", "name": "Interpretable and Steerable Concept Bottleneck Sparse Autoencoders", "authors": [{"id": 153224, "fullname": "Akshay R. Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/153224?format=json", "institution": "University of California, San Diego"}, {"id": 153228, "fullname": "Tsui-Wei Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153228?format=json", "institution": "University of California, San Diego"}, {"id": 182867, "fullname": "Vivek Narayanaswamy", "url": "http://cvpr.thecvf.com/api/miniconf/users/182867?format=json", "institution": "Lawrence Livermore National Laboratory"}, {"id": 89404, "fullname": "Shusen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89404?format=json", "institution": "Lawrence Livermore National Labs"}, {"id": 186617, "fullname": "Wesam A. Sakla", "url": "http://cvpr.thecvf.com/api/miniconf/users/186617?format=json", "institution": "Lawrence Livermore National Laboratory"}, {"id": 167965, "fullname": "Kowshik Thopalli", "url": "http://cvpr.thecvf.com/api/miniconf/users/167965?format=json", "institution": "Lawrence Livermore National Labs"}], "abstract": "Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE)\u2014a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37080", "url": null, "sourceid": 40081, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37081, "uid": "8423dd87983400be28badfcfaed92b99", "name": "SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time", "authors": [{"id": 70050, "fullname": "Zhening Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70050?format=json", "institution": "University of Cambridge"}, {"id": 186618, "fullname": "Hyeonho Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186618?format=json", "institution": "Adobe"}, {"id": 88780, "fullname": "Xuelin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88780?format=json", "institution": "Adobe Research"}, {"id": 186619, "fullname": "Yulia Gryaditskaya", "url": "http://cvpr.thecvf.com/api/miniconf/users/186619?format=json", "institution": "Adobe Research"}, {"id": 71652, "fullname": "Tuanfeng Y. Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71652?format=json", "institution": "Adobe Systems"}, {"id": 186620, "fullname": "Joan Lasenby", "url": "http://cvpr.thecvf.com/api/miniconf/users/186620?format=json", "institution": "University of Cambridge"}, {"id": 86072, "fullname": "Chun-Hao P. Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86072?format=json", "institution": "Adobe Systems"}], "abstract": "We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter both the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video\u2019s motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This simple yet crucial strategy enables the model to learn temporal control, directly producing the observed space\u2013time disentanglement effects.To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic Space and Time full-coverage rendering dataset that provides fully free space\u2013time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space\u2013time disentanglement and strong results compared to prior arts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37081", "url": null, "sourceid": 32547, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37083, "uid": "015de601ee1e62453e29e4fee1be359a", "name": "Omni-Attribute: Open-vocabulary Image Attribute Encoder for Visual Disentanglement and Composition", "authors": [{"id": 159921, "fullname": "Tsai-Shien Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159921?format=json", "institution": "UC Merced"}, {"id": 184760, "fullname": "Aliaksandr Siarohin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184760?format=json", "institution": "Snap Inc.; Snap Inc."}, {"id": 160337, "fullname": "Guocheng Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/160337?format=json", "institution": "Snap Inc."}, {"id": 158585, "fullname": "Kuan-Chieh Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158585?format=json", "institution": "Snap Inc."}, {"id": 157734, "fullname": "Egor Nemchinov", "url": "http://cvpr.thecvf.com/api/miniconf/users/157734?format=json", "institution": "Snap Inc."}, {"id": 131383, "fullname": "Moayed Haji Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/131383?format=json", "institution": "Rice University"}, {"id": 88968, "fullname": "Riza Alp Guler", "url": "http://cvpr.thecvf.com/api/miniconf/users/88968?format=json", "institution": "Snap Inc."}, {"id": 88383, "fullname": "Willi Menapace", "url": "http://cvpr.thecvf.com/api/miniconf/users/88383?format=json", "institution": "University of Trento"}, {"id": 87280, "fullname": "Ivan Skorokhodov", "url": "http://cvpr.thecvf.com/api/miniconf/users/87280?format=json", "institution": "KAUST"}, {"id": 127013, "fullname": "Anil Kag", "url": "http://cvpr.thecvf.com/api/miniconf/users/127013?format=json", "institution": "Snap Inc."}, {"id": 75570, "fullname": "Jun-Yan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75570?format=json", "institution": "Carnegie Mellon University"}, {"id": 85389, "fullname": "Sergey Tulyakov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85389?format=json", "institution": "Snap Inc."}], "abstract": "Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37083", "url": null, "sourceid": 33721, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37085, "uid": "1e5bef26e8dfbf7d72ab4979d14faea1", "name": "Intrinsic Image Fusion for Multi-View 3D Material Reconstruction", "authors": [{"id": 107633, "fullname": "Peter Kocsis", "url": "http://cvpr.thecvf.com/api/miniconf/users/107633?format=json", "institution": "TU Munich"}, {"id": 107618, "fullname": "Lukas H\u00f6llein", "url": "http://cvpr.thecvf.com/api/miniconf/users/107618?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 75895, "fullname": "Matthias Nie\u00dfner", "url": "http://cvpr.thecvf.com/api/miniconf/users/75895?format=json", "institution": "Technical University of Munich"}], "abstract": "We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images.Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view.To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions.We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse path tracing to optimize for the low-dimensional parameters. Our results outperform state-of-the-art methods in material disentanglement on both synthetic and real scenes, producing sharp and clean reconstructions suitable for high-quality relighting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37085", "url": null, "sourceid": 45320, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37086, "uid": "08bdd4ccdfefb4a7ad7757fba080aa28", "name": "Relightful Video Portrait Harmonization", "authors": [{"id": 144242, "fullname": "Jun Myeong Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/144242?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 157252, "fullname": "Jae Shin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/157252?format=json", "institution": "Adobe Systems"}, {"id": 102719, "fullname": "Luchao Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/102719?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 105982, "fullname": "Roni Sengupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/105982?format=json", "institution": "UNC Chapel Hill"}, {"id": 180187, "fullname": "Joon-Young Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/180187?format=json", "institution": "Adobe Research"}], "abstract": "We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37086", "url": null, "sourceid": 40608, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37087, "uid": "e3f37a80937016c28f4b687370f9783e", "name": "Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices", "authors": [{"id": 177093, "fullname": "Charantej Reddy Pochimireddy", "url": "http://cvpr.thecvf.com/api/miniconf/users/177093?format=json", "institution": "Samsung Research India Bangalore"}, {"id": 177050, "fullname": "Subhasmita Sahoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/177050?format=json", "institution": "Samsung"}, {"id": 164824, "fullname": "Apoorva Verma", "url": "http://cvpr.thecvf.com/api/miniconf/users/164824?format=json", "institution": "Samsung R&amp;D Institute, Bangalore"}, {"id": 186628, "fullname": "Palavalli Shyam", "url": "http://cvpr.thecvf.com/api/miniconf/users/186628?format=json", "institution": "Samsung"}, {"id": 177264, "fullname": "Swapnil Malviya", "url": "http://cvpr.thecvf.com/api/miniconf/users/177264?format=json", "institution": "Samsung research institute banglore"}, {"id": 186629, "fullname": "Sarvesh Sarvesh", "url": "http://cvpr.thecvf.com/api/miniconf/users/186629?format=json", "institution": null}, {"id": 177241, "fullname": "Raj Narayana Gadde", "url": "http://cvpr.thecvf.com/api/miniconf/users/177241?format=json", "institution": "Samsung R&amp;D Institute India-Bangalore Private Limited"}], "abstract": "Recent advancements in deep neural networks (DNNs) have significantly improved visual quality of camera captures under low-light ($<$10lx) conditions, yet visual quality in extreme low-light ($<$1lx) remains inadequate. Existing DNN models are computationally intensive and suffer from large processing times, making them impractical for real-time enhancement of high-resolution video. Consequently, Ultra HD (UHD) videos (4K/8K) captured in extreme low-light environments exhibit elevated noise and diminished detail. Developing DNN-based solutions for UHD video enhancement faces challenges including paired dataset creation, temporal consistency, and efficient deployment under strict latency ($<$33ms) and power constraints ($<$250mA for 30fps video).We present a \\textit{comprehensive methodology} for developing a real-time raw to raw denoising solution for UHD video in extreme low-light, designed for seamless integration into existing ISP pipelines. Unlike ISP-replacement approaches, our solution enhances commercial camera stacks across sensor platforms. Our framework comprises: (1) Diverse dataset creation methodology; (2) A low-complexity model architecture optimized for mobile compute elements; (3) Efficient training and post-training optimizations (reparameterization, restructuring, quantization) to meet latency constraints while ensuring high-quality output. The result is a power-efficient real-time raw to raw video denoiser that improves extreme low-light video quality while preserving downstream ISP behavior.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37087", "url": null, "sourceid": 41158, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37089, "uid": "dc72b9133bb5273495f16a9bd44009e6", "name": "Describe Anything Anywhere At Any Moment", "authors": [{"id": 177972, "fullname": "Nicolas Gorlo", "url": "http://cvpr.thecvf.com/api/miniconf/users/177972?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 186631, "fullname": "Lukas Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/186631?format=json", "institution": "University of Technology Nuremberg UTN"}, {"id": 75802, "fullname": "Luca Carlone", "url": "http://cvpr.thecvf.com/api/miniconf/users/75802?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Computer vision and robotics applications ranging from agumented reality to  robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D.To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding.DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing.It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation.DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM's 4D SG interfaces well with a tool-calling agent for inference and reasoning.We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering (SQA) on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations.DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6\\%, position errors by 21.9\\%, temporal errors by 21.6\\%, and SG3D task grounding accuracy by 27.8\\% over the most competitive baselines, respectively.We release our data and code open-source.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37089", "url": null, "sourceid": 44839, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37093, "uid": "f3cfc5cb5f8aadbfd8b7e9328fc0f0d1", "name": "Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos", "authors": [{"id": 183045, "fullname": "Mengmeng Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/183045?format=json", "institution": "AMD"}, {"id": 152487, "fullname": "Takashi Isobe", "url": "http://cvpr.thecvf.com/api/miniconf/users/152487?format=json", "institution": "Advanced Micro Devices"}, {"id": 87542, "fullname": "Xu Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87542?format=json", "institution": "Dalian University of Technology"}, {"id": 88724, "fullname": "Yanan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88724?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86259, "fullname": "Zetong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86259?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 186638, "fullname": "Weinong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186638?format=json", "institution": null}, {"id": 152489, "fullname": "Dong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152489?format=json", "institution": "Advanced Micro Devices"}, {"id": 152490, "fullname": "Dong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152490?format=json", "institution": "Advanced Micro Devices"}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}, {"id": 149601, "fullname": "Emad Barsoum", "url": "http://cvpr.thecvf.com/api/miniconf/users/149601?format=json", "institution": "AMD"}], "abstract": "Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37093", "url": null, "sourceid": 32625, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37096, "uid": "696da8718f4b20ac408558e7dfd9f7ee", "name": "Cross-modal Representation Learning for Diffusion-generated Image Detection", "authors": [{"id": 131741, "fullname": "Tao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131741?format=json", "institution": "University of Science and Technology of China"}, {"id": 186643, "fullname": "Dayong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186643?format=json", "institution": "University of Science and Technology of China"}, {"id": 131736, "fullname": "Qi Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131736?format=json", "institution": "University of Science and Technology of China"}, {"id": 107620, "fullname": "Bin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107620?format=json", "institution": "University of Science and Technology of China"}, {"id": 90580, "fullname": "Nenghai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90580?format=json", "institution": "University of Science and Technology of China"}], "abstract": "The astonishing proficiency and unprecedented level of realism of diffusion models in creating and manipulating images have undoubtedly drawn concerns.Many methods have been proposed to detect generated images. Typically, they usually take RGB images as input, and use backbones like ResNet, CLIP visual encoder to extract features. Even though these backbones are capable to detect fake images, they are mainly designed to extract the high-level semantic information, rather than inherently designed for fake image detection.  To this end, in this paper, we want to optimize the embedding space tailored for detecting fake images via representation learning. We notice that Neighboring Pixel Relationships (NPR) is capable to capture the intrinsic forgery clues, which means that NPR may be a good input to perform representation learning that aims at learning the embedding space tailored for detecting fake images.Therefore, we leverage features from both RGB modality and NPR modality to perform two proposed representation learning methods, Cross-Modal Contrastive Learning (CMCL) and Cross-Modal Mutual Distillation (CMMD), in order to learn the forgery-aware embedding space. The CMCL boosts the discrimination of features between real and fake images, while the CMMD simultaneously transfers the learned knowledge between two modalities, being able to learn compact features within the intra-class. CMCL and CMMD work collaboratively so that each modality learns a more comprehensive forgery-aware representation to distinguish real and fake images.Extensive experiments on GenImage, DRCT-2M, and Co-Spy-Bench datasets show that our method achieves state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37096", "url": null, "sourceid": 36271, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37097, "uid": "3969214cf75af11c1a013e4386787c54", "name": "HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles", "authors": [{"id": 143910, "fullname": "Yifan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143910?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 98658, "fullname": "Francesco Pittaluga", "url": "http://cvpr.thecvf.com/api/miniconf/users/98658?format=json", "institution": "NEC Laboratories America"}, {"id": 186644, "fullname": "Zaid Tasneem", "url": "http://cvpr.thecvf.com/api/miniconf/users/186644?format=json", "institution": "NEC"}, {"id": 150964, "fullname": "Chenyu You", "url": "http://cvpr.thecvf.com/api/miniconf/users/150964?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 75820, "fullname": "Manmohan Chandraker", "url": "http://cvpr.thecvf.com/api/miniconf/users/75820?format=json", "institution": "UC San Diego"}, {"id": 127828, "fullname": "Ziyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127828?format=json", "institution": "Texas A&amp;M"}], "abstract": "Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce \\textbf{HorizonForge}, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose \\textbf{HorizonSuite}, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian Splatting delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these finding, \\textbf{HorizonSuite} establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation. achieving an 83.4\\% user-preference gain and a 25.19\\% FID improvement over the second best state-of-the-art method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37097", "url": null, "sourceid": 43826, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37098, "uid": "2931063739d4a2ea969474ff753bf05e", "name": "CA-LoRA: Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation", "authors": [{"id": 128128, "fullname": "Minho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/128128?format=json", "institution": "KAIST"}, {"id": 135749, "fullname": "Sunghyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/135749?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 154425, "fullname": "Jungsoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/154425?format=json", "institution": "Qualcomm AI Research"}, {"id": 91319, "fullname": "Hyojin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/91319?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 154426, "fullname": "Kyuwoong Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154426?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 85634, "fullname": "Fatih Porikli", "url": "http://cvpr.thecvf.com/api/miniconf/users/85634?format=json", "institution": "QualComm"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 186645, "fullname": "Sungha Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186645?format=json", "institution": "Qualcomm AI Research"}], "abstract": "This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through text-to-image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine-tuning T2I models can help generate samples aligned with the target domain. However, it often overfits and memorizes training data, limiting their ability to generate diverse and well-aligned samples. To overcome these issues, we propose Concept-Aware LoRA (CA-LoRA), a novel fine-tuning approach that selectively identifies and updates only the weights associated with necessary concepts (e.g., style or viewpoint) for domain alignment while preserving the pretrained knowledge of the T2I model to produce informative samples. We demonstrate its effectiveness in generating datasets for urban-scene segmentation, outperforming baseline and state-of-the-art methods in in-domain (few-shot and fully-supervised) settings, as well as in domain generalization tasks, especially under challenging conditions such as adverse weather and varying illumination, further highlighting its superiority.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37098", "url": null, "sourceid": 34254, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37102, "uid": "a72fbdc03fde56aced63b34a97e6df0c", "name": "KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System", "authors": [{"id": 128607, "fullname": "Zhongyu Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/128607?format=json", "institution": "Peking University"}, {"id": 186661, "fullname": "Wenhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186661?format=json", "institution": "Peking University"}, {"id": 86049, "fullname": "Yongtao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86049?format=json", "institution": "Peking University"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}], "abstract": "Visual\u2013language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual\u2013language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37102", "url": null, "sourceid": 41706, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37104, "uid": "25d55f01734adca809111ca46b5344c0", "name": "StreamReady: Learning *What* to Answer and *When* in Long Streaming Videos", "authors": [{"id": 103910, "fullname": "Shehreen Azad", "url": "http://cvpr.thecvf.com/api/miniconf/users/103910?format=json", "institution": "University of Central Florida"}, {"id": 85397, "fullname": "Vibhav Vineet", "url": "http://cvpr.thecvf.com/api/miniconf/users/85397?format=json", "institution": "Microsoft"}, {"id": 135103, "fullname": "Yogesh Rawat", "url": "http://cvpr.thecvf.com/api/miniconf/users/135103?format=json", "institution": "University of Central Florida"}], "abstract": "Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the **Answer Readiness Score (ARS)**, a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce **StreamReady**, a framework to unify temporal reasoning with on-time answering through  a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce **ProReady-QA**, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37104", "url": null, "sourceid": 31444, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37111, "uid": "81182f336f1fdfa1544cf54d2718efaa", "name": "Training-free Motion Factorization for Compositional Video Generation", "authors": [{"id": 158836, "fullname": "Zixuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158836?format=json", "institution": "Sichuan University"}, {"id": 69746, "fullname": "Ziqin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/69746?format=json", "institution": "Canva"}, {"id": 158837, "fullname": "Feng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158837?format=json", "institution": "The University of Adelaide"}, {"id": 100744, "fullname": "DUO PENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/100744?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 184937, "fullname": "Yixin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184937?format=json", "institution": "Sichuan University"}, {"id": 105376, "fullname": "Changsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/105376?format=json", "institution": "Beijing Institute of Technology"}, {"id": 76644, "fullname": "Yinjie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76644?format=json", "institution": "Sichuan University"}], "abstract": "Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a \\textbf{motion factorization} framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a \\textit{planning before generation} paradigm. (1) During planning, we reason about  motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. \\textcolor[RGB]{237,0, 140}{Our code will be released soon}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37111", "url": null, "sourceid": 45310, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37113, "uid": "035470355c525490c9db87b4f0a48b51", "name": "Rosetta Stone For Unified MLLMs: A unified tokenizer to decipher understanding and generation", "authors": [{"id": 181688, "fullname": "Wenyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181688?format=json", "institution": "Alibaba Group"}, {"id": 186694, "fullname": "Hufei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186694?format=json", "institution": "Alibaba Group"}, {"id": 186695, "fullname": "Ruijin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186695?format=json", "institution": null}, {"id": 186696, "fullname": "Xiangheng Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186696?format=json", "institution": null}, {"id": 90189, "fullname": "Yuning Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90189?format=json", "institution": "University of Science and Technology of China, Tsinghua University"}], "abstract": "Major state-of-the-art unified tokenizers predominantly adopt pixel reconstruction and feature alignment as pretext tasks, they leave key domains largely unexplored such as architecture, supervised objectives and tasks interaction, potentially resulting in limited performance. We systematically investigate the critical factors of a unified visual tokenizer and propose a novel framework that strengthens synergy between understanding and generation in various aspects. Our initial analysis focus on properties of frontier vision models, confirming inherent conflict in contrastive learning style models for unifying generation and understanding, and demonstrate distinct convergence behavior of codebooks. To address the above bottleneck, we hierarchically decouple the conflicting proxy tasks, enriching the diversity of semantic features supervision to enhance thesemantic and low-level capabilities. Subsequently, we further introduce attention-prioritized mapping strategy, which guides fine-grained generation with powerful semantic prior.  Our method achieves rFID of 0.33 and zero-shot accuracy of 80.9\\% on ImageNet at 256$\\times$256 resolution, surpassing VILA-U by 7.6\\% and outperforms continuous embedding of SigLIP. When applied to discrete unified MLLMs, our 7B model exceeds TokenFlow-13B by 3.1\\% in understanding and achieve SOTA performance in GenAI-Bench and MJHQ-30K.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37113", "url": null, "sourceid": 40874, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37114, "uid": "e4e2602e040333a9d03277cf9312e1a7", "name": "Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation", "authors": [{"id": 182153, "fullname": "Gal Fiebelman", "url": "http://cvpr.thecvf.com/api/miniconf/users/182153?format=json", "institution": "Hebrew University of Jerusalem"}, {"id": 156128, "fullname": "Hadar Averbuch-Elor", "url": "http://cvpr.thecvf.com/api/miniconf/users/156128?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 97922, "fullname": "Sagie Benaim", "url": "http://cvpr.thecvf.com/api/miniconf/users/97922?format=json", "institution": "Hebrew University of Jerusalem"}], "abstract": "3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, Physics-Guided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance prior. This allows for photorealistic refinements while ensuring the dynamics remain plausible. Our framework enables scene-wide dynamic weather effects, including snowfall, rainfall, fog, and sandstorms, with physically plausible motion. Experiments demonstrate our physics-guided approach significantly outperforms baselines, with ablations confirming this joint refinement is essential for generating coherent, high-fidelity dynamics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37114", "url": null, "sourceid": 35647, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37116, "uid": "197e931b3ee852997d554b092e9455cb", "name": "VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments", "authors": [{"id": 181937, "fullname": "Zelai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181937?format=json", "institution": "Tsinghua University"}, {"id": 176166, "fullname": "Zhexuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176166?format=json", "institution": "University of science and technology of China"}, {"id": 186698, "fullname": "Xiangmin Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186698?format=json", "institution": "Tsinghua University"}, {"id": 186699, "fullname": "Huining Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186699?format=json", "institution": "Tsinghua University"}, {"id": 186357, "fullname": "Mo Guang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186357?format=json", "institution": null}, {"id": 186358, "fullname": "Kaiwen Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186358?format=json", "institution": "Li Auto Inc."}, {"id": 149284, "fullname": "Xinlei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149284?format=json", "institution": "Tsinghua University"}, {"id": 186700, "fullname": "Yi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186700?format=json", "institution": "Tsinghua University"}, {"id": 186701, "fullname": "Chao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186701?format=json", "institution": "Tsinghua University"}, {"id": 130817, "fullname": "Yu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130817?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human studies, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37116", "url": null, "sourceid": 43381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40284?format=json"], "related_events_ids": [40284]}, {"id": 40284, "uid": "197e931b3ee852997d554b092e9455cb", "name": "VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments", "authors": [{"id": 181937, "fullname": "Zelai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181937?format=json", "institution": "Tsinghua University"}, {"id": 176166, "fullname": "Zhexuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176166?format=json", "institution": "University of science and technology of China"}, {"id": 186698, "fullname": "Xiangmin Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186698?format=json", "institution": "Tsinghua University"}, {"id": 186699, "fullname": "Huining Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186699?format=json", "institution": "Tsinghua University"}, {"id": 186357, "fullname": "Mo Guang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186357?format=json", "institution": null}, {"id": 186358, "fullname": "Kaiwen Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186358?format=json", "institution": "Li Auto Inc."}, {"id": 149284, "fullname": "Xinlei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149284?format=json", "institution": "Tsinghua University"}, {"id": 186700, "fullname": "Yi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186700?format=json", "institution": "Tsinghua University"}, {"id": 186701, "fullname": "Chao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186701?format=json", "institution": "Tsinghua University"}, {"id": 130817, "fullname": "Yu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130817?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human studies, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40284", "url": null, "sourceid": -43381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37116?format=json"], "related_events_ids": [37116]}, {"id": 37118, "uid": "f20ad2475827b60cbfd9efe5e98d78ae", "name": "Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification", "authors": [{"id": 177005, "fullname": "JIan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177005?format=json", "institution": "Jiangsu University Of Technology"}, {"id": 180141, "fullname": "Yujian Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180141?format=json", "institution": "Jiangsu University of Technology"}, {"id": 186703, "fullname": "Shuai You", "url": "http://cvpr.thecvf.com/api/miniconf/users/186703?format=json", "institution": null}, {"id": 186704, "fullname": "Zhongkai Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186704?format=json", "institution": null}, {"id": 186705, "fullname": "Fei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186705?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 186706, "fullname": "Zhengjun Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/186706?format=json", "institution": "Jiangsu University of Technology"}, {"id": 186707, "fullname": "Yimu Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/186707?format=json", "institution": "Nanjing University of Posts and Telecommunications"}], "abstract": "Occluded visible\u2013infrared person re-identification (Occluded VI-ReID) remains difficult due to modality heterogeneity and occlusions, both of which break structural consistency and weaken cross-modality feature alignment. Existing methods rely mainly on spatial-domain cues (such as local body parts and salient patches), but their discriminability degrades severely under varying imaging conditions or partial visibility. To address these issues, we introduce a spatial-frequency collaborative perspective that offers global perception and cross-location consistency. Specifically, we propose a Spatial-Frequency Collaborative Learning (SFCL) framework that uses frequency information to complement spatial representations. SFCL comprises a Cross-Modality Frequency Alignment Module (CFAM), a Spatial-Frequency Interaction Module (SFIM), and a Frequency-Aware Discriminative (FAD) loss. The CFAM models the spectral features of visible/infrared images in the frequency domain, establishing modality-consistent spectral priors. The SFIM injects these priors into spatial features, promoting dual-domain interaction and complementary representations of spatial and frequency semantics. In addition, the FAD loss jointly enforces cross-modality frequency alignment and semantic consistency, thus enhancing robustness and discriminability under occlusions. For real-occlusion evaluation, we construct two occluded datasets, Occ-SYSU-MM01 and Occ-RegDB, on which SFCL outperforms the state-of-the-art.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37118", "url": null, "sourceid": 40659, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37119, "uid": "c888f742e4c25846c0733b03607e6078", "name": "Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection", "authors": [{"id": 158896, "fullname": "Dragos-Alexandru Boldisor", "url": "http://cvpr.thecvf.com/api/miniconf/users/158896?format=json", "institution": "Bitdefender"}, {"id": 160334, "fullname": "Stefan Smeu", "url": "http://cvpr.thecvf.com/api/miniconf/users/160334?format=json", "institution": "Bitdefender"}, {"id": 158897, "fullname": "Dan Oneata", "url": "http://cvpr.thecvf.com/api/miniconf/users/158897?format=json", "institution": "Politehnica Bucharest"}, {"id": 158898, "fullname": "Elisabeta Oneata", "url": "http://cvpr.thecvf.com/api/miniconf/users/158898?format=json", "institution": "Bitdefender"}], "abstract": "Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, the models attend to semantically meaningful regions rather than spurious artifacts. Yet none generalize reliably across datasets. This generalization failure likely stems from dataset characteristics, not from the features themselves latching onto superficial patterns. These results expose both the promise and fundamental challenges of self-supervised representations for deepfake detection: while they learn meaningful patterns, achieving robust cross-domain performance remains elusive.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37119", "url": null, "sourceid": 43343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37123, "uid": "607ad71bd13d4b0cef54dfaf6995c650", "name": "CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference", "authors": [{"id": 145918, "fullname": "Nurjahan Sultana", "url": "http://cvpr.thecvf.com/api/miniconf/users/145918?format=json", "institution": "Manchester Metropolitan University"}, {"id": 85957, "fullname": "Moi Hoon Yap", "url": "http://cvpr.thecvf.com/api/miniconf/users/85957?format=json", "institution": "The Manchester Metropolitan University"}, {"id": 76934, "fullname": "Xinqi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76934?format=json", "institution": "City University of Hong Kong"}, {"id": 156937, "fullname": "Wenqi Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156937?format=json", "institution": "The Manchester Metropolitan University"}], "abstract": "Models for AI-based skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clinical (target) photos, hindering real-world deployment. Existing domain adaptation methods often ignore crucial semantic invariants, such as clinical concepts. While new foundation models like MONET can provide this semantic information as dense, probabilistic scores, this metadata is unavailable at test time, creating a deployment paradox for practical image-only screening tools. We address this gap by proposing CoFiDA-M, a privileged information framework that learns from concepts at training time but deploys as an image-only model. Our method trains a teacher network that uses MONET concept probabilities to guide a FiLM modulator, transforming visual features into a semantically \"edited\" feature space. A lightweight, image-only student is then trained to reproduce this edited representation, not just the teacher's final predictions. This distillation \"bakes\" the clinical reasoning into the student's weights.On a challenging multi-dataset benchmark, our image-only student significantly outperforms state-of-the-art approaches, especially in melanoma recall. Our work provides a practical and generalizable framework for leveraging noisy, probabilistic metadata as privileged information, demonstrating strong cross-dataset robustness and potential for real-world deployment beyond dermatology. The code will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37123", "url": null, "sourceid": 42894, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37126, "uid": "6b52f0c01e26aec069e5bd6701dabf8b", "name": "Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge", "authors": [{"id": 181227, "fullname": "Minhyeok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/181227?format=json", "institution": "Chung-Ang University"}], "abstract": "Modern computer vision models from different architecture families--CNNs, Vision Transformers, and MLP-Mixers--achieve remarkably similar aggregate performance on standard benchmarks, masking potential systematic differences in how they process visual information. We introduce a simple yet revealing framework to identify where architectural inductive biases truly matter: by systematically mapping controversial images where pretrained models strongly disagree versus consensus images where all models agree. Analyzing 12 pretrained models spanning three architecture families on ImageNet validation set, we discover that controversial images exhibit approximately 4.5$\\times$ higher disagreement than consensus images (Controversy Score: 4.46). Despite mean accuracy around 80\\%, models show structured disagreement patterns: within-family agreement exceeds cross-family agreement, with CNNs and ViTs forming distinct clusters while MLPs show lower overall alignment. Crucially, only the top 10\\% most controversial images drive the majority of architectural divergence, constituting a small but informationally dense subset that reveals fundamental differences masked by aggregate metrics. Our analysis demonstrates that architectural choice matters most on this concentrated controversy space, providing researchers with actionable guidance for model selection and ensemble construction.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37126", "url": null, "sourceid": 30728, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37128, "uid": "151a8f24e6f9888bcb8745330c3dd7d9", "name": "Human Interaction-Aware 3D Reconstruction from a Single Image", "authors": [{"id": 87624, "fullname": "Gwanghyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87624?format=json", "institution": "Seoul National University"}, {"id": 180148, "fullname": "Junghun James Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/180148?format=json", "institution": "Seoul National University"}, {"id": 129823, "fullname": "Suh Yoon Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/129823?format=json", "institution": "Seoul National University"}, {"id": 183774, "fullname": "Jason Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/183774?format=json", "institution": "Seoul National University"}, {"id": 87674, "fullname": "Se Young Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87674?format=json", "institution": "Seoul National University"}], "abstract": "Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D.Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37128", "url": null, "sourceid": 41286, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37130, "uid": "5f8f8b4ab96b842515bfe342cf196929", "name": "Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions", "authors": [{"id": 182391, "fullname": "Chi Hsuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182391?format=json", "institution": "The University of Texas at Austin"}, {"id": 76703, "fullname": "Kumar Ashutosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/76703?format=json", "institution": "UT Austin &amp; FAIR, Meta"}, {"id": 69188, "fullname": "Kristen Grauman", "url": "http://cvpr.thecvf.com/api/miniconf/users/69188?format=json", "institution": "University of Texas at Austin"}], "abstract": "When obtaining visual illustrations from text descriptions, today\u2019s methods take a description with a single text context\u2014a caption, or an action description\u2014and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37130", "url": null, "sourceid": 33745, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37131, "uid": "49734ad17577047cc5239bbaa7a4800f", "name": "Smoothing the Score Function to Enhance Generalization in Diffusion Models", "authors": [{"id": 182080, "fullname": "Xinyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182080?format=json", "institution": "University of Wisconsin Madison"}, {"id": 186737, "fullname": "Jiawei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186737?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 186738, "fullname": "Stephen Wright", "url": "http://cvpr.thecvf.com/api/miniconf/users/186738?format=json", "institution": "University of Wisconsin-Madison"}], "abstract": "Diffusion models achieve remarkable generation quality, yet face a fundamental challenge known as **memorization**, where generated samples can replicate training samples exactly. We develop a theoretical framework to explain this phenomenon by showing that the empirical score function (the score function corresponding to the empirical distribution) is a weighted sum of the score functions of Gaussian distributions, in which the weights are sharp softmax functions.  This structure causes individual training samples to dominate the score function, resulting in sampling collapse. In practice, approximating the empirical score function with a neural network can partially alleviate this issue and improve generalization. Our theoretical framework explains why: In training, the neural network learns a smoother approximation of the weighted sum, allowing the sampling process to be influenced by local manifolds rather than single points. Leveraging this insight, we propose two novel methods to further enhance generalization: (1)**Noise Unconditioning** enables each training sample to adaptively determine its score function weight to increase the effect of more training samples, thereby preventing single-point dominance and mitigating collapse. (2) **Temperature Smoothing** introduces an explicit parameter to control the smoothness. By increasing temperature in the softmax weights, we naturally reduce the dominance of any single training sample and mitigate memorization. Experiments across multiple datasets validate our analysis and demonstrate the effectiveness of both methods in improving generalization while maintaining high generation quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37131", "url": null, "sourceid": 35194, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37132, "uid": "ca1a4a8c9b1cf2281c552bec03dfb2c4", "name": "RenderFlow: Single-Step Neural Rendering via Flow Matching", "authors": [{"id": 186739, "fullname": "Shenghao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186739?format=json", "institution": null}, {"id": 153949, "fullname": "Runtao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153949?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 85061, "fullname": "Christopher Schroers", "url": "http://cvpr.thecvf.com/api/miniconf/users/85061?format=json", "institution": "Disney Research|Studios, Disney"}, {"id": 156730, "fullname": "Yang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156730?format=json", "institution": "Disney Research Studios, The Walt Disney Company"}], "abstract": "Conventional physically-based rendering (PBR) pipelines generate photorealistic images through computationally expensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic single-step neural rendering framework \\textit{RenderFlow} built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse keyframe guidance. Our method significantly accelerates the rendering process and, by optionally incorporating sparsely rendered keyframes as guidance, enhances both the physical plausibility and overall visual quality of the output. The resulting pipeline achieves near real-time performance with photorealistic rendering quality, effectively bridging the gap between the efficiency of modern generative models and the precision of traditional physically based rendering. Furthermore, we demonstrate the versatility of our framework by introducing a lightweight, adapter-based module that efficiently repurposes the pretrained forward model for the inverse rendering task of intrinsic decomposition.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37132", "url": null, "sourceid": 37734, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37134, "uid": "fcc3d4757401a955a260255ff217a10d", "name": "GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics", "authors": [{"id": 186741, "fullname": "Modi Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186741?format=json", "institution": "Nankai University"}, {"id": 186742, "fullname": "Yiming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186742?format=json", "institution": "Nankai University"}, {"id": 72927, "fullname": "Bo-Yuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/72927?format=json", "institution": "Nankai University"}, {"id": 129116, "fullname": "Dingwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129116?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}, {"id": 90664, "fullname": "Qibin Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90664?format=json", "institution": "Nankai University"}], "abstract": "This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies, which conflict with geographic characteristics. To address these issues, we first introduce GeoSeek, a new geolocation dataset comprising CoT data annotated by geographic experts and professional players. We further thoroughly explore the inherent characteristics of geographic tasks and propose a geo-similarity reward and a consistency reward assessed by  a consistency agent to assist training. This encourages the model to converge towards correct answers from a geographic perspective while ensuring the integrity and consistency of its reasoning process. Experimental results show that GeoAgent outperforms existing methods and a series of general VLLMs across multiple grains, while generating reasoning that closely aligns with humans. Pretrained model and data will be openly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37134", "url": null, "sourceid": 34830, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37137, "uid": "e304076961ce84eeec9e5d066edd87b5", "name": "MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark", "authors": [{"id": 186746, "fullname": "Shaden Shaar", "url": "http://cvpr.thecvf.com/api/miniconf/users/186746?format=json", "institution": "Cornell University"}, {"id": 186747, "fullname": "Bradon Michael Thymes", "url": "http://cvpr.thecvf.com/api/miniconf/users/186747?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 186748, "fullname": "Sirawut Chaixanien", "url": "http://cvpr.thecvf.com/api/miniconf/users/186748?format=json", "institution": null}, {"id": 186749, "fullname": "Claire Cardie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186749?format=json", "institution": "Cornell University"}, {"id": 89201, "fullname": "Bharath Hariharan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89201?format=json", "institution": "Cornell University"}], "abstract": "Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers.In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, **MovieRecaps** created using movie recap videos\u2014a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities.Using the recap summary, we generate 8.2K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary  \"facts\" needed to verify an answer in a reference-free manner.To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation.Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis.We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37137", "url": null, "sourceid": 34991, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37139, "uid": "683ebb557d4e37fcc017bcf793aa67f3", "name": "DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging", "authors": [{"id": 180559, "fullname": "Xingjian Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180559?format=json", "institution": "Westlake University"}, {"id": 186752, "fullname": "Lishun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186752?format=json", "institution": "Chengdu Institute of Biology"}, {"id": 127898, "fullname": "Ping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127898?format=json", "institution": "Westlake University"}, {"id": 88684, "fullname": "Xin Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88684?format=json", "institution": "Westlake University"}], "abstract": "Video snapshot compressive imaging (SCI) offers a promising alternative to high-speed cameras by encoding multiple frames into a single 2D measurement. However, SCI requires algorithms to reconstruct the high-speed video and as resolution increases, reconstruction becomes computationally expensive and memory-intensive. Much of resource is wasted on recovering large background regions that contain little useful information, highlighting the need for selective, object-driven reconstruction. Existing object detectors struggle to perform accurately on SCI measurements due to the spatial\u2013temporal aliasing introduced by coded exposure. To address this challenge, we proposes DetectSCI, the first framework enabling object-guided region-of-interest (ROI) reconstruction for high-resolution SCI. The inside detector comprises two key components: an encoder built from weight-sharing Mamba-Implicit Modules (MIM) for progressive feature refinement, and a Frequency Mamba (FM) module dedicated to frequency-aware query selection. MIM enhances features via multi-scale dilated convolutions and implicit representations, while FM restores discriminative details by decomposing and reweighting frequency bands. Experiments on the SportsMOT dataset show that DetectSCI achieves 80.9 Average Precision (AP), surpassing the best CNN-based detector by at least 2.8 AP and the best Transformer-based detector by at least 4.1 AP, while maintaining comparable efficiency. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37139", "url": null, "sourceid": 35132, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37141, "uid": "2a07e653170f3679b2fc660d3ade0d0f", "name": "Geometric Neural Distance Fields for Learning Human Motion Priors", "authors": [{"id": 88348, "fullname": "Zhengdi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88348?format=json", "institution": "Imperial College London"}, {"id": 182061, "fullname": "Simone Foti", "url": "http://cvpr.thecvf.com/api/miniconf/users/182061?format=json", "institution": "Imperial College London"}, {"id": 156809, "fullname": "Linguang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156809?format=json", "institution": "Meta Reality Labs"}, {"id": 139332, "fullname": "Amy Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/139332?format=json", "institution": "Meta Reality Labs"}, {"id": 88527, "fullname": "Cem Keskin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88527?format=json", "institution": "Facebook"}, {"id": 86877, "fullname": "Stefanos Zafeiriou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86877?format=json", "institution": "Imperial College London"}, {"id": 73518, "fullname": "Tolga Birdal", "url": "http://cvpr.thecvf.com/api/miniconf/users/73518?format=json", "institution": "Imperial College London"}], "abstract": "We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to \u201croll out\u201d realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37141", "url": null, "sourceid": 36270, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37142, "uid": "d43ec839a158dc7efa4cf0520781925b", "name": "Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds", "authors": [{"id": 155876, "fullname": "Yujun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155876?format=json", "institution": "Shenzhen University"}, {"id": 150979, "fullname": "Ruisheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150979?format=json", "institution": "University of Calgary"}, {"id": 178462, "fullname": "Xiang Ao", "url": "http://cvpr.thecvf.com/api/miniconf/users/178462?format=json", "institution": null}, {"id": 186759, "fullname": "Haoyuan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186759?format=json", "institution": "Shenzhen University"}, {"id": 186760, "fullname": "Kuihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186760?format=json", "institution": "Shenzhen University"}, {"id": 186403, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186403?format=json", "institution": "Shenzhen University"}, {"id": 186761, "fullname": "Qingquan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186761?format=json", "institution": "Shenzhen University"}], "abstract": "Building reconstruction aims to extract compact wireframes  from point clouds. Recent edge-based methods achieve impressive results but suffer from sparse supervision from one-to-one matching, which leaves most edge proposals under-optimized. In this paper, we present  Group Relative Edge Optimization (GREO), the first attempt to incentivize dense supervision across edges proposals through reinforcement learning-style optimization in wireframe reconstruction. Specifically, GREO computes edge-level rewards based on  geometric alignment quality and transforms them into target confidence distributions via group-wise normalization. In addition, we incorporate entropy regularization to maintain distributional stability and prevent confidence collapse. This joint optimization enables dense and discriminative supervision across all edge proposals through cross-entropy minimization. Experiments on the large-scale Building3D dataset demonstrate that our powerful and versatile GREO integrates seamlessly into existing edge-based methods as a plug-and-play training strategy, achieving state-of-the-art performance on both the Entry-level and Tallinn benchmarks while adding no inference overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37142", "url": null, "sourceid": 43184, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37145, "uid": "35850f0af0027e582a418ab9e9298a9c", "name": "Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild", "authors": [{"id": 93104, "fullname": "Jiin Im", "url": "http://cvpr.thecvf.com/api/miniconf/users/93104?format=json", "institution": "Hanyang University"}, {"id": 76524, "fullname": "Sisung Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76524?format=json", "institution": "Hanyang University"}, {"id": 76091, "fullname": "Je Hyeong Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76091?format=json", "institution": "Hanyang University"}], "abstract": "Establishing semantic correspondence without supervision is essential for handling diverse in-the-wild images where annotations are scarce.While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features.In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization.The resulting probabilistic transport plan provides a structurally consistent, yet noisy, supervisory signal.We introduce a soft-target loss, which dynamically blends guidance from this plan with the network's current predictions, to build a learning framework robust to this noise.SoY achieves state-of-the-art performance on the SPair-71k and AP-10k datasets, establishing a new benchmark in unsupervised semantic correspondence.Code is in the supplement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37145", "url": null, "sourceid": 37241, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37148, "uid": "de3358cacdad583d84eb857858d15771", "name": "Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception", "authors": [{"id": 128187, "fullname": "Yanpeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128187?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 182651, "fullname": "JING HAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/182651?format=json", "institution": "University of Hong Kong"}, {"id": 128000, "fullname": "Ke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128000?format=json", "institution": "Nanjing University"}, {"id": 154836, "fullname": "Jiang-Jiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154836?format=json", "institution": "Baidu"}, {"id": 157000, "fullname": "Xiaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157000?format=json", "institution": "BAIDU Inc,"}, {"id": 135934, "fullname": "Na Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135934?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 181619, "fullname": "Zechao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181619?format=json", "institution": "Nanjing University of Science and Techonolgy"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}], "abstract": "Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them from publicly available internet images, or even generating them through human annotation. However, these strategies can fall short in terms of precision and granularity, particularly when dealing with complex visual reasoning tasks. In this paper, we propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named EDC, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. By systematically integrating these rich attributes into the generated captions, EDC significantly improves the descriptive quality of the captions, providing a deeper and more nuanced understanding of the visual content. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37148", "url": null, "sourceid": 37890, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37155, "uid": "1741c0f8d90a180b893a1776ae281820", "name": "Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator", "authors": [{"id": 77281, "fullname": "Gyeongsik Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/77281?format=json", "institution": "Korea University"}], "abstract": "Accurately recovering hand poses within the body context remains a major challenge in 3D whole-body pose estimation. This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose WholeBody++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. We introduce CHAM (Conditional Hands Modulator), a lightweight module that modulates the whole-body feature stream using hand-specific features extracted from a pre-trained hand pose estimator. This modulation enables the whole-body model to predict wrist orientations that are both accurate and coherent with the upper-body kinematic structure, without retraining the full-body model. In parallel, we directly incorporate finger articulations and hand shapes predicted by the hand pose estimator, aligning them to the full-body mesh via differentiable rigid alignment. This design allows WholeBody++ to combine globally consistent body reasoning with fine-grained hand detail. Extensive experiments demonstrate that WholeBody++ substantially improves hand accuracy and enhances overall full-body pose quality. Code and pretrained models will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37155", "url": null, "sourceid": 33004, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37156, "uid": "ec1859db271276d4081d7ca1f767258d", "name": "GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment", "authors": [{"id": 181897, "fullname": "Xin Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181897?format=json", "institution": "Beijing Institute of Technology"}, {"id": 88777, "fullname": "Xiabi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88777?format=json", "institution": "Beijing Institute of Technology"}, {"id": 88760, "fullname": "Liyuan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88760?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Co-segmentation aims to identify and segment common objects across a set of point clouds or images. Existing methods focus on single-modal co-segmentation. However, the limited semantics of a single modality restrict the discovery of common objects, leading to costly and labor-intensive segmentation masks. In contrast, cross-modal co-segmentation leverages both modalities, offering two key advantages: (i) additional semantic cues compensate for the absence of segmentation masks; and (ii) complementary modalities provide richer common semantics beyond the limitations of single-modality approaches. Motivated by these challenges, we introduce a novel task: unsupervised point cloud-image cross-modal co-segmentation. We tackle this problem using a coarse-to-fine approach. First, the 3D and 2D branches extract coarse common semantics from each modality, respectively. Then, a cross-modal common semantic graph purifies these features into fine-grained common semantics. Finally, 3D and 2D common semantic features are fused and mutually enhanced, without requiring geometric alignment. Experiments on two standard point cloud benchmarks and two corresponding image co-segmentation datasets demonstrate our superior performance compared to existing unsupervised state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37156", "url": null, "sourceid": 45584, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37159, "uid": "49edbffc1939be2cd6ce7055ade63f21", "name": "ID-Sim: An Identity-Focused Similarity Metric", "authors": [{"id": 136072, "fullname": "Julia Chae", "url": "http://cvpr.thecvf.com/api/miniconf/users/136072?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 126940, "fullname": "Nicholas Kolkin", "url": "http://cvpr.thecvf.com/api/miniconf/users/126940?format=json", "institution": "Adobe Systems"}, {"id": 91153, "fullname": "Jui-Hsien Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91153?format=json", "institution": "Adobe Systems"}, {"id": 89223, "fullname": "Richard Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89223?format=json", "institution": "Adobe Systems"}, {"id": 98284, "fullname": "Sara Beery", "url": "http://cvpr.thecvf.com/api/miniconf/users/98284?format=json", "institution": "MIT"}, {"id": 106697, "fullname": "Cusuh Ham", "url": "http://cvpr.thecvf.com/api/miniconf/users/106697?format=json", "institution": "Adobe Research"}], "abstract": "Humans have remarkable selective sensitivity to identities--they easily distinguish between highly-similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress towards identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37159", "url": null, "sourceid": 33953, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37160, "uid": "41fa78d83871255df97eca42c516d7a6", "name": "SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model", "authors": [{"id": 176374, "fullname": "Jiayuan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/176374?format=json", "institution": "Tongji University"}, {"id": 186805, "fullname": "Yiming Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186805?format=json", "institution": "Li Auto Inc."}, {"id": 186806, "fullname": "Zhenglong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186806?format=json", "institution": "Li Auto Inc."}, {"id": 186807, "fullname": "Yong Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186807?format=json", "institution": "Li Auto Inc."}, {"id": 186808, "fullname": "Wenbo Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186808?format=json", "institution": "Li Auto Inc."}, {"id": 185069, "fullname": "Zhihui Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185069?format=json", "institution": "Li Auto Inc."}, {"id": 153218, "fullname": "Kun Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153218?format=json", "institution": "LiAuto"}, {"id": 126809, "fullname": "Qijun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126809?format=json", "institution": "Tongji University"}], "abstract": "This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird\u2019s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1\u20123 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37160", "url": null, "sourceid": 31973, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40286?format=json"], "related_events_ids": [40286]}, {"id": 40286, "uid": "41fa78d83871255df97eca42c516d7a6", "name": "SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model", "authors": [{"id": 176374, "fullname": "Jiayuan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/176374?format=json", "institution": "Tongji University"}, {"id": 186805, "fullname": "Yiming Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186805?format=json", "institution": "Li Auto Inc."}, {"id": 186806, "fullname": "Zhenglong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186806?format=json", "institution": "Li Auto Inc."}, {"id": 186807, "fullname": "Yong Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186807?format=json", "institution": "Li Auto Inc."}, {"id": 186808, "fullname": "Wenbo Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186808?format=json", "institution": "Li Auto Inc."}, {"id": 185069, "fullname": "Zhihui Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185069?format=json", "institution": "Li Auto Inc."}, {"id": 153218, "fullname": "Kun Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153218?format=json", "institution": "LiAuto"}, {"id": 126809, "fullname": "Qijun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126809?format=json", "institution": "Tongji University"}], "abstract": "This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird\u2019s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1\u20123 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40286", "url": null, "sourceid": -31973, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37160?format=json"], "related_events_ids": [37160]}, {"id": 37161, "uid": "a87a1dd7a916d6d29a3eead260040e87", "name": "PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models", "authors": [{"id": 171843, "fullname": "Xuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/171843?format=json", "institution": "Tsinghua University"}, {"id": 90686, "fullname": "Guiguang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/90686?format=json", "institution": "Tsinghua University"}, {"id": 86015, "fullname": "Jungong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86015?format=json", "institution": "Aberystwyth University"}], "abstract": "Continual Learning (CL) enables Vision-Language Models (VLMs) to acquire new capabilities while retaining prior knowledge, for example, by employing task\u2011specific adapters. Existing CL approaches typically optimize these adapters to convergence, often with (near-)orthogonality constraints to reduce interference; however, isolating adapters in orthogonal subspaces can suppress cross\u2011task transfer and sharing. To address this problem, we provide a new perspective based on PAC-Bayesian analysis: once the per\u2011task optimization has converged, adapters should be further shaped to satisfy \\underline{P}hase\u2011like tr\\underline{A}nsition \\underline{C}ons\\underline{T}raints (PACT) -- a two-part formulation that (i) specifies a phase\u2011like transition relation among adapters and (ii) imposes explicit constraints that enforce this relation. Under PACT, adapter dynamics resemble the phase transition of water: the system gravitates toward either a \u201cfrozen\u201d (history\u2011preserving, tightly constrained) or a \u201cmelted\u201d (task\u2011adaptive, free) regime, while moving between them smoothly rather than via hard thresholds. We operationalize PACT by coupling stability and plasticity regularizers within a two\u2011branch Vision Transformer (ViT), seeding adapters with a Stable Adapter Initialization (SAI), and introducing a Prior Anchoring (PA) mechanism, thereby inducing phase\u2011like adapter dynamics. Across diverse CL settings, PACT surpasses state\u2011of\u2011the\u2011art methods while reducing the number of trainable parameters by $36.96\\%$ relative to standard adapter\u2011based baselines. Our code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37161", "url": null, "sourceid": 31576, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37164, "uid": "bafcc0f3168a2d49326da2ad2b1dce7f", "name": "GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials", "authors": [{"id": 183556, "fullname": "Bei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183556?format=json", "institution": "Peking University"}, {"id": 75772, "fullname": "Yixin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/75772?format=json", "institution": "BIGAI"}, {"id": 156709, "fullname": "Ruijie Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156709?format=json", "institution": "Peking University"}, {"id": 126215, "fullname": "Gang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126215?format=json", "institution": "Peking University"}, {"id": 128617, "fullname": "Hongbin Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/128617?format=json", "institution": "Peking University"}, {"id": 186818, "fullname": "Yuru Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186818?format=json", "institution": "Peking University"}, {"id": 75767, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75767?format=json", "institution": "Beijing Institute of General Artificial Intelligence"}], "abstract": "3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37164", "url": null, "sourceid": 37474, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40287?format=json"], "related_events_ids": [40287]}, {"id": 40287, "uid": "bafcc0f3168a2d49326da2ad2b1dce7f", "name": "GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials", "authors": [{"id": 183556, "fullname": "Bei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183556?format=json", "institution": "Peking University"}, {"id": 75772, "fullname": "Yixin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/75772?format=json", "institution": "BIGAI"}, {"id": 156709, "fullname": "Ruijie Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156709?format=json", "institution": "Peking University"}, {"id": 126215, "fullname": "Gang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126215?format=json", "institution": "Peking University"}, {"id": 128617, "fullname": "Hongbin Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/128617?format=json", "institution": "Peking University"}, {"id": 186818, "fullname": "Yuru Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186818?format=json", "institution": "Peking University"}, {"id": 75767, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75767?format=json", "institution": "Beijing Institute of General Artificial Intelligence"}], "abstract": "3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40287", "url": null, "sourceid": -37474, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37164?format=json"], "related_events_ids": [37164]}, {"id": 37166, "uid": "687e0d2bafc7e6ec43af9c3f65b45508", "name": "Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance", "authors": [{"id": 174290, "fullname": "Miao Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/174290?format=json", "institution": "National University of Defense Technology"}, {"id": 156894, "fullname": "Xingchen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156894?format=json", "institution": "National University of Defense Technology"}, {"id": 155808, "fullname": "Jiyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155808?format=json", "institution": "National University of Defense Technology"}, {"id": 90222, "fullname": "Siwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90222?format=json", "institution": "Academy of Military Sciences"}, {"id": 186820, "fullname": "Min Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186820?format=json", "institution": "National University of Singapore; National University of Defense Technology"}, {"id": 186821, "fullname": "Zijian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186821?format=json", "institution": "Power China Zhongnan Engineering Corporation Limited"}], "abstract": "Anchor-based multi-view clustering methods have gained significant attention for their effectiveness of handling large-scale datasets in recent years. The performance of these method is highly dependent on anchor quality.However, current methods neglect the interactive relationships among cross-view anchors, failing to effectively discover and exploit  consistent and complementary information, leading to noisy or suboptimal anchor representations. In this paper, we propose a novel scalable tensorized anchor guidance for multi-view subspace clustering, which directly couples anchors across views to improve clustering performance. Specifically, we construct a third-order anchor tensor from view-specific anchors in a low-dimensional latent space. By imposing a tensor Schatten p-norm constraint on the anchor tensor, we can explicitly capture cross-view low-rank structure and jointly exploit consistency and complementarity information  among anchors. Moreover, the tensorized anchor regularizer is independent of the number of samples, which reduces both time and space complexity.  Experimental results on seven datasets  demonstrate that SMVS-TAG achieves superior effectiveness and stability compared to state-of-the-art large-scale MVC methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37166", "url": null, "sourceid": 31050, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40288?format=json"], "related_events_ids": [40288]}, {"id": 40288, "uid": "687e0d2bafc7e6ec43af9c3f65b45508", "name": "Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance", "authors": [{"id": 174290, "fullname": "Miao Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/174290?format=json", "institution": "National University of Defense Technology"}, {"id": 156894, "fullname": "Xingchen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156894?format=json", "institution": "National University of Defense Technology"}, {"id": 155808, "fullname": "Jiyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155808?format=json", "institution": "National University of Defense Technology"}, {"id": 90222, "fullname": "Siwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90222?format=json", "institution": "Academy of Military Sciences"}, {"id": 186820, "fullname": "Min Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186820?format=json", "institution": "National University of Singapore; National University of Defense Technology"}, {"id": 186821, "fullname": "Zijian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186821?format=json", "institution": "Power China Zhongnan Engineering Corporation Limited"}], "abstract": "Anchor-based multi-view clustering methods have gained significant attention for their effectiveness of handling large-scale datasets in recent years. The performance of these method is highly dependent on anchor quality.However, current methods neglect the interactive relationships among cross-view anchors, failing to effectively discover and exploit  consistent and complementary information, leading to noisy or suboptimal anchor representations. In this paper, we propose a novel scalable tensorized anchor guidance for multi-view subspace clustering, which directly couples anchors across views to improve clustering performance. Specifically, we construct a third-order anchor tensor from view-specific anchors in a low-dimensional latent space. By imposing a tensor Schatten p-norm constraint on the anchor tensor, we can explicitly capture cross-view low-rank structure and jointly exploit consistency and complementarity information  among anchors. Moreover, the tensorized anchor regularizer is independent of the number of samples, which reduces both time and space complexity.  Experimental results on seven datasets  demonstrate that SMVS-TAG achieves superior effectiveness and stability compared to state-of-the-art large-scale MVC methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40288", "url": null, "sourceid": -31050, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37166?format=json"], "related_events_ids": [37166]}, {"id": 37169, "uid": "e911087c91c63c9d3e7ee83e372d10f7", "name": "Lipschitz Optimization for Formal Verification of Homographies", "authors": [{"id": 181761, "fullname": "Jean-Guillaume Durand", "url": "http://cvpr.thecvf.com/api/miniconf/users/181761?format=json", "institution": "Joby Aviation"}, {"id": 186826, "fullname": "Panagiotis Kouvaros", "url": "http://cvpr.thecvf.com/api/miniconf/users/186826?format=json", "institution": null}, {"id": 186827, "fullname": "Maxime Gariel", "url": "http://cvpr.thecvf.com/api/miniconf/users/186827?format=json", "institution": "Joby Aviation"}, {"id": 85623, "fullname": "Alessio Lomuscio", "url": "http://cvpr.thecvf.com/api/miniconf/users/85623?format=json", "institution": "Imperial College London Safe Intelligence"}], "abstract": "The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, aerospace, and autonomous vehicles. However, current approaches are confined to incomplete statistical verification, or robustness to $\\ell_p$-norm or affine transforms which represent a limited subset of perturbations to the image formation process.In this paper, we present a formal verification approach when the capturing camera undergoes 3D motion perturbations. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. While our formulae are grounded in the vision-based landing problem, they generalize to other scenes with predominantly planar features (e.g., augmented reality, traffic signs). This enables formal verification against a broad class of projective geometry transformations, without requiring simulation or complex modeling of image formation.We first validate our implementation, and show up to 89\\% speedup and 7\\% tighter bounds than the latest work. We then evaluate our method on established benchmarks from the VNN Competition, and highlight key model vulnerabilities to 3D transforms. Finally, we perform the first formal verification of a vision-based landing system under 3D perturbations, addressing a key challenge in the regulatory certification of learned models for real-world systems. The data and code used for this paper are publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37169", "url": null, "sourceid": 40341, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37171, "uid": "c662dcc3767481450092bfb1e6ab9419", "name": "ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body", "authors": [{"id": 165062, "fullname": "Juze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/165062?format=json", "institution": "Stanford University"}, {"id": 133214, "fullname": "Changan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/133214?format=json", "institution": "Stanford University"}, {"id": 89781, "fullname": "Xin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89781?format=json", "institution": "University of Chinese Academy of Sciences, ShanghaiTech University"}, {"id": 176502, "fullname": "Heng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176502?format=json", "institution": "Stanford University"}, {"id": 89277, "fullname": "Tiange Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89277?format=json", "institution": "Stanford University"}, {"id": 186835, "fullname": "Ali Sartaz Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186835?format=json", "institution": "Stanford University"}, {"id": 157447, "fullname": "Shrinidhi Kowshika Lakshmikanth", "url": "http://cvpr.thecvf.com/api/miniconf/users/157447?format=json", "institution": "Stanford University"}, {"id": 75810, "fullname": "Ehsan Adeli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75810?format=json", "institution": "Stanford University"}], "abstract": "Human communication is inherently multimodal and social: words, prosody, and body language jointly carry intent. Yet most prior systems model human behavior as a translation task\u2014co-speech gesture or text-to-motion that maps a fixed utterance to motion clips\u2014without requiring agentic decision-making about when to move, what to do, or how to adapt across multi-turn dialogue. This leads to brittle timing, weak social grounding, and fragmented stacks where speech, text, and motion are trained or inferred in isolation. We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. Concretely, ViBES is a speech-language-behavior (SLB) model with a mixture-of-modality-experts (MoME) backbone: modality-partitioned transformer experts for speech, facial expression, and body motion. The model processes interleaved multimodal token streams with hard routing by modality (parameters are split per expert), while sharing information through cross-expert attention. By leveraging strong pretrained speech-language models, the agent supports mixed-initiative interaction: users can speak, type, or issue body-action directives mid-conversation, and the system exposes controllable behavior hooks for streaming responses. We further benchmark on multi-turn conversation with automatic metrics of dialogue\u2013motion alignment and behavior quality, and observe consistent gains over strong co-speech and text-to-motion baselines. ViBES goes beyond \u201cspeech-conditioned motion generation\u201d toward agentic virtual bodies where language, prosody, and movement are jointly generated, enabling controllable, socially competent 3D interaction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37171", "url": null, "sourceid": 43587, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37172, "uid": "93de42b079dbd3509c11c6ce85d823c1", "name": "EpiAgent: An Agent-Centric System for Ancient Inscription Restoration", "authors": [{"id": 186836, "fullname": "Shipeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186836?format=json", "institution": null}, {"id": 182116, "fullname": "Ang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182116?format=json", "institution": "Southeast University"}, {"id": 186837, "fullname": "Na Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186837?format=json", "institution": "Nanjing university; Nanjing university"}, {"id": 186838, "fullname": "Pengfei Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186838?format=json", "institution": "Southeast University"}, {"id": 87706, "fullname": "Min-Ling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87706?format=json", "institution": "Southeast University"}, {"id": 186839, "fullname": "Hui Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186839?format=json", "institution": "Southeast University"}], "abstract": "Ancient inscriptions, as repositories of cultural memory, have suffered centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations.Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formalizes inscription restoration as a hierarchical planning problem. Following an Observe\u2013Conceive\u2013Execute\u2013Reevaluate paradigm,  an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods.Across real-world degraded inscriptions, EpiAgent delivers superior restoration quality and stronger generalization compared to existing methods. Our work marks a pivotal step toward expert-level agent-driven restoration of cultural heritage. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37172", "url": null, "sourceid": 46565, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37176, "uid": "c0715b919f21f722a004a90a42f2c790", "name": "Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels", "authors": [{"id": 180702, "fullname": "Dhruv Verma", "url": "http://cvpr.thecvf.com/api/miniconf/users/180702?format=json", "institution": "University of Toronto"}, {"id": 186850, "fullname": "Andrew Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186850?format=json", "institution": "Pinterest, Inc."}, {"id": 186851, "fullname": "Roberto Rangel", "url": "http://cvpr.thecvf.com/api/miniconf/users/186851?format=json", "institution": "University of Toronto; Universidade de S\u00e3o Paulo; Universidade do Estado do Rio de Janeiro"}, {"id": 186852, "fullname": "Ayandev Barman", "url": "http://cvpr.thecvf.com/api/miniconf/users/186852?format=json", "institution": "University of Toronto"}, {"id": 186853, "fullname": "Hao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186853?format=json", "institution": "University of Toronto"}, {"id": 186854, "fullname": "Chenjia Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186854?format=json", "institution": "University of Toronto, University of Toronto"}, {"id": 186855, "fullname": "Fengqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186855?format=json", "institution": "University of Toronto"}, {"id": 98042, "fullname": "Roman Genov", "url": "http://cvpr.thecvf.com/api/miniconf/users/98042?format=json", "institution": "University of Toronto"}, {"id": 77223, "fullname": "David B. Lindell", "url": "http://cvpr.thecvf.com/api/miniconf/users/77223?format=json", "institution": "University of Toronto"}, {"id": 93592, "fullname": "Kiriakos Kutulakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/93592?format=json", "institution": "University of Toronto"}, {"id": 186856, "fullname": "Alex Mariakakis", "url": "http://cvpr.thecvf.com/api/miniconf/users/186856?format=json", "institution": "University of Toronto"}], "abstract": "We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame\u2019s exposure, Lumosaic actively synchronizes illumination and pixel-wise exposure, improving photon utilization and preserving spectral fidelity under motion. A learning-based reconstruction pipeline then recovers 31-channel hyperspectral (400\u2013700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyperspectral imaging systems, enabling robust hyperspectral video across diverse materials and motion conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37176", "url": null, "sourceid": 39732, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37177, "uid": "648cbc29675eff7e6f337ffc66c907de", "name": "Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting", "authors": [{"id": 186857, "fullname": "Yuanyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186857?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 184104, "fullname": "YUNING GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/184104?format=json", "institution": "National University of Singapore; Sichuan University; Shanghai Artificial Intelligence Laboratory"}, {"id": 155848, "fullname": "Yifei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155848?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 129357, "fullname": "Li Jingfeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129357?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 88296, "fullname": "Dan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88296?format=json", "institution": "CSE, HKUST"}, {"id": 186858, "fullname": "Yanci Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186858?format=json", "institution": "Sichuan University"}, {"id": 129116, "fullname": "Dingwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129116?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 186859, "fullname": "Xiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186859?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 126700, "fullname": "Zhihang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126700?format=json", "institution": "Shanghai AI Lab"}], "abstract": "3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000$\\times$1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to accelerate rendering speed. Second, it guides the densification towards surfaces during training, avoiding inconsistencies in occluded regions, and improving the rendering quality. In heavily occluded scenarios, such as the MatrixCity Streets dataset, Proxy-GS not only equips MLP-based Gaussian splatting with stronger rendering capability but also achieves faster rendering speed than the original 3DGS. Specifically, it achieves more than $2.5\\times$ speedup over Octree-GS, and consistently delivers substantially higher rendering quality.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37177", "url": null, "sourceid": 44754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40289?format=json"], "related_events_ids": [40289]}, {"id": 37179, "uid": "e6bebc499c445570ecbe7829ae23b881", "name": "Vision-Language Model Guided Source-Free Domain Adaptation via Optimal Transport", "authors": [{"id": 174861, "fullname": "Shuo Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/174861?format=json", "institution": "Xidian University"}, {"id": 186861, "fullname": "Xu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186861?format=json", "institution": "Xidian University"}, {"id": 149259, "fullname": "Jingjing Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/149259?format=json", "institution": "Xidian University"}, {"id": 186862, "fullname": "Xiangrong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186862?format=json", "institution": null}], "abstract": "Unsupervised domain adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. When source data cannot be accessed, source-free domain adaptation (SFDA) becomes a practical alternative. However, existing SFDA methods mainly rely on pseudo-label based self-training, which often accumulates noise and bias under large domain gaps. We propose VSFOT, a framework that leverages a pretrained Vision-Language Model (VLM) to guide optimal transport (OT) alignment between target features and source prototypes. Instead of relying on unreliable pseudo-labels, VSFOT employs VLM-derived semantic priors and an OT-based matching strategy to achieve stable and reliable adaptation. To further enhance domain alignment, VSFOT incorporates a bidirectional distillation mechanism in which the model learns semantic consistency from the VLM, while the VLM is refined using task-specific cues from the model. These two stages alternate during training. By combining the generalization ability of the VLM with the discriminative power of the task model, VSFOT achieves robust, source-free adaptation and consistently outperforms existing SFDA methods on four benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37179", "url": null, "sourceid": 34963, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40289, "uid": "648cbc29675eff7e6f337ffc66c907de", "name": "Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting", "authors": [{"id": 186857, "fullname": "Yuanyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186857?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 184104, "fullname": "YUNING GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/184104?format=json", "institution": "National University of Singapore; Sichuan University; Shanghai Artificial Intelligence Laboratory"}, {"id": 155848, "fullname": "Yifei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155848?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 129357, "fullname": "Li Jingfeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129357?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 88296, "fullname": "Dan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88296?format=json", "institution": "CSE, HKUST"}, {"id": 186858, "fullname": "Yanci Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186858?format=json", "institution": "Sichuan University"}, {"id": 129116, "fullname": "Dingwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129116?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 186859, "fullname": "Xiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186859?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 126700, "fullname": "Zhihang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126700?format=json", "institution": "Shanghai AI Lab"}], "abstract": "3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000$\\times$1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to accelerate rendering speed. Second, it guides the densification towards surfaces during training, avoiding inconsistencies in occluded regions, and improving the rendering quality. In heavily occluded scenarios, such as the MatrixCity Streets dataset, Proxy-GS not only equips MLP-based Gaussian splatting with stronger rendering capability but also achieves faster rendering speed than the original 3DGS. Specifically, it achieves more than $2.5\\times$ speedup over Octree-GS, and consistently delivers substantially higher rendering quality.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40289", "url": null, "sourceid": -44754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37177?format=json"], "related_events_ids": [37177]}, {"id": 37181, "uid": "f30e13dda67413e38d9dec087e89d3ed", "name": "UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization", "authors": [{"id": 128130, "fullname": "Quynh Phung", "url": "http://cvpr.thecvf.com/api/miniconf/users/128130?format=json", "institution": "University of Maryland, College Park"}, {"id": 129262, "fullname": "Sandesh Ghimire", "url": "http://cvpr.thecvf.com/api/miniconf/users/129262?format=json", "institution": "QualComm"}, {"id": 175761, "fullname": "Minsi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175761?format=json", "institution": "University of Maryland"}, {"id": 129227, "fullname": "Charles Tsai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129227?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 88945, "fullname": "Jia-Bin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88945?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37181", "url": null, "sourceid": 40910, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37183, "uid": "443ecbd48bc81e881b05d044e4376b6f", "name": "MeToM: Metadata-Guided Token Merging for Efficient Video LLMs", "authors": [{"id": 181243, "fullname": "Zhuojie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181243?format=json", "institution": "The University of Queensland"}, {"id": 88175, "fullname": "Shijie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88175?format=json", "institution": "University of Queensland"}, {"id": 183016, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183016?format=json", "institution": "Adelaide University"}], "abstract": "Video Large Language Models (VLLMs) encounter significant computational challenges due to the large volume of visual tokens generated from multiple frames.Existing visual token pruning methods fail to account for the uneven spatiotemporal information density, thus squandering scarce token budgets on regions with low information density.In this paper, we propose a training-free \\textbf{Me}tadata-guided \\textbf{To}ken \\textbf{M}erging framework (\\textbf{MeToM}) that leverages intrinsic video metadata to adaptively allocate budgets and merge visual tokens based on content complexity.Specifically, MeToM exploits residual from the metadata as spatial information density cues.It merges less informative regions during tokenization, avoiding redundant encoding and improving the efficiency of the visual encoder.Additionally, MeToM captures temporal variations in information density by utilizing the average Group of Pictures (GoP) size to represent scene complexity.This mechanism enables dynamic per-frame token allocation that adaptively adjusts token budgets across time, assigning more tokens to content-complex frames and fewer to simple ones.Finally, inside the LLM, we merge low-contribution visual tokens via multi-layer attention to compact the prefill FLOPs and visual KV cache.Extensive experimental results demonstrate that MeToM outperforms the prior SoTA counterparts, achieving $2.65\\times$ inference speedup against the baseline VLLM, while still improving the performance, without training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37183", "url": null, "sourceid": 41285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37185, "uid": "dba76338497e2ebab41d96388c47ff26", "name": "ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion", "authors": [{"id": 155873, "fullname": "Remy Sabathier", "url": "http://cvpr.thecvf.com/api/miniconf/users/155873?format=json", "institution": "Department of Computer Science, University College London, University of London"}, {"id": 152120, "fullname": "David Novotny", "url": "http://cvpr.thecvf.com/api/miniconf/users/152120?format=json", "institution": "Meta"}, {"id": 85661, "fullname": "Niloy J. Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/85661?format=json", "institution": "University College London"}, {"id": 152119, "fullname": "Tom Monnier", "url": "http://cvpr.thecvf.com/api/miniconf/users/152119?format=json", "institution": "Facebook"}], "abstract": "Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes \"in action\" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed \"temporal 3D diffusion\". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37185", "url": null, "sourceid": 31368, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37186, "uid": "bd3f1eb5c94a9ef8a93e0205ade42ff2", "name": "CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation", "authors": [{"id": 156490, "fullname": "Kaiyi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156490?format=json", "institution": "University of Hong Kong"}, {"id": 88129, "fullname": "Yukun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88129?format=json", "institution": "University of Science and Technology of China"}, {"id": 156311, "fullname": "Yu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156311?format=json", "institution": "Tsinghua University"}, {"id": 186875, "fullname": "Jianhong Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186875?format=json", "institution": "Zhejiang University"}, {"id": 75722, "fullname": "Xintao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75722?format=json", "institution": "Tencent"}, {"id": 130820, "fullname": "Zinan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130820?format=json", "institution": "Microsoft Research"}, {"id": 130816, "fullname": "Xuefei Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/130816?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 158011, "fullname": "Jiwen Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158011?format=json", "institution": "University of Hong Kong"}, {"id": 130817, "fullname": "Yu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130817?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 86697, "fullname": "Xihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86697?format=json", "institution": "The University of Hong Kong"}], "abstract": "Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects.  To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37186", "url": null, "sourceid": 40901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37189, "uid": "7c4049f0668f21344eb1c0de0b3bd6b0", "name": "Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework", "authors": [{"id": 180446, "fullname": "Jingjie Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180446?format=json", "institution": "Peking University"}, {"id": 151421, "fullname": "Tengyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151421?format=json", "institution": "Dalian University of Technology"}, {"id": 186879, "fullname": "Heng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186879?format=json", "institution": "Dalian University of Technology"}, {"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}, {"id": 131737, "fullname": "Risheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131737?format=json", "institution": "Dalian University of Technology"}, {"id": 149573, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149573?format=json", "institution": "Peking University"}, {"id": 186880, "fullname": "Xiaochen Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186880?format=json", "institution": "Academy of Military Medical Sciences"}], "abstract": "Multi-Exposure Fusion (MEF) seeks to generate a single high-quality image from multiple inputs captured at different exposure levels. Despite substantial progress, most existing approaches depend on statistical metrics that poorly reflect human perceptual preferences. Electroencephalography (EEG) provides a direct physiological window into human cognition, yet its use in low-level vision remains limited due to scarce paired data and the absence of bio-signals during inference. We address these challenges through two key contributions. First, we introduce Cog-Expo, the first dataset capturing human cognitive responses to multi-exposure stimuli, establishing a bridge between neuroscience and computational photography. Second, we propose a bi-level coupled learning framework that leverages this cognitive information without requiring it during inference. A Mental Integrated Transformer serves as the Teacher, incorporating cognitive priors to guide visual feature learning, while a lightweight Student is trained to approximate these cues using only image inputs. Through bi-level optimization, the Teacher learns inherently distillable representations, enabling the Student to emulate cognitive guidance efficiently. Extensive experiments confirm that our method achieves state-of-the-art fusion performance and aligns more closely with human perception.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37189", "url": null, "sourceid": 43640, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37190, "uid": "2fcb2666e77de32af114840ce33d23b8", "name": "Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning", "authors": [{"id": 183852, "fullname": "Subin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/183852?format=json", "institution": "Kyung Hee University"}, {"id": 129674, "fullname": "Jung Uk Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/129674?format=json", "institution": "Kyung Hee University"}], "abstract": "Audio\u2013Visual Sound Source Localization (SSL) aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Existing SSL methods often rely on contrastive learning\u2013based feature matching but lack explicit reasoning and verification stages, limiting their effectiveness in complex acoustic scenes. Inspired by human metacognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies audio\u2013visual consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source (VGGSound-Single, MUSIC-Solo) and multi-source (VGGSound-Duet, MUSIC-Duet) benchmarks demonstrate competitive performance. The source code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37190", "url": null, "sourceid": 35160, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37191, "uid": "d6d5125f2d5e36115d2fe90d1a4d4225", "name": "Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Uncertainty Estimation", "authors": [{"id": 184042, "fullname": "Yongchan Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184042?format=json", "institution": "LG AI Research"}, {"id": 159364, "fullname": "Chanhee Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/159364?format=json", "institution": "Korea University"}, {"id": 186881, "fullname": "Jeongho Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/186881?format=json", "institution": "Korea University"}, {"id": 186269, "fullname": "Jaehyung Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186269?format=json", "institution": "Konkuk University"}, {"id": 186270, "fullname": "Heuiseok Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186270?format=json", "institution": "Korea University"}], "abstract": "Pretrained models have become standard in both vision and language, yet they typically do not provide reliable measures of confidence. Existing uncertainty estimation methods\u2014such as deep ensembles and MC dropout\u2014are often too computationally expensive to deploy in practice. Evidential Deep Learning (EDL) offers a more efficient alternative, but it requires models to be trained to output evidential quantities from the start, which is rarely true for pretrained networks.To enable EDL-style uncertainty estimation in pretrained models, we propose the Evidential Transformation Network (ETN), a lightweight post-hoc module that converts a pretrained predictor into an evidential model. ETN operates in logit space: it learns a sample-dependent affine transformation of the logits and interprets the transformed outputs as parameters of a Dirichlet distribution for uncertainty estimation.We evaluate ETN on image classification and large language model question-answering benchmarks, under both in-distribution and out-of-distribution settings. ETN consistently improves uncertainty estimation over post-hoc baselines, while preserving accuracy and adding only minimal computational overhead.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37191", "url": null, "sourceid": 34465, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37192, "uid": "66376c393ba5faa7b07d3d4a0c7e3d02", "name": "AE2VID: Event-based Video Reconstruction via Aperture Modulation", "authors": [{"id": 132957, "fullname": "Chenxu Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/132957?format=json", "institution": "Peking University"}, {"id": 126252, "fullname": "Boyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126252?format=json", "institution": "Peking University"}, {"id": 70351, "fullname": "Peiqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/70351?format=json", "institution": "Peking University"}, {"id": 98502, "fullname": "xinyu zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/98502?format=json", "institution": "Peking University"}, {"id": 90744, "fullname": "Hanyue Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90744?format=json", "institution": "Peking University"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}], "abstract": "Event-based video reconstruction seeks to recover high-speed, high-dynamic-range videos from event streams. While existing approaches rely exclusively on motion-triggered events, these events are inherently sparse and primarily capture dynamic regions. Therefore, they often suffer from error accumulation and degraded quality in regions with few events. In this work, we introduce aperture-modulation-triggered events as a complementary mechanism to enrich the captured scene information. Specifically, we periodically modulate the aperture to actively generate dense event signals, thereby encoding intensity cues even in static or low-motion regions. Building upon this idea, we design an AE2VID framework that jointly leverages aperture-modulation-triggered and motion-triggered events to enhance the fidelity of predictions. The proposed framework consists of two subnetworks for the dedicated processing of both event types. We further collect a real dataset and validate the effectiveness of our method. Extensive experiments show our superiority over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37192", "url": null, "sourceid": 45993, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37194, "uid": "0fd0233ea8f93dc84bef4f6b1d4799f3", "name": "Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation", "authors": [{"id": 107428, "fullname": "Ziqian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107428?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 107328, "fullname": "Xinqiao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107328?format=json", "institution": "Xi\u2019an Jiaotong-Liverpool University"}, {"id": 153442, "fullname": "Xiaolei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153442?format=json", "institution": "University of Liverpool"}, {"id": 149721, "fullname": "Quan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149721?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 89348, "fullname": "Jimin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89348?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Image-level Weakly Supervised Semantic Segmentation (WSSS) typically leverages Class Activation Maps (CAMs) for pixel-wise localization. However, existing CLIP-based methods often yield under-activated CAMs, primarily due to the inaccurate semantic relationships in the affinity-based refinement. In this work, we propose a novel framework, CD-CLIP (Class Distribution based CLIP), which addresses this issue by introducing a Class Distribution Aware (CDA) module. The CDA module captures richer semantic relationships by modeling patch-wise distributions across all classes using Jensen-Shannon divergence, thereby enhancing the completeness of CAMs. While this significantly improves the coverage of the foreground class, the over-activation at class boundaries might also exist due to the comprehensive integration of relationships between inter target classes. To mitigate this adverse effect on segmentation supervision, we introduce a Super-class Boundary Exploration (SBE) module, which leverages structural knowledge of DINO to generate boundary-aware super-class prototype CAMs. By employing the boundary-enhanced loss, our SBE module effectively provides accurate boundary supervision for the final segmentation. Our proposed CD-CLIP framework achieves state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37194", "url": null, "sourceid": 36670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37199, "uid": "46e8db5f3c47012f108a81f53d640c82", "name": "Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos", "authors": [{"id": 180070, "fullname": "Yicheng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180070?format=json", "institution": "Peking University"}, {"id": 186905, "fullname": "Wanpeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186905?format=json", "institution": "Peking University"}, {"id": 186906, "fullname": "Ye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186906?format=json", "institution": "Renmin University of China"}, {"id": 180059, "fullname": "Hao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180059?format=json", "institution": "Peking University"}, {"id": 186907, "fullname": "Haoqi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186907?format=json", "institution": "Peking University"}, {"id": 186908, "fullname": "Sipeng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186908?format=json", "institution": "BeingBeyond"}, {"id": 87087, "fullname": "Zongqing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87087?format=json", "institution": "Peking University"}], "abstract": "Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that enables models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features, and aligns the two through visual-physical alignment pretraining. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37199", "url": null, "sourceid": 45931, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37200, "uid": "8c4c4c79d1ee6b6a5ca9ed4b4fa72bf0", "name": "Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers", "authors": [{"id": 103300, "fullname": "Ruidong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/103300?format=json", "institution": "Tianjin University"}, {"id": 186537, "fullname": "Yancheng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186537?format=json", "institution": "Alibaba Group"}, {"id": 151630, "fullname": "Xuanpu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151630?format=json", "institution": "Tianjin University"}, {"id": 103451, "fullname": "Jianhao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/103451?format=json", "institution": "Tianjin University"}, {"id": 130491, "fullname": "Lanjun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130491?format=json", "institution": "Tianjin University"}, {"id": 89585, "fullname": "Dan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/89585?format=json", "institution": "Tianjin University"}, {"id": 185770, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185770?format=json", "institution": "Alibaba Group"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}, {"id": 126662, "fullname": "An-An Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126662?format=json", "institution": "Tianjin University"}], "abstract": "Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background.  At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37200", "url": null, "sourceid": 44085, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37202, "uid": "678ef2366ccbd2b30add8cc799fe8a4e", "name": "FloVerse: Floor Plan-Guided Multi-Modal Navigation", "authors": [{"id": 181986, "fullname": "weiqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181986?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186912, "fullname": "Shuangyi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186912?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155012, "fullname": "Jiaxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155012?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186913, "fullname": "YifeiGuo YifeiGuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186913?format=json", "institution": "Beijing Institute of Technology"}, {"id": 76729, "fullname": "Zan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76729?format=json", "institution": "Beijing Institute of Technology"}, {"id": 87190, "fullname": "Wei Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87190?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan\u2013guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan\u2013guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support this FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson $4$+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames.  We further propose ThreeDiff, a two-stage imitation learning policy consisting of a planner\u2014a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling\u2014and a refiner, a depth-based trajectory refinement module for safe execution. Extensive experiments show that (1) floor plan priors consistently improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly learns to infer goal locations from diverse goal representations through spatial reasoning. These results highlight the effectiveness of structured spatial priors and our unified approach for floor plan\u2013guided embodied navigation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37202", "url": null, "sourceid": 46747, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37203, "uid": "152830abb65020d0f6949cb584760162", "name": "Sparse\u2013View Localization via Online Neural 3D Regression", "authors": [{"id": 181805, "fullname": "Ludvig Dill\u00e9n", "url": "http://cvpr.thecvf.com/api/miniconf/users/181805?format=json", "institution": "Lund University"}, {"id": 131605, "fullname": "Magnus Oskarsson", "url": "http://cvpr.thecvf.com/api/miniconf/users/131605?format=json", "institution": "Lund University"}, {"id": 84646, "fullname": "Viktor Larsson", "url": "http://cvpr.thecvf.com/api/miniconf/users/84646?format=json", "institution": "Lund University"}], "abstract": "We present ON3R, an online-trained neural regressor addressing sparse-view structureless localization, where database images have limited visual overlap and no prebuilt 3D map. Given any sparse matches between a query and a $K$-tuple of posed database views, ON3R predicts 3D coordinates for matched query keypoints, supervised by database reprojection residuals and a monocular depth prior. Afterwards, the absolute pose of the query is estimated via P3P-RANSAC and refined with lightweight bundle adjustment. Across MegaDepth, Cambridge Landmarks, and a sparsified version of Aachen Day-Night, ON3R outperforms existing methods. ON3R is particularly effective when the data is extremely sparse -- we focus on $K\\leq10$ database images. The code, data splits, and SfM models will be made available for full reproducibility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37203", "url": null, "sourceid": 43669, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37212, "uid": "cb8261616ad295d9ddbf014b8a610148", "name": "PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations", "authors": [{"id": 177335, "fullname": "Mingqi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/177335?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 71334, "fullname": "Tao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71334?format=json", "institution": "LimX Dynamics"}, {"id": 186934, "fullname": "Haolin Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/186934?format=json", "institution": "University of Science and Technology of China"}, {"id": 85911, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85911?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 132631, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/132631?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 186935, "fullname": "Hua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186935?format=json", "institution": "Zhejiang University"}, {"id": 133538, "fullname": "Wenjun Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133538?format=json", "institution": "Eastern Institute of Technology, Ningbo"}], "abstract": "Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose **PvP**, a **P**roprioceptive-**P**rivileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop **SRL4Humanoid**, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37212", "url": null, "sourceid": 36490, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37216, "uid": "0ed92efc84e18342604f8d9e2b1b2496", "name": "Dexterous World Models", "authors": [{"id": 132210, "fullname": "Byungjun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/132210?format=json", "institution": "Seoul National University"}, {"id": 129941, "fullname": "Taeksoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/129941?format=json", "institution": "Seoul National University"}, {"id": 181522, "fullname": "Junyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/181522?format=json", "institution": "Seoul National University"}, {"id": 98572, "fullname": "Hanbyul Joo", "url": "http://cvpr.thecvf.com/api/miniconf/users/98572?format=json", "institution": "Seoul National University"}], "abstract": "Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static\u2014limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an scene-action-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human\u2013scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic human\u2013scene interaction dataset and real-world object manipulation dataset, then evaluate it across both synthetic and real-world egocentric benchmarks. Experiments demonstrate that DWM enables realistic, physically grounded interactions, such as grasping, opening, or moving objects, while maintaining camera and scene consistency. This framework establishes the first step toward video diffusion-based interactive digital twins, enabling embodied simulation and 3D scene interactivity from egocentric actions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37216", "url": null, "sourceid": 38425, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37218, "uid": "4eec603b40c78b0390abbd2bec778563", "name": "Recurrent Video Masked Autoencoders", "authors": [{"id": 137195, "fullname": "Daniel Zoran", "url": "http://cvpr.thecvf.com/api/miniconf/users/137195?format=json", "institution": "Google DeepMind"}, {"id": 158935, "fullname": "Nikhil Parthasarathy", "url": "http://cvpr.thecvf.com/api/miniconf/users/158935?format=json", "institution": "Google DeepMind"}, {"id": 129367, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129367?format=json", "institution": "DeepMind"}, {"id": 128540, "fullname": "Drew Hudson", "url": "http://cvpr.thecvf.com/api/miniconf/users/128540?format=json", "institution": "Google DeepMind"}, {"id": 75596, "fullname": "Joao Carreira", "url": "http://cvpr.thecvf.com/api/miniconf/users/75596?format=json", "institution": "DeepMind"}, {"id": 75512, "fullname": "Andrew Zisserman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75512?format=json", "institution": "University of Oxford"}], "abstract": "We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient \"generalist\" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to $30\\times$ greater parameter efficiency than competing video masked autoencoders. Finally, we demonstrate that \\model's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based video models. Ablation studies further highlight the factors driving the model's success, with qualitative results showing that RVM learns rich representations of scene semantics, structure, and motion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37218", "url": null, "sourceid": 31807, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37219, "uid": "7fbe7a8221c2e4270468a07cbfa77cf6", "name": "HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter", "authors": [{"id": 77413, "fullname": "Yakun Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77413?format=json", "institution": null}, {"id": 102064, "fullname": "Zhaojun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102064?format=json", "institution": "Peking University"}, {"id": 89138, "fullname": "Siqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89138?format=json", "institution": "Peking University"}, {"id": 129436, "fullname": "Yeliduosi Xiaokaiti", "url": "http://cvpr.thecvf.com/api/miniconf/users/129436?format=json", "institution": "Peking University"}, {"id": 88555, "fullname": "Shikui Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/88555?format=json", "institution": "Beijing jiaotong university"}, {"id": 88385, "fullname": "Yao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88385?format=json", "institution": "Beijing Jiaotong University"}, {"id": 88774, "fullname": "Tiejun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88774?format=json", "institution": "Peking University"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}], "abstract": "Capturing scenes with both high dynamic range (HDR) and high-speed motion remains challenging for conventional cameras. Existing alternating-exposure approaches exacerbate temporal resolution loss, making them unsuitable for high-speed scenes. Consequently, current solutions typically compromise either spatial resolution through fixed spatial-varying attenuation levels or employ multi-sensor configurations to maintain temporal resolution. In this paper, we leverage an ultra-high speed spike camera to enable spatial and temporal attenuation of incident light, thereby reconstructing high frame rate (HFR) and HDR video with a single sensor. We achieve this by placing a rapidly rotating spoke-pattern neutral density (SpokeND) filter in front of the spike camera, enabling each pixel to periodically capture multi-attenuated spikes while maintaining full spatial resolution. Building on these multi-attenuated spikes, we propose ReST-Net, which comprises the ReGain and ReFine modules. The ReGain module reconstructs spatially consistent frames by learning to recover relative gain from the multi-attenuated spikes, and the ReFine module removes temporal fluctuations to produce temporally consistent HDR videos. Extensive experiments on synthetic and real-world data demonstrate that our method can reconstruct HDR video at up to 2000 FPS.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37219", "url": null, "sourceid": 38176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37220, "uid": "dcfed44a960873147935b7b25f64f373", "name": "ShowUI-\u03c0: Flow-based Generative Models as GUI Dexterous Hands", "authors": [{"id": 186949, "fullname": "Siyuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186949?format=json", "institution": "Nanyang Technological University"}, {"id": 88436, "fullname": "Kevin Qinghong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88436?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-\u03c0, the first flow-based generative model as GUI dexterous hand, featuring the following designs:(i) Unified Discrete\u2013Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes;(ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories;(iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents\u2019 drag capabilities.Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-\u03c0 achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37220", "url": null, "sourceid": 45943, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37221, "uid": "1d76ffc82aa2886416809b12d5e65f5b", "name": "AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space", "authors": [{"id": 181488, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181488?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 98340, "fullname": "Changyao TIAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/98340?format=json", "institution": "The Chinese University of Hong Kong, The Chinese University of Hong Kong"}, {"id": 128191, "fullname": "Renqiu Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/128191?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186950, "fullname": "Ning Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186950?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186951, "fullname": "Weiwei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186951?format=json", "institution": "Tongji University"}, {"id": 86526, "fullname": "Hongsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86526?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87572, "fullname": "Jifeng Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87572?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 159452, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/159452?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 128276, "fullname": "Xue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128276?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37221", "url": null, "sourceid": 41808, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37222, "uid": "2ad8c5f5cf1914eef5cb64d16d685a73", "name": "PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention", "authors": [{"id": 180115, "fullname": "Hefei Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/180115?format=json", "institution": "City University of Hong Kong"}, {"id": 155934, "fullname": "Zirui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155934?format=json", "institution": "Dalian University of Technology"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}, {"id": 91041, "fullname": "Jianyuan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/91041?format=json", "institution": "University of Sydney"}, {"id": 126313, "fullname": "Minjing Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126313?format=json", "institution": "City University of Hong Kong"}], "abstract": "Large Vision\u2013Language Models (LVLMs) are foundational to modern multimodal applications, yet their susceptibility to adversarial attacks remains a critical concern. Prior white-box attacks rarely generalize across tasks, and black-box methods depend on expensive transfer, which limits efficiency. The vision encoder, standardized and often shared across LVLMs, provides a stable gray-box pivot with strong cross-model transfer. Building on this premise, we introduce PA-Attack (Prototype-Anchored Attentive Attack). PA-Attack begins with a prototype-anchored guidance that provides a stable attack direction towards a general and dissimilar prototype, tackling the attribute-restricted issue and limited task generalization of vanilla attacks. Building on this, we propose a two-stage attention enhancement mechanism: (i) leverage token\u2011level attention scores to concentrate perturbations on critical visual tokens, and (ii) adaptively recalibrate attention weights to track the evolving attention during the adversarial process. Extensive experiments across diverse downstream tasks and LVLM architectures show that PA\u2011Attack achieves an average 75.1% score reduction rate (SRR), demonstrating strong attack effectiveness, efficiency, and task generalization in LVLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37222", "url": null, "sourceid": 33315, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37223, "uid": "d4a938257c3ae3e323c94797f75a0524", "name": "SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment", "authors": [{"id": 181874, "fullname": "Tianle Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181874?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186952, "fullname": "Fang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186952?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 91449, "fullname": "Xiaofan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91449?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Achieving spatial alignment between whole-slide images (WSIs) across stains remains highly challenging due to extreme resolution, tissue fragmentation, and large nonlinear deformations. Conventional registration pipelines depend on global pre-alignment and spatial consistency, which often collapse under such distortions. We present SAR2Net, a framework that learns spatially anchored representations and reformulates cross-stain alignment as a region-level feature retrieval problem. Instead of estimating explicit transformations, SAR2Net learns pointwise representations encoding the relative spatial relationships to tissue landmarks. Given landmarks and arbitrary coordinates, it predicts spatially anchored features that serve as deformation-invariant descriptors of tissue topology. A multi-stage retrieval framework then establishes correspondences between slides, even when global alignment is infeasible. Experiments on biopsy-oriented HE-IHC datasets show that SAR2Net achieves robust region-level alignment under severe tissue distortions, outperforming previous registration methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37223", "url": null, "sourceid": 39583, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37224, "uid": "3a1d84f752947c47e87f7cfdc42a63b0", "name": "POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling", "authors": [{"id": 129691, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129691?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 181106, "fullname": "Chengqun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181106?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86439, "fullname": "Zhuo Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/86439?format=json", "institution": "ByteDance"}, {"id": 152756, "fullname": "Zheng Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/152756?format=json", "institution": null}, {"id": 179861, "fullname": "Jingnan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/179861?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186953, "fullname": "Xiaoyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186953?format=json", "institution": "ByteDance Inc."}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 91466, "fullname": "Yichao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/91466?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining \u201cchicken-and-egg\u2019\u2019 cycle for scalable and reproducible portrait illumination.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37224", "url": null, "sourceid": 32189, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40290?format=json"], "related_events_ids": [40290]}, {"id": 40290, "uid": "3a1d84f752947c47e87f7cfdc42a63b0", "name": "POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling", "authors": [{"id": 129691, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129691?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 181106, "fullname": "Chengqun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181106?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86439, "fullname": "Zhuo Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/86439?format=json", "institution": "ByteDance"}, {"id": 152756, "fullname": "Zheng Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/152756?format=json", "institution": null}, {"id": 179861, "fullname": "Jingnan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/179861?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186953, "fullname": "Xiaoyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186953?format=json", "institution": "ByteDance Inc."}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 91466, "fullname": "Yichao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/91466?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighting states, enabling scalable and controllable relighting. Together, POLAR and POLARNet form a unified illumination learning framework that links real data, generative synthesis, and physically grounded relighting, establishing a self-sustaining \u201cchicken-and-egg\u2019\u2019 cycle for scalable and reproducible portrait illumination.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40290", "url": null, "sourceid": -32189, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37224?format=json"], "related_events_ids": [37224]}, {"id": 37228, "uid": "428ccdcc8eaf556ba79854c22772135c", "name": "Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement", "authors": [{"id": 183598, "fullname": "Lixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183598?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 186956, "fullname": "Zhongnan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186956?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 176870, "fullname": "Jesse Hamilton", "url": "http://cvpr.thecvf.com/api/miniconf/users/176870?format=json", "institution": "University of Michigan"}, {"id": 176770, "fullname": "James Balter", "url": "http://cvpr.thecvf.com/api/miniconf/users/176770?format=json", "institution": "University of Michigan"}, {"id": 85852, "fullname": "Jeong Joon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/85852?format=json", "institution": "Stanford University"}, {"id": 186957, "fullname": "Liyue Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186957?format=json", "institution": "University of Michigan - Ann Arbor"}], "abstract": "Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, an Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37228", "url": null, "sourceid": 33650, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37230, "uid": "5606c4382f50a2c86fa0108c46c9fd32", "name": "Domain-Skewed Federated Learning with Feature Decoupling and Calibration", "authors": [{"id": 180316, "fullname": "Huan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180316?format=json", "institution": "University of Wollongong"}, {"id": 149020, "fullname": "Jun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149020?format=json", "institution": "University of Wollongong"}, {"id": 186961, "fullname": "Jun Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186961?format=json", "institution": "University of Wollongong"}, {"id": 89120, "fullname": "Guansong Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89120?format=json", "institution": "Singapore Management University"}], "abstract": "Federated learning (FL) allows distributed clients to collaboratively train a global model in a privacy-preserving manner. However, one major challenge is domain skew, where clients' data originating from diverse domains may hinder the aggregated global model from learning a consistent representation space, resulting in poor generalizable ability in multiple domains. In this paper, we argue that the domain skew is reflected in the domain-specific biased features of each client, causing the local model's representations to collapse into a narrow low-dimensional subspace. We then propose Federated Feature Decoupling and Calibration (**$F^2$DC**), which liberates valuable class-relevant information by calibrating the domain-specific biased features, enabling more consistent representations across domains. A novel component, Domain Feature Decoupler (DFD), is first introduced in $F^2$DC to determine the robustness of each feature unit, thereby separating the local features into domain-robust features and domain-related features. A Domain Feature Corrector (DFC) is further proposed to calibrate these domain-related features by explicitly linking discriminative signals, capturing additional class-relevant clues that complement the domain-robust features. Finally, a domain-aware aggregation of the local models is performed to promote consensus among clients. Empirical results on three popular multi-domain datasets demonstrate the effectiveness of the proposed $F^2$DC and the contributions of its two modules.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37230", "url": null, "sourceid": 38997, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37235, "uid": "21dd2bb7a4d988443cb303e42afb8316", "name": "Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models", "authors": [{"id": 166210, "fullname": "Yuehao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166210?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186981, "fullname": "Shanyan Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186981?format=json", "institution": "vivo"}, {"id": 180356, "fullname": "Weijia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180356?format=json", "institution": "Shanghai Jiao Tong University / Alibaba"}, {"id": 186982, "fullname": "Xuanming Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186982?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186983, "fullname": "Yanhao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/186983?format=json", "institution": "Future Imaging Area"}, {"id": 186984, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186984?format=json", "institution": "vivo; Huawei Technologies Ltd."}, {"id": 86334, "fullname": "Chao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86334?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose \\our, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. The two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT~\\cite{guo2025hide} show that \\our~establishes state-of-the-art performance, surpassing prior SOTA by 2.14\\% and 6.82\\% in terms of \\textit{Avg} and \\textit{Last}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37235", "url": null, "sourceid": 42020, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37237, "uid": "97c216cb25ce4c47de15d030c76fed39", "name": "Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising", "authors": [{"id": 186987, "fullname": "Yiwen Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186987?format=json", "institution": "Sichuan University"}, {"id": 76353, "fullname": "Haiyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76353?format=json", "institution": "Sichuan University"}, {"id": 86154, "fullname": "Peng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86154?format=json", "institution": "Sichuan University"}, {"id": 86197, "fullname": "Xi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86197?format=json", "institution": "Sichuan University"}, {"id": 76359, "fullname": "Yuanbiao Gou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76359?format=json", "institution": "Sichuan University"}], "abstract": "Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37237", "url": null, "sourceid": 38731, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37240, "uid": "f5e2e5bc65bce4f8d1c7c67acc428f8f", "name": "MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding", "authors": [{"id": 181477, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181477?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89415, "fullname": "Xingping Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89415?format=json", "institution": "Inception Institute of Artificial Intelligence"}, {"id": 183016, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183016?format=json", "institution": "Adelaide University"}, {"id": 86370, "fullname": "Wenhan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86370?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86172, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86172?format=json", "institution": "Tencent AI Lab"}, {"id": 86350, "fullname": "Kaihao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86350?format=json", "institution": "Australian National University"}], "abstract": "Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose \\textbf{\\textit{Multi-resolution Retrieval-Detection (MRD)}}, a  training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37240", "url": null, "sourceid": 33013, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37239, "uid": "fae30afa0073712435e3302fa439ad4c", "name": "BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections", "authors": [{"id": 102554, "fullname": "Subin Varghese", "url": "http://cvpr.thecvf.com/api/miniconf/users/102554?format=json", "institution": "University of Houston"}, {"id": 186992, "fullname": "Joshua Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186992?format=json", "institution": "University of Houston"}, {"id": 176299, "fullname": "Asad Ur Rahman", "url": "http://cvpr.thecvf.com/api/miniconf/users/176299?format=json", "institution": "University of Houston"}, {"id": 186993, "fullname": "Vedhus Hoskere", "url": "http://cvpr.thecvf.com/api/miniconf/users/186993?format=json", "institution": "University of Houston"}], "abstract": "Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery.We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images.Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37239", "url": null, "sourceid": 35107, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37241, "uid": "f8b146b1264be1257e65f518de82a372", "name": "E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction", "authors": [{"id": 182567, "fullname": "Yunsoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182567?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 130030, "fullname": "Changki Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/130030?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 155447, "fullname": "Dasol Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155447?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 130034, "fullname": "Hyun Myung", "url": "http://cvpr.thecvf.com/api/miniconf/users/130034?format=json", "institution": "KAIST"}], "abstract": "The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods still rely on known poses or depend on depth estimation models and auxiliary modalities such as RGB-D. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we exploit the distinct spatio-temporal characteristics of edges and non-edge regions. The event camera's movement induces consistent events along edges, while non-edge regions produce sparse noise. We leverage this through a patch-based temporal coherence analysis that measures local variance to extract edges while robustly suppressing noise. The extracted edges guide structure-aware Gaussian initialization and enable edge-weighted losses throughout initialization, tracking, and bundle adjustment. Extensive experiments on both synthetic and real datasets demonstrate that E2EGS achieves superior reconstruction quality and trajectory accuracy, establishing a fully pose-free paradigm for event-based 3D reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37241", "url": null, "sourceid": 36384, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37242, "uid": "85387a78160ec385df3dc4a589d62dd7", "name": "Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods", "authors": [{"id": 176710, "fullname": "Omer Ben Hayun", "url": "http://cvpr.thecvf.com/api/miniconf/users/176710?format=json", "institution": "Technion - Israel Institute of Technology"}, {"id": 180248, "fullname": "Roy Betser", "url": "http://cvpr.thecvf.com/api/miniconf/users/180248?format=json", "institution": "Technion - Israel Institute of Technology, Technion; Fujitsu Research of Europe"}, {"id": 186994, "fullname": "Meir Yossef Levi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186994?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 186995, "fullname": "Levi Kassel", "url": "http://cvpr.thecvf.com/api/miniconf/users/186995?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 77507, "fullname": "Guy Gilboa", "url": "http://cvpr.thecvf.com/api/miniconf/users/77507?format=json", "institution": "Technion"}], "abstract": "Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences that transform creative workflows.Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models.These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection.We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework.Across two public benchmarks including 20 generative models, STALL consistently outperforms prior image- and video-based baselines.To further test generalization, we curate ComGenVid, a new benchmark featuring state-of-the-art models (Sora and Veo-3), on which STALL demonstrates consistent and robust results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37242", "url": null, "sourceid": 40753, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40324, "uid": "c029d9d4b124cef75291bea08bd390c5", "name": "Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes", "authors": [{"id": 180903, "fullname": "Changqing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180903?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 145274, "fullname": "Yueru Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/145274?format=json", "institution": "Beijing University of Post and Telecommunications; The Chinese University of Hong Kong"}, {"id": 188912, "fullname": "Han Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188912?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 188913, "fullname": "Zeyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188913?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187943, "fullname": "Changhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187943?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian\u2013language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40324", "url": null, "sourceid": -40417, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38046?format=json"], "related_events_ids": [38046]}, {"id": 37249, "uid": "c04ee68bad31e814e67c402f57dbf1d2", "name": "Vocabulary Scaling Law : Tuning Open-vocabulary Predictors for Their Openness", "authors": [{"id": 73559, "fullname": "Ziliang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73559?format=json", "institution": "Pengcheng Laboratory"}, {"id": 187009, "fullname": "Yulu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187009?format=json", "institution": null}, {"id": 187010, "fullname": "Liangda Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187010?format=json", "institution": "Jinan University"}, {"id": 147015, "fullname": "jusheng zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147015?format=json", "institution": "National University of Singapore; SUN YAT-SEN UNIVERSITY"}, {"id": 153473, "fullname": "Yongsen Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153473?format=json", "institution": "Nanyang Technological University"}, {"id": 154194, "fullname": "Quanlong Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154194?format=json", "institution": "Jinan University"}, {"id": 187011, "fullname": "Xipeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187011?format=json", "institution": "Pengcheng Laboratory"}], "abstract": "Open-vocabulary learning on CLIP provides remarkable generalization on diverse concepts, however, falters under the realistic streaming open-world evaluations for Stability against distractor classes and Extensibility to novel classes. Current fine-tuning methods often fail these tests since they are mainly designed for closed-set conditions, leading to the performance gaps while the target vocabulary progressively scales. We formalize a ``vocabulary scaling law'' showing that these openness measures can be lower-bounded by performance on the full class-name universe, implying that robust fine-tuning should: (i) account for the entire vocabulary, (ii) tune class-name embeddings rather than context, and (iii) enforce orthogonality between prompt embeddings including training and open-set class names. Guided by our analysis, we propose Submodular-Vocabulary Fine-tuning (SVFT), a bi-level optimization framework that approximates the intractable objective of tuning all class name embedding by greedily selecting a small, informative subset of class names via constrained submodular maximization, thus, allows the employment of efficient greedy algorithm for the near-optimal class-name subset selection to fine-tune CLIP instead of using all open classes. Across extensive experiments, SVFT consistently improves both stability and extensibility, advancing the openness and practical robustness of CLIP-based vision\u2013language models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37249", "url": null, "sourceid": 33975, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37250, "uid": "4125b4e94852e1a68b609205afc1f5f7", "name": "ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving", "authors": [{"id": 88143, "fullname": "Han Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88143?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 75577, "fullname": "Xiaosong Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/75577?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 88188, "fullname": "Yichen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/88188?format=json", "institution": "University of California, Berkeley"}, {"id": 187012, "fullname": "Siyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187012?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 153208, "fullname": "Wenlong Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153208?format=json", "institution": "COWAROBOT"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 88195, "fullname": "Junchi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88195?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "End-to-end differentiable learning has emerged as a prominent paradigm in autonomous driving (AD). A significant bottleneck in this approach is its substantial demand for high-quality labeled data, such as 3D bounding boxes and semantic segmentation, which are especially expensive to annotate manually. This challenge is exacerbated by the long tailed distribution in AD datasets, where a substantial portion of the collected data might be trivial (e.g. simply driving straight on a straight road) and only a minority of instances are critical to safety. In this paper, we propose ActiveAD, a planning-oriented active learning strategy designed to enhance sampling and labeling efficiency in end-to-end autonomous driving. ActiveAD progressively annotates parts of collected raw data based on our newly developed metrics. We design innovative diversity metrics to enhance initial sample selection, addressing the cold-start problem. Furthermore, we develop uncertainty metrics to select valuable samples for the ultimate purpose of route planning during subsequent batch selection. Empirical results demonstrate that our approach significantly surpasses traditional active learning methods. Remarkably, our method achieves comparable results to state-of-the-art end-to-end AD methods - by using only 30% data in both open-loop nuScenes and closed-loop CARLA evaluation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37250", "url": null, "sourceid": 40251, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37251, "uid": "46aced6c94e6edd349cd1e5ce4435c15", "name": "Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared", "authors": [{"id": 127590, "fullname": "Yafei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127590?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 182660, "fullname": "Meng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/182660?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 127563, "fullname": "Huafeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127563?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 187013, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187013?format=json", "institution": "Hefei University of Technology"}], "abstract": "Infrared\u2013visible (IR\u2013VIS) image fusion is vital for perception and security, yet most methods rely on the availability of both modalities during training and inference. When the infrared modality is absent, pixel-space generative substitutes become hard to control and inherently lack interpretability. We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. The pipeline comprises three key components: (1) Joint Shared-dictionary Representation Learning (JSRL) learns a unified and interpretable atom space shared by both IR and VIS modalities; (2) VIS-Guided IR Inference (VGII) transfers VIS coefficients to pseudo-IR coefficients in the coefficient domain and performs a one-step closed-loop refinement guided by a frozen large language model as a weak semantic prior; and (3) Adaptive Fusion via Representation Inference (AFRI) merges VIS structures and inferred IR cues at the atom level through window attention and convolutional mixing, followed by reconstruction with the shared dictionary. This  \\emph{encode$\\rightarrow$transfer$\\rightarrow$fuse$\\rightarrow$reconstruct} pipeline avoids uncontrolled pixel-space generation while ensuring prior preservation within interpretable dictionary\u2013coefficient representation. Experiments under missing-IR settings demonstrate consistent improvements in perceptual quality and downstream detection performance. To our knowledge, this represents the first framework that jointly learns a shared dictionary and performs coefficient-domain inference\u2013fusion to tackle missing-IR fusion.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37251", "url": null, "sourceid": 31916, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37252, "uid": "92ef030b5b129817c1b812c22f4ce721", "name": "SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images", "authors": [{"id": 133667, "fullname": "Aayush Dhakal", "url": "http://cvpr.thecvf.com/api/miniconf/users/133667?format=json", "institution": "Washington University in St. Louis"}, {"id": 107003, "fullname": "Subash Khanal", "url": "http://cvpr.thecvf.com/api/miniconf/users/107003?format=json", "institution": "Washington University in St Louis"}, {"id": 154647, "fullname": "Srikumar Sastry", "url": "http://cvpr.thecvf.com/api/miniconf/users/154647?format=json", "institution": "Washington University in St Louis"}, {"id": 176816, "fullname": "Jacob Arndt", "url": "http://cvpr.thecvf.com/api/miniconf/users/176816?format=json", "institution": "Oak Ridge National Laboratory"}, {"id": 140667, "fullname": "Philipe Ambrozio Dias", "url": "http://cvpr.thecvf.com/api/miniconf/users/140667?format=json", "institution": "Oak Ridge National Laboratory"}, {"id": 106955, "fullname": "Dalton Lunga", "url": "http://cvpr.thecvf.com/api/miniconf/users/106955?format=json", "institution": "Oak Ridge National Laboratory"}, {"id": 75557, "fullname": "Nathan Jacobs", "url": "http://cvpr.thecvf.com/api/miniconf/users/75557?format=json", "institution": "Washington University in St. Louis"}], "abstract": "The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85\\% accuracy and +69.62\\% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37252", "url": null, "sourceid": 38838, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37254, "uid": "63989c8f73f5b989efb7af037283c179", "name": "Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule", "authors": [{"id": 179923, "fullname": "Boyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179923?format=json", "institution": "Virginia Tech"}, {"id": 134291, "fullname": "Liang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/134291?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 187016, "fullname": "Zhengzhi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187016?format=json", "institution": "Digiworld Ventures"}, {"id": 187017, "fullname": "Lanxin Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187017?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 178506, "fullname": "Loren Stowe", "url": "http://cvpr.thecvf.com/api/miniconf/users/178506?format=json", "institution": "Virginia Tech Transportation Institute"}, {"id": 136109, "fullname": "Feng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/136109?format=json", "institution": "Virginia Tech"}], "abstract": "The safety of autonomous driving systems (ADS) depends on accurate perception across distance and driving conditions. The outputs of AI perception algorithm are stochastic, which have a major impact on decision making and safety outcomes, including time-to-collision estimation. However, current perception evaluation metrics do not reflect the stochastic nature of perception algorithms. We introduce the Perception Characteristics Distance (PCD), a novel metric incorporating model output uncertainty as represented by the farthest distance at which an object can be reliably detected. To represent a system\u2019s overall perception capability in terms of reliable detection distance, we used the averaging PCD values across multiple detection quality and probabilistic thresholds produces the average PCD (aPCD). For empirical validation, we present the SensorRainFall dataset, collected on the Virginia Smart Road using a sensor-equipped vehicle (cameras, radar, and LiDAR) controlled under different weather (clear and rainy) and illumination conditions (daylight, streetlight, and nighttime). The dataset includes ground-truth distances, bounding boxes, and segmentation masks for target objects. Experiments with state-of-the-art models show that aPCD captures meaningful differences across weather, daylight, and illumination conditions, which traditional evaluation metrics fail to reflect. PCD provides an uncertainty-aware measure of perception performance, supporting safer and more robust ADS operation, while the SensorRainFall dataset offers a valuable benchmark for evaluation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37254", "url": null, "sourceid": 41441, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37257, "uid": "776174b25a2867e6384f5cb3e00fd8c9", "name": "Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes", "authors": [{"id": 183951, "fullname": "Luke Palmer", "url": "http://cvpr.thecvf.com/api/miniconf/users/183951?format=json", "institution": "GlimspeML"}, {"id": 176071, "fullname": "Petar Palasek", "url": "http://cvpr.thecvf.com/api/miniconf/users/176071?format=json", "institution": "GlimpseML"}, {"id": 180250, "fullname": "Hazem Abdelkawy", "url": "http://cvpr.thecvf.com/api/miniconf/users/180250?format=json", "institution": "Toyota Motors Europe"}], "abstract": "Accurately modelling human attention is essential for numerous computer vision applications, particularly in the domain of automotive safety. Existing methods often collapse gaze into scanpaths or saliency maps, overlooking the dynamics of natural eye movements and introducing artefacts into training data. We instead propose a dynamical systems approach that treats gaze as an active agent interacting with its environment, enabling the simulation of raw, continuous gaze trajectories. In our approach, driving scenes are represented as gaze-centric spatiotemporal graphs processed by the Affinity Relation Transformer (ART), a heterogeneous graph transformer that models interactions between driver gaze and surrounding traffic objects and road structure. We further introduce an Object Density Network (ODN) to predict next-step gaze distributions, capturing the stochastic, object-centric nature of attentional shifts in complex environments. To support this research, we also present Focus100, a new dataset of gaze recordings from 30 participants viewing ego-centric driving footage. Trained directly on raw gaze, without any fixation filtering, our unified approach produces more natural gaze timeseries, scanpath dynamics, and saliency maps than existing attention estimation methods, offering valuable insights for the temporal modelling of human attention and automotive safety.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37257", "url": null, "sourceid": 43737, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38050, "uid": "2d1838bd5b731e64a54439dac82b3a4e", "name": "TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection", "authors": [{"id": 147197, "fullname": "Yearang Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/147197?format=json", "institution": "Korea University"}, {"id": 107331, "fullname": "Ho-Joong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/107331?format=json", "institution": "Korea University"}, {"id": 130510, "fullname": "Seong-Whan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/130510?format=json", "institution": "Korea University"}], "abstract": "Zero-Shot Temporal Action Detection (ZSTAD) aims to localize and recognize action instances from unseen action categories in untrimmed videos. Although existing methods have shown effectiveness by advancing architectural text-video alignment, they still struggle with capturing semantic distinctions between action classes, resulting in text-irrelevant predictions.To address this issue, we propose a Text-Foreground Concentrated Alignment for zero-shot temporal action DEtector (TF-CADE) that explicitly aligns textual information with action-relevant foreground regions.Specifically, we introduce Action Concentrate Aggregation (ACA), which extracts action concentrate scores to aggregate temporally informative video segments into a foreground-weighted video embedding.This foreground concentrated alignment enhances the semantic consistency between text and video features and improves inter-class discriminability.In addition, a Certainty-based Confidence Re-weighting (CCR) strategy refines per-snippet confidence scores by leveraging foreground-aware similarity, effectively suppressing irrelevant action classes during inference.Extensive evaluations show that our TF-CADE not only achieves state-of-the-art performance under in-distribution settings but also excels in cross-dataset generalization to unseen action classes.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38050", "url": null, "sourceid": 31232, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38051, "uid": "98229e47efcf293ba35164fa38faba05", "name": "Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics", "authors": [{"id": 183456, "fullname": "Jiayuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183456?format=json", "institution": "The Ohio State University"}, {"id": 188930, "fullname": "Ruoqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188930?format=json", "institution": "Stanford University"}, {"id": 188931, "fullname": "Zishan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188931?format=json", "institution": "University of California, Berkeley"}, {"id": 180339, "fullname": "Ping Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180339?format=json", "institution": "The Ohio State University"}], "abstract": "Microscopy-based phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches.We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38051", "url": null, "sourceid": 37280, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37266, "uid": "485f1d0ae76a063a524f0c6315e93e74", "name": "FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier\u2013State Space Integration", "authors": [{"id": 142776, "fullname": "Yizhou Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/142776?format=json", "institution": "Brunel University of London"}, {"id": 187049, "fullname": "Genze Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187049?format=json", "institution": "Brunel University of London"}, {"id": 128486, "fullname": "Yihua Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128486?format=json", "institution": "University of Birmingham"}, {"id": 154488, "fullname": "Kezhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154488?format=json", "institution": "Brunel University of London"}], "abstract": "Accurate trajectory prediction is vital for safe autonomous driving, yet existing approaches struggle to balance modeling power and computational efficiency. Attention-based architectures incur quadratic complexity with increasing agents, while recurrent models struggle to capture long-range dependencies and fine-grained local dynamics. Building upon this, we present FoSS, a dual-branch framework that unifies frequency-domain reasoning with linear-time sequence modeling. The frequency-domain branch performs a discrete Fourier transform to decompose trajectories into amplitude components encoding global intent and phase components capturing local variations, followed by a progressive helix reordering module that preserves spectral order; two selective state-space submodules, Coarse2Fine-SSM and SpecEvolve-SSM, refine spectral features with $\\mathcal{O}(N)$ complexity. In parallel, a time-domain dynamic selective SSM reconstructs self-attention behavior in linear time to retain long-range temporal context. A cross-attention layer fuses temporal and spectral representations, while learnable queries generate multiple candidate trajectories, and a weighted fusion head expresses motion uncertainty. Experiments on Argoverse 1 and Argoverse 2 benchmarks demonstrate that FoSS achieves state-of-the-art accuracy while reducing computation by 22.5\\% and parameters by over 40\\%. Comprehensive ablations confirm the necessity of each component.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37266", "url": null, "sourceid": 31101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37270, "uid": "ce69714025385b379247c4d0e0444606", "name": "ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation", "authors": [{"id": 102757, "fullname": "Jie Long Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/102757?format=json", "institution": null}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Estimating dense three dimensional motion in dynamic high speed scenes remains challenging due to motion blur, illumination variation, and the limited temporal resolution of conventional cameras. We introduce ARES, a unified framework for Asymmetric RGB-Event Stereo that addresses these issues through a hybrid setup where an event camera captures fine grained temporal dynamics and an RGB camera provides rich spatial structure. To integrate these heterogeneous modalities, we propose Multimodal Contextual Attention, a transformer based fusion mechanism that attends to spatial and temporal contexts under cross view constraints and forms a unified correspondence space for disparity and optical flow estimation. Building on this shared representation, we introduce Temporal Disparity Posterior Fusion, a probabilistic framework that models the evolution of disparity posteriors to infer disparity change and recover metrically coherent scene flow. Trained with sparse supervision and dense self consistency cues, our ARES achieves geometrically consistent and temporally stable three dimensional motion estimation across diverse driving scenarios. Experiments show that ARES attains state of the art performance in scene flow estimation, establishing a principled path toward unified asymmetric multimodal stereo sensing. Our code will be released upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37270", "url": null, "sourceid": 36190, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37272, "uid": "82005b5333cfe3b63634cf1afaac86af", "name": "LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration", "authors": [{"id": 180784, "fullname": "Peiliang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180784?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186336, "fullname": "Jiacheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186336?format=json", "institution": "Shanghai Jiaotong University; Tencent; Shandong University"}, {"id": 187060, "fullname": "Haowen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187060?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 181329, "fullname": "Xinyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181329?format=json", "institution": "Tsinghua University"}, {"id": 179949, "fullname": "Chang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/179949?format=json", "institution": "Shanghai Jiao Tong University &amp; Tencent Hunyuan"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a \\textbf{LE}arnable \\textbf{S}tage-\\textbf{A}ware (\\textbf{LESA}) predictor framework based on two-stage training. Our approach leverages a Kolmogorov\u2013Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00$\\times$ acceleration on FLUX.1-dev with minimal quality degradation (1.0\\% drop), 6.25$\\times$ speedup on Qwen-Image with a 20.2\\% quality improvement over the previous SOTA (TaylorSeer), and 5.00$\\times$ acceleration on HunyuanVideo with a 24.7\\% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37272", "url": null, "sourceid": 37968, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37275, "uid": "e0652a0045dbc0b14d016619158789ce", "name": "Spectral Mixture-of-Experts for Continual Learning", "authors": [{"id": 187065, "fullname": "Chen Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187065?format=json", "institution": "Anhui University"}, {"id": 129216, "fullname": "Xingbo Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129216?format=json", "institution": "Anhui University"}, {"id": 187066, "fullname": "Xuelin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187066?format=json", "institution": "GUANGMING Laboratory"}, {"id": 129224, "fullname": "Zhe Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129224?format=json", "institution": "Anhui University"}], "abstract": "While Parameter-Efficient Fine-Tuning using Mixture-of-Experts (MoE) is a promising solution for continual learning (CL), it suffers from two critical failure modes: structural interference, where expert updates interfere, and compositional forgetting, where the model\u2019s routing policy drifts. To address these issues, we introduce Spectral MoE, a novel framework built for CL from three core components. First, Spectral Experts are parameterized using unique, disjoint spectral masks to confine their learnable parameters to distinct frequency subspaces, ensuring a priori orthogonal updates that prevent structural interference. Second, a Dual-Router mechanism decouples online routing that learns new tasks from an offline memory that archives historical expert importance. Finally, this offline memory enables a Dynamic Consistency Projection, a geometric constraint that suppresses router drift and adaptively shields experts based on their past contributions, mitigating compositional forgetting. Validated on a strict cross-domain CL benchmark, our framework significantly outperforms existing methods, demonstrating superior knowledge retention and plasticity for new tasks. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37275", "url": null, "sourceid": 33891, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37276, "uid": "1accc1c92278a9bc2bbc6a8002026b25", "name": "GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers", "authors": [{"id": 187067, "fullname": "Yuxuan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/187067?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 95045, "fullname": "Ruofan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/95045?format=json", "institution": "University of Toronto"}, {"id": 87620, "fullname": "Egor Zakharov", "url": "http://cvpr.thecvf.com/api/miniconf/users/87620?format=json", "institution": "Meta Reality Labs"}, {"id": 89940, "fullname": "Timur Bagautdinov", "url": "http://cvpr.thecvf.com/api/miniconf/users/89940?format=json", "institution": "Reality Labs Research"}, {"id": 89875, "fullname": "Chen Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89875?format=json", "institution": "Facebook"}, {"id": 159460, "fullname": "Giljoo Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/159460?format=json", "institution": "Meta"}, {"id": 89935, "fullname": "Shunsuke Saito", "url": "http://cvpr.thecvf.com/api/miniconf/users/89935?format=json", "institution": "Reality Labs Research"}, {"id": 75975, "fullname": "Gerard Pons-Moll", "url": "http://cvpr.thecvf.com/api/miniconf/users/75975?format=json", "institution": "University of T\u00fcbingen"}, {"id": 95337, "fullname": "Javier Romero", "url": "http://cvpr.thecvf.com/api/miniconf/users/95337?format=json", "institution": "Meta"}], "abstract": "Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: **GeoRelight**. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37276", "url": null, "sourceid": 45409, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37310, "uid": "28e268a8abf4b64614d94f8bba977246", "name": "Disco-GS: Gaussian Splatting in Dynamic Color Lighting", "authors": [{"id": 157067, "fullname": "Ashish Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/157067?format=json", "institution": "Indian Institute of Technology, Madras"}, {"id": 85247, "fullname": "A. N. Rajagopalan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85247?format=json", "institution": "Indian Institute of Technology Madras"}], "abstract": "Recent advances in Gaussian Splatting (GS) have significantly improved 3D scene reconstruction and novel view synthesis. However, most existing methods typically assume that training inputs are captured under stable lighting conditions and achromatic light. In contrast, scenes recorded under temporally varying color light, as in \u201cdisco lights\u201d commonly seen in events, performances, and decorative settings, introduce severe ambiguities in both scene photometry and geometry. We propose Disco-GS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. Disco-GS estimates the effective per-pixel transient light, which, when applied to the canonical image, results in the observed color image of the scene, thereby enabling self-supervised learning. Disco-GS is an end-to-end framework that does not rely on any prior knowledge, such as color values, ambient lighting conditions, or scene properties. It effectively handles both global and spatially localized transient color variations. It also enables controllable brightness manipulation of the canonical scene, facilitating applications such as simulating low-light and well-lit scene conditions. To the best of our knowledge, Disco-GS is the first method to simultaneously perform 3D scene reconstruction and canonical appearance recovery from inputs captured under artificially varying, disco-style colored light. To enable quantitative and qualitative evaluations, we also introduce the Disco dataset, a collection of 25 videos of real-world scenes exhibiting diverse and random color variations. The dataset will be released. Extensive experiments demonstrate the robustness and fidelity of Disco-GS.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37310", "url": null, "sourceid": 42399, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37277, "uid": "5b18af25d34fc1f0670d8d09354bbb1b", "name": "Event Stream Filtering via Probability Flux Estimation", "authors": [{"id": 166654, "fullname": "Jinze Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/166654?format=json", "institution": "University of Science and Technology of China"}, {"id": 86247, "fullname": "Wei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86247?format=json", "institution": "University of Science and Technology of China"}, {"id": 86250, "fullname": "Yang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86250?format=json", "institution": "University of Science and Technology of China"}, {"id": 88935, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88935?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Event cameras asynchronously capture brightness changes with microsecond latency, offering exceptional temporal precision but suffering from severe noise and signal inconsistencies. Unlike conventional signals, events carry state information through polarities and process information through inter-event time intervals. However, existing event filters often ignore the latter, producing outputs that are sparser than the raw input and limiting the reconstruction of continuous irradiance dynamics. We propose the Event Density Flow Filter (EDFilter), a framework that models event generation as threshold-crossing probability fluxes arising from the stochastic diffusion of irradiance trajectories. EDFilter performs nonparametric, kernel-based estimation of probability flux and reconstructs the continuous event density flow using an O(1) recursive solver, enabling real-time processing. The Rotary Event Dataset (RED), featuring microsecond-resolution ground-truth irradiance flow under controlled illumination is also presented for event quality evaluation. Experiments demonstrate that EDFilter achieves high-fidelity, physically interpretable event denoising and motion reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37277", "url": null, "sourceid": 38939, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37279, "uid": "f3314823b3e76de2f99ab14830e4e2e1", "name": "ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders", "authors": [{"id": 183214, "fullname": "Junsik Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183214?format=json", "institution": "Electronics and Telecommunications Research Institute (ETRI)"}, {"id": 181293, "fullname": "Gun Bang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181293?format=json", "institution": "ETRI"}, {"id": 183215, "fullname": "Soowoong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183215?format=json", "institution": "Electronics and Telecommunications Research Institute (ETRI)"}], "abstract": "Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and computation efficiency, yielding state-of-the-art compression at real-time throughput on Ford and SemanticKITTI.Code and models will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37279", "url": null, "sourceid": 37869, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37278, "uid": "b0784fd78e6211f2c634896ef544cfe0", "name": "MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting", "authors": [{"id": 144845, "fullname": "yongjian liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144845?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 127460, "fullname": "Xu Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127460?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 145288, "fullname": "Wenjun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145288?format=json", "institution": "Guangdong University of Technology"}, {"id": 187068, "fullname": "Huixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187068?format=json", "institution": "Guangdong University of Technology"}, {"id": 187069, "fullname": "Xiaoen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187069?format=json", "institution": "Guangdong University of Technology"}, {"id": 187070, "fullname": "Chunxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187070?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187071, "fullname": "Shixiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187071?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187072, "fullname": "Gang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187072?format=json", "institution": "China Mobile Communications Company Limited Research Institute"}, {"id": 107125, "fullname": "Jiahuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/107125?format=json", "institution": "Peking University"}, {"id": 127455, "fullname": "Sheng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/127455?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90454, "fullname": "Luxin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90454?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Although 4D reconstruction based on Gaussian Splatting has achieved many impressive results, reconstructing real-world images captured by a casual monocular camera remains a significant challenge. In dynamic scenes, as the camera and objects move during the exposure time, these input images inevitably contain a considerable amount of motion blur, which severely compromises the quality of reconstruction and new viewpoint synthesis. The existing deblurring 3D Gaussian models still cannot handle motion blur issues in real dynamic scenes. To address these challenges, we propose MSCD-GS\u2014a novel method for motion-separated collaborative deblurring 4D reconstruction via Gaussian Splatting, capable of effectively handling motion-blurred inputs. Specifically, due to the distinct motion characteristics of static and dynamic Gaussians, we perform separate motion modeling to achieve dynamic scene reconstruction. To predict Gaussian changes during the exposure time, we designed motion-aware networks for static and dynamic Gaussians, thereby synthesizing virtual blurred images. Finally, we utilize the results from the deblurring network and the synthesized images to supervise 4D reconstruction collaboratively. Extensive experiments demonstrate that MSCD-GS can effectively reconstruct high-quality dynamic scenes from blurred image inputs, with performance surpassing existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37278", "url": null, "sourceid": 31438, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37283, "uid": "0a8b2799af5e9b658b8fdb0558fc99c8", "name": "3D Gaussian Splatting with Self-Constrained Prior for High Fidelity Surface Reconstruction", "authors": [{"id": 153055, "fullname": "Takeshi Noda", "url": "http://cvpr.thecvf.com/api/miniconf/users/153055?format=json", "institution": "Tsinghua University"}, {"id": 76426, "fullname": "Yu-Shen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76426?format=json", "institution": "Tsinghua University"}, {"id": 88625, "fullname": "Zhizhong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88625?format=json", "institution": "Wayne State University"}], "abstract": "Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constraining the movement of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is a TSDF grid fused by the rendered depth during the learning of 3D Gaussians. The prior measures a band on both sides of the estimated surface for imposing more specific constraints on the right 3D Gaussians, such as removing 3D Gaussians outside the band, encouraging larger opacity for Gaussians near the center of the band or smaller opacity for Gaussians near the boundary of the band. We regularly update the prior by fusing more recent depth images which are usually more accurate, and progressively narrow the band to tighten the constraint on Gaussian movements. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37283", "url": null, "sourceid": 37703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37288, "uid": "e68d60edfb709b5834cb5e9286b4ce4b", "name": "Voxify3D: Pixel Art Meets Volumetric Rendering", "authors": [{"id": 149275, "fullname": "Yichuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149275?format=json", "institution": "NYCU"}, {"id": 175250, "fullname": "Jiewen Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/175250?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 178816, "fullname": "Hao-Jen Chien", "url": "http://cvpr.thecvf.com/api/miniconf/users/178816?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 127300, "fullname": "Yu-Lun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127300?format=json", "institution": "National Yang Ming Chiao Tung University"}], "abstract": "Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20\u00b3-50\u00b3 resolutions).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37288", "url": null, "sourceid": 30907, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37284, "uid": "63d196328512c582293ce6c845521bb6", "name": "SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs", "authors": [{"id": 181408, "fullname": "Haiduo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181408?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 187078, "fullname": "Jiangcheng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/187078?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 187079, "fullname": "Yadong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187079?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185131, "fullname": "Pengju Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/185131?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from \"how to measure divergence'' to \"where to apply learning\". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-$k$ and non-greedy Spec-$k$. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models\u2014without architectural changes or extra reference models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37284", "url": null, "sourceid": 35445, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37291, "uid": "1768bc8c31a3cd81b33098e5ec2e868f", "name": "FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control", "authors": [{"id": 147415, "fullname": "Zhiyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147415?format=json", "institution": "City University of Hong Kong"}, {"id": 181441, "fullname": "Can Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181441?format=json", "institution": null}, {"id": 86337, "fullname": "Dongdong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86337?format=json", "institution": "Microsoft GenAI"}, {"id": 86410, "fullname": "Jing Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86410?format=json", "institution": "City University of Hong Kong"}], "abstract": "We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a  temporally consistent trajectory ID, a segmentation ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37291", "url": null, "sourceid": 42913, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37298, "uid": "cc4d91edae41488c825cc05a61fc4452", "name": "Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation", "authors": [{"id": 154031, "fullname": "Qitong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154031?format=json", "institution": "Xidian University"}, {"id": 90799, "fullname": "Mingtao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90799?format=json", "institution": "Xidian University"}, {"id": 154032, "fullname": "Zijie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154032?format=json", "institution": "Xidian University"}, {"id": 187120, "fullname": "Huixin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187120?format=json", "institution": "Rocket Force University of Engineering"}, {"id": 89279, "fullname": "Weisheng Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89279?format=json", "institution": "Xidian University"}, {"id": 154033, "fullname": "Yaonan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154033?format=json", "institution": "Hunan University"}, {"id": 90780, "fullname": "Ajmal Mian", "url": "http://cvpr.thecvf.com/api/miniconf/users/90780?format=json", "institution": "University of Western Australia"}], "abstract": "3D shape generation has become increasingly important for graphics and vision applications. Current part-aware 3D generation usually overlooks hierarchical part relations or inefficiently encodes multi-level semantics in Euclidean space. Thus we propose a novel framework for hierarchical and efficient part-aware 3D generation in hyperbolic space. Our contributions are three-fold: (1) Hierarchical Hyperbolic Mixture Model (H$^2$MM): We propose part-aware semantic representation of objects within a hyperbolic manifold, providing a high-fidelity hierarchical part-aware representation of object details and semantics. (2) Hyperbolic Semantically Consistent Diffusion Model: We design the geodesic diffusion process that preserves the hierarchical and semantic structure of H$^{2}$MM, and progressively generates semantics from conditions and generates object under their joint guidance. We use an adaptive tree-structured neural network to loosen the constraint of jointly generating nodes and edges in previous hyperbolic diffusion. (3) Hyperbolic Diffusion Model Solver: We leverage higher-order Riemannian gradient on hyperbolic manifolds for designing a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. Extensive experiments demonstrate that our method achieves superior quality and efficiency. Code will be public.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37298", "url": null, "sourceid": 35709, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37299, "uid": "df1764dfa48b244868c24a2992715607", "name": "MTA: Multimodal Task Alignment for BEV Perception and Captioning", "authors": [{"id": 93157, "fullname": "Yunsheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/93157?format=json", "institution": "Purdue University"}, {"id": 154294, "fullname": "Burhan Yaman", "url": "http://cvpr.thecvf.com/api/miniconf/users/154294?format=json", "institution": "Uber"}, {"id": 150932, "fullname": "Xin Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/150932?format=json", "institution": "Bosch"}, {"id": 171060, "fullname": "Jingru Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/171060?format=json", "institution": "Bosch Research Center"}, {"id": 187121, "fullname": "Feng Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187121?format=json", "institution": "Robert Bosch"}, {"id": 126502, "fullname": "Abhirup Mallik", "url": "http://cvpr.thecvf.com/api/miniconf/users/126502?format=json", "institution": "Bosch"}, {"id": 187122, "fullname": "Ziran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187122?format=json", "institution": "Purdue University"}, {"id": 84539, "fullname": "Liu Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/84539?format=json", "institution": "Bosch Research"}], "abstract": "Bird's eye view (BEV)-based 3D perception plays a crucial role in autonomous driving applications. The rise of large language models has spurred interest in BEV-based captioning to understand object behavior in the surrounding environment. However, existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one task and overlooking the potential benefits of multimodal alignment. To bridge this gap between modalities, we introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning. MTA consists of two key components: (1) BEV-Language Alignment (BLA), a contextual learning mechanism that aligns the BEV scene representations with ground-truth language representations, and (2) Detection-Captioning Alignment (DCA), a cross-modal prompting mechanism that aligns detection and captioning outputs. MTA seamlessly integrates into state-of-the-art baselines during training, adding no extra computational complexity at runtime. Extensive experiments on the nuScenes and TOD3Cap datasets show that MTA significantly outperforms state-of-the-art baselines in both tasks, achieving a 10.7% improvement in challenging rare perception scenarios and a 9.2% improvement in captioning. These results underscore the effectiveness of unified alignment in reconciling BEV-based perception and captioning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37299", "url": null, "sourceid": 38308, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37300, "uid": "493c8b3821e768713a4d1c5b1e7f5ad4", "name": "POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval", "authors": [{"id": 182674, "fullname": "Junfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182674?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 130364, "fullname": "Zhe Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/130364?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 85042, "fullname": "Yuankai Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85042?format=json", "institution": "The University of Adelaide"}, {"id": 187123, "fullname": "Junping Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/187123?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187124, "fullname": "Xiangyang Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187124?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 187125, "fullname": "Yishuo Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187125?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 187126, "fullname": "Amin Beheshti", "url": "http://cvpr.thecvf.com/api/miniconf/users/187126?format=json", "institution": "Macquarie University"}, {"id": 153914, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153914?format=json", "institution": "Macquarie University"}, {"id": 88134, "fullname": "Anton van den Hengel", "url": "http://cvpr.thecvf.com/api/miniconf/users/88134?format=json", "institution": "University of Adelaide"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}], "abstract": "Most of the models used to generate embeddings for retrieval are not trained for the purpose which leads them to focus on coarse semantic alignment rather than particular object attributes or arrangements. This limits their performance, particularly on challenging problems such as cross-modal fine-grained retrieval. Furthermore, their training objectives lack the discriminative ability required to distinguish between descriptions that are semantically similar but factually different. To address these challenges, we propose POGA (Paraphrased and Oppositional Graph Alignment), a novel framework for fine-grained cross-modal alignment. POGA comprises two core innovations: (1) Multi-source Graph Augmentation (MSGA), which not only generates paraphrased positives and oppositional negatives, but also parses the image and all text variants into structured graphs to provide difference-rich supervisory signals; (2) Hybrid Multi-granularity Alignment (HMA), which defines a composite training objective that jointly optimizes the model at four distinct granularities: including robust dual global alignment, and precise matching at three fine-grained levels: node, relation, and focal disproving. Experiments on benchmarks such as DCI and DOCCI demonstrate that POGA performs favorably against several state-of-the-art methods in long-text understanding and complex relation discrimination.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37300", "url": null, "sourceid": 31624, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37301, "uid": "1f0643149dd5c86751215fd0040675aa", "name": "Explaining Object Detectors via Collective Contribution of Pixels", "authors": [{"id": 183709, "fullname": "Toshinori Yamauchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183709?format=json", "institution": "Japan"}, {"id": 127476, "fullname": "Hiroshi Kera", "url": "http://cvpr.thecvf.com/api/miniconf/users/127476?format=json", "institution": "Chiba University"}, {"id": 127481, "fullname": "Kazuhiko Kawamoto", "url": "http://cvpr.thecvf.com/api/miniconf/users/127481?format=json", "institution": "Chiba University"}], "abstract": "Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available soon.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37301", "url": null, "sourceid": 32126, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37302, "uid": "d589dd26d0c0260c8a2001f8db379c14", "name": "ViHOI: Human-Object Interaction Synthesis with Visual Priors", "authors": [{"id": 175486, "fullname": "Songjin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/175486?format=json", "institution": "South China University of Technology"}, {"id": 187127, "fullname": "Linjie Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187127?format=json", "institution": "South China University of Technology"}, {"id": 153440, "fullname": "Ling Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/153440?format=json", "institution": "South China University of Technology"}, {"id": 88971, "fullname": "Changxing Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/88971?format=json", "institution": "South China University of Technology"}], "abstract": "Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM\u2019s high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization. The code for this work will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37302", "url": null, "sourceid": 40712, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37303, "uid": "e7b5b8a1ae60d4207a0837248daf1599", "name": "ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery", "authors": [{"id": 174439, "fullname": "Weiqin Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/174439?format=json", "institution": "University of Twente"}, {"id": 133794, "fullname": "Hao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133794?format=json", "institution": "University of Twente"}, {"id": 72228, "fullname": "George Vosselman", "url": "http://cvpr.thecvf.com/api/miniconf/users/72228?format=json", "institution": "University of Twente"}, {"id": 177371, "fullname": "Claudio Persello", "url": "http://cvpr.thecvf.com/api/miniconf/users/177371?format=json", "institution": "University of Twente"}], "abstract": "We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and no gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that guarantees shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512, e.g., compared to TopDiG on vegetation, +9.9 IoU (semantic fidelity), -45% PoLiS (geometric error), -59% N-ratio (vertex redundancy). It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building. Data, code, and models will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37303", "url": null, "sourceid": 41370, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37304, "uid": "e3fd458d02cc75934dd2c30f896ff698", "name": "PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning", "authors": [{"id": 174521, "fullname": "Yinan Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/174521?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 166137, "fullname": "Qing-Yuan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/166137?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Open-set semi-supervised learning (OSSL) has achieved notable progress in exploiting unlabeled data, yet most existing methods overlook the distinct sensitivities of in-distribution (ID) and out-of-distribution (OOD) samples to semantic-preserving perturbations, resulting in unreliable OOD sample filtering. We address this gap by leveraging the behavioral difference between ID and OOD samples under perturbations and extend it into a representation-level signal for reliable OOD filtering. Specifically, we propose a novel filtering strategy, Perturbation-Aware Filtering (PAF), which identifies OOD samples by measuring the representation instability under semantic-preserving perturbations. We then integrate PAF into a carefully designed two-stage training framework, allowing the model to exploit abundant unlabeled data in the first stage and gradually adapt to the open-set setting with limited labels in the second stage. Extensive experimental results on widely-used OSSL benchmarks demonstrate that our proposed PAF approach achieves superior performance compared to state-of-the-art OSSL methods. The source code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37304", "url": null, "sourceid": 42537, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37312, "uid": "a1f822aee4a7a8bad18385af7c7b420f", "name": "T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding", "authors": [{"id": 182024, "fullname": "Chaohong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182024?format=json", "institution": "South China University of Technology"}, {"id": 187135, "fullname": "Yihan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187135?format=json", "institution": "South China University of Technology"}, {"id": 86825, "fullname": "Yongwei Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/86825?format=json", "institution": "South China University of Technology"}, {"id": 187136, "fullname": "Fei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187136?format=json", "institution": "Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), also known as Guangming Laboratory"}, {"id": 89048, "fullname": "Xuemiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89048?format=json", "institution": "South China University of Technology"}, {"id": 86787, "fullname": "Chengjiang Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/86787?format=json", "institution": "ByteDance Inc."}], "abstract": "Video Temporal Grounding (VTG) aims to localize the video segment that corresponds to a natural language query, which requires a comprehensive understanding of complex temporal dynamics. Existing Vision-LMMs typically perceive temporal dynamics via positional encoding, text-based timestamps, or visual frame numbering. However, these approaches exhibit notable limitations: assigning each frame a text-based timestamp token introduces additional computational overhead and leads to sparsity in visual attention, positional encoding struggles to capture absolute temporal information, and visual frame numbering often compromises spatial detail. To address these issues, we propose Temporal to Spatial Gridification (T2SGrid), a novel framework that reformulates video temporal understanding as a spatial understanding task. The core idea of T2SGrid is to process video content in clips rather than individual frames. we employ a overlapping sliding windows mechanism to segment the video into temporal clips. Within each window, frames are arranged chronologically in a row-major order into a composite grid image, effectively transforming temporal sequences into structured 2D layouts. The gridification not only encodes temporal information but also enhances local attention within each grid. Furthermore, T2SGrid enables the use of composite text timestamps to establish global temporal awareness. Experiments on standard VTG benchmarks demonstrate that T2SGrid achieves superior performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37312", "url": null, "sourceid": 39549, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37316, "uid": "8d5679179e529f9ed2dfd332a526f51f", "name": "SonoWorld: From One Image to a 3D Audio-Visual Scene", "authors": [{"id": 140758, "fullname": "Derong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/140758?format=json", "institution": "UMD"}, {"id": 182783, "fullname": "Xiyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182783?format=json", "institution": "University of Maryland, College Park"}, {"id": 150899, "fullname": "Ming Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/150899?format=json", "institution": "University of Maryland, College Park"}, {"id": 135743, "fullname": "Ruohan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135743?format=json", "institution": "University of Maryland, College Park | Meta"}], "abstract": "Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360\u00b0 panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37316", "url": null, "sourceid": 45860, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37317, "uid": "39e367521684bd5cf70773f610e2c5c0", "name": "PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation", "authors": [{"id": 151772, "fullname": "Jianyu LAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/151772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 152539, "fullname": "Sixiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152539?format=json", "institution": "HKUST(GZ)"}, {"id": 159498, "fullname": "Jialin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/159498?format=json", "institution": "Meituan"}, {"id": 187140, "fullname": "Hengyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187140?format=json", "institution": "Meituan"}, {"id": 187141, "fullname": "Zhongying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187141?format=json", "institution": "Meituan; Fudan University"}, {"id": 187142, "fullname": "Fuxiang Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187142?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 154365, "fullname": "Junfeng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154365?format=json", "institution": "Meituan"}, {"id": 84905, "fullname": "Xiaoming Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84905?format=json", "institution": "Meituan"}, {"id": 187143, "fullname": "Lujia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187143?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187144, "fullname": "Lei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187144?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Recent advancements in the text-rendering capabilities of image generation models have made the end-to-end creation of graphic design content, such as posters, increasingly feasible. However, existing reward models fail to accurately assess the quality of graphic design. They primarily focus on global image aesthetics and lack the capacity to evaluate two other core elements of graphic design: typography and layout. Furthermore, current text-to-image preference datasets suffer from a scarcity of data related to graphic design, which hinders the further development of generative models in this domain.To address this gap, we have designed an automated pipeline to construct a high-quality dataset of 70k poster preferences. Subsequently, we have developed, a reward model capable of accurately assessing the quality of generated posters, by leveraging a cascaded, multi-stage training pipeline. We also provide multiple variants of the model to cater to different application scenarios. Finally, we introduce and  to evaluate the performance of existing reward models in poster assessment and the capabilities of current image generation models in poster creation, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37317", "url": null, "sourceid": 44427, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37318, "uid": "92ee5fcb44fc70d5432a7130b72f2f3d", "name": "ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting", "authors": [{"id": 172303, "fullname": "Yuming Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/172303?format=json", "institution": "Peking University"}, {"id": 107274, "fullname": "Dong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107274?format=json", "institution": "Peking University"}, {"id": 128617, "fullname": "Hongbin Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/128617?format=json", "institution": "Peking University"}], "abstract": "Understanding and segmenting objects in dynamic 4D environments from natural language is crucial yet underexplored. Existing works either perform referring segmentation in static 3D scenes or build open-vocabulary 4D language fields, but none of them supports grounding complex spatio-temporal referring descriptions in explicit 4D reconstructions. Based on 4D Gaussian Splatting(4DGS), We formalize this missing setting as Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS): given a 4DGS representation of a dynamic scene and a referring expression, the goal is to identify the target object and segment it across both space and time, resolving where the described instance is and when it exhibits the queried state. To tackle this challenge, we propose ST4R-Splat, the first framework for STRS-4DGS. ST4R-Splat builds on deformable 4D Gaussians and introduces an Instance-Aware 4D Referring Field that assigns each Gaussian a time-invariant embedding, enabling robust instance-level grounding for both time-agnostic and time-sensitive referring queries. On top of this, an Instance-level Temporal State Mapping module models a view-independent mapping from instance identity and time to semantic states directly in feature space. To obtain rich supervision without manual annotation, we design a task-adaptive captioning pipeline that uses multimodal large language models to generate complementary frame-level descriptive captions and time-aware state captions for each object. We construct a new benchmark on dynamic 4D reconstructions with spatio-temporally grounded referring expressions and adapt state-of-the-art 3D/4D language grounding methods as baselines.Extensive experiments show that ST4R-Splat significantly outperforms baselines on both spatial (time-agnostic) and temporal (time-sensitive) metrics, establishing a strong foundation for fine-grained, language-driven understanding of dynamic 4D scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37318", "url": null, "sourceid": 45469, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37320, "uid": "ad5ba4274138d6f039cc4cc1c14867cf", "name": "Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation", "authors": [{"id": 187145, "fullname": "Guangkai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187145?format=json", "institution": "Zhejiang University"}, {"id": 185381, "fullname": "Hua Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185381?format=json", "institution": "Zhejiang University"}, {"id": 187146, "fullname": "Huanyi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187146?format=json", "institution": "Zhejiang University"}, {"id": 177198, "fullname": "Songyi Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/177198?format=json", "institution": "Zhejiang University"}, {"id": 187147, "fullname": "Yanlong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187147?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185384, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185384?format=json", "institution": "Zhejiang University"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}], "abstract": "Recent advancements in feed-forward architectures for visual geometry estimation have achieved significant progress. Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals three key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that enables high-resolution geometry estimation. These contributions are integrated into CFG, a model that simultaneously generates precise and coherent geometric representations from diverse input perspectives at high resolutions. Comprehensive testing across multiple benchmarks for point cloud reconstruction, video depth estimation, and camera pose/intrinsic parameter estimation confirms CFG's superior performance, establishing it as a state-of-the-art solution for visual geometry tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37320", "url": null, "sourceid": 39204, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37324, "uid": "34dc21dee723e965b9a5e5ee9c8965b1", "name": "SelfHVD: Self-Supervised Handheld Video Deblurring", "authors": [{"id": 144938, "fullname": "Honglei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144938?format=json", "institution": "Harbin Institute of Technology"}, {"id": 88919, "fullname": "Zhilu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88919?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187164, "fullname": "Junjie Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187164?format=json", "institution": "School of Mathematics, HIT"}, {"id": 89739, "fullname": "Xiaohe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89739?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Shooting video with handheld shooting devices often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the deblurring ability of the model, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct synthetic and real-world handheld video datasets for handheld video deblurring. Extensive experiments on these and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37324", "url": null, "sourceid": 32213, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37325, "uid": "4b1e14f32e85dc7b48a2ef9bb1cac0a4", "name": "Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling", "authors": [{"id": 180500, "fullname": "Euisoo Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/180500?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187165, "fullname": "Byunghyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187165?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187166, "fullname": "Hyunjin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187166?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187167, "fullname": "Seonghye Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/187167?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 131183, "fullname": "Jae-Gil Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/131183?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\\times$ and $2.07\\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://anonymous.4open.science/r/hybrid-diffusion/.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37325", "url": null, "sourceid": 34152, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37326, "uid": "bc1f015249c3cf61633174c93943648b", "name": "FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection", "authors": [{"id": 174605, "fullname": "Maxime Fontana", "url": "http://cvpr.thecvf.com/api/miniconf/users/174605?format=json", "institution": "King&#x27;s College London"}, {"id": 137990, "fullname": "Michael Spratling", "url": "http://cvpr.thecvf.com/api/miniconf/users/137990?format=json", "institution": "University of Luxembourg and King&#x27;s College London"}, {"id": 126766, "fullname": "Miaojing Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126766?format=json", "institution": "King&#x27;s College London"}], "abstract": "Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for downstream tasks. However, the constant growth of state-of-the-art models makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for different tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that captures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrinking (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 10.3 times compared to traditional MTL fine-tuning whilst boosting performance on all tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37326", "url": null, "sourceid": 33325, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37328, "uid": "c24de509a2932ca70c2474f44802a76e", "name": "Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models", "authors": [{"id": 130671, "fullname": "Dailan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/130671?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 187172, "fullname": "Guanlin Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187172?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 130712, "fullname": "Xingtong Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/130712?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187173, "fullname": "Yazhe Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187173?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 88965, "fullname": "Yi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88965?format=json", "institution": "VIVIX.AI"}, {"id": 187174, "fullname": "Bingqi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187174?format=json", "institution": "Sensetime Group Limmited"}, {"id": 153375, "fullname": "Guanglu Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/153375?format=json", "institution": "Sensetime"}, {"id": 86566, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86566?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86526, "fullname": "Hongsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86526?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Group Relative Policy Optimization (GRPO) has shown promise in aligning image and video generative models with human preferences. However, applying it to modern flow matching models is challenging because of its deterministic sampling paradigm. Current methods address this issue by converting Ordinary Differential Equations (ODEs) to Stochastic Differential Equations (SDEs), which introduce stochasticity. However, this SDE-based GRPO suffers from issues of inefficient credit assignment and incompatibility with high-order solvers for fewer-step sampling. In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. Neighbor GRPO generates a diverse set of candidate trajectories by perturbing the initial noise conditions of the ODE and optimizes the model using a softmax distance-based surrogate leaping policy. We establish a theoretical connection between this distance-based objective and policy gradient optimization, rigorously integrating our approach into the GRPO framework. Our method fully preserves the advantages of deterministic ODE sampling, including efficiency and compatibility with high-order solvers. We further introduce symmetric anchor sampling for computational efficiency and group-wise quasi-norm reweighting to address reward flattening. Extensive experiments demonstrate that Neighbor GRPO significantly outperforms SDE-based counterparts in terms of training cost, convergence speed, and generation quality.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37328", "url": null, "sourceid": 35359, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37329, "uid": "b5faafe944dadb86aee82ebe006a3e0f", "name": "Residual Primitive Fitting of 3D Shapes with SuperFrusta", "authors": [{"id": 152922, "fullname": "Aditya Ganeshan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152922?format=json", "institution": "Brown University"}, {"id": 106544, "fullname": "Matheus Gadelha", "url": "http://cvpr.thecvf.com/api/miniconf/users/106544?format=json", "institution": "Adobe Systems"}, {"id": 126403, "fullname": "Thibault Groueix", "url": "http://cvpr.thecvf.com/api/miniconf/users/126403?format=json", "institution": "Adobe Systems"}, {"id": 89985, "fullname": "Zhiqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89985?format=json", "institution": "Simon Fraser University"}, {"id": 85692, "fullname": "Siddhartha Chaudhuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/85692?format=json", "institution": "Adobe Systems"}, {"id": 86499, "fullname": "Vladimir G. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86499?format=json", "institution": "Adobe Systems"}, {"id": 159462, "fullname": "Wang Yifan", "url": "http://cvpr.thecvf.com/api/miniconf/users/159462?format=json", "institution": "Adobe Systems"}, {"id": 84678, "fullname": "Daniel Ritchie", "url": "http://cvpr.thecvf.com/api/miniconf/users/84678?format=json", "institution": "Brown University"}], "abstract": "We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37329", "url": null, "sourceid": 42799, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40293?format=json"], "related_events_ids": [40293]}, {"id": 40293, "uid": "b5faafe944dadb86aee82ebe006a3e0f", "name": "Residual Primitive Fitting of 3D Shapes with SuperFrusta", "authors": [{"id": 152922, "fullname": "Aditya Ganeshan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152922?format=json", "institution": "Brown University"}, {"id": 106544, "fullname": "Matheus Gadelha", "url": "http://cvpr.thecvf.com/api/miniconf/users/106544?format=json", "institution": "Adobe Systems"}, {"id": 126403, "fullname": "Thibault Groueix", "url": "http://cvpr.thecvf.com/api/miniconf/users/126403?format=json", "institution": "Adobe Systems"}, {"id": 89985, "fullname": "Zhiqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89985?format=json", "institution": "Simon Fraser University"}, {"id": 85692, "fullname": "Siddhartha Chaudhuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/85692?format=json", "institution": "Adobe Systems"}, {"id": 86499, "fullname": "Vladimir G. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86499?format=json", "institution": "Adobe Systems"}, {"id": 159462, "fullname": "Wang Yifan", "url": "http://cvpr.thecvf.com/api/miniconf/users/159462?format=json", "institution": "Adobe Systems"}, {"id": 84678, "fullname": "Daniel Ritchie", "url": "http://cvpr.thecvf.com/api/miniconf/users/84678?format=json", "institution": "Brown University"}], "abstract": "We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40293", "url": "https://bardofcodes.github.io/superfit/", "sourceid": -42799, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37329?format=json"], "related_events_ids": [37329]}, {"id": 37332, "uid": "696fb322493073d42be4fafa2a7b6f77", "name": "Erasing Invisible Watermarks via Novel View Synthesis", "authors": [{"id": 153189, "fullname": "Fahad Shamshad", "url": "http://cvpr.thecvf.com/api/miniconf/users/153189?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 187184, "fullname": "Nils Lukas", "url": "http://cvpr.thecvf.com/api/miniconf/users/187184?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 153190, "fullname": "Karthik Nandakumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/153190?format=json", "institution": "Michigan State University"}], "abstract": "Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative ``view\" of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37332", "url": null, "sourceid": 42518, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40294?format=json"], "related_events_ids": [40294]}, {"id": 40294, "uid": "696fb322493073d42be4fafa2a7b6f77", "name": "Erasing Invisible Watermarks via Novel View Synthesis", "authors": [{"id": 153189, "fullname": "Fahad Shamshad", "url": "http://cvpr.thecvf.com/api/miniconf/users/153189?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 187184, "fullname": "Nils Lukas", "url": "http://cvpr.thecvf.com/api/miniconf/users/187184?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 153190, "fullname": "Karthik Nandakumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/153190?format=json", "institution": "Michigan State University"}], "abstract": "Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative ``view\" of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40294", "url": null, "sourceid": -42518, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37332?format=json"], "related_events_ids": [37332]}, {"id": 37337, "uid": "63b2bcd96a3981153de8111edeeca175", "name": "DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer", "authors": [{"id": 180056, "fullname": "Qingji Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180056?format=json", "institution": "Bytedance Inc."}, {"id": 184676, "fullname": "Hang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184676?format=json", "institution": "ByteDance"}, {"id": 177720, "fullname": "Mingqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/177720?format=json", "institution": "ByteDance Inc."}, {"id": 184677, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184677?format=json", "institution": "ByteDance Inc."}, {"id": 87905, "fullname": "Yitong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87905?format=json", "institution": "ByteDance Inc"}], "abstract": "Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies.To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches.We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model\u2019s capability to capture patch information and effectively restore local textures.Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37337", "url": null, "sourceid": 40665, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37336, "uid": "ff63a1e61dc77ef6d2ac4aed28de3ccc", "name": "White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation", "authors": [{"id": 126275, "fullname": "Shuwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126275?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 152911, "fullname": "Lei Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152911?format=json", "institution": "National University of Singapore"}, {"id": 85616, "fullname": "Robby T. Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85616?format=json", "institution": "National University of Singapore"}], "abstract": "Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras.We propose VLM-CC, a vision-language model (VLM)-guided framework that formulates color constancy as an iterative refinement process.Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by VLM-based evaluation.At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback.This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence.Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression.By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37336", "url": null, "sourceid": 41765, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37341, "uid": "1295bf698192e4a35ead87d31113fb06", "name": "Leveraging Multispectral Sensors for Color Correction in Mobile Cameras", "authors": [{"id": 179945, "fullname": "Luca Cogo", "url": "http://cvpr.thecvf.com/api/miniconf/users/179945?format=json", "institution": "University of Milano-Bicocca"}, {"id": 187201, "fullname": "Marco Buzzelli", "url": "http://cvpr.thecvf.com/api/miniconf/users/187201?format=json", "institution": "University of Milano-Bicocca"}, {"id": 187202, "fullname": "Simone Bianco", "url": "http://cvpr.thecvf.com/api/miniconf/users/187202?format=json", "institution": "University of Milan-Bicocca"}, {"id": 92830, "fullname": "Javier Vazquez-Corral", "url": "http://cvpr.thecvf.com/api/miniconf/users/92830?format=json", "institution": "Computer Vision Center / Autonomous University of Barcelona"}, {"id": 187203, "fullname": "Raimondo Schettini", "url": "http://cvpr.thecvf.com/api/miniconf/users/187203?format=json", "institution": "University of Milan - Bicocca"}], "abstract": "Recent advances in snapshot multispectral (MS) imaging have enabled compact, low-cost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50\\% compared to RGB-only and MS-driven baselines. Datasets, code, and models will be made available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37341", "url": null, "sourceid": 39942, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37343, "uid": "2a155efbdf7faa8784ee3b922e9062c0", "name": "Streamlined Open-Vocabulary Human-Object Interaction Detection", "authors": [{"id": 175429, "fullname": "Chang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/175429?format=json", "institution": "South China University of Technology"}, {"id": 187206, "fullname": "LiaoDongliang LiaoDongliang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187206?format=json", "institution": "South China University of Technology"}, {"id": 88971, "fullname": "Changxing Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/88971?format=json", "institution": "South China University of Technology"}], "abstract": "Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training.Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories.However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations.To address this issue, we introduce **SL-HOI**, a **S**tream**L**ined open-vocabulary **HOI** detection framework based solely on the powerful DINOv3 model.Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification.Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output, we propose first feeding both the interaction queries and the backbone patch tokens into the vision head, effectively bridging their representation gaps. All DINOv3 parameters in our approach are frozen, with only a small number of learnable parameters added, allowing a fast adaptation to the HOI detection task.Extensive experiments show that SL-HOI achieves state-of-the-art performance on both the SWiG-HOI and HICO-DET benchmarks, demonstrating the effectiveness of our streamlined model architecture.The code of this work will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37343", "url": null, "sourceid": 36221, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37347, "uid": "4a551692a9f6dd50294ebb1c41b92e33", "name": "Explicit Recovery Behaivor for Diffusion Policies", "authors": [{"id": 182026, "fullname": "zundong Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/182026?format=json", "institution": "ShanghaiTech University"}, {"id": 187221, "fullname": "Junlin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187221?format=json", "institution": "ShanghaiTech University"}, {"id": 187222, "fullname": "Jiayi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187222?format=json", "institution": "ShanghaiTech University"}, {"id": 187223, "fullname": "Kuanhao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/187223?format=json", "institution": "ShanghaiTech University"}, {"id": 154716, "fullname": "Jiayuan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154716?format=json", "institution": "ShanghaiTech University"}, {"id": 175153, "fullname": "boyi zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/175153?format=json", "institution": "ShanghaiTech University"}], "abstract": "Diffusion policies have emerged as a powerful paradigm for robot learning, but their inherent multi-modality can lead to a diverse set of plausible\u2014though not always optimal\u2014actions from a single observation. We posit that for a given task, an optimal action exists within this distribution. Inspired by negative prompting in generative models, we introduce a novel method that leverages an error detector to identify out-of-distribution (OOD) execution histories and uses them to construct negative action prompts. This allows our policy to steer away from suboptimal behaviors and converge towards higher-performance actions. We present a comprehensive ablation study demonstrating the effectiveness of positive, and negative prompts, and validate our approach on a suite of simulated benchmarks and real-world robotic tasks. Our results show that the proposed Negative-Prompt-guided Diffusion Policy achieves significant improvement in task performance by effectively filtering undesirable action modes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37347", "url": null, "sourceid": 43216, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37349, "uid": "f6bb36049c3b42a62dfa88c46e0f79d2", "name": "PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing", "authors": [{"id": 178729, "fullname": "Antonio Oroz", "url": "http://cvpr.thecvf.com/api/miniconf/users/178729?format=json", "institution": "Technical University of Munich"}, {"id": 75895, "fullname": "Matthias Nie\u00dfner", "url": "http://cvpr.thecvf.com/api/miniconf/users/75895?format=json", "institution": "Technical University of Munich"}, {"id": 84643, "fullname": "Tobias Kirschstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/84643?format=json", "institution": "Department of Informatics, Technische Universit\u00e4t M\u00fcnchen"}], "abstract": "We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37349", "url": null, "sourceid": 31993, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37350, "uid": "3a93c1c593e624d2d0a6ea4c55e9cfd2", "name": "Agentic Retoucher for Text-To-Image Generation", "authors": [{"id": 174465, "fullname": "Shaocheng Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/174465?format=json", "institution": "Shanghai JiaoTong University"}, {"id": 142395, "fullname": "Jianfeng Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/142395?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 154581, "fullname": "Chunlei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/154581?format=json", "institution": "Bilibili Inc"}, {"id": 187225, "fullname": "Cong Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187225?format=json", "institution": "JIUTIAN Research"}, {"id": 154576, "fullname": "Huiyu Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154576?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 90103, "fullname": "Xiaoyun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90103?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 130391, "fullname": "Qiang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130391?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision\u2013language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose **Agentic Retoucher**, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like *perception\u2013reasoning\u2013action* loop.Specifically, we design (1) a **perception agent** that learns contextual saliency for fine-grained distortion localization under text\u2013image consistency cues, (2) a **reasoning agent** that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an **action agent** that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct **GenBlemish-27K**, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories.Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37350", "url": null, "sourceid": 35320, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37355, "uid": "fb387072ad549c098928fcb1e0011533", "name": "Adaptive Capacity Autoregressive Visual Tracking", "authors": [{"id": 176475, "fullname": "Tong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/176475?format=json", "institution": "Xi\u2019an Jiaotong University"}, {"id": 187241, "fullname": "Yifan Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187241?format=json", "institution": "Alibaba Group"}, {"id": 158140, "fullname": "Shiyi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158140?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 187242, "fullname": "Ruigang Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187242?format=json", "institution": "DAMO, Alibaba Group"}, {"id": 76677, "fullname": "Xing Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76677?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "We present \\textbf{ARTrack-AC}, a new step in the autoregressive tracking paradigm that introduces adaptive capacity inference to achieve both temporal consistency and dynamic efficiency. While existing autoregressive trackers predict object states sequentially with fixed inference capacity, they fail to accommodate the fluctuating temporal difficulty of real videos. ARTrack-AC addresses this limitation by equipping the tracker with the ability to \\textbf{modulate its inference capacity over time}. A diffusion-based difficulty estimator anticipates the stability of upcoming segments, guiding a controller to switch between an \\textbf{accurate} (high-capacity) and an \\textbf{efficient} (low-capacity) mode while maintaining autoregressive consistency. This system-level autoregression extends conventional sequence modeling beyond \u201cwhat to predict\u201d toward \u201chow to predict,\u201d forming a self-regulated tracking process that aligns inference cost with temporal complexity. Despite its simplicity, ARTrack-AC achieves state-of-the-art accuracy\u2013speed trade-off on major benchmarks\u201466.7\\% AUC on LaSOT and 47.5\\% AUC on LaSOText\u2014running 2.9$\\times$ faster than its predecessor.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37355", "url": null, "sourceid": 38644, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37356, "uid": "c1014dccbadfa3bd223e055e26e65527", "name": "GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision", "authors": [{"id": 175249, "fullname": "Yuxiao Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175249?format=json", "institution": "University of Science &amp; Technology of China"}, {"id": 187243, "fullname": "Junchi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187243?format=json", "institution": "Alibaba Group; University of Science and Technology of China"}, {"id": 187244, "fullname": "Zhenchao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187244?format=json", "institution": "University of Hong Kong"}, {"id": 183238, "fullname": "Changtao Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183238?format=json", "institution": "Ant Group"}, {"id": 187245, "fullname": "Haojie Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187245?format=json", "institution": "University of Science and Technology of China"}, {"id": 131736, "fullname": "Qi Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131736?format=json", "institution": "University of Science and Technology of China"}, {"id": 131741, "fullname": "Tao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131741?format=json", "institution": "University of Science and Technology of China"}, {"id": 90580, "fullname": "Nenghai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90580?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question\u2013Thinking\u2013Answer (QTA) pipeline via joint image\u2013text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1\\% on unsafe reasoning detection tasks, representing a 13.5\\% improvement in F1 score compared to the previous strongest multimodal safety defense methods.The codes will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37356", "url": null, "sourceid": 33348, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37357, "uid": "ab330135f213dedd5653800ce7703d35", "name": "Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation", "authors": [{"id": 183752, "fullname": "Joonhyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/183752?format=json", "institution": "KAIST"}, {"id": 179986, "fullname": "Hyeongwon Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179986?format=json", "institution": "KAIST"}, {"id": 156325, "fullname": "Joowon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156325?format=json", "institution": "Korea Advanced Institute of Science and Technology (KAIST)"}, {"id": 85226, "fullname": "Eunho Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85226?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-$N$ can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with $N{=}4$, it even outperforms Best-of-$N$ ($N{=}8$) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-$N$ baselines. The source code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37357", "url": null, "sourceid": 31560, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37359, "uid": "ecc3ebf4a557b4a5f1f8655ccd8897f3", "name": "Multi-speaker Attention Alignment for Multimodal Social Interaction", "authors": [{"id": 159927, "fullname": "LIANGYANG OUYANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/159927?format=json", "institution": "The University of Tokyo"}, {"id": 187251, "fullname": "Yifei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187251?format=json", "institution": "The University of Tokyo"}, {"id": 76883, "fullname": "Mingfang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76883?format=json", "institution": "The University of Tokyo"}, {"id": 77362, "fullname": "Caixin Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77362?format=json", "institution": "The University of Tokyo"}, {"id": 187252, "fullname": "Ryosuke Furuta", "url": "http://cvpr.thecvf.com/api/miniconf/users/187252?format=json", "institution": "Mantra Inc."}, {"id": 69170, "fullname": "Yoichi Sato", "url": "http://cvpr.thecvf.com/api/miniconf/users/69170?format=json", "institution": "University of Tokyo"}], "abstract": "Understanding social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues: who is speaking, to whom, and with what gaze or gestures.While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks. Our quantitative analysis of cross-modal attention inside state-of-the-art MLLMs reveals a core failure mode: in multi-speaker scenes, visual and textual tokens lack speaker-consistent alignment, exhibiting substantially weaker cross-modal attention than in object-centric images.To address this, we propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs. First, we introduce dynamic cross-modal head selection to identify attention heads most responsible for grounding. Then, an adaptive social-aware attention bias, computed from existing attention patterns and speaker locations, is injected into the attention mechanism. This bias reinforces alignment between a speaker\u2019s visual representation and their utterances without introducing trainable parameters or architectural changes.We integrate our method into three distinct MLLMs (LLaVA-NeXT-Video, Qwen2.5-VL, and InternVL3) and evaluate on three benchmarks (TVQA+, MMSI, OnlineMMSI). Across four social tasks, results demonstrate that our approach improves the ability of MLLMs and achieves state-of-the-art results.Attention visualizations confirm our method successfully focuses the model on speaker-relevant regions, enabling more robust multi-party social reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37359", "url": null, "sourceid": 33312, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37361, "uid": "64efc780f9e9d573f623c9c0718a7b9a", "name": "ALLNet: Multi-task Dense Prediction for Degraded Images", "authors": [{"id": 154437, "fullname": "Weiran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154437?format=json", "institution": "Xidian University"}, {"id": 182632, "fullname": "Jialing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182632?format=json", "institution": "Xidian University"}, {"id": 187256, "fullname": "Yaqi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187256?format=json", "institution": "China Telecom"}, {"id": 154436, "fullname": "Gang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/154436?format=json", "institution": "Xidian University"}, {"id": 187257, "fullname": "Li Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187257?format=json", "institution": "Xidian University"}, {"id": 187258, "fullname": "Chang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187258?format=json", "institution": "Xidian University"}, {"id": 73148, "fullname": "Yunsong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73148?format=json", "institution": "Xidian University"}], "abstract": "Multi-task dense prediction aims to simultaneously address multiple pixel-level tasks through a unified network for visual scene understanding. However, adverse environmental conditions limit the generalization and practicality of such tasks. To address this, we propose ALLNet, a novel framework that effectively explores degradation patterns and integrates multi-task collaborative information. Specifically, we design a MoE-based Mixture of Adaptive Experts (MaE) restoration component network that enhances degradation features through dynamic routing and guides task-specific feature extraction. Furthermore, we formulate a Task-aware Collaborative Refinement (TCR) module to capture global semantic correlations and cross-task dependencies, facilitating bidirectional collaboration between restoration and task-specific features on degraded images. To the best of our knowledge, this represents the first attempt at multi-task dense prediction under image degradation. Experimental results on degraded NYUD-v2 and PASCAL-Context benchmarks demonstrate that our architecture significantly outperforms existing methods in degraded scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37361", "url": null, "sourceid": 34815, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37365, "uid": "e1ad6ffc6195076ffef9077d6f57e4ed", "name": "Velox: Learning Representations of 4D Geometry and Appearance", "authors": [{"id": 160630, "fullname": "Anagh Malik", "url": "http://cvpr.thecvf.com/api/miniconf/users/160630?format=json", "institution": "University of Toronto"}, {"id": 185200, "fullname": "Dorian Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185200?format=json", "institution": "Apple"}, {"id": 151171, "fullname": "Xiaoming Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151171?format=json", "institution": "Apple MLR"}, {"id": 77223, "fullname": "David B. Lindell", "url": "http://cvpr.thecvf.com/api/miniconf/users/77223?format=json", "institution": "University of Toronto"}, {"id": 90013, "fullname": "Oncel Tuzel", "url": "http://cvpr.thecvf.com/api/miniconf/users/90013?format=json", "institution": "Apple"}, {"id": 167821, "fullname": "Rick Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/167821?format=json", "institution": "Apple"}], "abstract": "We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of *dynamic shape tokens*. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance.To demonstrate the utility of our representation, we evaluate it across three downstream tasks\u2014video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation\u2014and observe strong performances in all settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37365", "url": null, "sourceid": 40746, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37370, "uid": "f3e900a7f5a48ea8b91bb7df49c6895c", "name": "Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout", "authors": [{"id": 131338, "fullname": "Hidir Yesiltepe", "url": "http://cvpr.thecvf.com/api/miniconf/users/131338?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 132088, "fullname": "Tuna Han Salih Meral", "url": "http://cvpr.thecvf.com/api/miniconf/users/132088?format=json", "institution": "Virginia Tech"}, {"id": 187270, "fullname": "Kaan Akan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187270?format=json", "institution": null}, {"id": 181126, "fullname": "Kaan Oktay", "url": "http://cvpr.thecvf.com/api/miniconf/users/181126?format=json", "institution": "fal"}, {"id": 133116, "fullname": "Pinar Yanardag", "url": "http://cvpr.thecvf.com/api/miniconf/users/133116?format=json", "institution": "Virginia Tech"}], "abstract": "Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two anchor tokens, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores while requiring only a fraction of their KV cache budget.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37370", "url": null, "sourceid": 45441, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37372, "uid": "a0e278e882f1eb24248fe91b3411e42a", "name": "One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution", "authors": [{"id": 187276, "fullname": "Yushun Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187276?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187277, "fullname": "Yuxiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187277?format=json", "institution": "Xiaohongshu"}, {"id": 187278, "fullname": "Shibo Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187278?format=json", "institution": null}, {"id": 130391, "fullname": "Qiang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130391?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86992, "fullname": "Jiangchao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86992?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 76651, "fullname": "Ya Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76651?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 90103, "fullname": "Xiaoyun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90103?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 126281, "fullname": "Yanfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126281?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37372", "url": null, "sourceid": 32125, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37373, "uid": "08d64daa10339a98d9d9fa9145b7cca8", "name": "RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video", "authors": [{"id": 155036, "fullname": "Haiyang Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/155036?format=json", "institution": "National University of Singapore"}, {"id": 183863, "fullname": "Qiming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183863?format=json", "institution": "National University of Singapore"}, {"id": 89491, "fullname": "Hai Ci", "url": "http://cvpr.thecvf.com/api/miniconf/users/89491?format=json", "institution": "Beijing Institute for General Artificial Intelligence"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise construction of digital twins and world models for robotic applications, supports robot-centric data augmentation, and provides reliable cues for extracting robot actions and poses. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37373", "url": null, "sourceid": 32944, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40297?format=json"], "related_events_ids": [40297]}, {"id": 40297, "uid": "08d64daa10339a98d9d9fa9145b7cca8", "name": "RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video", "authors": [{"id": 155036, "fullname": "Haiyang Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/155036?format=json", "institution": "National University of Singapore"}, {"id": 183863, "fullname": "Qiming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183863?format=json", "institution": "National University of Singapore"}, {"id": 89491, "fullname": "Hai Ci", "url": "http://cvpr.thecvf.com/api/miniconf/users/89491?format=json", "institution": "Beijing Institute for General Artificial Intelligence"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise construction of digital twins and world models for robotic applications, supports robot-centric data augmentation, and provides reliable cues for extracting robot actions and poses. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40297", "url": null, "sourceid": -32944, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37373?format=json"], "related_events_ids": [37373]}, {"id": 37374, "uid": "3196ea2ae5a340440a166eafdd183a33", "name": "Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes", "authors": [{"id": 129354, "fullname": "Jing Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129354?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87380, "fullname": "Zhaoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87380?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 187279, "fullname": "Yantao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187279?format=json", "institution": "Amazon"}, {"id": 127609, "fullname": "Jiarui Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/127609?format=json", "institution": "Amazon AWS AI"}, {"id": 187280, "fullname": "Shuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187280?format=json", "institution": "Amazon"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}, {"id": 187281, "fullname": "Wei Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/187281?format=json", "institution": "Amazon"}, {"id": 84901, "fullname": "Zhuowen Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84901?format=json", "institution": "University of California, San Diego"}, {"id": 130843, "fullname": "Stefano Soatto", "url": "http://cvpr.thecvf.com/api/miniconf/users/130843?format=json", "institution": "University of California, Los Angeles"}], "abstract": "We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations\u2014such as translating, rotating, or resizing objects\u2014due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations.Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37374", "url": null, "sourceid": 42772, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37375, "uid": "d66d7531860ea50930a6d1280ae9f5e3", "name": "Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression", "authors": [{"id": 143243, "fullname": "SHIYIN JIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/143243?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 103351, "fullname": "Wei Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/103351?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187282, "fullname": "Minghao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187282?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187283, "fullname": "Zhenghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187283?format=json", "institution": "University of Newcastle"}, {"id": 128611, "fullname": "Ce Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128611?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 86553, "fullname": "Shuhang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86553?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "The proliferation of visual data under tight storage and bandwidth budgets makes extremely low\u2013bitrate generative image compression increasingly important. Vector quantization (VQ) is compelling in this regime because codebooks encode cross-channel correlations and dataset-level semantics, enabling perceptually faithful reconstructions when bits are scarce. We propose RDVQ, a vector-quantization (VQ) based generative image compression method designed for extremely low bitrates. While end-to-end learned image codecs rely on a differentiable rate term for rate\u2013distortion (RD) optimization, however, a key challenge is that na\u00efvely integrating VQ introduces non-differentiability and is not directly compatible with entropy modeling, forcing prior work to regulate bitrate only indirectly. We resolve this by defining a distance-aware soft posterior over codebook indices and training a conditional autoregressive entropy model to predict it. Therefore the cross-entropy between the approximate and predicted posteriors yields a differentiable rate loss, restoring a gradient pathway from rate to the encoder via codeword distances. Such predicted codebook index distribution enables prefix-only transmission at inference, with the model imputing the rest of the indices, delivering retraining-free bitrate control over a practical range. Our end-to-end RD optimized RDVQ outperforms all baseline methods in terms of DISTS and CLIPIQA, which reflect superior structural restoration and better alignment with human visual perception on the Kodak, DIV2K and CLIC2020 datasets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37375", "url": null, "sourceid": 33293, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40298?format=json"], "related_events_ids": [40298]}, {"id": 40298, "uid": "d66d7531860ea50930a6d1280ae9f5e3", "name": "Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression", "authors": [{"id": 143243, "fullname": "SHIYIN JIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/143243?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 103351, "fullname": "Wei Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/103351?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187282, "fullname": "Minghao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187282?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187283, "fullname": "Zhenghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187283?format=json", "institution": "University of Newcastle"}, {"id": 128611, "fullname": "Ce Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128611?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 86553, "fullname": "Shuhang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86553?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "The proliferation of visual data under tight storage and bandwidth budgets makes extremely low\u2013bitrate generative image compression increasingly important. Vector quantization (VQ) is compelling in this regime because codebooks encode cross-channel correlations and dataset-level semantics, enabling perceptually faithful reconstructions when bits are scarce. We propose RDVQ, a vector-quantization (VQ) based generative image compression method designed for extremely low bitrates. While end-to-end learned image codecs rely on a differentiable rate term for rate\u2013distortion (RD) optimization, however, a key challenge is that na\u00efvely integrating VQ introduces non-differentiability and is not directly compatible with entropy modeling, forcing prior work to regulate bitrate only indirectly. We resolve this by defining a distance-aware soft posterior over codebook indices and training a conditional autoregressive entropy model to predict it. Therefore the cross-entropy between the approximate and predicted posteriors yields a differentiable rate loss, restoring a gradient pathway from rate to the encoder via codeword distances. Such predicted codebook index distribution enables prefix-only transmission at inference, with the model imputing the rest of the indices, delivering retraining-free bitrate control over a practical range. Our end-to-end RD optimized RDVQ outperforms all baseline methods in terms of DISTS and CLIPIQA, which reflect superior structural restoration and better alignment with human visual perception on the Kodak, DIV2K and CLIC2020 datasets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40298", "url": null, "sourceid": -33293, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37375?format=json"], "related_events_ids": [37375]}, {"id": 37376, "uid": "380145bb084aa454fe34abc4cad8c357", "name": "Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning", "authors": [{"id": 135405, "fullname": "SARA GHAZANFARI", "url": "http://cvpr.thecvf.com/api/miniconf/users/135405?format=json", "institution": "New York University (NYU)"}, {"id": 75522, "fullname": "Francesco Croce", "url": "http://cvpr.thecvf.com/api/miniconf/users/75522?format=json", "institution": "University of T\u00fcbingen"}, {"id": 187284, "fullname": "Nicolas Flammarion", "url": "http://cvpr.thecvf.com/api/miniconf/users/187284?format=json", "institution": "Swiss Federal Institute of Technology Lausanne"}, {"id": 149482, "fullname": "Prashanth Krishnamurthy", "url": "http://cvpr.thecvf.com/api/miniconf/users/149482?format=json", "institution": "New York University"}, {"id": 149062, "fullname": "Farshad Khorrami", "url": "http://cvpr.thecvf.com/api/miniconf/users/149062?format=json", "institution": "New York University"}, {"id": 187285, "fullname": "Siddharth Garg", "url": "http://cvpr.thecvf.com/api/miniconf/users/187285?format=json", "institution": null}], "abstract": "Recent work has shown that eliciting Large Language Models (LLMs) to generate reasoning traces in natural language before answering the user's request can significantly improve their performance across tasks.This approach has been extended to multimodal LLMs, where the models can produce chains-of-thoughts (CoT) about the content of input images and videos. For video inputs, prior works use complex multi-step pipelines that extract and include relevant frames from videos in the CoT, or produce simpler single-stage reasoning traces at the expense of poor temporal grounding.Here, we propose the first video LLMs with single-stage reasoning that includes explicit references to relevant frames, thereby reducing temporal inconsistencies in the reasoning process. Our approach is simple, unified, and self-contained, employing a single-stage inference to handle complex video understanding tasks without relying on auxiliary modules for frame selection or caption generation.For this, we first create CoF-Data, a large dataset of diverse questions, answers, and corresponding frame-grounded reasoning traces from both natural and synthetic videos, spanning various topics and tasks.Our models, obtained by fine-tuning video LLMs on this chain-of-frames (CoF) data, generate reasoning traces that accurately identify key frames to answer given questions.In turn, this consistently improves performance across multiple video understanding benchmarks. Surprisingly, we find that synthetic data alone, despite being out-of-distribution with respect to these real-world benchmarks, provides a significant boost in model accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37376", "url": null, "sourceid": 41704, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37377, "uid": "d692938bafade3b6eea0fdbde76d79ff", "name": "Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing", "authors": [{"id": 182408, "fullname": "Le Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182408?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 87165, "fullname": "Hongping Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87165?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "In the field of Compressive Sensing (CS), deep unrolling networks (DUNs) have demonstrated exceptional performance and interpretability by integrating traditional optimization solvers with deep networks. However, existing DUNs suffer from homogenization in cross-stage feature extraction and insufficient integration of gradient-guided information. Additionally, the feature extraction module struggles to balance the global receptive field and computational efficiency, which limits improvements in image reconstruction details. To address these challenges, we propose a multi-scale gradient-guided unrolling architecture with adaptive Mamba for CS, named MambaCS.  Specifically, we utilize our customized Adaptive State-Space Block (A-SSB) to unroll the well-known Proximal Gradient Descent (PGD) algorithm across multiple feature levels to extract comprehensive image features while maintaining computational efficiency. Moreover, we design a High-Dimensional Gradient Fusion (HDGF) that ensures the persistent and stable injection of gradient-guided information across various scales and dimensions, while effectively eliminating information bottlenecks. Finally, we develop a Feature-Adaptive Proximal Operator (FAPO), using A-SSB as an extension of the sparse basis associated with the PGD proximal operator, which enhances sensitivity to multi-scale features and improves detail reconstruction. Extensive experiments demonstrate the signifcant advantages of our proposed MambaCS over the current SOTA methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37377", "url": null, "sourceid": 38680, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37378, "uid": "fd80c4b06025c38f9d6958ebe4f14532", "name": "DVGT: Visual Geometry Transformer for Autonomous Driving", "authors": [{"id": 153230, "fullname": "Sicheng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/153230?format=json", "institution": "Tsinghua University"}, {"id": 187286, "fullname": "Zixun Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187286?format=json", "institution": "Central South University"}, {"id": 130710, "fullname": "Wenzhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130710?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 144260, "fullname": "Shaoqing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144260?format=json", "institution": "University of Macau"}, {"id": 187287, "fullname": "Fang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187287?format=json", "institution": "University of Macau"}, {"id": 187288, "fullname": "Shengyin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187288?format=json", "institution": null}, {"id": 158846, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158846?format=json", "institution": "Wayve"}, {"id": 149023, "fullname": "Zhixin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149023?format=json", "institution": "University of Macau"}, {"id": 88597, "fullname": "Jiwen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88597?format=json", "institution": "Tsinghua University"}], "abstract": "Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, it still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. Finally, we use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego pose for each frame. Our DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets, including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms the other geometry prediction models on various scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37378", "url": null, "sourceid": 31415, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37389, "uid": "614484cb68d6a51ba0e36f9794ee38fc", "name": "Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models", "authors": [{"id": 181962, "fullname": "Zitong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181962?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187321, "fullname": "Kaidong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187321?format=json", "institution": "University of Science and Technology of China"}, {"id": 187322, "fullname": "Yukang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/187322?format=json", "institution": "Alibaba Group"}, {"id": 187323, "fullname": "Chao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187323?format=json", "institution": "Alibaba Group"}, {"id": 187324, "fullname": "Rui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/187324?format=json", "institution": null}, {"id": 157235, "fullname": "Ying Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157235?format=json", "institution": "Taobao, Alibaba Group"}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose \\textbf{LocalDPO}, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with \\textbf{a single inference} per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and inpainting only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37389", "url": null, "sourceid": 37099, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37388, "uid": "171061605226364f27c0a15445307397", "name": "Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification", "authors": [{"id": 182474, "fullname": "William Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182474?format=json", "institution": "Princeton University"}, {"id": 90813, "fullname": "Xindi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90813?format=json", "institution": "Princeton University"}, {"id": 187319, "fullname": "Zhiwei Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187319?format=json", "institution": "Google DeepMind"}, {"id": 187320, "fullname": "Esin Tureci", "url": "http://cvpr.thecvf.com/api/miniconf/users/187320?format=json", "institution": "Princeton University"}, {"id": 150975, "fullname": "Olga Russakovsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/150975?format=json", "institution": "Princeton University"}], "abstract": "Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating synthetic training data to improve fine-grained classification performance remains challenging. Fine-tuning the T2I model with a few real examples can help generate more appropriate synthetic training data; however, this fine-tuning may also introduce overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (Beyond OBjects) for mitigating these concerns. Given a small set of real examples, we first describe them using class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, thus preserving the T2I model\u2019s generative prior and reducing estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets demonstrate state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with 5 real images augmented with 100 synthetic images). Additionally, in three of the four datasets, the fine-tuning downstream models with synthetic data generated from BOB and five real images achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with over 2% accuracy improvements in 14 of these settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37388", "url": null, "sourceid": 36933, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37385, "uid": "e0054c962f184b100dff364a2e32fac0", "name": "VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control", "authors": [{"id": 158602, "fullname": "Sixiao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158602?format=json", "institution": "Fudan University"}, {"id": 128777, "fullname": "Minghao Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128777?format=json", "institution": "The University of Hong Kong"}, {"id": 104852, "fullname": "Wenbo Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104852?format=json", "institution": "Tencent ARC Lab"}, {"id": 85467, "fullname": "Xiaoyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85467?format=json", "institution": "Tencent ARC Lab"}, {"id": 84809, "fullname": "Ying Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84809?format=json", "institution": "Tencent"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}], "abstract": "Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset. We will release our code, model, and dataset to support research in realistic and controllable video world models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37385", "url": null, "sourceid": 42227, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37391, "uid": "e07472e70e7971af8747adb73ebeac05", "name": "Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation", "authors": [{"id": 182826, "fullname": "Karlis Martins Briedis", "url": "http://cvpr.thecvf.com/api/miniconf/users/182826?format=json", "institution": "DisneyResearch|Studios"}, {"id": 86229, "fullname": "Markus Gross", "url": "http://cvpr.thecvf.com/api/miniconf/users/86229?format=json", "institution": "Disney Research, Disney"}, {"id": 85061, "fullname": "Christopher Schroers", "url": "http://cvpr.thecvf.com/api/miniconf/users/85061?format=json", "institution": "Disney Research|Studios, Disney"}], "abstract": "Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is significantly slower in practice and therefore many prior methods process images at downsampled resolutions, missing fine-grained details.To address this, we propose an algorithm for both memory and compute-efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 92% while maintaining equally low memory usage, and performs at least on par with the default implementation with up to 99% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 63% savings for the total end-to-end model inference on high-resolution inputs. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an inference-time extension of the SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and runtime.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37391", "url": null, "sourceid": 34190, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37395, "uid": "3cd13cea091d84211559098a33a38ce9", "name": "Graph Attention Prototypical Network for Robust Few-Shot Classification", "authors": [{"id": 181289, "fullname": "Tingyun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181289?format=json", "institution": "Hunan University"}, {"id": 187338, "fullname": "Licheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187338?format=json", "institution": "Hunan University"}, {"id": 187339, "fullname": "Qibin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187339?format=json", "institution": "Hunan University"}, {"id": 187340, "fullname": "Qiying Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187340?format=json", "institution": "Guangzhou University"}, {"id": 155062, "fullname": "C.L.Philip Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155062?format=json", "institution": "South China University of Technology"}], "abstract": "Few-shot learning has attracted extensive attention, with metric-based approaches such as Prototypical Networks establishing strong baselines. These methods construct class prototypes from support samples and classify query samples via distance metrics, but their performance is highly sensitive to label noise. To tackle this challenge, we propose a novel graph attention prototypical network (GAPNet) for robust few-shot classification. GAPNet first extracts local and global features via a classic CNN backbone and a group attention broad learning module, respectively. To mitigate the impact of label noise, the intra-class and inter-class relationships between support and query samples are explicitly modeled via a pseudo-label guided graph constructor, and then processed by an edge-aware graph attention module to capture topological correlations. Furthermore, an adaptive noise-robust prototype generator is introduced to dynamically suppress the contributions of noisy samples, substantially improving the reliability of class prototypes. Extensive experiments demonstrate the effectiveness and robustness of GAPNet to label noise. Compared to state-of-the-art approaches, GAPNet improves accuracy in the 5-way 5-shot setting by $3\\% \\sim 8\\%$ on three general image benchmarks and one fine-grained classification dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37395", "url": null, "sourceid": 44902, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37398, "uid": "92d19e9626e3e23c23bfa1f6dcdc1837", "name": "Gaussian-Mixture Latent Flow for Stochastic 3D Human Motion Prediction", "authors": [{"id": 146478, "fullname": "Yue Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/146478?format=json", "institution": "Beihang University"}, {"id": 187348, "fullname": "Frederick W. B. Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187348?format=json", "institution": "Durham University"}, {"id": 130473, "fullname": "Xiaohui Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130473?format=json", "institution": "Zhongguancun Laboratory"}], "abstract": "Stochastic human motion prediction aims to forecast future motion distributions. Although recent studies have achieved strong performance in terms of accuracy and diversity, they often overlook plausibility (e.g., resulting in physically unrealistic predictions) and uncertainty quantification, which is essential for real-world applications and downstream tasks. To address these issues, we propose a latent flow-based model equipped with a data-driven Gaussian mixture prior that more effectively disentangles diverse human behaviors than conventional single-modal priors. This prior is derived from patterns in the training data without requiring additional annotations. Furthermore, the fully invertible nature of our model enables natural uncertainty quantification through tractable likelihood computation. Experiments on the Human3.6M and AMASS datasets demonstrate that our approach achieves state-of-the-art performance in both accuracy and plausibility, while also providing reliable uncertainty estimates.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37398", "url": null, "sourceid": 36849, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37399, "uid": "32c92c5a9c391d8a2a2a05770f1a3395", "name": "Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising", "authors": [{"id": 180567, "fullname": "Zhan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180567?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 145287, "fullname": "Wang Leiquan", "url": "http://cvpr.thecvf.com/api/miniconf/users/145287?format=json", "institution": "China University of Petroleum"}, {"id": 149251, "fullname": "Chunlei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149251?format=json", "institution": "China University of Petroleum \uff08East China)"}, {"id": 187349, "fullname": "Yu Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187349?format=json", "institution": "China University of Petroleum"}], "abstract": "Image denoising is a fundamental task in computer vision aimed at recovering clean images from noise-corrupted observations. While supervised deep learning methods achieve remarkable performance when trained on paired data with known noise levels, their real-world applicability is limited as noise characteristics are often unknown. Existing unsupervised techniques, such as blind-spot networks or methods based on statistical estimation, either compromise performance due to information loss or suffer from inaccuracies in noise level estimation. To address these challenges, we propose a novel two-stage self-supervised denoising framework that first accurately estimates the noise level directly from noisy images, without requiring clean references or prior noise knowledge. Building upon theoretical insights from Noisier2Noise, we rigorously derive a relationship between the noise level and the variance of the denoised image, enabling robust estimation via a deep learning model and a ternary search strategy. The estimated noise level is then used to synthesize training pairs for supervised denoising. Experiments demonstrate that our method outperforms existing unsupervised approaches and traditional noise estimation techniques, achieving performance competitive with\u2014and in some cases surpassing\u2014supervised methods trained with known noise levels. The proposed framework effectively overcomes the training data pair limitations of supervised approaches for unknown additive white Gaussian noise. Our code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37399", "url": null, "sourceid": 32614, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37400, "uid": "e52da5a31de788599378924f0e639557", "name": "FG-portrait:  3D Flow Guided Editable Portrait Animation", "authors": [{"id": 181426, "fullname": "Yating Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181426?format=json", "institution": null}, {"id": 153657, "fullname": "Yunqi Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153657?format=json", "institution": "Huawei London Research Center"}, {"id": 164785, "fullname": "Evangelos Ververas", "url": "http://cvpr.thecvf.com/api/miniconf/users/164785?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}, {"id": 86145, "fullname": "Jifei Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/86145?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Motion transfer from the driving to the source portrait remains a key challenge in the portrait animation. Current diffusion-based approaches condition only on the driving motion, which fails to capture source-to-driving correspondences and consequently yields suboptimal motion transfer.  Although flow estimation provides an alternative, predicting dense correspondences from 2D input is ill-posed and often yields inaccurate animation. We address this problem by introducing 3D flows, a learning-free and geometry-driven motion correspondence directly computed from parametric 3D head models. To integrate this 3D prior into diffusion model, we introduce 3D flow encoding to query potential 3D flows for each target pixel to indicate its displacement back to the source location. To obtain 3D flows aligned with 2D motion changes, we further propose depth-guided sampling to accurately locate the corresponding 3D points for each pixel. Beyond high-fidelity portrait animation, our model further supports user-specified editing of facial expression and head pose. Extensive experiments demonstrate the superiority of our method on consistent driving motion transfer as well as faithful source identity preservation. The source code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37400", "url": null, "sourceid": 35737, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37401, "uid": "9838dd08963f61611712f456f5bdd255", "name": "TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas", "authors": [{"id": 181796, "fullname": "QIU QI", "url": "http://cvpr.thecvf.com/api/miniconf/users/181796?format=json", "institution": "Alibaba Group"}, {"id": 187350, "fullname": "Xuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187350?format=json", "institution": null}, {"id": 126428, "fullname": "Jiawei Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126428?format=json", "institution": "Southeast University"}, {"id": 187351, "fullname": "Yuan Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187351?format=json", "institution": null}, {"id": 126445, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126445?format=json", "institution": "Southeast University"}, {"id": 187352, "fullname": "Yanlong Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/187352?format=json", "institution": "Alibaba Group"}], "abstract": "Video highlight detection aims to identify the most engaging segments in long-form videos, supporting content editing and recommendation, especially for movies and TV dramas. However, existing methods are ill-suited to cinematic content due to its narrative complexity, while the scarcity of annotated data and the high cost of manual labeling further hinder progress. To bridge this gap, we introduce **TVHighlights**, the first large-scale dataset tailored for video highlight detection in movies and TV dramas, with 1,721 carefully curated videos covering diverse genres. Built on community-driven behaviors, it provides realistic and diverse annotations without human labeling. Based on TVHighlights, we propose **LTV-HD**: a LLM-guided, human-free collaborative training framework for video highlight detection in cinematic content. LTV-HD operates in two stages: (1) weakly supervised pre-training of a lightweight model using video-level labels, followed by (2) iterative refinement through collaboration between large language models (LLMs) and the lightweight model. LLMs generate noisy clip-level pseudo-labels, which the lightweight model learns from under a noise-robust strategy, and its high-confidence predictions are then fed back to guide the LLM in distilling genre-specific highlight patterns through a self-improving loop. Experiments demonstrate that LTV-HD achieves state-of-the-art performance on TVHighlights, validating its effectiveness in real-world, annotation-free scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37401", "url": null, "sourceid": 44940, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37406, "uid": "7148b36cd53dca251b2311494f5732af", "name": "BiGain: Unified Token Compression for Joint Generation and Classification", "authors": [{"id": 183522, "fullname": "Jiacheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183522?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 85024, "fullname": "Shengkun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85024?format=json", "institution": "North Carolina State University"}, {"id": 146194, "fullname": "Jiacheng Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/146194?format=json", "institution": "MBZUAI(Mohamed bin Zayed University of Artificial Intelligence)"}, {"id": 84989, "fullname": "Dongkuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84989?format=json", "institution": "North Carolina State University"}, {"id": 135097, "fullname": "Zhiqiang Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/135097?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}], "abstract": "Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize for synthesis quality under reduced compute, yet they often ignore the model's latent discriminative capacity. We revisit token compression with a joint objective and present **BiGain**, a training-free, plug-and-play framework that preserves generation quality while markedly improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) *Laplacian-gated token merging*, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) *Interpolate\u2013Extrapolate KV Downsampling*, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision without retraining. Across DiT- and U-Net\u2013based backbones and multiple datasets of ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our proposed operators consistently improve the speed\u2013accuracy trade-off for diffusion-based classification, while maintaining, sometimes even enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with a token merging ratio of 70% on Stable Diffusion 2.0, BiGain improves classification accuracy by **7.15%** while also reducing FID for generation by 0.34 (**1.85%**). Our comprehensive analyses indicate that balanced spectral retention, preserving high-frequency detail alongside low/mid-frequency semantic content is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, offering a practical way to deployable, dual-purpose generative systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37406", "url": null, "sourceid": 39484, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37407, "uid": "0012e5ac66e6927d61c675cef8548bee", "name": "Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning", "authors": [{"id": 102035, "fullname": "Beichen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102035?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 131138, "fullname": "Yuhang Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131138?format=json", "institution": "Nanyang Technological University"}, {"id": 90594, "fullname": "Xiaoyi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/90594?format=json", "institution": "Microsoft"}, {"id": 152680, "fullname": "Yuhang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152680?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 153062, "fullname": "Haodong Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153062?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 77217, "fullname": "Jiaqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77217?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4.These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence.The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks.Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles.However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution.This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages:vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution.Building on this insight, we introduce two synergistic strategies:(1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and(2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction.Extensive experiments demonstrate that our approach yields up to a 4.33\\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks.Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37407", "url": null, "sourceid": 38411, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37413, "uid": "4dbccb7943944f9e433288413ca88828", "name": "Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection", "authors": [{"id": 181567, "fullname": "Jinhu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181567?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 87134, "fullname": "Yihang Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/87134?format=json", "institution": "Peking University"}, {"id": 187383, "fullname": "Qingyi Si", "url": "http://cvpr.thecvf.com/api/miniconf/users/187383?format=json", "institution": "JD.com"}, {"id": 187384, "fullname": "Shudong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187384?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187385, "fullname": "Sen Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/187385?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs. We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity.Extensive experiments on multiple LVLM safety benchmarks demonstrate that our causal\u2013subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Moreover, our approach demonstrates robust transferability, effectively defending against unseen and adaptive attacks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37413", "url": null, "sourceid": 42135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37419, "uid": "fcd6ee20a31612b7f2f886ed16348750", "name": "Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism", "authors": [{"id": 172721, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/172721?format=json", "institution": "Zhejiang University"}, {"id": 187401, "fullname": "Shujun Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187401?format=json", "institution": "Hangzhou Hi-tech Innovation Group"}, {"id": 187402, "fullname": "Tao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187402?format=json", "institution": "National University of Singapore"}, {"id": 187403, "fullname": "Han Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187403?format=json", "institution": "Zhejiang University"}, {"id": 187404, "fullname": "Zonghui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187404?format=json", "institution": "Zhejiang University"}, {"id": 187405, "fullname": "Wenzhi CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/187405?format=json", "institution": "Zhejiang University"}], "abstract": "Diffusion models (DMs) have recently achieved remarkable success across diverse modalities, including high-fidelity image and video synthesis.However, their inherent step sequential denoising process introduces substantial cumulative latency, which significantly degrades user experience. While existing multi-GPU parallelization motheds can alleviate latency, they often incur prohibitive GPU-GPU communication overhead, offsetting much of the performance gain.We present Otil (Only Transmit Informative Latents), a communication-efficient parallel framework for accelerating diffusion inference.% Otil can minimizes redundant data exchange across GPUs while preserving generation quality.Our key insight is that latent activations change only marginally between consecutive denoising process. Leveraging this property, Otil identifies and synchronizes only the most informative latent sub-blocks and introduces a dynamic polling mechanism that periodically revisits all spatial regions, ensuring complete coverage without unnecessary communication. The framework is fully plug-and-play and remains compatible with fast sample and architectural acceleration algorithms, without requiring any retraining or architectural modification.Otil reduces GPU\u2013GPU communication up to 87.5\\% compared with SOTA parallelism methods, achieving $1.8\\times$ speedup on two GPUs with Stable Diffusion v1.5 and $2.6\\times$ on four GPUs with Stable Diffusion XL. When combined with few-step samplers ($30$ steps) and LoRA models, the acceleration further increases to $2.46\\times$\u2013$2.84\\times$ on 2 GPUs. These demonstrate the strong potential of Otil for scalable and efficient multi-GPU diffusion inference while preserving generation fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37419", "url": null, "sourceid": 40160, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37421, "uid": "75d2fc965b6f80e29b2e9621c16af23b", "name": "SG-LoRA: Semantic-guided LoRA Parameters Generation", "authors": [{"id": 181278, "fullname": "Miaoge Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181278?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 157491, "fullname": "Yang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157491?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 187411, "fullname": "Zhijie Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187411?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 187412, "fullname": "Can Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187412?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 187413, "fullname": "Kang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187413?format=json", "institution": "Southeast University"}, {"id": 85166, "fullname": "Jingcai Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85166?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Generating new Low-Rank Adaptation (LoRA) weights from pre-trained LoRAs has demonstrated strong generalization capabilities across various tasks, enabling the efficient transfer of AI models, particularly on resource-constrained edges. However, previous studies either merge base LoRAs via weighting coefficients or train a generative model under the closed-world assumption, limiting their efficiency and flexibility in complex edge user cases. This challenge may further increase when there are significant domain shifts between training and deployment. To this end, we propose $S$emantic-$G$uided $LoRA$ Parameter Generation ($SG$-$LoRA$), a tuning-free generative framework to efficiently produce task-specific parameters for unseen tasks in a semantic-to-LoRA pipeline. Concretely, SG-LoRA uses task descriptions as the semantic bridge, measuring their proximity to a set of known expert tasks in a shared embedding space. Based on this semantic guidance, it models the target task's LoRA parameter distribution to generate high-performing parameters for novel tasks. SG-LoRA enables the real-time construction of LoRA models aligned with individual intents by distilling knowledge from prominent LoRA experts, while also offering a privacy-preserving solution for personalized model adaptation in a novel zero-shot open-world setting proposed in this work.  Extensive experiments on multiple challenging tasks confirm the superior performance and remarkable adaptability of SG-LoRA. The code is attached in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37421", "url": null, "sourceid": 37919, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37423, "uid": "b3f7bb40292f61fa966cb2a8a4cd339d", "name": "RAG-TP: A General Framework for Vehicle Trajectory Prediction via Retrieval-Augmented Generation", "authors": [{"id": 187417, "fullname": "Ziyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187417?format=json", "institution": "National University of Defense Technology"}, {"id": 187418, "fullname": "ZhangYang ZhangYang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187418?format=json", "institution": null}, {"id": 176170, "fullname": "Guijian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176170?format=json", "institution": "National University of Defense Technology"}, {"id": 187419, "fullname": "Chao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187419?format=json", "institution": "National University of Defense Technology"}, {"id": 177412, "fullname": "Shibo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177412?format=json", "institution": "National University of Defense Technology"}, {"id": 187420, "fullname": "Xueqiong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187420?format=json", "institution": "National University of Defense Technology"}, {"id": 187421, "fullname": "Shaowu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187421?format=json", "institution": "National University of Defense Technology"}], "abstract": "Vehicle trajectory prediction is a critical technology for safe and efficient autonomous driving.However, its generalization and scalability have long been hindered by a heavy reliance on real-time, online priors.To break this bottleneck, we introduce RAG-TP, a general framework that reframes the problem from relying on uncertain online perception to retrieving from a large-scale, structured, offline knowledge base.The core of RAG-TP is to enhance predictions at inference time by dynamically querying a pre-built, heterogeneous knowledge base rich with scene topologies and motion patterns, using the retrieved historical experiences as priors.We further design a dynamic fusion module based on a learnable Mixture-of-Experts (MoE), which intelligently weights and integrates the multi-source retrieved knowledge via cross-attention to generate a high-density context for the final multi-modal trajectory decoding.By decoupling online inference from offline knowledge, this retrieval-augmented approach grounds predictions in a vast structured database, thereby mitigating model hallucination and compensating for unreliable priors to significantly enhance robustness and domain adaptation.Extensive experiments demonstrate that RAG-TP achieves excellent performance in both map-based and map-free settings, surpassing existing map-free methods while achieving performance comparable to state-of-the-art (SOTA) map-based models.It demonstrates significant advantages, particularly in cross-domain and zero-shot generalization tasks.Our work provides a promising and effective technical pathway toward building more scalable and robust prediction systems for autonomous driving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37423", "url": null, "sourceid": 39958, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37426, "uid": "ad2b5e729bc747066e6422cb6e1fa5da", "name": "Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long-Video Understanding", "authors": [{"id": 187427, "fullname": "Pengfei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187427?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 70644, "fullname": "Meng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/70644?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 187428, "fullname": "Yingyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187428?format=json", "institution": null}, {"id": 86626, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86626?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 181696, "fullname": "Jiahua Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181696?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 152920, "fullname": "Jun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152920?format=json", "institution": "Alibaba Group"}, {"id": 185404, "fullname": "YuCheng YuCheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185404?format=json", "institution": null}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 69930, "fullname": "Xiaodan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69930?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm\u2014which alternates between global temporal reasoning and local frame examination\u2014has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft\u2019s proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37426", "url": null, "sourceid": 35599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40300?format=json"], "related_events_ids": [40300]}, {"id": 37428, "uid": "a55703897794aad1d95721677a93c826", "name": "Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models", "authors": [{"id": 183284, "fullname": "Yu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183284?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 135363, "fullname": "Hanwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135363?format=json", "institution": "Adobe Systems"}, {"id": 187430, "fullname": "Ahmed Abdelkader", "url": "http://cvpr.thecvf.com/api/miniconf/users/187430?format=json", "institution": "Google LLC"}, {"id": 88961, "fullname": "Wen-Sheng Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88961?format=json", "institution": "Google Research"}, {"id": 128449, "fullname": "Brandon Y. Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128449?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 75636, "fullname": "Zhangyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75636?format=json", "institution": "University of Texas at Austin"}, {"id": 91087, "fullname": "Qixing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91087?format=json", "institution": "University of Texas at Austin"}], "abstract": "With the emergence of 3D foundation models, such as DUSt3R, VGGT, and their variants, there is a growing interest in fine-tuning them for various downstream tasks, where using LoRA is the dominant fine-tuning paradigm.  As 3D datasets exhibit distinct variations in geometry, texture, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA sub-spaces associated with each type of variation? 2) Are these sub-spaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions.  We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation.  We show that these sub-spaces are approximately disentangled. Integrating them leads to a reduced LoRA sub-space that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks.  In particular, we show that such a reduced LoRA sub-space, despite derived entirely from synthetic data, generalizes to real datasets.  An ablation study validates the effectiveness of the choices in our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37428", "url": null, "sourceid": 41809, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37430, "uid": "6c9ca5f434f0fec54a8cb4073a3e2938", "name": "SIMPLEPOSTER: A SIMPLE BASELINE FOR PRODUCT POSTER GENERATION", "authors": [{"id": 156578, "fullname": "Benlei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/156578?format=json", "institution": "Alibaba Group"}, {"id": 89119, "fullname": "Fangao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89119?format=json", "institution": "Megvii Technology Inc."}, {"id": 187431, "fullname": "Weitao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187431?format=json", "institution": null}, {"id": 183682, "fullname": "Yuwen Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/183682?format=json", "institution": "Zhejiang Tmall Technology Co., Ltd."}, {"id": 129579, "fullname": "Haiwen Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129579?format=json", "institution": "Alibaba Group"}, {"id": 157341, "fullname": "Longtao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157341?format=json", "institution": "Alibaba Group"}, {"id": 90252, "fullname": "Hui Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/90252?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 156579, "fullname": "Wenxiang Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156579?format=json", "institution": "Taotian, Alibaba Group"}, {"id": 185968, "fullname": "Pipei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185968?format=json", "institution": null}], "abstract": "Product poster generation presents unique challenges beyond general-purpose de-sign: it demands not only aesthetic composition and accurate text rendering, butalso strict preservation of the product subject and precise control over dense,multi-line text layouts. While general image editing models struggle with text lay-out control and subject consistency, existing specialized approaches\u2014often builtupon inpainting frameworks\u2014still suffer from unintended subject extension andinaccurate text synthesis. A common solution involves integrating auxiliary mod-ules such as ControlNet to condition on subject structure and text layout, but theseapproaches introduce significant architectural complexity and training overhead.In this work, we challenge the necessity of such complexity and demonstrate thatminimalist adaptation is sufficient. We introduce SimplePoster, a minimalist yetpowerful inpainting-based framework that enables faithful subject preservationand position-controllable text rendering\u2014entirely without external controllers likeControlNet. SimplePoster rests on two key insights: (1) full-parameter fine-tuningalone effectively suppresses subject extension by aligning the model\u2019s internalrepresentations with domain-specific priors; and (2) a lightweight character-levelposition encoding strategy enables end-to-end, spatially grounded text generation.Experiments show that SimplePoster achieves near-perfect subject preservation(98.7% of cases with strict subject preservation), significantly outperforming boththe state-of-the-art editing model SeedEdit3.0 (55.2%) and the specialized ap-proach PosterMaker (85.3%). It further demonstrates superior text rendering ac-curacy, even in challenging scenarios with complex multi-line layouts. We believeSimplePoster establishes a simple yet strong baseline for product poster genera-tion. We question the necessity of such complexity and demonstrate that minimalist designs suffice. We propose SimplePoster, a simple yet effective inpainting-based framework that achieves faithful subject preservation and position-controllable text rendering without relying on external controllers like ControlNet. SimplePoster is based on two key insights: (1) full-parameter fine-tuning effectively suppresses subject extension; and (2) a training-free character-level position encoding strategy enables end-to-end, geometry-aware text generation. Remarkably, SimplePoster achieves a near-perfect subject preservation rate ($98.7\\%$), significantly outperforming SOTA models SeedEdit 3.0 ($55.2\\%$) and PosterMaker ($85.3\\%$). It also excels in text rendering accuracy. We believe SimplePoster establishes a simple yet strong baseline for product poster generation. Code, models and benchmark will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37430", "url": null, "sourceid": 33939, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37431, "uid": "0c9ab9755db29d773177576d8032354f", "name": "Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning", "authors": [{"id": 163084, "fullname": "ZHENYU ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/163084?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131641, "fullname": "Yixiong Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131641?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131642, "fullname": "Yuhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131642?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86547, "fullname": "Ruixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86547?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 181535, "fullname": "Guangyao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181535?format=json", "institution": "Cornell University"}], "abstract": "Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \\textbf{strengthening visual-modal discriminability actually suppresses VLMs\u2019 performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution.By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($L_{vlm}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL.However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $L_{vlm}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance.Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. We will release the code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37431", "url": null, "sourceid": 36412, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37432, "uid": "988351ce43f13c91eff27470890c8376", "name": "MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy", "authors": [{"id": 187432, "fullname": "Albert Dominguez Mantes", "url": "http://cvpr.thecvf.com/api/miniconf/users/187432?format=json", "institution": "Swiss Federal Institute of Technology (EPFL)"}, {"id": 187433, "fullname": "Gioele Manno", "url": "http://cvpr.thecvf.com/api/miniconf/users/187433?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 187434, "fullname": "Martin Weigert", "url": "http://cvpr.thecvf.com/api/miniconf/users/187434?format=json", "institution": "Technische Universit\u00e4t Dresden"}], "abstract": "Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modeling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37432", "url": "https://github.com/weigertlab/muvit", "sourceid": 37281, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37436, "uid": "ca8f0604222d01681e211a903ea53063", "name": "Interactive Episodic Memory with User Feedback", "authors": [{"id": 139461, "fullname": "Nikesh Subedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/139461?format=json", "institution": "University of Utah"}, {"id": 93522, "fullname": "Loris Bazzani", "url": "http://cvpr.thecvf.com/api/miniconf/users/93522?format=json", "institution": "Amazon"}, {"id": 75952, "fullname": "Ziad Al-Halah", "url": "http://cvpr.thecvf.com/api/miniconf/users/75952?format=json", "institution": "University of Utah"}], "abstract": "Human memory is often unreliable. We forget where we placed objects, overlook small details, and struggle to recall past events accurately. Episodic Memory with Natural Language Query (EM-NLQ) seeks to overcome these limitations by allowing users to search their past visual experiences, captured through egocentric videos, using natural language questions. While recent models focus on addressing challenges in EM-NLQ like noisy input videos and efficiency, they overlook a key aspect of this task: interactivity. In real scenarios, users have the ability to refine their queries and provide feedback when a model's response is off-target, yet current EM-NLQ methods cannot incorporate or benefit from such feedback. To address this gap, we introduce the first \\textit{interactive} EM-NLQ framework, featuring a plug-and-play Feedback ALignment Module (FALM) that empowers existing models to efficiently incorporate user feedback and refine their predictions. Additionally, we introduce the Episodic Memory with Questions and Feedback task (EM-QnF) and new datasets tailored for feedback-based interaction and a lightweight training scheme that eliminates the need for expensive sequential optimization. Our approach, dubbed ReFocus, combines FALM with state-of-the-art EM-NLQ methods to achieve state-of-the-art results on three challenging benchmarks and demonstrates significant improvements in human-based feedback evaluation, bringing EM-NLQ closer to truly interactive and adaptive visual memory systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37436", "url": null, "sourceid": 37284, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37438, "uid": "62be10fcc6aa7b1ac58ba2fedbb6ff2f", "name": "VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation", "authors": [{"id": 169707, "fullname": "Longteng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169707?format=json", "institution": "Beijing Electronic Science and Technology Institute"}, {"id": 149225, "fullname": "DanDan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/149225?format=json", "institution": "Alibaba Group"}, {"id": 187450, "fullname": "Qianqian Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187450?format=json", "institution": "Nanjing University"}, {"id": 177030, "fullname": "Heng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177030?format=json", "institution": "University of Science and Technology of China"}, {"id": 187451, "fullname": "Huaye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187451?format=json", "institution": "Alibaba Group; Beijing Electronic Science and Technology Institute"}, {"id": 187452, "fullname": "Yihang Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187452?format=json", "institution": null}, {"id": 187453, "fullname": "Bao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187453?format=json", "institution": null}, {"id": 149274, "fullname": "Jingdong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149274?format=json", "institution": "Ant Group"}, {"id": 131054, "fullname": "JUN ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/131054?format=json", "institution": "Ant Group"}, {"id": 130629, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130629?format=json", "institution": "Beijing Electronic Science and Technology Institute"}], "abstract": "The rapid advancement of AIGC-based video generation has underscored the critical need for comprehensive evaluation frameworks that go beyond traditional generation quality metrics to encompass aesthetic appeal. However, existing benchmarks remain largely focused on technical fidelity, leaving a significant gap in holistic assessment\u2014particularly with respect to perceptual and artistic qualities. To address this limitation, we introduce VGA-Bench, a unified benchmark for joint evaluation of video generation quality and aesthetic quality. VGA-Bench is built upon a principled three-tier taxonomy: Aesthetic Quality, Aesthetic Tagging, and Generation Quality, each decomposed into multiple fine-grained sub-dimensions to enable systematic assessment. Guided by this taxonomy, we design 1,016 diverse prompts and generate a large-scale dataset of over 60,000 videos using 12 video generation models, ensuring broad coverage across content, style, and artifacts. To enable scalable and automated evaluation, we annotate a subset of the dataset via human labeling and develop three dedicated multi-task neural assessors: VAQA-Net for aesthetic quality prediction, VTag-Net for automatic aesthetic tagging, and VGQA-Net for generation and basic quality attributes. Extensive experiments demonstrate that our models achieve reliable alignment with human judgments, offering both accuracy and efficiency. We release VGA-Bench as a public benchmark to foster research in AIGC evaluation, with applications in content moderation, model debugging, and generative model optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37438", "url": null, "sourceid": 43164, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37437, "uid": "3e190a437c17f0b3deac567b7ca509bf", "name": "Alternative Reprogramming for Service Models", "authors": [{"id": 136638, "fullname": "Yunbei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136638?format=json", "institution": "Tulane University"}, {"id": 187448, "fullname": "Chengyi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187448?format=json", "institution": "University of Melbourne"}, {"id": 153389, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153389?format=json", "institution": "University of Melbourne"}, {"id": 187449, "fullname": "Jihun Hamm", "url": "http://cvpr.thecvf.com/api/miniconf/users/187449?format=json", "institution": "Tulane University"}], "abstract": "Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8\\% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5\\% for VLMs, +15.6\\% for standard VMs) while reducing API calls by over 99.99\\%. AReS thus provides a robust and practical solution for adapting modern closed-box models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37437", "url": "https://github.com/yunbeizhang/AReS", "sourceid": 42208, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37440, "uid": "fb51bdc4b99e2bd881862c8f76d404d4", "name": "A Polynomial Chaos Framework for Causal Discovery in Nonlinear Uncertain Systems", "authors": [{"id": 183372, "fullname": "Liang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183372?format=json", "institution": "University of British Columbia"}], "abstract": "In safety-critical industrial applications, accurately identifying causal relationships and quantifying uncertainty is essential for tasks such as root cause analysis, feature selection, and process optimization. Traditional causal discovery methods inadequately handle nonlinearities and complex uncertainties prevalent in industrial sensor data. To address this, we introduce a novel causal discovery framework that integrates Polynomial Chaos Expansion (PCE) representations of stochastic noise into structural equations. This method effectively captures complex nonlinear couplings and arbitrary noise distributions characteristic of industrial data. We rigorously prove the identifiability of causal structures under mild sparsity conditions on the chaos coefficients, significantly extending classical linear non-Gaussian acyclic model (LiNGAM) identifiability results. Extensive experiments on real-world industrial dataset demonstrate superior accuracy, robustness under extreme non-Gaussian noise conditions, and practical uncertainty quantification. This framework presents a principled, interpretable, and computationally feasible approach to causal analysis in nonlinear uncertain industrial environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37440", "url": null, "sourceid": 34781, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37442, "uid": "879076927f643b14ada250ec07ef4a88", "name": "SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval", "authors": [{"id": 101305, "fullname": "Ruixiang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/101305?format=json", "institution": "Renmin University of China"}, {"id": 187456, "fullname": "Zhihao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187456?format=json", "institution": "Renmin University of China"}, {"id": 126116, "fullname": "Bangxiang Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126116?format=json", "institution": "Renmin University of China"}, {"id": 170737, "fullname": "Zijie Xin", "url": "http://cvpr.thecvf.com/api/miniconf/users/170737?format=json", "institution": "Renmin University of China"}, {"id": 187457, "fullname": "Jingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187457?format=json", "institution": "Renmin University of China"}, {"id": 107408, "fullname": "Xirong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107408?format=json", "institution": "Renmin University of China"}], "abstract": "For video-text retrieval, the use of CLIP has been a *de facto* standard. However, as CLIP provides only image and text encoders, this consensus has led to a biased paradigm that entirely ignores the sound track of videos. While several attempts have been made to reintroduce audio -- typically by incorporating an audio encoder and fusing its output with visual features -- these methods face two challenges: **ineffective representation of speech content** and **suboptimal vision-audio fusion**. To address these issues jointly, we propose **SAVE**, a **S**peech **A**ware **V**ideo r**E**presentation learning  method. SAVE improves upon AVIGATE, a SOTA audiovisual method, with a dedicated speech branch for more effective speech embedding. Furthermore, we introduce soft-ALBEF for early vision-audio alignment that facilitates fusion. Extensive experiments on five benchmarks show that SAVE compares favorably against the SOTA, outperforming AVIGATE by +4.1% on MSRVTT-1k, +1.9% on MSRVTT-3k, +2.5% on VATEX, +9.8% on Charades, and +2.1% on LSMDC, in light of the SumR metric.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37442", "url": null, "sourceid": 43996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37447, "uid": "a05feb5ea711fae86b501c3dcc77c00c", "name": "No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection", "authors": [{"id": 180628, "fullname": "Zunkai Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180628?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 88809, "fullname": "Ke Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88809?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187469, "fullname": "JIAJIA LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/187469?format=json", "institution": null}, {"id": 187470, "fullname": "Jie Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187470?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187471, "fullname": "Yuanyuan Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187471?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "The collection and detection of video anomaly data has long been a challenging problem due to its rare occurrence and spatio-temporal scarcity. Existing video anomaly detection (VAD) methods under perform in open-world scenarios. Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. ii) LAVIDA employs an Anomaly Exposure Sampler that transforms segmented objects into pseudo-anomalies to enhance model adaptability to unseen anomaly categories. It further integrates a Multimodal Large Language Model (MLLM) to bolster semantic comprehension capabilities. Additionally, iii) we design a token compression approach based on reverse attention to handle the spatio-temporal scarcity of anomalous patterns and decrease computational cost. The training process is conducted solely on pseudo anomalies without any VAD data. Evaluations across four benchmark VAD datasets demonstrate that LAVIDA achieves SOTA performance in both frame-level and pixel-level anomaly detection under the zero-shot setting. Our code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37447", "url": null, "sourceid": 45041, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37448, "uid": "6fec8c1ba9b9aa79bd26e09fc8aae3eb", "name": "ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation", "authors": [{"id": 174715, "fullname": "Wei Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/174715?format=json", "institution": "Fudan university"}, {"id": 91010, "fullname": "Mingcheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91010?format=json", "institution": "Fudan University"}, {"id": 153724, "fullname": "Xuecheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153724?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 126844, "fullname": "Jingqun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126844?format=json", "institution": "Bytedance"}, {"id": 91003, "fullname": "Dingkang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91003?format=json", "institution": "Fudan University"}, {"id": 91007, "fullname": "Lihua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91007?format=json", "institution": "Fudan University"}], "abstract": "Vision-and-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose ProFocus, a training-free progressive framework that unifies Proactive Perception and Focused Reasoning through collaboration between large language models (LLMs) and vision-language models (VLMs). For proactive perception, ProFocus transforms panoramic observations into structured ego-centric semantic maps, enabling the orchestration agent to identify missing visual information needed for reliable decision-making, and to generate targeted visual queries with corresponding focus regions that guide the perception agent to acquire the required observations. For focused reasoning, we propose Branch-Diverse Monte Carlo Tree Search (BD-MCTS) to identify top-k high-value waypoints from extensive historical candidates. The decision agent focuses reasoning on the historical contexts associated with these waypoints, rather than considering all historical waypoints equally. Extensive experiments validate the effectiveness of ProFocus, achieving state-of-the-art performance among zero-shot methods on R2R and REVERIE benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37448", "url": null, "sourceid": 42078, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37449, "uid": "6f3249aa304055d63828af3bfab778f6", "name": "Envisioning the Future, One Step at a Time", "authors": [{"id": 153587, "fullname": "Stefan Andreas Baumann", "url": "http://cvpr.thecvf.com/api/miniconf/users/153587?format=json", "institution": "CompVis        LMU Munich"}, {"id": 170371, "fullname": "Jannik Wiese", "url": "http://cvpr.thecvf.com/api/miniconf/users/170371?format=json", "institution": "LMU Munich"}, {"id": 187472, "fullname": "Tommaso Martorella", "url": "http://cvpr.thecvf.com/api/miniconf/users/187472?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 95854, "fullname": "Mahdi Kalayeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/95854?format=json", "institution": "Netflix"}, {"id": 85132, "fullname": "Bj\u00f6rn Ommer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85132?format=json", "institution": "University of Munich"}], "abstract": "Accurately anticipating how complex, open-world scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence.We further introduce OWM, a benchmark for open-world motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-world future prediction both scalable and practical.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37449", "url": null, "sourceid": 31337, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37450, "uid": "5a1c75edbbb57641d5479f233810a798", "name": "From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation", "authors": [{"id": 70643, "fullname": "Huayu Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/70643?format=json", "institution": "University of Science and Technology of China"}, {"id": 86490, "fullname": "Rui Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/86490?format=json", "institution": "University of Science and Technology of China"}, {"id": 185768, "fullname": "Yujia Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185768?format=json", "institution": "University of Science and Technology of China"}, {"id": 153463, "fullname": "Wangkai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153463?format=json", "institution": "University of Science and Technology of China"}, {"id": 187473, "fullname": "Bingzhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187473?format=json", "institution": "University of Science and Technology of China"}, {"id": 187474, "fullname": "Aibing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187474?format=json", "institution": "University of Science and Technology of China"}, {"id": 187475, "fullname": "Zhangyu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187475?format=json", "institution": "University of Science and Technology of China"}, {"id": 87373, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87373?format=json", "institution": "University of Science and Technology of China"}], "abstract": "The critical challenge of semi-supervised semantic segmentation lies in how to fully exploit a large volume of unlabeled data to improve the model's generalization performance for robust segmentation. However, existing softmax scores-based filtering methods tend to be affected by the overconfidence issue in neural networks, leading to the inclusion of incorrect pseudo-labels that negatively impact the training process. In this paper, we propose a novel evidential learning framework to explicitly model the prediction uncertainty for reliable pseudo-label selection. By modeling the distribution of class probabilities using Dirichlet distributions, we obtain principled and improved uncertainty estimates from a distributional perspective. Furthermore, we propose HESS (Hyper-ESS), decoupling the modeling of exclusive and collective evidence for comprehensive evidence perception, to yield more accurate uncertainty estimates. Extensive experiments on three challenging benchmarks demonstrate that integrating HESS into existing semi-supervised semantic segmentation frameworks consistently improves performance, benefiting from more reliable pseudo-label selection. Our work sheds light on the potential of evidential learning in semi-supervised semantic segmentation and opens up new avenues for future research. Code and models will be made available to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37450", "url": null, "sourceid": 31168, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37451, "uid": "e8d22590eca2d4cdca8ab4f23a21bb93", "name": "Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization", "authors": [{"id": 187476, "fullname": "CailingHan CailingHan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187476?format=json", "institution": "Hefei University of Technology"}, {"id": 187477, "fullname": "Zhangbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187477?format=json", "institution": "Hefei University of Technology"}, {"id": 87055, "fullname": "Jinxing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/87055?format=json", "institution": "Hefei University of Technology"}, {"id": 182656, "fullname": "Wei Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/182656?format=json", "institution": "Hefei University of Technol"}, {"id": 152784, "fullname": "Jingjing Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152784?format=json", "institution": "Hefei University of Technology"}, {"id": 187478, "fullname": "Yanghao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187478?format=json", "institution": "National University of Singapore"}, {"id": 187480, "fullname": "Zhangling Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187480?format=json", "institution": "Hefei Comprehensive National Science Center"}, {"id": 128503, "fullname": "Dan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128503?format=json", "institution": "Hefei University of Technology"}], "abstract": "Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To tackle the intrinsic challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network FSENet, a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach first introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We then propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At last, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings. The code will be open source to the public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37451", "url": null, "sourceid": 43780, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37456, "uid": "62e81b7815b24e46b69fcfa197aea837", "name": "TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation", "authors": [{"id": 128145, "fullname": "Zhi Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128145?format=json", "institution": "Purdue University"}, {"id": 187494, "fullname": "Liangkun Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187494?format=json", "institution": null}, {"id": 128136, "fullname": "Tianyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128136?format=json", "institution": "Purdue University"}], "abstract": "Recent research has investigated the use of large language models (LLMs) to generate traffic scenarios for autonomous driving. However, pretrained LLMs often fail to align with real-world traffic distributions. In this work, we present TrafficAlign, an automated framework that synthesizes traffic scenarios based on real-world driving videos, performs data validation, and aligns LLMs with the synthesized scenarios. The evaluation shows that traffic scenarios generated by TrafficAlign are highly effective, revealing up to 10.8% more collisions on average across three autonomous driving models than state-of-the-art methods. Furthermore, fine-tuning these driving models with TrafficAlign-generated scenarios significantly reduced collision rates by 36.1% compared with the original models. A qualitative study using traffic datasets from six geographically diverse regions shows that TrafficAlign-generated scenarios exhibit strong alignment with corresponding traffic distributions in these regions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37456", "url": null, "sourceid": 39991, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37460, "uid": "9c7660b8dc616c5d3392c0c1c14c2245", "name": "MRI Contrast Enhancement Kinetics World Model", "authors": [{"id": 187503, "fullname": "Jindi Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187503?format=json", "institution": "Case Western Reserve University"}, {"id": 182546, "fullname": "Yuting He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182546?format=json", "institution": "Case Western Reserve University"}, {"id": 187504, "fullname": "Cong Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/187504?format=json", "institution": "Nanjing Medical University"}, {"id": 187505, "fullname": "YWUSO YWUSO", "url": "http://cvpr.thecvf.com/api/miniconf/users/187505?format=json", "institution": "Southeast University"}, {"id": 90856, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90856?format=json", "institution": "Case Western Reserve University"}], "abstract": "Clinical MRI contrast acquisition suffers from inefficient information yield, which presents as a mismatch between the risky and costly acquisition protocol and fixed and sparse acquisition sequence. Applying world models to simulate the contrast enhancement kinetics in human body enables continuous contrast-free dynamics. However, the low temporal resolution in MRI acquisition restricts the training of world models, leading to the sparsely sampled dataset. Directly training a generative model to capture the kinetics leads to two limitations: (a) Due to the absence data on missing time, the model tends to overfit to irrelevant features, leading to content distortion. (b) Due to the lack of continuous temporal supervision, the model fails to learn the continuous kinetics law over time,  causing temporal discontinuities. For the first time, we propose MRI Contrast Enhancement Kinetics World model (MRI CEKWorld) with SpatioTemporal Consistency Learning (STCL). For (a), guided by spatial law that patient-level structures remain consistent during enhancement, we propose Latent Alignment Learning (LAL) that constructs a patient-specific template to constrain contents to align with this template. For (b), guided by the temporal law that the kinetics follows a consistent smooth trend, we propose Latent Difference Learning (LDL) which extends the unobserved intervals by interpolation and constrain smooth variations in the latent space among interpolated sequence. Extensive experiments on two datasets show our MRI CEKWorld achieves better realistic contents and kinetics. Codes will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37460", "url": null, "sourceid": 39422, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37461, "uid": "bae38661a27f6228ba38c36e766ed769", "name": "HiDRA: Hierarchical Degradation Representation and Adaptation with Generative Priors for Enhancing Infrared Vision", "authors": [{"id": 180032, "fullname": "Zihang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180032?format=json", "institution": "Dalian University of Technology"}, {"id": 152574, "fullname": "Zhu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152574?format=json", "institution": "Dalian University of Technology"}, {"id": 187506, "fullname": "Changbo Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187506?format=json", "institution": "Dalian University of Technology"}, {"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}, {"id": 131737, "fullname": "Risheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131737?format=json", "institution": "Dalian University of Technology"}], "abstract": "Thermal infrared (TIR) imaging enables robust perception in adverse conditions. However, it often suffers from complex degradations (\\textit{e.g.,} fixed-pattern noise and low-resolution) due to sensor limitations and environmental dynamics. Existing methods, whether traditional or learning-based, easily fail under composite and varying degradation. Pre-trained generative models showcase powerful capabilities for alleviating degradations but lack effective tools to adapt visible generative priors to TIR-specific characteristics. To overcome these challenges, we propose a Hierarchical Degradation Representation and Adaptation (HiDRA) framework to decompose the enhancement procedure into degradation representation estimation and generative model fine-tuning. The degradation representation estimation aims to disentangle TIR degradation patterns, which then guide the parameter adaptation for thermal image enhancement. Additionally, we introduce a hierarchical adaptation solution that aggregates learning across varying degradation levels, further improving robustness under various scenarios. Experiments across diverse types and degrees demonstrate the robustness of our approach and further validate its effectiveness on downstream tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37461", "url": null, "sourceid": 42089, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37462, "uid": "4dcfeaf97d0aec9f35e384d9c3624b39", "name": "Global-Aware Edge Prioritization for Pose Graph Initialization", "authors": [{"id": 181807, "fullname": "Tong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181807?format=json", "institution": "Czech technical university in Prague, Faculty of Eletrical Engineering"}, {"id": 73917, "fullname": "Giorgos Tolias", "url": "http://cvpr.thecvf.com/api/miniconf/users/73917?format=json", "institution": "CTU in Prague"}, {"id": 75899, "fullname": "Jiri Matas", "url": "http://cvpr.thecvf.com/api/miniconf/users/75899?format=json", "institution": "Czech Technical University, Prague"}, {"id": 74200, "fullname": "Daniel Barath", "url": "http://cvpr.thecvf.com/api/miniconf/users/74200?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37462", "url": null, "sourceid": 41144, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40302?format=json"], "related_events_ids": [40302]}, {"id": 40302, "uid": "4dcfeaf97d0aec9f35e384d9c3624b39", "name": "Global-Aware Edge Prioritization for Pose Graph Initialization", "authors": [{"id": 181807, "fullname": "Tong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181807?format=json", "institution": "Czech technical university in Prague, Faculty of Eletrical Engineering"}, {"id": 73917, "fullname": "Giorgos Tolias", "url": "http://cvpr.thecvf.com/api/miniconf/users/73917?format=json", "institution": "CTU in Prague"}, {"id": 75899, "fullname": "Jiri Matas", "url": "http://cvpr.thecvf.com/api/miniconf/users/75899?format=json", "institution": "Czech Technical University, Prague"}, {"id": 74200, "fullname": "Daniel Barath", "url": "http://cvpr.thecvf.com/api/miniconf/users/74200?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40302", "url": null, "sourceid": -41144, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37462?format=json"], "related_events_ids": [37462]}, {"id": 37471, "uid": "0319d8c48c991a68588b363d776d2720", "name": "OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens", "authors": [{"id": 187533, "fullname": "Yiying Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187533?format=json", "institution": "Fudan University"}, {"id": 88826, "fullname": "Wei Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88826?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 89775, "fullname": "Sijin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89775?format=json", "institution": "Fudan University"}, {"id": 187534, "fullname": "Honghao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187534?format=json", "institution": "The University of Queensland"}, {"id": 126775, "fullname": "Xianfang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126775?format=json", "institution": "Tencent PCG"}, {"id": 153126, "fullname": "Yujun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/153126?format=json", "institution": "The University of Queensland"}, {"id": 87502, "fullname": "Gang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87502?format=json", "institution": "Tencent"}, {"id": 90424, "fullname": "Xingjun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90424?format=json", "institution": "Deakin University"}], "abstract": "OmniLottie is a versatile framework that generates high-quality vector animations from multi-modal instructions, including interleaved texts, images, and videos. To fully parameterize vector animations for flexible motion and visual content control, we seek help from the Lottie representation, which encodes both shapes and animated behaviors in a single JSON file. Building upon a pretrained vision\u2013language model (VLM), OmniLottie produces vivid, semantically aligned vector animations that adhere closely to multi-modal conditions. To avoid the complexity and irregularity of raw JSON structures, we introduce a dedicated Lottie tokenizer that transforms Lottie files into structured sequences of function calls representing shapes, animation commands, and their parameters.  This design enables the model to directly learn the underlying shape and animation priors from data, substantially improving generation stability and controllability. To further advance research in vector animation generation, we curate MMLottie-2M, a large-scale dataset of professionally designed vector animations paired with textual and visual annotations. Leveraging the well-designed tokenizer and our newly established dataset, OmniLottie demonstrates strong multi-modal conditional generation capabilities using a simple next-token prediction objective. For qualitative results, please refer to the generated animations rendered through standard Lottie players on the supplementary website.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37471", "url": null, "sourceid": 36796, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37476, "uid": "a6a6747117519d0d5a91b57b5d63db32", "name": "MIBURI: Towards Expressive Interactive Gesture Synthesis", "authors": [{"id": 89508, "fullname": "M. Hamza Mughal", "url": "http://cvpr.thecvf.com/api/miniconf/users/89508?format=json", "institution": "Max-Planck Institute for Informatics"}, {"id": 89511, "fullname": "Rishabh Dabral", "url": "http://cvpr.thecvf.com/api/miniconf/users/89511?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 154543, "fullname": "Vera Demberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/154543?format=json", "institution": "Saarland University"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}], "abstract": "Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)\u2013based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs either produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, an online, causal framework for generating expressive co-speech gestures and facial expressions synchronized with real-time spoken dialogue. We first employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce contrastive objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37476", "url": null, "sourceid": 44826, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37481, "uid": "12967cc2a03871bd9eef46ed6da69398", "name": "UniCorn: Unified Correspondence Transformer Across 2D and 3D", "authors": [{"id": 183989, "fullname": "Prajnan Goswami", "url": "http://cvpr.thecvf.com/api/miniconf/users/183989?format=json", "institution": "Northeastern University"}, {"id": 180407, "fullname": "Tianye Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/180407?format=json", "institution": "Northeastern University"}, {"id": 88403, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88403?format=json", "institution": "Adobe Systems"}, {"id": 127396, "fullname": "Huaizu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127396?format=json", "institution": "Northeastern University"}], "abstract": "Visual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder and decoder components, trained jointly on diverse data combining pseudo point clouds from depth maps with real 3D correspondence annotations. UniCorn achieves competitive performance on 2D-2D matching and surpasses prior state-of-the-art by 8\\% on 7Scenes (2D-3D) and 10\\% on 3DLoMatch (3D-3D) in registration recall. Code and model checkpoints will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37481", "url": null, "sourceid": 32118, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37482, "uid": "f00e84df4566ae8d44aa1927f7ab6092", "name": "Stealing Split Learning Bottom Models by Recovering Embedding Geometry", "authors": [{"id": 182482, "fullname": "Qinbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182482?format=json", "institution": "Stony Brook University"}, {"id": 187561, "fullname": "Yanhang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187561?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 187562, "fullname": "Ziyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187562?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 187563, "fullname": "Hao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187563?format=json", "institution": "Stevens Institute of Technology"}, {"id": 152396, "fullname": "Sai Qian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152396?format=json", "institution": "New York University"}, {"id": 187564, "fullname": "Jian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187564?format=json", "institution": "Stony Brook University"}], "abstract": "Vertical federated learning (VFL) trains models by splitting computation across clients and a server that only exchange intermediate embeddings. Recent work shows that a server even if honest-but-curious can steal a client\u2019s bottom model by querying the system and regressing on the returned embeddings, and in response, defenses perturb or decouple the embedding channel. We show these defenses remain vulnerable. We propose VENOM, a geometry-aware stealing attack. VENOM first learns a contrastive space over server-observed embeddings, then builds a neighborhood graph and trains a surrogate bottom model to match targets and respect local geometry via a neighbor-matching loss  alongside pointwise and feature-shape alignment. This strategy preserves the relational structure that defenses fail to erase, effectively recoupling the embeddings produced by multi-branch and noise-based defenses. Across six datasets, VENOM consistently outperforms standard stealing methods under no defense and multiple defenses,  and remains effective with out-of-distribution (OOD) auxiliary data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37482", "url": null, "sourceid": 38634, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37486, "uid": "bc688aa94a48bc92798b48702596db51", "name": "Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection", "authors": [{"id": 184252, "fullname": "Boyang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184252?format=json", "institution": "The University of Hong Kong"}, {"id": 158540, "fullname": "Chaoqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158540?format=json", "institution": "Shenzhen University"}, {"id": 89008, "fullname": "Yizhou Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89008?format=json", "institution": "The University of Hong Kong"}], "abstract": "Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37486", "url": null, "sourceid": 43961, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37487, "uid": "a673b3cb54c0924a8ba7b141f432f7f3", "name": "Latent Diffusion Inversion Requires Understanding the Latent Space", "authors": [{"id": 148652, "fullname": "Mingxing Mingxing", "url": "http://cvpr.thecvf.com/api/miniconf/users/148652?format=json", "institution": "Vanderbilt University"}, {"id": 187571, "fullname": "Bowen Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187571?format=json", "institution": "Vanderbilt University"}, {"id": 150986, "fullname": "Daniel Moyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/150986?format=json", "institution": "Vanderbilt University"}], "abstract": "The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7\\% and substantial increases in TPR@1\\%FPR (6.42\\%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pok\u00e9mon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37487", "url": null, "sourceid": 36550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37485, "uid": "efe655d620d2d3d55ab8b2b6c86a945d", "name": "Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning", "authors": [{"id": 187568, "fullname": "Yuxuan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187568?format=json", "institution": "Peking University"}, {"id": 187569, "fullname": "Weimin Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187569?format=json", "institution": "Peking University"}, {"id": 187570, "fullname": "Yifei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187570?format=json", "institution": "Rice University"}, {"id": 153196, "fullname": "Weijian Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/153196?format=json", "institution": "Peking University"}, {"id": 73505, "fullname": "He Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/73505?format=json", "institution": "Peking University"}], "abstract": "Masked auto-regressive diffusion models (MAR) benefit from the expressive modeling ability of diffusion models and the flexibility of masked auto-regressive ordering. However, vanilla MAR suffers from slow inference due to its hierarchical inference mechanism: an outer AR unmasking loop and an inner diffusion denoising chain. Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training.To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. Such a distillation with MARVAL not only yields substantial inference acceleration but, crucially, makes RL post-training with verifiable rewards practical, resulting in scalable yet human-preferred fast generative models. Our contributions are twofold: (1) a novel score-based variational objective for distilling masked auto-regressive diffusion models into a single generation step without sacrificing sample quality; and (2) an efficient RL framework for masked auto-regressive models via MARVAL-RL. On ImageNet 256\u00d7256, MARVAL-Huge achieves an FID of \\textbf{2.00} with more than \\textbf{30 times speedup} compared with MAR-diffusion, and MARVAL-RL yields consistent improvements in CLIP and image-reward scores on ImageNet datasets with entity names. In conclusion, MARVAL demonstrates the first practical path to distillation and RL of masked auto-regressive diffusion models, enabling fast sampling and better preference alignments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37485", "url": null, "sourceid": 42179, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37492, "uid": "fd537f53e8f93d331a3cf6a0f5f1e748", "name": "Captain Safari: A Real-time World Engine", "authors": [{"id": 181449, "fullname": "Yu-Cheng Chou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181449?format=json", "institution": "NVIDIA / JHU"}, {"id": 155330, "fullname": "Xingrui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155330?format=json", "institution": "Department of Computer Science, Whiting School of Engineering"}, {"id": 187577, "fullname": "Yitong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187577?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 130079, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130079?format=json", "institution": "Johns Hopkins University"}, {"id": 187578, "fullname": "Hanting Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187578?format=json", "institution": "Johns Hopkins University"}, {"id": 75526, "fullname": "Cihang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/75526?format=json", "institution": "University of California, Santa Cruz"}, {"id": 84745, "fullname": "Alan L. Yuille", "url": "http://cvpr.thecvf.com/api/miniconf/users/84745?format=json", "institution": "Johns Hopkins University"}, {"id": 84770, "fullname": "Junfei Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84770?format=json", "institution": "Johns Hopkins University"}], "abstract": "World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers.To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. In a 50-participant human study, \\textbf{67.6\\%} of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide \\emph{OpenSafari} as a challenging new benchmark for future world-engine research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37492", "url": null, "sourceid": 30889, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37493, "uid": "5b5920d06d8fef2496244d5684bdbf1c", "name": "AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision\u2013Language Models", "authors": [{"id": 131961, "fullname": "Shih-Po Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/131961?format=json", "institution": "Northeastern University"}, {"id": 92940, "fullname": "Ehsan Elhamifar", "url": "http://cvpr.thecvf.com/api/miniconf/users/92940?format=json", "institution": "Northeastern University"}], "abstract": "Virtual task assistants must recognize and explain users\u2019 mistakes to provide effective and corrective guidance. In this paper, we address the problem of error reasoning in long task videos, which is to detect and explain errors. Although recent Vision\u2013Language Models (VLMs) demonstrate strong capabilities in visual question answering, they struggle to attend to the sparse spatiotemporal cues associated with errors in long task videos. We introduce an error reasoning framework, AXG-Reasoner, that leverages a frozen VLM in conjunction with a proposed Action eXecution Graph (AXG) and a temporal action segmentation (TAS) model, obtained and learned from normal (error-free) videos. To enable VLMs to attend to the sparse spatiotemporal cues associated with errors, we decompose each action segment of the video, obtained by TAS, into a sequence of fine-grained subactions by aligning it with the AXG. For each subaction segment, we query the VLM using a small number of keyframes and enhanced prompts to detect and explain errors, enabling efficient inference. To avoid costly manual subaction annotations, we develop a method to automatically construct AXG from training videos using foundation models. Extensive experiments on EgoPER and CaptainCook4D show that our method consistently improves over VLM baselines in error explanation by effectively identifying spationtemporal cues and achieves state-of-the-art performance in error detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37493", "url": null, "sourceid": 46406, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37499, "uid": "8093f823794b8fa03c379c035300fd0b", "name": "MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based DiTs", "authors": [{"id": 73844, "fullname": "Weiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73844?format=json", "institution": "Shandong University"}, {"id": 129874, "fullname": "Antoine Toisoul", "url": "http://cvpr.thecvf.com/api/miniconf/users/129874?format=json", "institution": "Meta"}, {"id": 152119, "fullname": "Tom Monnier", "url": "http://cvpr.thecvf.com/api/miniconf/users/152119?format=json", "institution": "Facebook"}, {"id": 85874, "fullname": "Roman Shapovalov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85874?format=json", "institution": "Meta"}, {"id": 73496, "fullname": "Rakesh Ranjan", "url": "http://cvpr.thecvf.com/api/miniconf/users/73496?format=json", "institution": "Meta"}, {"id": 126273, "fullname": "Ping Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126273?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}], "abstract": "We present MeshFlow, a new method for compressing and generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh connectivity, which, however, scales poorly due to the inference cost being quadratic in mesh size. AR methods also require discretizing the vertex coordinates, which introduces quantization errors and can cause vertex collapse. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space.This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified-Flow transformer, which generates all mesh vertices and edges in parallel. This model samples meshes $26\\times$ faster than the fastest AR generator while also achieving state-of-the-art accuracy across standard mesh-generation metrics.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37499", "url": null, "sourceid": 37675, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37505, "uid": "e84d3444422cda735543114cf5df6b95", "name": "ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild", "authors": [{"id": 155508, "fullname": "Hanyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155508?format=json", "institution": "Cornell University"}, {"id": 90264, "fullname": "Ruojin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/90264?format=json", "institution": "Cornell University"}, {"id": 155571, "fullname": "Steve Marschner", "url": "http://cvpr.thecvf.com/api/miniconf/users/155571?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 85450, "fullname": "Noah Snavely", "url": "http://cvpr.thecvf.com/api/miniconf/users/85450?format=json", "institution": "Google / Cornell"}], "abstract": "Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting *3D-grounded reflectional symmetries* from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37505", "url": null, "sourceid": 43616, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37504, "uid": "d9bb2a8cb67a206ee4c40793f95c8229", "name": "Lenses: Toward Polysemous Vision\u2013Language Understanding", "authors": [{"id": 187598, "fullname": "Hani Alomari", "url": "http://cvpr.thecvf.com/api/miniconf/users/187598?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 168009, "fullname": "Ali Asgarov", "url": "http://cvpr.thecvf.com/api/miniconf/users/168009?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 127058, "fullname": "Chris Thomas", "url": "http://cvpr.thecvf.com/api/miniconf/users/127058?format=json", "institution": "Virginia Polytechnic Institute and State University"}], "abstract": "Most vision-language models assume images have a single literal meaning, even though images are polysemous. We propose a retrieval paradigm that models many-to-many relationships between images and text using interpretive lenses and introduce Lenses, a multi-prompt embedding model and dataset for polysemous image-text retrieval. The Lenses dataset contains \\(105,669\\) images and \\(732,405\\) captions, with each image paired with multiple captions and image-side prompts annotated across five categories: Literal, Figurative, Emotional, Abstract, and Background. Building on a multimodal large language model, the Lenses model uses learned lens tokens to extract lens-specific embeddings for every image and caption and compares these using a lens-masking similarity function with a global fallback that prioritizes same-lens matches while retaining a global pathway. Training uses a category-aware multi-positive contrastive loss and intra-set diversity regularization to align corresponding perspectives while preventing semantic collapse across lenses. We further propose lens-aware evaluation protocols, including category-aware ranking, that better reflect how humans match images and text. Experiments on the Lenses dataset and public benchmarks show that our model outperforms baselines on literal and non-literal retrieval and reduces over-reliance on literal cues.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37504", "url": null, "sourceid": 44197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37507, "uid": "23f10867ae6acbd27fdbaa5e92065732", "name": "MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents", "authors": [{"id": 180082, "fullname": "YiJun Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180082?format=json", "institution": "Southeast Community College Area"}, {"id": 186836, "fullname": "Shipeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186836?format=json", "institution": null}, {"id": 187599, "fullname": "Ruijia Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187599?format=json", "institution": "Southeast University"}, {"id": 186837, "fullname": "Na Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186837?format=json", "institution": "Nanjing university; Nanjing university"}, {"id": 186839, "fullname": "Hui Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186839?format=json", "institution": "Southeast University"}], "abstract": "Chinese historical documents are essential carriers for the inheritance and dissemination of traditional Chinese culture.However, traditional manual digitization of different types of historical carriers is not only time-consuming and labor-intensive but also heavily reliant on experts with specialized knowledge of the specific carrier domains.In the past,experts read the Chinese historical documents relying on the recognition of the documents and consulted a large number of professional books for citation and correction .With the emergence of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), we see new opportunities for uniformly reading different types of carriers. Nevertheless, existing studies mainly focus on evaluating the OCR capabilities of MLLMs, without incorporating citation or retrieval functionalities, and are restricted to a single type of carrier.To address this,we introduce MCHDoc,a comprehensive benchmark for reading multi-carrier Chinese historical documents.This benchmark consists of 15,723 documents and covers six types of carriers, including Inscription, AncientBook, Calligraphy, Oracle Bone, Silk, and JianDu(bamboo slip).Based on this benchmark,we evaluate various MLLMs and LLMs to test their capacities of reading multi-carrier Chinese historical documents. The results reveal that the top MLLMs and LLMs achieve excellent  performance on some type of carriers. but there is still some place for them to read the multi-type carriers perfectly.Overall,MCHDoc is a standardized and comprehensive benchmark for reading Chinese historical document, providing valuable insights for Chinese cultural study.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37507", "url": null, "sourceid": 39653, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37512, "uid": "abd638b578129dd4a9fa93e484935e58", "name": "GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding", "authors": [{"id": 183049, "fullname": "Peirong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183049?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187614, "fullname": "Yidan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187614?format=json", "institution": null}, {"id": 187615, "fullname": "Luxiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187615?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187616, "fullname": "Jinliang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187616?format=json", "institution": null}, {"id": 76730, "fullname": "Zonghao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76730?format=json", "institution": "Tsinghua University"}, {"id": 158611, "fullname": "Fengxiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158611?format=json", "institution": "National University of Defense Technology"}, {"id": 128276, "fullname": "Xue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128276?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 187617, "fullname": "Kaiwen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187617?format=json", "institution": "Chongqing University"}, {"id": 187618, "fullname": "Lei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187618?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects.To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness.Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37512", "url": null, "sourceid": 37376, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40303?format=json"], "related_events_ids": [40303]}, {"id": 40303, "uid": "abd638b578129dd4a9fa93e484935e58", "name": "GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding", "authors": [{"id": 183049, "fullname": "Peirong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183049?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187614, "fullname": "Yidan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187614?format=json", "institution": null}, {"id": 187615, "fullname": "Luxiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187615?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187616, "fullname": "Jinliang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187616?format=json", "institution": null}, {"id": 76730, "fullname": "Zonghao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76730?format=json", "institution": "Tsinghua University"}, {"id": 158611, "fullname": "Fengxiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158611?format=json", "institution": "National University of Defense Technology"}, {"id": 128276, "fullname": "Xue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128276?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 187617, "fullname": "Kaiwen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187617?format=json", "institution": "Chongqing University"}, {"id": 187618, "fullname": "Lei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187618?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects.To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness.Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40303", "url": null, "sourceid": -37376, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37512?format=json"], "related_events_ids": [37512]}, {"id": 37514, "uid": "0acecb86d3b3fab2fea045403bedfb1f", "name": "Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers", "authors": [{"id": 128777, "fullname": "Minghao Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128777?format=json", "institution": "The University of Hong Kong"}, {"id": 104852, "fullname": "Wenbo Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104852?format=json", "institution": "Tencent ARC Lab"}, {"id": 183964, "fullname": "Jiale Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183964?format=json", "institution": "Tencent"}, {"id": 84809, "fullname": "Ying Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84809?format=json", "institution": "Tencent"}, {"id": 88212, "fullname": "Kai Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88212?format=json", "institution": "The University of Hong Kong"}], "abstract": "Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet truly dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis, charting a path toward efficient and scalable 4D generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37514", "url": null, "sourceid": 32854, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37518, "uid": "fb30d3b7fa58e575c4bf59bb899c314d", "name": "Global Structure-from-Motion Meets Feedforward Reconstruction", "authors": [{"id": 154490, "fullname": "Linfei Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154490?format=json", "institution": "Department of Computer Science, ETHZ - ETH Zurich"}, {"id": 153646, "fullname": "Johannes Sch\u00f6nberger", "url": "http://cvpr.thecvf.com/api/miniconf/users/153646?format=json", "institution": "Meta"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}], "abstract": "Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved.Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited image overlap, and symmetries.However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings.In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods.Extensive experiments over a wide range of reconstruction scenarios demonstrate the benefits of our approach by achieving state-of-the-art results across the board.The implementation of our pipeline will be shared as open source software.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37518", "url": null, "sourceid": 34359, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37516, "uid": "d8de81eff6dfe582a05bc6981879f01a", "name": "Gyro-based Deep Video Deblurring", "authors": [{"id": 88365, "fullname": "Jaesung Rim", "url": "http://cvpr.thecvf.com/api/miniconf/users/88365?format=json", "institution": "POSTECH"}, {"id": 103510, "fullname": "Woohyeok Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/103510?format=json", "institution": "POSTECH"}, {"id": 187621, "fullname": "Haeyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/187621?format=json", "institution": "Korea University of Technology and Education"}, {"id": 132402, "fullname": "Heemin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132402?format=json", "institution": "POSTECH"}, {"id": 86862, "fullname": "Ke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86862?format=json", "institution": "Electrical Engineering and Computer Sciences, University of California, Berkeley"}, {"id": 88380, "fullname": "Sunghyun Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/88380?format=json", "institution": "POSTECH"}], "abstract": "Modern cameras, such as smartphone cameras and DSLRs, are equipped with gyro sensors that measure motion of the camera. While the motion information is valuable for deblurring, gyro-based deblurring has not been widely studied, particularly for video. A few gyro-based video deblurring methods have been proposed, but they exhibit inherent limitations. First, gyro sensors capture only rotational motion, leading these methods to ignore translational motion. Second, their dependence on simplified blur models and deconvolution-based solutions restricts overall performance. To address these limitations, we introduce GyroDVD, the first learning-based framework for gyro-based video deblurring. We propose a novel blur kernel construction scheme that jointly accounts for rotational and translational motion. A video deblurring network then restores sharp videos by exploiting the constructed kernels together with the video frames. For training and evaluation, we introduce the GyroVD dataset, a large-scale and realistic dataset specifically designed for gyro-based deblurring. Extensive experiments demonstrate that our method significantly outperforms prior gyro-based image and video deblurring methods. Code and dataset will be made publicly available on our project page.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37516", "url": null, "sourceid": 44778, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37520, "uid": "ef2e814b4e7cc7e6e4a1f3c8f035275d", "name": "Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning", "authors": [{"id": 182967, "fullname": "Sungrae Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182967?format=json", "institution": "KAIST"}, {"id": 187625, "fullname": "Jiwon Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187625?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187626, "fullname": "Jisu Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187626?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187627, "fullname": "Donghee Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187627?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187628, "fullname": "Sol Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/187628?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187629, "fullname": "Kyungeun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187629?format=json", "institution": "Seegene Medical Foundation"}, {"id": 187630, "fullname": "Mun Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187630?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Multiple Instance Learning (MIL) has emerged as a promising paradigm for Whole Slide Image (WSI) diagnosis, offering effective learning with limited annotations. However, existing MIL frameworks overlook diagnostic priorities and fail to differentiate the severity of misclassifications in multiclass, leaving clinically critical errors unaddressed. We propose a mistake\u2013severity\u2013aware training strategy that organizes diagnostic classes into a hierarchical structure, with each level optimized using a severity-weighted cross-entropy loss that penalizes high-severity misclassifications more strongly. Additionally, hierarchical consistency is enforced through probabilistic alignment, a semantic feature remix applied to the instance bag to robustly train class priority and accommodate clinical cases involving multiple symptoms. An asymmetric Mikel\u2019s Wheel-based metric is also introduced to quantify the severity of errors specific to medical fields. Experiments on challenging public and real-world in-house datasets demonstrate that our approach significantly mitigates critical errors in MIL diagnosis compared to existing methods. We present additional experimental results on natural domain data to demonstrate the generalizability of our proposed method beyond medical contexts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37520", "url": null, "sourceid": 44761, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37521, "uid": "d334c7dfa96b5fd5cc4f23e76e8b4166", "name": "OccAny: Generalized Unconstrained Urban 3D Occupancy", "authors": [{"id": 113787, "fullname": "Anh Quan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/113787?format=json", "institution": "INRIA"}, {"id": 107207, "fullname": "TUAN-HUNG VU", "url": "http://cvpr.thecvf.com/api/miniconf/users/107207?format=json", "institution": "valeo.ai"}], "abstract": "Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization.While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios.We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features.OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images.Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii)Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion.Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37521", "url": null, "sourceid": 40094, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37523, "uid": "d5a91f7c1dee41a945e3f3109633423b", "name": "FlashIn: Fast and Accurate Image Inversion for Real-time Image Editing", "authors": [{"id": 128289, "fullname": "Guangzhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128289?format=json", "institution": "National University of Singapore"}], "abstract": "Given an image and a descriptive prompt, image inversion seeks to identify the initial noise that, when denoised, accurately reconstructs the original image. This is crucial for applications like image editing, which can be achieved by denoising the inverted noise with an edited prompt. Existing methods often rely on approximations. They often require many steps, leading to inaccuracies, slow processing, and artifacts due to the inherent intractability in the inversion process. To overcome these issues, in this work, we propose FlashIn, a novel algorithm for faster and more accurate image inversion, enabling high-quality, real-time editing. FlashIn offers two main contributions: i) A learnable neural network directly maps an image to its corresponding noise. Trained with a cycle-consistent strategy using generated data and seed noise, this approach yields a more efficient and precise inversion model. ii) Adversarial training aligns noise-reconstructed images with real ones, enhancing inversion accuracy and editing quality. These strategies enable a fast, accurate inversion process in a single step, with further improvements possible through additional steps. Integrated with few-step diffusion models such as Flux.1-Schnell, our method achieves high-quality image editing within one second on a single A100 GPU, facilitating real-time, interactive editing. Extensive experiments demonstrate that FlashIn delivers state-of-the-art inversion precision and impressive editing results across various scenarios and applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37523", "url": null, "sourceid": 46472, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37528, "uid": "e26cee6aa8650fdb4189de69f24432e8", "name": "MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models", "authors": [{"id": 180436, "fullname": "fengyu chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180436?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 157719, "fullname": "Tiao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/157719?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 99883, "fullname": "Teng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/99883?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187651, "fullname": "Yuantian Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187651?format=json", "institution": "Tsinghua University"}, {"id": 187652, "fullname": "Qingmin Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187652?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Radar semantic segmentation (RSS) is critical for robust perception in adverse conditions, but poses unique challenges: radar frequency maps are highly anisotropic, multi-scale, sparse and noisy. Conventional CNN or Transformer architectures, designed for camera images, fail to account for these characteristics, leading degraded performance. We propose MARSS (Modular Attention-enhanced Radar Semantic Segmentation), a novel framework that integrates three specialized modules to address radar-specific issues. In the encoder, the RADE module employs lightweight channel self-attention and depthwise convolutions to robustly encode noisy, anisotropic features. In intermediate layers, the RFAF module performs multi-scale feature fusion and region-level attention to isolate salient radar features. The decoder's RADM module combines state space models with axial self-attention to reconstruct segmentation masks with anisotropy and temporality-aware context. These components collectively suppress noise, disentangle range-Doppler features, and enforce spatial-temporal consistency. On the CARRADA dataset, MARSS achieves substantially higher performance than prior RSS methods, especially for small fast-moving targets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37528", "url": null, "sourceid": 44166, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37530, "uid": "66eb81a2a1c2e634f6e2993408674fce", "name": "Guiding Token-Sparse Diffusion Models", "authors": [{"id": 153588, "fullname": "Felix Krause", "url": "http://cvpr.thecvf.com/api/miniconf/users/153588?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 153587, "fullname": "Stefan Andreas Baumann", "url": "http://cvpr.thecvf.com/api/miniconf/users/153587?format=json", "institution": "CompVis        LMU Munich"}, {"id": 153479, "fullname": "Johannes Schusterbauer", "url": "http://cvpr.thecvf.com/api/miniconf/users/153479?format=json", "institution": "CompVis       LMU Munich"}, {"id": 180106, "fullname": "Olga Grebenkova", "url": "http://cvpr.thecvf.com/api/miniconf/users/180106?format=json", "institution": "LMU Munich"}, {"id": 153480, "fullname": "Ming Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/153480?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 76248, "fullname": "Vincent Tao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76248?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 85132, "fullname": "Bj\u00f6rn Ommer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85132?format=json", "institution": "University of Munich"}], "abstract": "Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37530", "url": null, "sourceid": 42159, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37569, "uid": "51dd5ac7bbb01546f7f2e3b2bce27834", "name": "Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding", "authors": [{"id": 184159, "fullname": "Jialuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184159?format=json", "institution": "Georgia Institute of Technology"}, {"id": 88028, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88028?format=json", "institution": "Microsoft"}, {"id": 132590, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132590?format=json", "institution": "Microsoft Research Asia"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose~\\name, a training-free frame selection framework that adapts its strategy based on the query type. Specifically, DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37569", "url": null, "sourceid": 39320, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37534, "uid": "f7b49030b84b97848504c5f439564b69", "name": "Learning Long-term Motion Embeddings for Efficient Kinematics Generation", "authors": [{"id": 153590, "fullname": "Nick Stracke", "url": "http://cvpr.thecvf.com/api/miniconf/users/153590?format=json", "institution": "CompVis       LMU Munich"}, {"id": 167969, "fullname": "Kolja Bauer", "url": "http://cvpr.thecvf.com/api/miniconf/users/167969?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 153587, "fullname": "Stefan Andreas Baumann", "url": "http://cvpr.thecvf.com/api/miniconf/users/153587?format=json", "institution": "CompVis        LMU Munich"}, {"id": 152428, "fullname": "Miguel \u00c1ngel Bautista", "url": "http://cvpr.thecvf.com/api/miniconf/users/152428?format=json", "institution": "Apple"}, {"id": 148925, "fullname": "Joshua Susskind", "url": "http://cvpr.thecvf.com/api/miniconf/users/148925?format=json", "institution": "Apple"}, {"id": 85132, "fullname": "Bj\u00f6rn Ommer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85132?format=json", "institution": "University of Munich"}], "abstract": "Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models.This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37534", "url": null, "sourceid": 30991, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37536, "uid": "6185e56c2cc78ba47a751d63d62b7488", "name": "CrossAgent: Bridging Cross-level Actions into One Agentic Model via Reinforcement Learning", "authors": [{"id": 180969, "fullname": "Kaichen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/180969?format=json", "institution": "Peking University"}, {"id": 88435, "fullname": "Zihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88435?format=json", "institution": "Peking University"}, {"id": 187667, "fullname": "Muyao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187667?format=json", "institution": "Peking University"}, {"id": 90828, "fullname": "Anji Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90828?format=json", "institution": "University of California, Los Angeles"}, {"id": 90870, "fullname": "Yitao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90870?format=json", "institution": "Peking University"}], "abstract": "Autonomous end-to-end agents are increasingly required to operate in environments where actions are not derived directly from the environment's raw actions but instead selected from higher-level action spaces. These actions are then mapped to the corresponding low-level interactions with the environment through controllers. In existing research, the action space is typically predefined. However, in practice, the optimal action space is context-dependent and difficult to determine in advance. For example, in complex domains such as Minecraft, relying solely on low-level raw actions or high-level planning actions is insufficient to handle the wide range of open-ended tasks, which vary in complexity and time horizons. The effective granularity of the control inevitably varies depending on the situation.To address this challenge, we propose CrossAgent, which introduces a novel adaptive action-space selection framework. CrossAgent is built through two stages of reinforcement learning fine-tuning: cold-start single-step reinforcement learning and multi-step reinforcement learning. Within Minecraft, we define three complementary action spaces: motion, grounding, and raw action\u2014each with distinct advantages and limitations. Our framework enables agents to dynamically switch among these spaces and balance task rewards against reasoning costs.Experiments on over 30 diverse tasks in Minecraft demonstrate that CrossAgent exhibits strong long-horizon planning, precise execution, generalization, and efficiency, significantly outperforming fixed-action baselines. These results highlight the critical role of dynamic action-space adaptation in the development of generalist agents capable of tackling open-ended environments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37536", "url": null, "sourceid": 41754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37539, "uid": "65448ceadc9d24cfb20915c0a610a4ef", "name": "Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach", "authors": [{"id": 182187, "fullname": "Zixun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182187?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "The F-measure is a widely used metric in multi-label classification, where multiple labels are predicted simultaneously for a single instance. The optimal prediction rule for F-measure requires estimating $q^2+1$ probabilities, where $q$ is the number of labels. Existing approaches train $q$ multinomial estimators (multi-class classifiers) to directly estimate these probabilities, followed by a matrix multiplication for making predictions. However, this method has two major drawbacks. First, the matrix multiplication incurs a time complexity of $\\mathcal{O}(q^3)$, which becomes computationally expensive for large $q$. Second, training multinomial estimators is challenging due to the sparsity of the underlying distributions, which results from the inherent imbalance in multi-label datasets and is further exacerbated by the label transformation required by the method itself. In this paper, we first demonstrate that matrix multiplication can be reformulated as a series of convolutions by exploiting a special structure in the matrix. These convolutions can then be efficiently computed using the Fast Fourier Transform (FFT), reducing the time complexity to $\\mathcal{O}(q^2\\log q)$. For example, on the \\textit{COCO} dataset, matrix multiplication requires 27 seconds, while FFT takes only 1 seconds, resulting in a 27x speedup. To avoid multinomial label transformation, we propose an indirect sampling-then-estimation approach to estimate the required probabilities. This method trains only $q$ binary estimators instead of multinomial ones, thereby alleviating the sparsity issue, simplifying the training process, and improving performance. We provide theoretical guarantees for the consistency of the proposed sampling-based method and demonstrate its effectiveness through extensive experiments on diverse datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37539", "url": null, "sourceid": 45691, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37541, "uid": "014e4cc0f935b06c12c94cbaeabbaae0", "name": "VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation", "authors": [{"id": 177394, "fullname": "Tan Junwen", "url": "http://cvpr.thecvf.com/api/miniconf/users/177394?format=json", "institution": "South China university of Technology"}, {"id": 187681, "fullname": "Jinglin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187681?format=json", "institution": "South China University of Technology"}, {"id": 187682, "fullname": "Hongyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187682?format=json", "institution": "South China University of Technology"}, {"id": 84875, "fullname": "Shuangping Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84875?format=json", "institution": "South China University of Technology"}], "abstract": "Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Previous acceleration methods rely on caching and reusing, neglecting the growing mismatch between static cached values and evolving input, leading to reduced generated content fidelity.This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating.VDE periodically anchors the model\u2019s state with a full forward pass and estimates subsequent outputs analytically. VDE first decomposes the model\u2019s velocity output into components parallel and orthogonal to the input, then exploiting the temporal predictability of the components' coefficients and the consistency of the orthogonal direction for precise, input-adaptive estimation at each timestep.Extensive experiments on image and video generation tasks demonstrate that VDE achieves up to 2.04-3.22\u00d7 acceleration with minimal loss in visual quality. For example, in image generation, VDE achieves a 2.21\u00d7 speedup while preserving nearly identical visual quality, outperforming the best baseline by 19.5% in SSIM, 30.3% in PSNR, and reducing LPIPS by 55.4%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37541", "url": null, "sourceid": 38194, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37542, "uid": "438be6a47658f47479deb34f558eba4e", "name": "Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval", "authors": [{"id": 180145, "fullname": "Li Weiqing", "url": "http://cvpr.thecvf.com/api/miniconf/users/180145?format=json", "institution": "Alibaba Cloud Computing"}, {"id": 177020, "fullname": "Jinyue Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/177020?format=json", "institution": "Institute of Automation Chinese Academy of Sciences"}, {"id": 187683, "fullname": "Yaqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187683?format=json", "institution": "Nankai University"}, {"id": 182844, "fullname": "HAIYANG XIAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/182844?format=json", "institution": "Alibaba Cloud Computing"}, {"id": 187684, "fullname": "Yuewei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187684?format=json", "institution": "Alibaba Group"}, {"id": 187685, "fullname": "Guohua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187685?format=json", "institution": "Tsinghua University; Alibaba Group"}, {"id": 187686, "fullname": "Hao Henry Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187686?format=json", "institution": null}], "abstract": "Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model\u2019s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel viewpoint-pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates \"hard queries\" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary in above collaboration scenario is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model\u2019s evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2 % and 77.1 %, proving the efficacy of our evolutionary approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37542", "url": null, "sourceid": 41971, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37543, "uid": "1f3a4f7715b9fd855595b4836101ec30", "name": "VGA:Empowering Aerial-Ground Localization by Visual Geometry Alignment", "authors": [{"id": 101296, "fullname": "Tao Jun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/101296?format=json", "institution": "Australian National university"}, {"id": 133585, "fullname": "Yujiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133585?format=json", "institution": "ShanghaiTech University"}, {"id": 92749, "fullname": "Hongdong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/92749?format=json", "institution": "Australian National University"}], "abstract": "Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird\u2019s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with extreme wide base-lines and limited overlap. We evaluate our method on challenging MatrixCity, ACC-NVS1 and ULTRRA ground-aerial pairs, demonstrating that optimizing with learned geometric priors can further improve the camera pose estimation across diverse altitudes and environment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37543", "url": null, "sourceid": 30662, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37546, "uid": "d32108a27936712222dcc9675c0b7f44", "name": "Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots", "authors": [{"id": 175482, "fullname": "Mingzhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/175482?format=json", "institution": "Xiamen University"}, {"id": 160148, "fullname": "Mengyin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/160148?format=json", "institution": "Xiamen University"}, {"id": 185303, "fullname": "Zekai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185303?format=json", "institution": "Xiamen University"}, {"id": 129253, "fullname": "Xincheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129253?format=json", "institution": "Xiamen University"}, {"id": 177246, "fullname": "Junsheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177246?format=json", "institution": "Xiamen University"}, {"id": 77069, "fullname": "Ming Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/77069?format=json", "institution": "Xiamen University"}, {"id": 187691, "fullname": "Zengye Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187691?format=json", "institution": "Xiamen University"}, {"id": 187692, "fullname": "Changwang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187692?format=json", "institution": "CCF Theoretical Computer Science Technical Committee; OPPO Research Institute"}, {"id": 86653, "fullname": "Chenglu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86653?format=json", "institution": "Xiamen University"}, {"id": 73999, "fullname": "Lan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73999?format=json", "institution": "ShanghaiTech University"}, {"id": 86709, "fullname": "Siqi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86709?format=json", "institution": "Xiamen University"}, {"id": 86652, "fullname": "Cheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86652?format=json", "institution": "Xiamen University"}], "abstract": "Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion HHMotion dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0\u20135 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this gap, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based approaches. The dataset, code, and benchmark will be publicly released to support future research in the community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37546", "url": null, "sourceid": 36519, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37547, "uid": "447c076ca136c51e3b7446b61f70c9d8", "name": "CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models", "authors": [{"id": 179991, "fullname": "Junyang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/179991?format=json", "institution": "Tsinghua University"}, {"id": 187693, "fullname": "Qifan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187693?format=json", "institution": "Southern University of Science and Technology"}, {"id": 87900, "fullname": "Wenming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87900?format=json", "institution": "Tsinghua University,"}, {"id": 85357, "fullname": "Zhihai He", "url": "http://cvpr.thecvf.com/api/miniconf/users/85357?format=json", "institution": "Pengcheng Lab, Shenzhen P R China"}], "abstract": "Recent Large Vision-Language Models (LVLMs) have shown impressive capabilities in multimodal understanding and generation. Despite this progress, they remain prone to *hallucination*, where model outputs conflict with the visual input due to an over-reliance on textual priors. Existing inference-time mitigation approaches frequently depend on multi-pass or contrastive decoding, which increases latency and limits their applicability in real-time settings. To address this limitation, we propose **CausalLens**, a training-free and single-pass intervention that directly adjusts the decoder hidden states to strengthen visual grounding. By decomposing attention heads into visual, textual, and system prompt pathways, CausalLens identifies visually reliable heads using a sensitivity measure and selectively adjusts their mid-layer hidden-state contributions. A projection-aligned correction further stabilizes these adjusted states after multi-head fusion, ensuring that the enhanced visual information is preserved throughout decoding. Extensive experiments across multiple hallucination benchmarks and LVLM architectures demonstrate that CausalLens consistently improves visual fidelity while adding negligible computational overhead. The method requires no fine-tuning or architectural changes, making it well-suited for practical, latency-sensitive applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37547", "url": null, "sourceid": 36936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37549, "uid": "0d73a4088d495eed48b66afadc4aa1d5", "name": "Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Diffusion Transformers", "authors": [{"id": 181170, "fullname": "Guantao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181170?format=json", "institution": "\u4e0a\u6d77\u4ea4\u901a\u5927\u5b66"}, {"id": 181164, "fullname": "Shikang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181164?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186337, "fullname": "Yuqi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186337?format=json", "institution": "Jilin University"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "Diffusion Transformer (DiT) models have achieved unprecedented quality in image and video generation, yet their iterative sampling process remains computationally prohibitive. To accelerate inference, feature caching methods have emerged by reusing intermediate representations across timesteps. However, existing caching approaches treat all feature components uniformly. We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, low-energy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. Extensive experiments demonstrate that SVD-Cache achieves near-lossless across diverse models and methods, including 5.55$\\times$ speedup on FLUX and HunyuanVideo, and compatibility with model acceleration techniques including distillation, quantization and sparse attention. Our code is in supplementary material and will be released on Github.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37549", "url": null, "sourceid": 35548, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37552, "uid": "d7de72bb6751415de803b44e8d6ac63f", "name": "SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push\u2013Pull Optimization", "authors": [{"id": 154872, "fullname": "YUNYI LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/154872?format=json", "institution": "University of Sydney"}, {"id": 187704, "fullname": "Yingshu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187704?format=json", "institution": "University of Sydney"}, {"id": 154871, "fullname": "Tong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154871?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 77115, "fullname": "Lingqiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77115?format=json", "institution": "University of Adelaide"}, {"id": 85317, "fullname": "Lei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85317?format=json", "institution": "University of Wollonong"}, {"id": 85242, "fullname": "Luping Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85242?format=json", "institution": "University of Sydney"}], "abstract": "Radiology report generators often produce fluent text yet miss crucial details, leading to local semantic conflicts or flipped findings that require stronger penalties. **Cross-entropy (CE) merely increases the probability of the ground-truth token $y^*$ without directly suppressing the model\u2019s current wrong choice $\\hat{y}$**, and treats all positions uniformly, so corrections are not prioritized.  We introduce a **self-adaptive optimization framework** that dynamically adjusts token-level gradients based on semantic discrepancy cues derived from a frozen LLM referee. The LLM itself is not the contribution\u2014it merely provides weak supervision to trigger the adaptive learning process. Within this framework, (i) semantic conflicts between the predicted and reference reports are **automatically localized** and tagged with ``<e>...</e>`` (used only during training), and (ii) **adaptive, stronger penalties** are applied within these sparse but critical spans. Updates follow a *push\u2013pull* scheme: error spans are pushed down, while non-error tokens are reinforced. The update strength is governed by two complementary signals\u2014*normalized entropy* (for uncertainty calibration) and *focal-style confidence* (for handling over- and under-confident predictions). On MIMIC-CXR and IU-Xray, our framework consistently improves both language metrics (BLEU-4, ROUGE-L, CIDEr) and clinical metrics (RadGraph F1, CheXbert), and remains robust to noisy or imperfect error tags.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37552", "url": null, "sourceid": 44135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37557, "uid": "0e2e8e519b5915aa995fbbdf75d0f1ea", "name": "FabricGen: Microstructure-Aware Woven Fabric Generation", "authors": [{"id": 180911, "fullname": "Yingjie Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180911?format=json", "institution": "Nankai University"}, {"id": 186586, "fullname": "Di Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186586?format=json", "institution": "Nankai University"}, {"id": 187717, "fullname": "Zixiong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187717?format=json", "institution": "Nankai University"}, {"id": 187718, "fullname": "Xiaoli Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/187718?format=json", "institution": "nanjing university"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}, {"id": 145863, "fullname": "Beibei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145863?format=json", "institution": "Nanjing University"}], "abstract": "Woven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37557", "url": null, "sourceid": 38681, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37560, "uid": "0315243c468b238bf15c230c774c9791", "name": "Reliable Clustering Number Estimation for Contrastive Multi-View Clustering", "authors": [{"id": 184024, "fullname": "Zhengzhong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184024?format=json", "institution": "Sichuan University"}, {"id": 185608, "fullname": "Pei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185608?format=json", "institution": "Sichuan University"}, {"id": 187728, "fullname": "Lanxi Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187728?format=json", "institution": "Sichuan University"}, {"id": 187729, "fullname": "Li Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187729?format=json", "institution": "Sichuan University"}, {"id": 187730, "fullname": "Jia Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187730?format=json", "institution": "Sichuan University"}, {"id": 185609, "fullname": "Shiquan min", "url": "http://cvpr.thecvf.com/api/miniconf/users/185609?format=json", "institution": "Sichuan University"}, {"id": 185610, "fullname": "Jiangping Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185610?format=json", "institution": "Sichuan University"}], "abstract": "In recent years, contrastive multi-view clustering has achieved remarkable performance improvements. However, existing methods still face two key challenges: (1) reliance on a predefined number of clusters k, which is often unknown in real-world scenarios; and (2)  contrastive learning might cause representation degeneration when thecollected multiple views inherently have inconsistent semantic information .  To address these issues, we propose a novel framework\u2014Reliable Clustering Number Estimation for Contrastive Multi-View Clustering (RCNMC). RCNMC consists of a Semantics-Aware Contrastive Learning module and a Reinforcement Learning-based Cluster Number Learning module. Specifically, the Semantics-Aware Contrastive Learning module first measures the discrepancy between pairwise representations and adaptively strengthens useful pairwise views while weakening unreliable ones, thereby alleviating representation degeneration. The Reinforcement Learning-based Cluster Number Learning module infers the optimal number of clusters in an unsupervised manner by using intra-cluster and inter-cluster distances as a reward-driven strategy. The two modules complement each other, making RCNMC more suitable for complex multi-view clustering tasks in real-world scenarios. Extensive experiments on multiple benchmark datasets demonstrate that RCNMC significantly outperforms existing state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37560", "url": null, "sourceid": 33209, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37561, "uid": "8f72f7d9931feb4ecedbb2c0722575eb", "name": "Lighting in Motion: Spatiotemporal HDR Lighting Estimation", "authors": [{"id": 148518, "fullname": "Christophe Bolduc", "url": "http://cvpr.thecvf.com/api/miniconf/users/148518?format=json", "institution": "Universit\u00e9 Laval"}, {"id": 88889, "fullname": "Julien Philip", "url": "http://cvpr.thecvf.com/api/miniconf/users/88889?format=json", "institution": "Adobe Systems"}, {"id": 187731, "fullname": "Li Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187731?format=json", "institution": "Scanline VFX"}, {"id": 153378, "fullname": "Mingming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/153378?format=json", "institution": "Netflix Eyeline Studios"}, {"id": 153383, "fullname": "Paul Debevec", "url": "http://cvpr.thecvf.com/api/miniconf/users/153383?format=json", "institution": "NetFlix"}, {"id": 86795, "fullname": "Jean-Fran\u00e7ois Lalonde", "url": "http://cvpr.thecvf.com/api/miniconf/users/86795?format=json", "institution": "Universit\u00e9 Laval"}], "abstract": "We present LiMo, a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37561", "url": null, "sourceid": 35905, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37563, "uid": "fc90ef8f569ff0eecfdc2c3860ee64a1", "name": "ORION: ORthonormal Text Encoding for Universal VLM AdaptatION", "authors": [{"id": 181113, "fullname": "Omprakash Chakraborty", "url": "http://cvpr.thecvf.com/api/miniconf/users/181113?format=json", "institution": "ETS Montreal"}, {"id": 84856, "fullname": "Jose Dolz", "url": "http://cvpr.thecvf.com/api/miniconf/users/84856?format=json", "institution": "\u00c9cole de technologie sup\u00e9rieure"}, {"id": 77361, "fullname": "Ismail Ben Ayed", "url": "http://cvpr.thecvf.com/api/miniconf/users/77361?format=json", "institution": "ETS Montreal"}], "abstract": "Vision-language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero-shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task-specific discriminability. We introduce ORION, a text encoder fine-tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low-rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore,  we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens\u2019 theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug-and-play module on top of various state-of-the-art methods, and across different prediction settings (zero-shot, few-shot and test-time adaptation), ORION improves the performance consistently and significantly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37563", "url": null, "sourceid": 44547, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37564, "uid": "04c7ec3579a0033ab281960fbd7b84c3", "name": "EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection", "authors": [{"id": 173321, "fullname": "Jiang Shuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/173321?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 187738, "fullname": "Gaojia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187738?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 181771, "fullname": "Min Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181771?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 180426, "fullname": "Yufei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180426?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 87277, "fullname": "Gang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87277?format=json", "institution": "Zhejiang University"}], "abstract": "Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37564", "url": null, "sourceid": 44065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37566, "uid": "5dced4e1ca6b1b9cc943b06909f92e05", "name": "PixelDiT: Pixel Diffusion Transformers for Image Generation", "authors": [{"id": 187743, "fullname": "Yongsheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187743?format=json", "institution": "University of Rochester"}, {"id": 137970, "fullname": "Wei Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/137970?format=json", "institution": "NVIDIA"}, {"id": 162987, "fullname": "Weili Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/162987?format=json", "institution": "NVIDIA"}, {"id": 88915, "fullname": "Yichen Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88915?format=json", "institution": "NVIDIA"}, {"id": 187744, "fullname": "Shiqiu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187744?format=json", "institution": "NVIDIA"}, {"id": 85765, "fullname": "Jiebo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85765?format=json", "institution": "University of Rochester"}], "abstract": "Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 2.21 FID on ImageNet 512, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and  pretrain it at the $1024^{2}$ resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37566", "url": null, "sourceid": 36675, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40304?format=json"], "related_events_ids": [40304]}, {"id": 40304, "uid": "5dced4e1ca6b1b9cc943b06909f92e05", "name": "PixelDiT: Pixel Diffusion Transformers for Image Generation", "authors": [{"id": 187743, "fullname": "Yongsheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187743?format=json", "institution": "University of Rochester"}, {"id": 137970, "fullname": "Wei Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/137970?format=json", "institution": "NVIDIA"}, {"id": 162987, "fullname": "Weili Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/162987?format=json", "institution": "NVIDIA"}, {"id": 88915, "fullname": "Yichen Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88915?format=json", "institution": "NVIDIA"}, {"id": 187744, "fullname": "Shiqiu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187744?format=json", "institution": "NVIDIA"}, {"id": 85765, "fullname": "Jiebo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85765?format=json", "institution": "University of Rochester"}], "abstract": "Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 2.21 FID on ImageNet 512, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and  pretrain it at the $1024^{2}$ resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40304", "url": null, "sourceid": -36675, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37566?format=json"], "related_events_ids": [37566]}, {"id": 37567, "uid": "5420aad7fec3549a85876ba1c529bd84", "name": "ResCa: Residual Caching for Diffusion Transformers Acceleration", "authors": [{"id": 160020, "fullname": "Haipeng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160020?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 187745, "fullname": "Yu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187745?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 87043, "fullname": "Fan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87043?format=json", "institution": "Institute of Computing Technology, CAS"}, {"id": 187746, "fullname": "Yixing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187746?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 90790, "fullname": "Juan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90790?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 90803, "fullname": "Sheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90803?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Diffusion transformers have achieved remarkable progress in high-quality image and video generation, but their computational overhead remains a significant challenge. Existing token reduction-based acceleration techniques, such as caching and merging, attempt to reduce this cost from both temporal and spatial perspectives, but often compromise generation quality by introducing non-updated or non-self denosing directions. In this paper, we propose Residual Caching (ResCa), a novel, training-free framework that introduces a proxy denoising perspective to overcome these limitations. ResCa achieves acceleration while maintaining a denoising trajectory that is both self and updated. The core idea is to perform true denoising on only one proxy token within each trajectory-based cluster, and use its computed multi-order residuals to guide the simulated denoising of all other tokens. ResCa can be seamlessly integrated into various diffusion models, including DiT, FLUX, and HunyuanVideo. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method, achieving up to a 5.5 times acceleration in GFLOPs while maintaining near-lossless generation quality on FLUX.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37567", "url": null, "sourceid": 42334, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37570, "uid": "dbd456d85889141a5bca8a0b51abcd7a", "name": "ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction", "authors": [{"id": 145699, "fullname": "Jie Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145699?format=json", "institution": "Peking University"}, {"id": 142884, "fullname": "Jiahao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142884?format=json", "institution": "Peking University"}, {"id": 157820, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157820?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 155725, "fullname": "Jiayu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155725?format=json", "institution": "Pengcheng Laboratory"}, {"id": 187748, "fullname": "Xiaoyun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187748?format=json", "institution": "Pengcheng Laboratory"}, {"id": 180184, "fullname": "Kaiqiang Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180184?format=json", "institution": "Peking University"}, {"id": 187749, "fullname": "Zhanke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187749?format=json", "institution": "Peking University"}, {"id": 151756, "fullname": "Jinbo Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151756?format=json", "institution": "Peking University"}, {"id": 129700, "fullname": "Feng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129700?format=json", "institution": "Peking University"}, {"id": 86749, "fullname": "Ronggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86749?format=json", "institution": "Peking University Shenzhen Graduate School"}], "abstract": "Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length.We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37570", "url": null, "sourceid": 37438, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37572, "uid": "ad644e9304513dd34ebab678e9c3840d", "name": "Camouflage-aware Image-Text Retrieval via Expert Collaboration", "authors": [{"id": 177244, "fullname": "Yao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177244?format=json", "institution": "Sichuan University"}, {"id": 176837, "fullname": "Zhongkuan Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/176837?format=json", "institution": "Sichuan University"}, {"id": 177256, "fullname": "xuan wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177256?format=json", "institution": "sichuan university"}, {"id": 153198, "fullname": "Keren Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153198?format=json", "institution": "Sichuan University"}, {"id": 153199, "fullname": "Qijun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153199?format=json", "institution": "Sichuan University"}], "abstract": "Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves a $\\sim$29\\% CA-ITR accuracy boost, surpassing seven representative retrieval models. Our dataset and code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37572", "url": null, "sourceid": 45821, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37579, "uid": "5d4c7bf4ebab441c07f4fa26c4f0c4c3", "name": "MARCO: Navigating the Unseen Space of Semantic Correspondence", "authors": [{"id": 129915, "fullname": "Claudia Cuttano", "url": "http://cvpr.thecvf.com/api/miniconf/users/129915?format=json", "institution": "Polytechnic Institute of Turin"}, {"id": 95391, "fullname": "Gabriele Trivigno", "url": "http://cvpr.thecvf.com/api/miniconf/users/95391?format=json", "institution": "Politecnico di Torino"}, {"id": 128244, "fullname": "Carlo Masone", "url": "http://cvpr.thecvf.com/api/miniconf/users/128244?format=json", "institution": "Politecnico di Torino"}, {"id": 73884, "fullname": "Stefan Roth", "url": "http://cvpr.thecvf.com/api/miniconf/users/73884?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}], "abstract": "Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training.Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which extends sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences.MARCO sets a new state of the art on SPair-71k, AP-10K and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+10.3 PCK@0.01), strongest generalization to unseen keypoints (+3.8, SPair-U) and categories (+5.6, MP-100), while remaining 3\u00d7 smaller and 10\u00d7 faster than diffusion-based approaches.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37579", "url": null, "sourceid": 33149, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40305?format=json"], "related_events_ids": [40305]}, {"id": 37576, "uid": "fef6255b484a1dc0dac35fd87bb905ae", "name": "RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation", "authors": [{"id": 174364, "fullname": "Kai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174364?format=json", "institution": "Peking University"}, {"id": 179208, "fullname": "Zhenyu Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/179208?format=json", "institution": "Peking University"}, {"id": 184057, "fullname": "Zehua Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184057?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 107125, "fullname": "Jiahuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/107125?format=json", "institution": "Peking University"}], "abstract": "Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models' capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37576", "url": null, "sourceid": 44027, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40305, "uid": "5d4c7bf4ebab441c07f4fa26c4f0c4c3", "name": "MARCO: Navigating the Unseen Space of Semantic Correspondence", "authors": [{"id": 129915, "fullname": "Claudia Cuttano", "url": "http://cvpr.thecvf.com/api/miniconf/users/129915?format=json", "institution": "Polytechnic Institute of Turin"}, {"id": 95391, "fullname": "Gabriele Trivigno", "url": "http://cvpr.thecvf.com/api/miniconf/users/95391?format=json", "institution": "Politecnico di Torino"}, {"id": 128244, "fullname": "Carlo Masone", "url": "http://cvpr.thecvf.com/api/miniconf/users/128244?format=json", "institution": "Politecnico di Torino"}, {"id": 73884, "fullname": "Stefan Roth", "url": "http://cvpr.thecvf.com/api/miniconf/users/73884?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}], "abstract": "Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training.Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which extends sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences.MARCO sets a new state of the art on SPair-71k, AP-10K and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+10.3 PCK@0.01), strongest generalization to unseen keypoints (+3.8, SPair-U) and categories (+5.6, MP-100), while remaining 3\u00d7 smaller and 10\u00d7 faster than diffusion-based approaches.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40305", "url": null, "sourceid": -33149, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37579?format=json"], "related_events_ids": [37579]}, {"id": 37581, "uid": "7faae67aa3cb2e3cbc3d5f3f12fe809c", "name": "A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling", "authors": [{"id": 149140, "fullname": "Jianlu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149140?format=json", "institution": "School of Computer Science and Engineering, Southeast University"}, {"id": 151464, "fullname": "Fu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151464?format=json", "institution": "Southeast University"}, {"id": 173505, "fullname": "Jiaze Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173505?format=json", "institution": "Southeast University "}, {"id": 157444, "fullname": "Yucheng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/157444?format=json", "institution": "Southeast University"}, {"id": 84853, "fullname": "Jiaqi Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/84853?format=json", "institution": "RIKEN"}, {"id": 84884, "fullname": "Xin Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84884?format=json", "institution": "Southeast University"}], "abstract": "Transferring pre-trained knowledge from a source model to a target model of a different architectural size is a key challenge for flexible and efficient model scaling. However, current parameter-space methods treat Small-to-Large (S2L) and Large-to-Small (L2S) scaling as separate, incompatible problems, focusing on parameter synthesis and selection, respectively.This fragmented perspective has resulted in specialized tools, hindering a unified, bidirectional framework.In this paper, we propose BoT (Bidirectional knowledge Transfer), the first size-agnostic framework to unify S2L and L2S scaling.Our core insight is to treat model weights as continuous signals, where models of different sizes represent distinct discretizations of the transferable knowledge.This multi-resolution perspective directly casts S2L and L2S scaling as the signal processing operations of upsampling and downsampling, naturally leading to the adoption of the Discrete Wavelet Transform (DWT) and its Inverse (IDWT).BoT leverages the recursive nature of wavelets, using the decomposition level as a dynamic scaling factor to bridge disparate model sizes in a parameter-free and computationally efficient manner. Extensive experiments on DeiT, BERT, and GPT demonstrate significant pre-training FLOPs savings (up to 67.1\\% for S2L, 52.8\\% for L2S) and state-of-the-art performance on benchmarks like GLUE and SQuAD.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37581", "url": null, "sourceid": 30696, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37583, "uid": "0c4033c4ab2117e808f1b11a8181ef98", "name": "Turning Pre-Trained Vision Transformers into End-to-End Histopathology Whole Slide Image Models for Survival Prediction", "authors": [{"id": 180318, "fullname": "Jiawen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180318?format=json", "institution": "Tsinghua University"}, {"id": 187770, "fullname": "Jiali Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187770?format=json", "institution": "City University of Hong Kong (Dongguan)"}, {"id": 143860, "fullname": "Xitong Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/143860?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187771, "fullname": "Renao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187771?format=json", "institution": "University of Washington"}, {"id": 101751, "fullname": "Yuxuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/101751?format=json", "institution": "Tsinghua University"}, {"id": 131575, "fullname": "Tian Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131575?format=json", "institution": "Graduate School at Shenzhen, Tsinghua University"}, {"id": 131580, "fullname": "Yonghong He", "url": "http://cvpr.thecvf.com/api/miniconf/users/131580?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Conventional whole slide image (WSI) analysis pipelines follow a two-stage process. First, an image encoder, such as a vision transformer (ViT), is used to perform batched offline feature extraction on a series of tiles cropped from the WSI. Second, a multiple instance learning (MIL) model is trained with slide-level labels to obtain task-specific slide embeddings. However, several limitations exist: strong reliance on pre-trained weights of the tile encoder, the absence of receptive fields from the original image, and a lack of task-independent WSI representations. An ideal improvement would be to develop an end-to-end pre-trained WSI model, but training it from scratch will face challenges such as high training costs and computational complexity. In this work, we deconstruct the key steps of ViT-based pathology image representation and propose a conversion strategy called E2E-ViT, which transforms a vanilla ViT into an end-to-end pre-trained WSI model without introducing additional parameters. E2E-ViT directly inputs the entire tissue region in WSIs to efficiently feed image sequences into the transformer backbone, achieving information interaction from the original receptive fields and generating slide features. Through multiple survival prediction tasks, we demonstrate that transformed pre-trained ViTs outperform two-stage MIL models and slide foundation models (SFM). Our work presents a new end-to-end learning paradigm that provides a promising direction for the next generation of computational pathology models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37583", "url": null, "sourceid": 40471, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37584, "uid": "408a9c4a79800232ac656249af3162eb", "name": "Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers", "authors": [{"id": 187772, "fullname": "Zachary Shinnick", "url": "http://cvpr.thecvf.com/api/miniconf/users/187772?format=json", "institution": "University of Adelaide"}, {"id": 155394, "fullname": "Liangze Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155394?format=json", "institution": "EPFL"}, {"id": 127561, "fullname": "Hemanth Saratchandran", "url": "http://cvpr.thecvf.com/api/miniconf/users/127561?format=json", "institution": "University of Adelaide/Australian Institute of Machine Learning"}, {"id": 127840, "fullname": "Damien Teney", "url": "http://cvpr.thecvf.com/api/miniconf/users/127840?format=json", "institution": "Idiap Research Institute"}, {"id": 88134, "fullname": "Anton van den Hengel", "url": "http://cvpr.thecvf.com/api/miniconf/users/88134?format=json", "institution": "University of Adelaide"}], "abstract": "Transformers show remarkable versatility across domains, suggesting the existence of generic inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors.  When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1\\% of the training budget to procedural data improves final accuracy by over 1.7\\%.  In terms of its effect on performance, 1\\% procedurally generated data is thus equivalent to 28\\% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37584", "url": null, "sourceid": 42341, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37586, "uid": "713dc417a38ad80fdf481609ea2ac4b5", "name": "Fast Reasoning Segmentation for Images and Videos", "authors": [{"id": 187776, "fullname": "Yiqing Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187776?format=json", "institution": "Johns Hopkins University"}, {"id": 75511, "fullname": "Mathias Unberath", "url": "http://cvpr.thecvf.com/api/miniconf/users/75511?format=json", "institution": "Johns Hopkins University"}], "abstract": "Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine-tuning on teacher-generated reasoning chains. Then it is followed by reinforcement fine-tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM-Seg40K) demonstrate that our FastReasonSeg achieves state-of-the-art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource-constrained environments to enable real-time reasoning segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37586", "url": null, "sourceid": 40019, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37590, "uid": "e633a3d4fb3db2c7665c170c80db6717", "name": "SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning", "authors": [{"id": 159468, "fullname": "Ye-Chan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/159468?format=json", "institution": "Hanyang University"}, {"id": 147634, "fullname": "SeungJu Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/147634?format=json", "institution": "Hanyang University"}, {"id": 180270, "fullname": "Si-Woo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/180270?format=json", "institution": "Hanyang Univiersity"}, {"id": 132367, "fullname": "minju Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/132367?format=json", "institution": "Hanyang University"}, {"id": 187780, "fullname": "HyunGee Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187780?format=json", "institution": "Hanyang Universty"}, {"id": 69900, "fullname": "Dong-Jin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/69900?format=json", "institution": "Hanyang University"}], "abstract": "Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focus merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity-aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37590", "url": null, "sourceid": 36355, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37591, "uid": "94eedebf6923d613525b288f05d9e3c2", "name": "PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in  3D Gaussian Splatting", "authors": [{"id": 181233, "fullname": "Jingyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181233?format=json", "institution": "Xidian University"}, {"id": 187781, "fullname": "Yumeng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187781?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 138940, "fullname": "Fei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/138940?format=json", "institution": "Hangzhou Institute of Technology, Xidian University"}, {"id": 153048, "fullname": "Mingjin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153048?format=json", "institution": "Xidian University"}], "abstract": "Thermal infrared (TIR) 3D reconstruction provides geometry that is intrinsically coupled to the temperature field, even in low-light, nighttime, and smoke-obscured environments. TIR imaging measures self-emitted thermal radiation driven by object temperature and is largely independent of external illumination; therefore, simply carrying over visible-spectrum assumptions to TIR-based 3D reconstruction and novel view synthesis (NVS) often results in floating artifacts and blurred edges. In addition, radiometric inconsistency and low contrast in TIR weaken structure-from-motion (SfM) initialization, which in turn hinders subsequent 3D Gaussian Splatting (3DGS) optimization. We present PhysIR-Splat, a 3DGS framework that follows infrared radiative transfer: we explicitly model temperature, emissivity, and environmental irradiance on Gaussian primitives and, during rendering, jointly account for thermal emission, the reflected component, and atmospheric transmittance to produce physically consistent thermal synthesis. We also introduce VGGT-IR, a Transformer-based feed-forward initializer that takes TIR input with optional RGB and directly regresses camera poses and initial geometry, providing a modality-aligned and stable starting point for PhysIR-Splat. Extensive experiments demonstrate that our method significantly surpasses existing approaches in thermal reconstruction quality and cross-view consistency, effectively suppressing floating artifacts and enhancing boundary sharpness. The code will be made publicly available upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37591", "url": null, "sourceid": 38976, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37592, "uid": "b258cda7f6762de012fddbb6477f5190", "name": "Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments", "authors": [{"id": 156778, "fullname": "Yun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156778?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 127401, "fullname": "Jianjun Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/127401?format=json", "institution": "Nanjing University of Science and Techonology"}, {"id": 85000, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85000?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}, {"id": 135934, "fullname": "Na Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135934?format=json", "institution": "Singapore University of Technology and Design"}], "abstract": "Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish  both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37592", "url": null, "sourceid": 34511, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37593, "uid": "1a2556609a024ce314f6c2c4afd261bc", "name": "Convolutional Neural Networks Driven by Content Similarity", "authors": [{"id": 181732, "fullname": "Ligeng Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181732?format=json", "institution": "Hunan Normal University"}, {"id": 187782, "fullname": "Guihu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187782?format=json", "institution": "Central South University"}], "abstract": "Although convolutional neural networks (CNNs) have continued to evolve in recent years, Transformers have become increasingly popular in the field of computer vision. In this work, we open a new avenue for CNNs, enabling them to aggregate information based on content similarity\u2014an ability analogous to the self-attention mechanism. We innovatively adopt reverse thinking to transform the feature similarity between tokens into relative positional information: specifically, the closer the positions of two tokens are, the higher their feature similarity. This approach allows convolution operations to be indirectly transformed into an aggregation mode driven by content similarity. Experiments show that our proposed model, named Ego, achieves excellent performance across various tasks, underscoring the untapped potential of CNNs. Code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37593", "url": null, "sourceid": 44411, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37598, "uid": "1083e9b50ae58b27dab4b7c92e7d16d9", "name": "TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion", "authors": [{"id": 142878, "fullname": "Rui Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/142878?format=json", "institution": null}, {"id": 187546, "fullname": "Haozhi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187546?format=json", "institution": "The University of Tokyo, The University of Tokyo"}, {"id": 69214, "fullname": "Tianchen Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/69214?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 162897, "fullname": "TIANXIN HU", "url": "http://cvpr.thecvf.com/api/miniconf/users/162897?format=json", "institution": "Nanyang Technological University"}, {"id": 187796, "fullname": "Weixiang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187796?format=json", "institution": "Nanyang Technological University"}, {"id": 131235, "fullname": "Shenghai Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131235?format=json", "institution": "National Technological University"}, {"id": 72464, "fullname": "Lihua Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/72464?format=json", "institution": "Nanyang Technological University"}], "abstract": "Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases.To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches.For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention.Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness.Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37598", "url": null, "sourceid": 42296, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37599, "uid": "c55267f7bb895914a9181a94a0d9b4c3", "name": "CaptionQA: Is Your Caption as Useful as the Image Itself?", "authors": [{"id": 183101, "fullname": "Shijia Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183101?format=json", "institution": "AMD"}, {"id": 143990, "fullname": "Yunong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143990?format=json", "institution": "Stanford University"}, {"id": 187797, "fullname": "Bohan Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187797?format=json", "institution": "Snowflake"}, {"id": 131723, "fullname": "Ximeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131723?format=json", "institution": "Boston University"}, {"id": 126358, "fullname": "Zicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126358?format=json", "institution": "Microsoft"}, {"id": 149601, "fullname": "Emad Barsoum", "url": "http://cvpr.thecvf.com/api/miniconf/users/149601?format=json", "institution": "AMD"}, {"id": 150949, "fullname": "Manling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/150949?format=json", "institution": "Northwestern University"}, {"id": 86399, "fullname": "Chenfeng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86399?format=json", "institution": "University of California Berkeley"}], "abstract": "Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains, Natural, Document, E-commerce, and Embodied AI, each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32\\% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37599", "url": null, "sourceid": 43865, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37603, "uid": "57e7710feda049f5d5b0c46d6f611852", "name": "GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection", "authors": [{"id": 182964, "fullname": "YITING LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/182964?format=json", "institution": "Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR)"}, {"id": 107391, "fullname": "Xulei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107391?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 187809, "fullname": "Jingyi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187809?format=json", "institution": "Nanyang Technological University; A*STAR"}, {"id": 187810, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187810?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 127844, "fullname": "Fayao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127844?format=json", "institution": "Institute for Infocomm Research, A*STAR"}], "abstract": "We address unsupervised multi-modal anomaly detection (MAD) in few-shot regimes, where only a handful of normal exemplars are available per class. Existing approaches struggle with such data scarcity due to their incapacity in capturing the distribution-level information of normal appearance and geometry. To capture diverse and continuous normality variations, we propose GPFlow, a probability flow inspired  framework that embeds diverse normal patterns into a latent space of learnable Gaussian prototypes. At its core, GPFlow uses an analytical Posterior\u2011Mean Path (PMP) router that  iteratively moves features toward prototype\u2011centered high\u2011probability neighborhoods, acting as an explicit information bottleneck to prevent trivial reconstruction of anomalies. To exploit multi-modal cues, GPFlow employs a coupled reconstruction architecture enforces both intra- and cross-modal consistency at the prototype level. Finally, to handle distribution shift between sparse training samples and unseen test samples, GPFlow incorporates inference-aware prototype refinement to dynamically expand the prototypes' coverage to new normal variations during test time. Extensive experiments on MVTec\u20113D\u2011AD and Eyecandies show that GPFlow achieves state\u2011of\u2011the\u2011art performance with only a few normal training samples, while remaining computationally efficient.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37603", "url": null, "sourceid": 42926, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37605, "uid": "399fb867d690d0bb82fae38942bc29ae", "name": "SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer", "authors": [{"id": 156705, "fullname": "Tong Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156705?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 187812, "fullname": "Yusen Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187812?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187813, "fullname": "Guoying Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187813?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187814, "fullname": "Jingde Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187814?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89574, "fullname": "Zhuotao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89574?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 89550, "fullname": "Jingyong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/89550?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$\\alpha$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37605", "url": null, "sourceid": 42804, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37609, "uid": "232049f5aff15800ce41a9a7c4cf6730", "name": "A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs", "authors": [{"id": 176644, "fullname": "Nicolas Stalder", "url": "http://cvpr.thecvf.com/api/miniconf/users/176644?format=json", "institution": "Ethz"}, {"id": 187847, "fullname": "Benjamin F Grewe", "url": "http://cvpr.thecvf.com/api/miniconf/users/187847?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 187848, "fullname": "Matteo Saponati", "url": "http://cvpr.thecvf.com/api/miniconf/users/187848?format=json", "institution": "Insititute of Neuroinformatics, University of Zurich and ETH Zurich, ETHZ - ETH Zurich"}, {"id": 187849, "fullname": "Pau Vilimelis Aceituno", "url": "http://cvpr.thecvf.com/api/miniconf/users/187849?format=json", "institution": "University of Zurich and ETH Zurich"}], "abstract": "The vulnerability of deep neural networks to adversarial examples poses a significant challenge for real-world deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined.Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only $\\sim$35\\% of the training FLOPs, using a model with 50\\% less parametets, trained with $\\sim$33\\% of the epochs and $\\sim$15\\% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2\u20138\u00d7 less total compute across $\\sim$3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37609", "url": null, "sourceid": 42267, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37611, "uid": "1ccae08c176dc746f2dae273210caac8", "name": "CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image", "authors": [{"id": 181903, "fullname": "Yizheng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/181903?format=json", "institution": "Nanjing University"}, {"id": 70490, "fullname": "Yiyu Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70490?format=json", "institution": "Nanjing University"}, {"id": 175660, "fullname": "Qipeng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175660?format=json", "institution": "Nanjing University"}, {"id": 187856, "fullname": "Haixiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187856?format=json", "institution": "nanjing university"}, {"id": 187857, "fullname": "Jiahe Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187857?format=json", "institution": "nanjing university"}, {"id": 187858, "fullname": "Jing Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187858?format=json", "institution": "nanjing university"}, {"id": 154534, "fullname": "Siyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154534?format=json", "institution": "Fudan University"}, {"id": 153839, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153839?format=json", "institution": "Nanjing University"}], "abstract": "Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs.Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37611", "url": null, "sourceid": 41581, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37612, "uid": "3724b218f56fc9192efa744b62f4bd1c", "name": "Pantheon360: Taming Digital Twin Generation via 3D-Aware 360\u00b0 Video Diffusion", "authors": [{"id": 183235, "fullname": "Ting-Hsuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183235?format=json", "institution": "University of Southern California"}, {"id": 156011, "fullname": "Ying-Huan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156011?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 187859, "fullname": "Tao Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187859?format=json", "institution": "Cornell University"}, {"id": 100237, "fullname": "Jie-Ying Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/100237?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 187860, "fullname": "Cho-Ying Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187860?format=json", "institution": "Bosch"}, {"id": 182284, "fullname": "Fangzhou Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182284?format=json", "institution": "Worcester Polytechnic Institute"}, {"id": 184155, "fullname": "Hengyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184155?format=json", "institution": "Bosch Center for AI"}, {"id": 138764, "fullname": "David Paz", "url": "http://cvpr.thecvf.com/api/miniconf/users/138764?format=json", "institution": "Bosch Center for AI"}, {"id": 129002, "fullname": "Xinyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129002?format=json", "institution": "Robert Bosch Research NA"}, {"id": 96697, "fullname": "Yuliang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96697?format=json", "institution": "Bosch US Research"}, {"id": 127300, "fullname": "Yu-Lun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127300?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 75626, "fullname": "Yue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75626?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 84539, "fullname": "Liu Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/84539?format=json", "institution": "Bosch Research"}], "abstract": "Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial\u2013temporal consistency\u2014constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift.We argue that 360\u00b0 video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360\u00b0 Video Diffusion, a controllable 360\u00b0 video generation framework that synthesizes high-fidelity videos from sparse 360\u00b0 inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency.Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360\u00b0 scene generation for downstream simulation and digital-twin applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37612", "url": null, "sourceid": 41100, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37615, "uid": "735701335a53e5b70d7465c28eed4088", "name": "Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution", "authors": [{"id": 155339, "fullname": "Zhiwei Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155339?format=json", "institution": "City University of Hong Kong"}, {"id": 155340, "fullname": "Peilin CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/155340?format=json", "institution": "City University of Hong Kong"}, {"id": 187868, "fullname": "Qiangqiang Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187868?format=json", "institution": "City University of Hong Kong"}, {"id": 87210, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87210?format=json", "institution": "vivo Mobile Communication Co.,Ltd."}, {"id": 86613, "fullname": "Shiqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86613?format=json", "institution": "City University of Hong Kong"}], "abstract": "Depth map super-resolution with color guidance is a fundamental task in computer vision that aims to reconstruct high-resolution depth maps by leveraging structural correlations from corresponding guidance images. Recently, with the development of deep learning techniques, the performance of guided depth super-resolution (GDSR) models has been significantly improved. However, most existing approaches rely on black-box architectures that lack theoretical interpretability. Although graph optimization has been explored to integrate model-driven and data-driven frameworks, it remains computationally expensive and struggles to preserve the intrinsic structures of the depth maps. To overcome these limitations, we propose a novel GDSR framework based on a dual graph Laplacian prior, termed LapNet, which efficiently unfolds graph optimization into a deep neural network. Specifically, we first formulate a dual graph Laplacian  prior that separately models structural dependencies along the row and column dimensions of the depth maps. This formulation explicitly enforces piecewise smoothness while reducing computational complexity from $\\mathcal{O}(H^3W^3)$ to $\\mathcal{O}(H^3 + W^3)$ by avoiding the construction of global affinity graph. Furthermore, we develop a deep implicit prior to extract high-frequency structural cues from the guidance image, serving as a complementary component to the manually designed prior. Finally, we integrate these complementary priors into a unified variational optimization framework, which is efficiently solved through alternating minimization and subsequently unfolded into an interpretable multi-stage deep network. Extensive experiments on both synthetic and real-world datasets demonstrate that LapNet achieves state-of-the-art performance while maintaining low computational complexity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37615", "url": null, "sourceid": 32715, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37621, "uid": "11bd1fa9edc725f8d2a295da79eeb9fb", "name": "PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation", "authors": [{"id": 187882, "fullname": "Jiahao Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187882?format=json", "institution": "Fudan University"}, {"id": 182076, "fullname": "Zizhang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182076?format=json", "institution": "Stanford University"}, {"id": 85365, "fullname": "Hong-Xing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85365?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}], "abstract": "We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37621", "url": null, "sourceid": 41596, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37626, "uid": "fedc1c02505c1330544af28c1abe2528", "name": "High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation", "authors": [{"id": 187899, "fullname": "Daichao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187899?format=json", "institution": "Anhui University"}, {"id": 187900, "fullname": "Qiupu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187900?format=json", "institution": "Henan Univeristy"}, {"id": 187901, "fullname": "Feng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187901?format=json", "institution": "University of Science and Technology of China"}, {"id": 129141, "fullname": "Xin Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/129141?format=json", "institution": "Institute of Semiconductors, Chinese Academy of Sciences"}, {"id": 182645, "fullname": "Qiankun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182645?format=json", "institution": "Nanyang Technological University"}], "abstract": "Lane detection is a crucial task in autonomous driving, which is conducive to ensuring the safe operation of vehicles. However, current datasets like CULane and TuSimple have relatively limited data under extreme weather conditions, such as rain, snow and fog, which makes detection models unreliable in extreme conditions, potentially leading to serious safety-critical failures on the road. In this direction, we propose \\textbf{\\textit{HG-Lane}}, a \\textbf{H}igh-fidelity \\textbf{G}eneration framework for \\textbf{Lane} Scenes under adverse weather and lighting conditions, without the need for re-annotation and training. Based on our framework, we further propose a benchmark that includes adverse weather and lighting conditions, with 30,000 images. Experiment results demonstrate that our method constantly and significantly improves the detection performance of all the related lane detection networks. Taking the state-of-the-art CLRNet as an example, the overall mF1 on our benchmark increases by 20.87%. The F1@50 for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75%, 8.63%, 38.8%, 14.96%, 26.84%, 21.5%, and 12.04%, respectively. Code and dataset are included in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37626", "url": null, "sourceid": 43829, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37628, "uid": "6ddc098a344870bdb384e1de45e4c8ea", "name": "PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion", "authors": [{"id": 176144, "fullname": "Qixiu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/176144?format=json", "institution": "National University of Defense Technology, China"}, {"id": 175465, "fullname": "Xiang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175465?format=json", "institution": "College of meteorology and oceanography, National University of Defense Technology"}, {"id": 187905, "fullname": "Xiaoyong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187905?format=json", "institution": "National University of Defense Technology"}, {"id": 154126, "fullname": "Xiaolong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154126?format=json", "institution": "Nanjing University of Information Science and Technology"}], "abstract": "Ocean dynamics drive global climate patterns and extreme weather events, making accurate spatiotemporal forecasting essential for climate monitoring and marine operations. Traditional Global Ocean Forecasting Systems (GOFSs) offer high accuracy predictions, yet remain computationally expensive and fail to fully leverage growing historical data. Recent deep learning models have achieved notable success, but still face three fundamental challenges: (1) they homogenize ocean variables despite strong physical coupling via equation-of-state relationships; (2) they neglect spherical geometry, resulting in severe distortions at high latitudes; and (3) they struggle to model multi-scale temporal dynamics. We introduce PhyOceanCast, a physics-informed diffusion model that overcomes these limitations through two key innovations. First, the Spherical Graph Attention Network for Multi-scale Ocean Coupling (SGAN-MOC) preserves spherical topology while enabling cross-variable interactions via heterogeneous encoding and k-hop-constrained attention. Second, the Physics-Informed Wavelet Temporal Coherence (PWTC) module that decomposes ocean dynamics across multiple scales with advection-diffusion constraints. PhyOceanCast forecasts 145 ocean variables, including temperature, salinity, and velocity fields, across 36 depth levels plus sea surface height. Extensive experiments demonstrate superior performance over diffusion, transformer, and hybrid baselines, promising a new paradigm for global ocean canonical variable forecasting. Code is available at supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37628", "url": null, "sourceid": 46597, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37629, "uid": "1ad4dd307e6008f932e93b13dd19de5d", "name": "AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models", "authors": [{"id": 187906, "fullname": "Zhiteng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187906?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187907, "fullname": "Mingyuan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/187907?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 181067, "fullname": "JINGYUAN ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/181067?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187908, "fullname": "Zheng Hui", "url": "http://cvpr.thecvf.com/api/miniconf/users/187908?format=json", "institution": "Baidu"}, {"id": 153355, "fullname": "Haotong Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/153355?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 187909, "fullname": "Linghe Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187909?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 69292, "fullname": "Yulun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69292?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}], "abstract": "Large Multimodal Models (LMMs) have attained impressive achievements in multimodal processing tasks, yet their massive memory demands pose major obstacles to deployment on resource-limited devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LMMs, delivering substantial reductions in memory overhead. However, existing SVD-based methods often struggle to effectively alleviate the errors caused by SVD truncation, resulting in a noticeable performance gap when compared to the original models. Moreover, adopting a uniform compression ratio across all transformer layers fails to consider the varying importance of different layers. To tackle these challenges, we propose AdaSVD, an adaptive SVD-based LMM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matricesand. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios according to the relative importance of each layer. Comprehensive experiments across multiple LMM families show the effectiveness of AdaSVD, achieving better performance while significantly reducing memory requirements. We will make all the code and models of AdaSVD publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37629", "url": null, "sourceid": 44952, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37631, "uid": "8253c7f87a4ec34d392b5fb236aad608", "name": "ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model", "authors": [{"id": 187912, "fullname": "Boshu Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187912?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 187913, "fullname": "Wen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187913?format=json", "institution": "Waymo"}, {"id": 75501, "fullname": "Kostas Daniilidis", "url": "http://cvpr.thecvf.com/api/miniconf/users/75501?format=json", "institution": "University of Pennsylvania"}], "abstract": "Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution.The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping.The source code of our paper will be made public when the paper is released to the public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37631", "url": null, "sourceid": 43082, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37632, "uid": "9cfe7097cc936bf5e8ffa224c7231653", "name": "Animator-Centric Skeleton Generation on Objects with Fine-Grained Details", "authors": [{"id": 71203, "fullname": "Mingze Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/71203?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187914, "fullname": "Cheng Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187914?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 177236, "fullname": "Pei Jiansong", "url": "http://cvpr.thecvf.com/api/miniconf/users/177236?format=json", "institution": "Tsinghua University"}, {"id": 158047, "fullname": "Junhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158047?format=json", "institution": "Tsinghua University"}, {"id": 151778, "fullname": "Chaoyue Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/151778?format=json", "institution": "Nanyang Technological University"}, {"id": 176015, "fullname": "Shaohui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176015?format=json", "institution": "Tencent"}, {"id": 187915, "fullname": "Tianyuan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187915?format=json", "institution": null}, {"id": 187916, "fullname": "Bin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187916?format=json", "institution": null}, {"id": 187917, "fullname": "Zijiao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187917?format=json", "institution": null}, {"id": 87774, "fullname": "Ruqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87774?format=json", "institution": "Tsinghua Shenzhen International Graduate School/Tsinghua Berkeley Shenzhen Institute "}], "abstract": "Skeleton generation is essential for animating 3D assets, but current deep learning methods remain limited: they cannot handle the growing structural complexity of modern models and offer minimal controllability, creating a major bottleneck for real-world animation workflows. To address this, we propose an animator-centric SG framework that achieves high-quality skeleton prediction on complex inputs while providing intuitive control handles. Our contributions are threefold. First, we curate a large-scale dataset of 82,633 rigged meshes with diverse and complicated structures. Second, we introduce a novel semantic-aware tokenization scheme for auto-regressive modeling. This scheme effectively complements purely geometric prior methods by subdividing bones into semantically meaningful groups, thereby enhancing robustness to structural complexity and enabling a key control mechanism. Third, we design a learnable density interval module that allows animators to exert soft, direct control over bone density. Extensive experiments demonstrate that our framework not only generates high-quality skeletons for challenging inputs but also successfully fulfills two critical requirements from professional animators. Our work paves the way for more flexible and efficient animation pipelines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37632", "url": null, "sourceid": 38418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37634, "uid": "d0dcbef5a8ea4e3440294882ca3d7cfb", "name": "HandWorld: Hand-Centric Unified Video Action Generation", "authors": [{"id": 182561, "fullname": "Zhihao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/182561?format=json", "institution": "Fudan University"}, {"id": 146215, "fullname": "Zhiying Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/146215?format=json", "institution": "Fudan University"}, {"id": 85606, "fullname": "Xitong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85606?format=json", "institution": "Meta"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}], "abstract": "Hand-object interaction forms the foundation of how humans interact with the world.Understanding the connection between hand action and egocentric video is essential for enabling embodied agents to perceive, simulate, and plan like humans. However, it is challenging to learn and predict across handactions and egocentric videos due to their non-linear relationship. In this work, we introduce HandWorld, a unified generative framework that focuses on hand-object interaction and jointly models egocentric videos and hand actions. HandWorld learns shared cross-domain conditions through a dual-branch condition network that integrates information from both video and action domains. MANO-rendered hand representation is incorporated as an intermediate input to further enhance cross-domain coherence.Conditioned on the shared representation, two decoupled diffusion transformers are trained to predict in their respective domain. A flexible training strategy enables the model to learn across diverse task configurations, including action forecasting and controllable video generation. Experiments on large-scale egocentric HOI datasets demonstrate that HandWorld achieves high-fidelity video synthesis and accurate action prediction, outperforming existing baselines across diverse scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37634", "url": null, "sourceid": 43650, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37638, "uid": "edbac39df611159f89c10012a27d1563", "name": "BOP-ASK: Object-Interaction Reasoning for Vision-Language Models", "authors": [{"id": 166051, "fullname": "Vineet Bhat", "url": "http://cvpr.thecvf.com/api/miniconf/users/166051?format=json", "institution": "New York University"}, {"id": 187925, "fullname": "Sungsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187925?format=json", "institution": "New York University"}, {"id": 87370, "fullname": "Valts Blukis", "url": "http://cvpr.thecvf.com/api/miniconf/users/87370?format=json", "institution": "NVIDIA"}, {"id": 128340, "fullname": "Greg Heinrich", "url": "http://cvpr.thecvf.com/api/miniconf/users/128340?format=json", "institution": "NVIDIA"}, {"id": 149482, "fullname": "Prashanth Krishnamurthy", "url": "http://cvpr.thecvf.com/api/miniconf/users/149482?format=json", "institution": "New York University"}, {"id": 187926, "fullname": "Ramesh Karri", "url": "http://cvpr.thecvf.com/api/miniconf/users/187926?format=json", "institution": "New York University"}, {"id": 159437, "fullname": "Stan Birchfield", "url": "http://cvpr.thecvf.com/api/miniconf/users/159437?format=json", "institution": "NVIDIA"}, {"id": 149062, "fullname": "Farshad Khorrami", "url": "http://cvpr.thecvf.com/api/miniconf/users/149062?format=json", "institution": "New York University"}, {"id": 87339, "fullname": "Jonathan Tremblay", "url": "http://cvpr.thecvf.com/api/miniconf/users/87339?format=json", "institution": "NVIDIA"}], "abstract": "Vision\u2013Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high-level relationships (\"left of,\" \"behind\", etc.) but ignore fine-grained spatial understanding needed for real-world applications: precise 3D localization, physical compatibility between objects, object affordances and multi-step spatial planning. In this work, we present BOP-ASK, a novel large-scale dataset for object-interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine-grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question\u2013answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open-sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37638", "url": null, "sourceid": 31924, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37641, "uid": "cdde90d3321aee7ac051cf58d649a5f5", "name": "MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model", "authors": [{"id": 181159, "fullname": "Shiyu Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/181159?format=json", "institution": "Tsinghua University"}, {"id": 106651, "fullname": "XINJIE ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/106651?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 183140, "fullname": "Zhening Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183140?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 181347, "fullname": "Jinpeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181347?format=json", "institution": "Harbin Institue of Technology, Shenzhen"}, {"id": 87209, "fullname": "Bin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87209?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 187934, "fullname": "Jiawei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187934?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187935, "fullname": "Yifan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/187935?format=json", "institution": null}, {"id": 87242, "fullname": "Shu-Tao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87242?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 90255, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90255?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Stereo image compression (SIC) has become increasingly vital with its applications surging in fields such as 3D reconstruction and autonomous navigation. Previous methods leverage cross-attention to model inter-view redundancy and employ autoregressive entropy models to predict probability distributions, achieving impressive rate-distortion performance. However, they suffer from slow coding speed due to the quadratic complexity of cross-attention mechanisms and the spatial autoregressive iterations of the entropy models. To address these limitations, we propose MambaSIC, which introduces two key innovations. First, we propose a Mamba-based stereo visual state space block (stereo VSSB) that leverages its linear complexity and long-range modeling capabilities to more rapidly and efficiently capture redundancy information between the two views. Second, to accelerate the compression process and enhance the accuracy of probability distribution estimation, we introduce a bi-directional multi-reference entropy model that utilizes a checkerboard partitioning strategy and the stereo VSSB to get rich inter-view priors. Experimental results demonstrate that our MambaSIC outperforms the state-of-the-art methods in both rate-distortion performance and coding efficiency. Moreover, it achieves the smallest inter-view PSNR discrepancy, resulting in more balanced reconstruction quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37641", "url": null, "sourceid": 40980, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37642, "uid": "3a0b86bff1bab2d03baa6b260e578b50", "name": "Object-WIPER: Training-Free Object and Associated Effect Removal in Videos", "authors": [{"id": 136674, "fullname": "Saksham Singh Kushwaha", "url": "http://cvpr.thecvf.com/api/miniconf/users/136674?format=json", "institution": "University of Texas at Dallas"}, {"id": 187437, "fullname": "Sayan Nag", "url": "http://cvpr.thecvf.com/api/miniconf/users/187437?format=json", "institution": "Adobe Research"}, {"id": 87904, "fullname": "Yapeng Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/87904?format=json", "institution": "University of Texas at Dallas"}, {"id": 86085, "fullname": "Kuldeep Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/86085?format=json", "institution": "Adobe Systems"}], "abstract": "In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark WIPER-Bench show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37642", "url": null, "sourceid": 31201, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37644, "uid": "09431f00e08dbe3e72d3d2c5b66ed0b6", "name": "Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision\u2013Language Understanding", "authors": [{"id": 158482, "fullname": "Yutao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158482?format=json", "institution": "Johns Hopkins University"}, {"id": 99408, "fullname": "Cheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/99408?format=json", "institution": "Bosch Research"}, {"id": 77002, "fullname": "Gaurav Mittal", "url": "http://cvpr.thecvf.com/api/miniconf/users/77002?format=json", "institution": "Microsoft"}, {"id": 153120, "fullname": "Rohith Kukkala", "url": "http://cvpr.thecvf.com/api/miniconf/users/153120?format=json", "institution": "Microsoft"}, {"id": 75499, "fullname": "Rama Chellappa", "url": "http://cvpr.thecvf.com/api/miniconf/users/75499?format=json", "institution": "Johns Hopkins University"}, {"id": 187937, "fullname": "Cheng Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187937?format=json", "institution": "School of Data Science, University of Virginia; Mathematical Institute for Data Science (MINDS) at JHU"}, {"id": 172531, "fullname": "Mei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/172531?format=json", "institution": "Dolby"}], "abstract": "Recent advances in 3D vision\u2013language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning.However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec).Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37644", "url": null, "sourceid": 42927, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37645, "uid": "5ef243f333262b761775d558aba4a864", "name": "TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution", "authors": [{"id": 181483, "fullname": "Zhiqiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181483?format=json", "institution": "East China Normal University"}, {"id": 187938, "fullname": "Yitong Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187938?format=json", "institution": "Zhejiang University"}, {"id": 128955, "fullname": "Xian Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/128955?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\\times8$) exceeding the model's native-supported upsampling ratio (e.g $\\times4$), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present \\textbf{TUDSR}, a \\textbf{T}wice \\textbf{U}psampling\u2013\\textbf{D}iffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37645", "url": null, "sourceid": 45949, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37646, "uid": "abaae1c5a4f3eb2248bfc782c08ac6b0", "name": "CLEX: Complementary Label Exchange Learning for Noisy Facial Expression Recognition", "authors": [{"id": 183491, "fullname": "Lin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183491?format=json", "institution": "South China University of Technology"}, {"id": 187939, "fullname": "Fang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187939?format=json", "institution": "Guangdong University of Finance"}, {"id": 187940, "fullname": "Xiaofen Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/187940?format=json", "institution": "South China University of Technology"}, {"id": 187941, "fullname": "Kailing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187941?format=json", "institution": "South China University of Technology"}, {"id": 130259, "fullname": "Xiangmin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130259?format=json", "institution": "South China University of Technology"}], "abstract": "Facial expression recognition (FER) in the wild is severely hampered by label noise and annotation ambiguity. Existing methods, including sample selection, label ensembling, and consistency regularization, primarily rely on ordinary label supervision and offer limited control over non-target predictions, leading to spurious activations and overfitting to noisy labels. To address this limitation, we propose a novel learning framework, named Complementary Label Exchange Learning (CLEX), enhances robustness by exchanging knowledge from non-target predictions across augmented views. Specifically, CLEX comprises three synergistic components. First, Stochastic Non-Target Logit Exchange randomly swaps a subset of non-target logits between original and augmented views to couple error-prone predictions, creating robust consistency constraints. Second, Scale-Invariant Logit Normalization eliminates magnitude artifacts through $L_p$-norm normalization, ensuring that regularization operates over geometrically meaningful directions rather than being dominated by arbitrary scales. Third, Complementary Suppression Loss selectively penalizes spurious activations over a randomly retained subset of non-target classes, avoiding the uniform shrinkage that hampers discriminative learning. To further stabilize training, we incorporate attention consistency regularization that enforces spatial alignment between augmented views, while retaining auxiliary cross-entropy to preserve semantic localization capability. Extensive experiments across multiple benchmark FER datasets (RAF-DB, FERPlus, and AffectNet) demonstrate that CLEX consistently outperforms existing robust FER learning approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37646", "url": null, "sourceid": 43429, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37647, "uid": "042e830eefee166996d5f71239ee1e4c", "name": "LitePT: Lighter Yet Stronger Point Transformer", "authors": [{"id": 69246, "fullname": "Yuanwen Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/69246?format=json", "institution": "ETH Zurich &amp; University of Oxford"}, {"id": 162612, "fullname": "Damien Robert", "url": "http://cvpr.thecvf.com/api/miniconf/users/162612?format=json", "institution": "University of Zurich"}, {"id": 160682, "fullname": "Jianyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160682?format=json", "institution": "Oxford VGG"}, {"id": 187942, "fullname": "Sunghwan Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187942?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 87671, "fullname": "Jan D. Wegner", "url": "http://cvpr.thecvf.com/api/miniconf/users/87671?format=json", "institution": "University of Zurich"}, {"id": 129663, "fullname": "Christian Rupprecht", "url": "http://cvpr.thecvf.com/api/miniconf/users/129663?format=json", "institution": "University of Oxford"}, {"id": 86863, "fullname": "Konrad Schindler", "url": "http://cvpr.thecvf.com/api/miniconf/users/86863?format=json", "institution": "ETH Zurich"}], "abstract": "Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6\u00d7 fewer parameters, runs 2\u00d7 faster, and uses 2\u00d7 less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37647", "url": null, "sourceid": 40192, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37648, "uid": "498a4339f96d1949476f4b857c3729c2", "name": "Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction", "authors": [{"id": 180903, "fullname": "Changqing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180903?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 145274, "fullname": "Yueru Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/145274?format=json", "institution": "Beijing University of Post and Telecommunications; The Chinese University of Hong Kong"}, {"id": 187943, "fullname": "Changhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187943?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction.We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65 $\\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37648", "url": null, "sourceid": 34527, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37649, "uid": "ad0cb903053ea88697227d16d33d7012", "name": "MangoBench: A Benchmark for Multi-Agent Goal-Conditioned Offline Reinforcement Learning", "authors": [{"id": 144045, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144045?format=json", "institution": "Sun Yat-sen University"}, {"id": 187944, "fullname": "Ningze Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187944?format=json", "institution": null}, {"id": 187945, "fullname": "Zhiheng Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187945?format=json", "institution": "University of Western Australia"}, {"id": 75725, "fullname": "Longguang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75725?format=json", "institution": "National University of Defense Technology"}, {"id": 127818, "fullname": "Ye Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127818?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87094, "fullname": "Yulan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87094?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Offline Multi-Agent Reinforcement Learning (MARL) is critical for coordinating multiple agents in costly and unsafe environments, yet existing methods struggle from high sensitivity to reward functions and weak generalization to new goals, limiting its practical impact. Inspired by single-agent Offline Goal-Conditioned RL (OGCRL), we propose the first goal-conditioned offline MARL framework, extending OGCRL to multi-agent settings under both fully decentralized and centralized training with decentralized execution (CTDE) paradigms. To systematically evaluate this setting, we introduce MangoBench, the first fully cooperative multi-goal benchmark for MARL, covering 3 environments, 4 agent types, and 47 tasks, designed to assess joint-control locomotion, synchronous and asynchronous bimanual manipulation, and robustness to high-dimensional inputs. Extensive experiments demonstrate that our baselines achieve strong multi-goal generalization under sparse rewards, yet no method dominates all tasks, revealing both the intrinsic complexity and the unexplored potential of goal-conditioned offline MARL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37649", "url": null, "sourceid": 32057, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37652, "uid": "c7bf6d5cc46eabd157d7bb7e037b3e00", "name": "FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection", "authors": [{"id": 174570, "fullname": "Mingyu Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174570?format=json", "institution": null}, {"id": 88436, "fullname": "Kevin Qinghong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88436?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}, {"id": 187951, "fullname": "Hwee Tou Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187951?format=json", "institution": "National University of Singapore"}], "abstract": "Vision-Language Models (VLMs) have shown strong performance on User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4,700 for 2K resolution), which incurs significant computational overhead and dilutes attention. In contrast, humans typically focus on regions of interest when interacting with UIs. In this work, we pioneer the task of efficient UI grounding. We propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) eliminating redundant tokens in visual encoding by constructing patch-level supervision that fuses an instruction-conditioned score with a rule-based UI-graph score, down-weighting large homogeneous regions to select distinct and instruction-relevant visual tokens; and (2) preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. To address this, we introduce a PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence\u2019s last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a 3.7% performance improvement over GUI-Actor-7B. Even with only 30% visual token retention, the performance of FocusUI-7B drops by just 3.2%, while achieving up to 1.44x faster inference and 17% lower peak GPU memory.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37652", "url": null, "sourceid": 39079, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37656, "uid": "5a9ad32644ebce6e589f4be9634332b9", "name": "Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance", "authors": [{"id": 181171, "fullname": "Qiuhai Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181171?format=json", "institution": "East China Normal University"}, {"id": 187960, "fullname": "Kang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187960?format=json", "institution": "East China Normal University; Jiangxi Normal University"}, {"id": 187961, "fullname": "Zhengjie Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187961?format=json", "institution": "East China Normal University"}, {"id": 187962, "fullname": "Tingting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187962?format=json", "institution": "East China Normal University"}, {"id": 153307, "fullname": "Faming Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153307?format=json", "institution": "East China Normal University"}, {"id": 86004, "fullname": "Guixu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86004?format=json", "institution": "East China Normal University"}], "abstract": "Implicit neural representation (INR) based methods learn a continuous mapping from a low-resolution (LR) target magnetic resonance (MR) image and a high-resolution (HR) reference image to achieve arbitrary-scale super-resolution (SR). However, their inherent spectral bias favors learning low-frequency (LF) components, often failing to capture the sharp transitions at anatomical boundaries and resulting in the loss of high-frequency (HF) details. Inspired by 3D Gaussian splatting, we propose GaussM\u00b2ASR (Gaussian Multi-contrast MRI Arbitrary-scale Super-Resolution), which converts the challenging task of HF anatomical reconstruction into a smoother parameter optimization problem by learning the parameters of anisotropic 2D Gaussian kernels. To handle inter-contrast discrepancies, we introduce an anatomy-guided pipeline comprising three core modules: a Structure Prior Modulation Fusion (SPMF) module for feature enhancement; an Anatomy-Guided Dual-Domain Cross Attention (AG-DDCA) module for joint spatial-frequency modeling; and an Anatomy-Guided Gaussian Parametrizer (AGGP) that leverages gradient-based sparse attention to concentrate Gaussian centers on critical anatomical structures. Extensive experiments on multiple datasets demonstrate that GaussM\u00b2ASR surpasses state-of-the-art methods in recovering fine anatomical details. The source code will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37656", "url": null, "sourceid": 45351, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37658, "uid": "de785fbd9c75be03fbd0dcc93a638fae", "name": "LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories", "authors": [{"id": 154538, "fullname": "Zhanhao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154538?format=json", "institution": "The Australian National University"}, {"id": 106920, "fullname": "Tao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106920?format=json", "institution": "Xi&#x27;an JiaoTong University"}, {"id": 89921, "fullname": "Jie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89921?format=json", "institution": "ByteDance Inc."}, {"id": 88202, "fullname": "Chengjian Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88202?format=json", "institution": "Meituan Inc."}, {"id": 75496, "fullname": "Liang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/75496?format=json", "institution": "Australian National University"}], "abstract": "This paper focuses on the alignment of flow-matching models with human preference. A promising way is fine-tuning by directly backpropagating reward signals through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early-step latents.Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of terms with large gradients, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image\u2013text alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37658", "url": null, "sourceid": 31795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37659, "uid": "04ff2896ed09b2e8d686c58dcb78cb83", "name": "AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction", "authors": [{"id": 187967, "fullname": "Hanyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187967?format=json", "institution": "Ohio State University, Columbus"}, {"id": 75794, "fullname": "Rongjun Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75794?format=json", "institution": "The Ohio State University"}], "abstract": "Recent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion.The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37659", "url": null, "sourceid": 34007, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37662, "uid": "60c8179fad5be6d39b73660ca24c8d65", "name": "MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos", "authors": [{"id": 141646, "fullname": "Arkaprava Sinha", "url": "http://cvpr.thecvf.com/api/miniconf/users/141646?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 187971, "fullname": "Monish Raj", "url": "http://cvpr.thecvf.com/api/miniconf/users/187971?format=json", "institution": null}, {"id": 127740, "fullname": "Pu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127740?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 184949, "fullname": "Ahmed Helmy", "url": "http://cvpr.thecvf.com/api/miniconf/users/184949?format=json", "institution": "University of North Carolina at Charlotte"}, {"id": 76444, "fullname": "Hieu Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/76444?format=json", "institution": "EPFL"}, {"id": 127579, "fullname": "Srijan Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/127579?format=json", "institution": "University of North Carolina at Charlotte"}], "abstract": "Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37662", "url": null, "sourceid": 35368, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37665, "uid": "d0df036bf29fd4db6e3fadab6b314715", "name": "Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation", "authors": [{"id": 187974, "fullname": "Yi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187974?format=json", "institution": null}, {"id": 157868, "fullname": "Yu-Bin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157868?format=json", "institution": "Nanjing University, China"}], "abstract": "In the deep-learning-based computer vision community, Neural Architecture Search (NAS) has become the de-facto tool for acquiring task-optimal network structures. Nevertheless, NAS methods are trapped in a fundamental accuracy-efficiency dilemma: training-based approaches deliver reliable performance but incur prohibitive search costs, whereas training-free strategies are ultra-fast but often yield relatively unreliable rankings. To reconcile this conflict, we propose a vision-oriented lightweight training-based NAS framework. We first design six micro vision tasks whose training time is negligible, yet together they probe a broad spectrum of representational capacities. Built upon these tasks, we introduce a budget-adaptive performance evaluator to produce the most accurate ranking attainable within the limit. Experiments on popular NAS benchmarks show that our method achieves a ranking correlation higher than existing methods. Furthermore, we construct a search space from prevalent neural blocks and run our method at a cost close to training-free methods; the discovered architecture surpasses the current state-of-the-art under identical training recipes. Our code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37665", "url": null, "sourceid": 40796, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37668, "uid": "db20613d4556a700b9a55120254dfa1b", "name": "FVGen: Scaling 3D Scene Datasets with Certainty-Aware Free-View Generation from Scene Geometry Reconstruction", "authors": [{"id": 84983, "fullname": "Chenhan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84983?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 187981, "fullname": "Yu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187981?format=json", "institution": null}, {"id": 180890, "fullname": "Qingwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180890?format=json", "institution": "KTH Royal Institute of Technology"}, {"id": 86145, "fullname": "Jifei Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/86145?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 84723, "fullname": "Songcen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84723?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 84988, "fullname": "Dit-Yan Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/84988?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}], "abstract": "The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data with diverse and accurate camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FVGen, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy that identifies novel viewpoints which are both semantically meaningful and minimally affected by reconstruction errors. We demonstrate FVGen's effectiveness by scaling up the training of feedforward NVS models, achieving a significant improvement of 2.6 dB on challenging out-of-distribution benchmarks. Furthermore, we show that the generated data can actively enhance per-scene 3D Gaussian Splatting optimization, leading to consistent improvements across multiple datasets. Our work provides a practical and powerful data generation engine to overcome a fundamental bottleneck in 3D vision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37668", "url": null, "sourceid": 41075, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37669, "uid": "ed2224fd1d95ca6f6546b1d7750ad106", "name": "ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes", "authors": [{"id": 176305, "fullname": "Emily Steiner", "url": "http://cvpr.thecvf.com/api/miniconf/users/176305?format=json", "institution": "Stanford University"}, {"id": 151552, "fullname": "Jianhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151552?format=json", "institution": "Stanford University"}, {"id": 140486, "fullname": "Henry Howard-Jenkins", "url": "http://cvpr.thecvf.com/api/miniconf/users/140486?format=json", "institution": "Facebook; Meta"}, {"id": 153447, "fullname": "Chris Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/153447?format=json", "institution": "Meta"}, {"id": 106415, "fullname": "Iro Armeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/106415?format=json", "institution": "Stanford University"}], "abstract": "Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining consistent instance identities across intermittently captured 3D scans with unobserved change or, equivalently, performing 4D indoor semantic instance segmentation (SIS)---the joint task of segmenting, identifying, and temporally associating object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and 4D LiDAR approaches, which show limited performance due to their reliance on continuous temporal measurements that is uncommon in indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores temporal fusion strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To rigorously evaluate this task, we define a new metric that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37669", "url": null, "sourceid": 43087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37670, "uid": "867c499e76f186216e70ca0d85fb56c2", "name": "DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning", "authors": [{"id": 151964, "fullname": "Junha Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/151964?format=json", "institution": "POSTECH"}, {"id": 140610, "fullname": "Eunha Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/140610?format=json", "institution": "POSTECH"}, {"id": 86304, "fullname": "Minsu Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/86304?format=json", "institution": "POSTECH"}], "abstract": "Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14\\% success rate, outperforming state-of-the-art by 3.83\\%p with 96.4\\% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37670", "url": null, "sourceid": 44851, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37673, "uid": "ff16d86ffe81cfaa7f6e627ed1679c1b", "name": "MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation", "authors": [{"id": 131537, "fullname": "Yiren Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/131537?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187986, "fullname": "Cheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187986?format=json", "institution": "National University of Singapore"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences.  Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37673", "url": null, "sourceid": 41841, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37675, "uid": "98e91749e0199da4b939761492530d23", "name": "D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping", "authors": [{"id": 180933, "fullname": "Heng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180933?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 187990, "fullname": "Xiangping Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187990?format=json", "institution": "Harbin Institute of Technology"}, {"id": 86681, "fullname": "Qingcai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86681?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "Document image dewarping remains a challenging task in the deep learning era. While existing methods have improved by leveraging text line awareness, they typically focus only on a single horizontal dimension. In this paper, we propose a fine-grained deformation perception model that focuses on $\\textbf{D}$ual $\\textbf{D}$imensions of document horizontal-vertical-lines to improve document $\\textbf{Dewarp}$ing called $\\textit{D2Dewarp}$. It can perceive distortion trends in different directions across document details. To combine the horizontal and vertical granularity features, an effective fusion module based on X and Y coordinate is designed to facilitate interaction and constraint between the two dimensions for feature complementarity. Due to the lack of annotated line features in current public dewarping datasets, we also propose an automatic fine-grained annotation method using public document texture images and automatic rendering engine to build a new large-scale distortion training dataset named $\\textit{DocDewarpHV}$. The code and dataset will be publicly released. On three public Chinese and English benchmarks, both quantitative and qualitative results show that our method achieves better rectification results compared with the state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37675", "url": null, "sourceid": 41453, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37677, "uid": "d37db81b770e9c7f95621cf281a50deb", "name": "From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs", "authors": [{"id": 180546, "fullname": "Mingrui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180546?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 73247, "fullname": "Zhaozhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73247?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 77183, "fullname": "Fangjinhua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77183?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 129072, "fullname": "Jiaolong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129072?format=json", "institution": "Microsoft Research"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}, {"id": 85748, "fullname": "Tong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85748?format=json", "institution": "EPFL / University of Chinese Academy of Sciences"}], "abstract": "While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence\u2014crucial for robust and grounded AI systems\u2014remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum\u2014from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37677", "url": null, "sourceid": 39896, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37678, "uid": "1a80837c8f34ef0c1bf5264b5370963f", "name": "Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images", "authors": [{"id": 180750, "fullname": "Lu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180750?format=json", "institution": "University of Hong Kong"}, {"id": 87563, "fullname": "Guosheng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87563?format=json", "institution": "The University of Hong Kong"}], "abstract": "Fully decentralized deep learning removes global servers and ensures local data privacy. However, Euclidean consensus, averaging weights, gradients or momentum, may degrade under non-i.i.d. data and client size imbalance. We propose a geometry-aware approach based on natural gradient variational inference. Clients communicate in the expectation parameter space of an exponential family, where simple linear mixing yields a forward KL barycenter consensus. The aggregate is the model closest to all client distributions, aligning updates across heterogeneous sites and mitigating distribution shift. We further provide a lightweight decentralized Adam implementation, in which each client maintains a diagonal-Gaussian posterior and both updates and gossips in the expectation space. We prove convergence for convex losses on connected graphs. On CIFAR-100 and a medical image segmentation benchmark, our method\\footnote{All code is included in the supplementary materials and will be publicly released.} substantially outperforms Euclidean-space consensus baselines under severe non-i.i.d. and client-imbalance cases, achieving around $20\\%$ accuracy gain on CIFAR-100, while matching the communication budget and improving training stability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37678", "url": null, "sourceid": 39536, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37691, "uid": "e6401fdedbe7b5eb6113a585d1502e73", "name": "Coded-E2LF: Coded Aperture Light Field Imaging from Events", "authors": [{"id": 179980, "fullname": "Tomoya Tsuchida", "url": "http://cvpr.thecvf.com/api/miniconf/users/179980?format=json", "institution": "Nagoya University"}, {"id": 131447, "fullname": "Keita Takahashi", "url": "http://cvpr.thecvf.com/api/miniconf/users/131447?format=json", "institution": "Nagoya University"}, {"id": 131436, "fullname": "Chihiro Tsutake", "url": "http://cvpr.thecvf.com/api/miniconf/users/131436?format=json", "institution": "Nagoya University"}, {"id": 131454, "fullname": "Toshiaki Fujii", "url": "http://cvpr.thecvf.com/api/miniconf/users/131454?format=json", "institution": "Nagoya University"}, {"id": 87505, "fullname": "Hajime Nagahara", "url": "http://cvpr.thecvf.com/api/miniconf/users/87505?format=json", "institution": "Osaka University"}], "abstract": "We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software is included in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37691", "url": null, "sourceid": 41867, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37681, "uid": "b114906b1c37b67e17100bd6401d859c", "name": "Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding", "authors": [{"id": 145665, "fullname": "Jianghao Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/145665?format=json", "institution": "East China Normal University"}, {"id": 184733, "fullname": "Qingbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184733?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 181989, "fullname": "KUN SUN", "url": "http://cvpr.thecvf.com/api/miniconf/users/181989?format=json", "institution": "ByteDance Inc."}, {"id": 187999, "fullname": "Cheng Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/187999?format=json", "institution": "East China Normal University"}, {"id": 188000, "fullname": "Jie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188000?format=json", "institution": "ByteDance Inc."}, {"id": 89730, "fullname": "Qin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89730?format=json", "institution": "East China Normal University"}, {"id": 156872, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156872?format=json", "institution": "East China Normal University"}, {"id": 188001, "fullname": "Nan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188001?format=json", "institution": "ByteDance Inc."}, {"id": 188002, "fullname": "Changqing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188002?format=json", "institution": null}, {"id": 188003, "fullname": "wupei wupei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188003?format=json", "institution": "ByteDance Inc."}, {"id": 188004, "fullname": "Jian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188004?format=json", "institution": null}, {"id": 187379, "fullname": "Zheming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187379?format=json", "institution": "ByteDance Inc."}, {"id": 89725, "fullname": "Liang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/89725?format=json", "institution": "East China Normal University"}], "abstract": "While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ.For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation.To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37681", "url": null, "sourceid": 32289, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37683, "uid": "d64e4fd92a28a9d7b691e34732d55cb3", "name": "Learning to Generate Highly Dynamic Videos using Synthetic Motion Data", "authors": [{"id": 130980, "fullname": "Wonjoon Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130980?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 182104, "fullname": "Jiyun Won", "url": "http://cvpr.thecvf.com/api/miniconf/users/182104?format=json", "institution": "POSTECH"}, {"id": 155790, "fullname": "Janghyeok Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/155790?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 85511, "fullname": "Qi Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85511?format=json", "institution": "Microsoft Research Asia"}, {"id": 86583, "fullname": "Chong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86583?format=json", "institution": "Microsoft Research Asia"}, {"id": 72991, "fullname": "Seung-Hwan Baek", "url": "http://cvpr.thecvf.com/api/miniconf/users/72991?format=json", "institution": "POSTECH"}, {"id": 88380, "fullname": "Sunghyun Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/88380?format=json", "institution": "POSTECH"}], "abstract": "Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control. Codes and datasets will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37683", "url": null, "sourceid": 35578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37684, "uid": "a0c5216bf6f3fb7bcc2200e78618c778", "name": "ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement", "authors": [{"id": 152097, "fullname": "Zhihang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152097?format=json", "institution": "University of Science and Technology of China"}, {"id": 103229, "fullname": "Xiaoyi Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/103229?format=json", "institution": "CASIA"}, {"id": 155988, "fullname": "Pandeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155988?format=json", "institution": "University of Science and Technology of China"}, {"id": 146339, "fullname": "Junjie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/146339?format=json", "institution": "Nanjing University"}, {"id": 106510, "fullname": "Zhaohe Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/106510?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 155624, "fullname": "Yefei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/155624?format=json", "institution": "Zhejiang University"}, {"id": 130001, "fullname": "Kaixun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130001?format=json", "institution": "Fudan University"}, {"id": 85359, "fullname": "Chen-Wei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85359?format=json", "institution": "Alibaba Group"}, {"id": 85406, "fullname": "Yun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/85406?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 85497, "fullname": "Hongtao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85497?format=json", "institution": "University of Science and Technology of China"}], "abstract": "While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic that faithfully and aesthetically visualizes the data from a given table. To address this challenge, we propose ShowTable, a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process. The MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors to provide refined instructions, the diffusion execute the commands from MLLM, achieving high-fidelity results. To support this task and our pipeline, we introduce three automated data construction pipelines for training different modules. Furthermore, we introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions, to assess performance on this task. Experiments demonstrate that our pipeline, instantiated with different models, significantly outperforms baselines, highlighting its effective multi-modal reasoning, generation, and error correction capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37684", "url": null, "sourceid": 34326, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37686, "uid": "55b41404d256c30aeee0e2c554dc43f6", "name": "R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection", "authors": [{"id": 128607, "fullname": "Zhongyu Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/128607?format=json", "institution": "Peking University"}, {"id": 188012, "fullname": "Yousen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188012?format=json", "institution": "Tongji University"}, {"id": 86049, "fullname": "Yongtao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86049?format=json", "institution": "Peking University"}, {"id": 188013, "fullname": "Zhifeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188013?format=json", "institution": "Ebtech Inc."}, {"id": 186365, "fullname": "Weijun Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186365?format=json", "institution": "EbTech Co. Ltd."}], "abstract": "4D radar\u2013camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37686", "url": null, "sourceid": 36834, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37693, "uid": "99f463038fbcf182d7988dbb7474e2e0", "name": "Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing", "authors": [{"id": 188028, "fullname": "Weitong Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188028?format=json", "institution": "Queen Mary University of London"}, {"id": 188029, "fullname": "Hang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188029?format=json", "institution": "Cornell University"}, {"id": 188030, "fullname": "Yukai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188030?format=json", "institution": "Adecco"}, {"id": 188031, "fullname": "Shitong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188031?format=json", "institution": "Queen Mary University of London"}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}, {"id": 84723, "fullname": "Songcen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84723?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 86145, "fullname": "Jifei Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/86145?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 128891, "fullname": "Zhensong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128891?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}], "abstract": "Always-on sensing is essential for next-generation edge/wearable AI systems, yet continuous high-fidelity RGB video capture remains prohibitively expensive for resource-constrained mobile and edge platforms. We present a new paradigm for efficient streaming video understanding: grayscale-always, color-on-demand. Through preliminary studies, we discover that color is not always necessary. Sparse RGB frames suffice for comparable performance when temporal structure is preserved via continuous grayscale streams. Building on this insight, we propose ColorTrigger, an online training-free trigger that selectively activates color capture based on windowed grayscale affinity analysis. Designed for real-time edge deployment, ColorTrigger uses lightweight quadratic programming to detect chromatic redundancy causally, coupled with credit-budgeted control and dynamic token routing to jointly reduce sensing and inference costs. On streaming video understanding benchmarks, ColorTrigger achieves 91.6% of full-color baseline performance while using only 8.1% RGB frames, demonstrating substantial color redundancy in natural videos and enabling practical always-on video sensing on resource-constrained devices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37693", "url": null, "sourceid": 34702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37697, "uid": "1961d93172f8088a077c52e638e31f41", "name": "From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras", "authors": [{"id": 177144, "fullname": "Taehun Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177144?format=json", "institution": "Ulsan National Institute of Science and Technology (UNIST)"}, {"id": 71827, "fullname": "Changwoo Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71827?format=json", "institution": "UNIST"}, {"id": 77309, "fullname": "Kyungdon Joo", "url": "http://cvpr.thecvf.com/api/miniconf/users/77309?format=json", "institution": "Ulsan National Institute of Science and Technology"}], "abstract": "The conventional checkerboard-based calibration for standard cameras faces fundamental limitations when applied to bio-inspired event cameras. Specifically, this stems from two challenges: (i) Events are triggered asynchronously at different timestamps along motion trajectories. If we accumulate them directly on the image plane, it causes temporal misalignment and produces blurred edges. Directly accumulating them on the image plane causes temporal misalignment and produces blurred edges. (ii) Checkerboard corners on event cameras show near-zero event occurrence at the corner itself. This hinders reliable corner localization and makes calibration difficult. To address these issues, we present a novel calibration framework that directly detects checkerboard corners from a raw event stream. We first mathematically analyze the absence of events at corner points. Based on this fact, we then leverage edge-driven event cues to initialize corner positions. Using the near-zero event occurrence at checkerboard corners, we gradually refine the estimated corner toward low event-density regions, achieving sub-pixel accuracy. Furthermore, we extend the corner detection to fiducial markers such as AprilTags, resulting in reliable detection even under partial visibility or occlusion. Evaluations on self-collected and public data demonstrate reliable checkerboard corner detection and stable camera calibration.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37697", "url": null, "sourceid": 39205, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37700, "uid": "d071aa99b2e94835d08dcae55ae2d128", "name": "Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification", "authors": [{"id": 183330, "fullname": "Yadong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183330?format=json", "institution": "Harbin Institute of Technology"}, {"id": 188039, "fullname": "Qiaoqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188039?format=json", "institution": "Beijing Jiaotong University"}, {"id": 181439, "fullname": "Yueying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181439?format=json", "institution": "Shanghai University"}, {"id": 90259, "fullname": "Lunke Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/90259?format=json", "institution": "Guangdong University of Technology"}, {"id": 76232, "fullname": "Jie Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76232?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}], "abstract": "While existing incomplete multi-view multi-label learning methods have achieved promising performance, few studies have focused on the issue of multi-view imbalance. Existing methods using gradient modulation or alternating optimization strategies alleviate this problem but often oversimplify the interaction between views, resulting in persistently performance. In response to the challenge, we propose the Cross-view Distillation and Adaptive Masking (CDAM) framework, a novel approach designed to achieve balanced multi-view optimization for the challenging double incomplete multi-view multi-label learning tasks. First, to overcome the performance bottleneck of views, we design a cross-view distillation module. This module aligns low-quality student representations with high-quality teacher representations, thereby effectively mitigating the multi-view imbalance problem. Second, recognizing that distillation may not rectify all low-quality views, we introduce a subsequent adaptive masking module to perform an explicit quality assessment. This module dynamically identifies and masks out any remaining unreliable representations before multi-view fusion, thus preventing low-quality information from corrupting the fused representation. Extensive comparisons with nine state-of-the-art methods on six datasets validate the effectiveness and stability of our method.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37700", "url": null, "sourceid": 45937, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37704, "uid": "2c86a217e06d86e3db130723abd90fff", "name": "StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks", "authors": [{"id": 188056, "fullname": "Xilin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188056?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 184830, "fullname": "Xiaole Xian", "url": "http://cvpr.thecvf.com/api/miniconf/users/184830?format=json", "institution": "Shenzhen University"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 76698, "fullname": "Muhammad Haris Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76698?format=json", "institution": "Mohamed Bin Zayed University of Artificial Intelligence"}], "abstract": "Style generation has made significant progress through diffusion models. Recent efforts have explored reinforcement learning with human-preference reward models to enhance diffusion models for general downstream applications. However, we identify a critical limitation: existing human-preference reward models struggle to effectively perceive image style, resulting in suboptimal performance after reinforcement fine-tuning. To address this, we first introduce a large-scale style reward modeling dataset comprising 400K paired samples spanning 1,000 diverse style categories, augmented with textual instructions and style reward annotations.We then propose StyleDoctor, a novel style perception reward model capable of jointly evaluating style consistency between paired images and style-text alignment. StyleDoctor outperforms existing style perception models in both style retrieval and generation tasks. Extensive quantitative and qualitative experiments demonstrate the superiority of StyleDoctor over competing approaches, showcasing its efficiency and versatility in style-conditioned generation. Our dataset and code will be made public upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37704", "url": null, "sourceid": 31820, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37705, "uid": "51f7fb04609d5775e4fb7621c753f3bc", "name": "B\u00e9zier Degradation Modeling for LiDAR-based Human Motion Capture", "authors": [{"id": 180562, "fullname": "Xiaoqi An", "url": "http://cvpr.thecvf.com/api/miniconf/users/180562?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 126389, "fullname": "Lin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126389?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 127403, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127403?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 84750, "fullname": "Chen Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/84750?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 85000, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85000?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible B\u00e9zier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37705", "url": null, "sourceid": 45839, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37711, "uid": "0098df6528e402d2ee740ec568c27967", "name": "Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching", "authors": [{"id": 181638, "fullname": "Xin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181638?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186804, "fullname": "Ke Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186804?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 144191, "fullname": "Wen Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/144191?format=json", "institution": "UESTC (University of Electronic Science and Technology of China)"}, {"id": 152754, "fullname": "Yuan-Fang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152754?format=json", "institution": "Monash University"}, {"id": 152147, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152147?format=json", "institution": "Guangming Laboratory"}, {"id": 156250, "fullname": "Tao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/156250?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject\u2013predicate\u2013object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification instead of a genuine {progressive, generative} task. We propose \\textbf{FlowSG}, which recasts SGG as continuous-time transport on a {hybrid} discrete\u2013continuous state: starting from a {noised graph}, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a {scene graph} (e.g., the continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for {categorical tokens (object features and predicate labels)}, coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors/segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete\u2013continuous generative formulation over one-shot classification baselines, e.g.,  an average improvement of about 3  points over the SOTA USG-Par.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37711", "url": null, "sourceid": 40614, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37709, "uid": "64f9b199132a3f597af54f875ba0078d", "name": "MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds", "authors": [{"id": 157845, "fullname": "Xiangzuo Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157845?format=json", "institution": "Tsinghua University"}, {"id": 100217, "fullname": "Chengwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/100217?format=json", "institution": "Tsinghua University"}, {"id": 152833, "fullname": "Jun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152833?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 90909, "fullname": "Xiu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90909?format=json", "institution": "Tsinghua University"}, {"id": 77389, "fullname": "Yuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77389?format=json", "institution": "The University of Hong Kong"}], "abstract": "Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. Existing single-view approaches often ignore cross-view relationships, leading to inconsistent results, while multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallicity, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37709", "url": null, "sourceid": 34657, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37708, "uid": "84ea3a7910c3ca971d4718adf0f707f2", "name": "See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning", "authors": [{"id": 181514, "fullname": "Shuoshuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181514?format=json", "institution": "Tsinghua University"}, {"id": 188060, "fullname": "Yizhen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188060?format=json", "institution": "Tencent; Tsinghua University, Tsinghua University; Microsoft"}, {"id": 139197, "fullname": "JINGJING FU", "url": "http://cvpr.thecvf.com/api/miniconf/users/139197?format=json", "institution": null}, {"id": 188061, "fullname": "Lei Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/188061?format=json", "institution": "Microsoft"}, {"id": 154740, "fullname": "Jiang Bian", "url": "http://cvpr.thecvf.com/api/miniconf/users/154740?format=json", "institution": "Microsoft Research"}, {"id": 73271, "fullname": "Yujiu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73271?format=json", "institution": "Tsinghua University"}, {"id": 188062, "fullname": "Rui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188062?format=json", "institution": "Microsoft"}], "abstract": "Large vision\u2013language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose $\\textbf{Bi}$-directional $\\textbf{P}$erceptual $\\textbf{S}$haping ($\\textbf{BiPS}$), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2\\% on average and shows strong out-of-domain generalization to unseen datasets and image types.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37708", "url": "https://github.com/zss02/BiPS", "sourceid": 46596, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37713, "uid": "b8d1200c2569eb9ce9c29e1698dbc84e", "name": "JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction", "authors": [{"id": 184235, "fullname": "Haohong Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184235?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 89746, "fullname": "Yang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89746?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 174381, "fullname": "Changlong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174381?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 89720, "fullname": "Jinghong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89720?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 188070, "fullname": "Hang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188070?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 146758, "fullname": "Ran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146758?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 74017, "fullname": "Zhiguo Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/74017?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 89174, "fullname": "Joey Tianyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89174?format=json", "institution": "National University of Singapore "}], "abstract": "In this paper, JUMP-Hand is proposed as a novel method for multi-view 3D hand reconstruction, which is the first to introduce probabilistic joint-wise uncertainty as an explicit gating mechanism to fuse multi-view information.Existing approaches usually fuse multi-view information by na\u00efve pooling or implicit attention.However, they overlook that each hand joint exhibits varying visibility and reliability across views, which may degrade performance by indiscriminately aggregating noisy or unreliable information.For instance, one joint may be clearly visible in one view, while another joint is occluded in that view but visible in a different view.In contrast, JUMP-Hand addresses this by introducing the core insight of Mixture of Experts (MoE) and regard each 2D view as an expert.The key idea is that the reliability of each view expert is quantified through joint-wise uncertainty modeling, serving as a explicit gating signal to route experts' partial yet complementary clues for each joint in a coarse-to-fine reconstruction paradigm.In this design, uncertainty not only guides the uncertainty-aware triangulation for reliable 3D hand initialization during coarse stage, but also acts as a gating signal during refinement stage to adaptively aggregate multi-scale features from different view experts on a joint-wise basis, enabling robust 3D hand reconstruction.Extensive experiments on DexYCB-MV, HO3D-MV, and OakInk-MV demonstrate that our method achieves state-of-the-art results, validating the effectiveness of the proposed method with joint-wise uncertainty gating for reliable 3D hand reconstruction.The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37713", "url": null, "sourceid": 31057, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37716, "uid": "07bfe49721e230f6699703eb9d4128d8", "name": "Motion-Aware Animatable Gaussian Avatars Deblurring", "authors": [{"id": 87527, "fullname": "Muyao Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87527?format=json", "institution": "The University of Tokyo"}, {"id": 155849, "fullname": "Yifan Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155849?format=json", "institution": "University of Tokyo"}, {"id": 188078, "fullname": "Qingtian Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188078?format=json", "institution": "The University of Tokyo"}, {"id": 188079, "fullname": "Zhuoxiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188079?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188080, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188080?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 126700, "fullname": "Zhihang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126700?format=json", "institution": "Shanghai AI Lab"}, {"id": 186859, "fullname": "Xiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186859?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 73929, "fullname": "Yinqiang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/73929?format=json", "institution": "The University of Tokyo"}], "abstract": "The creation of 3D human avatars from multi-view videos is a significant yet challenging task in computer vision. However, existing techniques rely on high-quality, sharp images as input, which are often impractical to obtain in real-world scenarios due to variations in human motion speed and intensity. This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. The proposed approach incorporates a 3D-aware, physics-based model of blur formation caused by human motion, together with a 3D human motion model designed to resolve ambiguities in motion-induced blur. This framework enables the joint optimization of the avatar representation and motion parameters from a coarse initialization. Comprehensive benchmarks are established using both a synthetic dataset and a real-world dataset captured with a 360-degree synchronous hybrid-exposure camera system. Extensive evaluations demonstrate the effectiveness and robustness of the model across diverse conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37716", "url": null, "sourceid": 41279, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37718, "uid": "1a1d4f57954c3c7f143cbaf3c9996ac4", "name": "3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes  from a Single Image", "authors": [{"id": 69551, "fullname": "Ze-Xin Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/69551?format=json", "institution": "Nankai University"}, {"id": 154826, "fullname": "Liu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154826?format=json", "institution": "Horizon Robotics"}, {"id": 159473, "fullname": "Xinjie wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159473?format=json", "institution": "Horizon Robotics"}, {"id": 166700, "fullname": "Wei Sui", "url": "http://cvpr.thecvf.com/api/miniconf/users/166700?format=json", "institution": "D-Robotics, China"}, {"id": 164010, "fullname": "Zhizhong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/164010?format=json", "institution": "Horizon Robotics"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}], "abstract": "We introduce 3D-Fixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. Unlike existing feed-forward frameworks that lack generalization ability in open-set scenarios due to the limited dataset, or divide-and-conquer frameworks that suffer from slow inference or accumulated registration errors during layout alignment, 3D-Fixer extends pre-trained object-level 3D generation priors to perform in-place completion on the single-view estimated geometry, eliminating the need for pose alignment while preserving feed-forward efficiency. At its core, 3D-Fixer introduces a coarse-to-fine scheme to accurately determine the completion boundary and generate high quality completion 3D asset based on the single-view estimated fragmented geometry. Also, we design a dual-branch conditioning network that integrates 2D and 3D contextual information to guide the pre-trained object generation priors for in-place completion. Furthermore, we introduce the Occlusion-Robust Feature Alignment strategy, which employs feature distillation to stabilize the training of the generative priors under occlusion scenarios. Existing scene-level dataset, either suffering from limited scale or lacking accurate per-instance ground truth, severely restricting the development of scene generation approaches. Therefore, we constructed the large-scale scene-level dataset, featuring over 110K diverse scenes and 3M images with complete 3D asset ground truth and accurate placement annotation. Experiments demonstrate that 3D-Fixer achieves state-of-the-art geometric accuracy while maintaining an inference speed comparable  to feed-forward estimation methods, vastly outperforming iterative optimization approaches.  Our dataset and trained models will be publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37718", "url": null, "sourceid": 35954, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37723, "uid": "55dbee9943100076f718829ec0359185", "name": "FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction", "authors": [{"id": 188091, "fullname": "Runqi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188091?format=json", "institution": "University of Sydney"}, {"id": 188092, "fullname": "Alasdair Paren", "url": "http://cvpr.thecvf.com/api/miniconf/users/188092?format=json", "institution": "Chalmers University of Technology"}, {"id": 188093, "fullname": "Suqin Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188093?format=json", "institution": "The University of Sydney"}, {"id": 182403, "fullname": "Muyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182403?format=json", "institution": "The University of Sydney"}, {"id": 75532, "fullname": "Philip H.S. Torr", "url": "http://cvpr.thecvf.com/api/miniconf/users/75532?format=json", "institution": "University of Oxford"}, {"id": 73798, "fullname": "Adel Bibi", "url": "http://cvpr.thecvf.com/api/miniconf/users/73798?format=json", "institution": "University of Oxford"}, {"id": 84796, "fullname": "Tongliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84796?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}], "abstract": "The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37723", "url": null, "sourceid": 43052, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37719, "uid": "63bad29d7bd81c84cfee6ab8e3890545", "name": "Recovering Physically Plausible Human-Object Interactions from Monocular Videos", "authors": [{"id": 156474, "fullname": "Dingbang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156474?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 158598, "fullname": "Etienne Vouga", "url": "http://cvpr.thecvf.com/api/miniconf/users/158598?format=json", "institution": "University of Texas at Austin"}, {"id": 91087, "fullname": "Qixing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91087?format=json", "institution": "University of Texas at Austin"}, {"id": 128201, "fullname": "Georgios Pavlakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/128201?format=json", "institution": "University of Texas at Austin"}], "abstract": "In this paper, we present a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physical artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework that begins with a kinematic estimate and then refines it through a reinforcement learning (RL) policy trained to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that automatically identifies the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37719", "url": null, "sourceid": 38045, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37725, "uid": "d4b8b3b9bc3b112cc8333458d7cbbf19", "name": "Bias at the End of the Score", "authors": [{"id": 91089, "fullname": "Salma Abdel Magid", "url": "http://cvpr.thecvf.com/api/miniconf/users/91089?format=json", "institution": "Harvard University"}, {"id": 188100, "fullname": "Grace Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188100?format=json", "institution": "Harvard University, Harvard University"}, {"id": 187320, "fullname": "Esin Tureci", "url": "http://cvpr.thecvf.com/api/miniconf/users/187320?format=json", "institution": "Princeton University"}, {"id": 133590, "fullname": "Amaya Dharmasiri", "url": "http://cvpr.thecvf.com/api/miniconf/users/133590?format=json", "institution": "Princeton University"}, {"id": 75606, "fullname": "Vikram V. Ramaswamy", "url": "http://cvpr.thecvf.com/api/miniconf/users/75606?format=json", "institution": "Princeton University"}, {"id": 89796, "fullname": "Hanspeter Pfister", "url": "http://cvpr.thecvf.com/api/miniconf/users/89796?format=json", "institution": "Harvard University"}, {"id": 150975, "fullname": "Olga Russakovsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/150975?format=json", "institution": "Princeton University"}], "abstract": "Reward models (RMs) are inherently non-neutral value functions  designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used during pretraining, finetuning of models and test-time optimization and post-generation safety and quality filtering of T2I outputs.  While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse during training), their robustness and fairness as scoring functions remains largely unknown. We conduct a large-scale audit of RMs' robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to sexualize female images (especially darker-skinned females), reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight the shortcomings of current RMs, challenging their reliability as quality metrics and underscoring the critical need for alternative data collection, training, and optimization procedures to establish more robust scoring.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37725", "url": null, "sourceid": 35882, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37727, "uid": "fc2b7a8a346f15fe5c587c6e1242038d", "name": "Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression", "authors": [{"id": 134906, "fullname": "Nazia Tasnim", "url": "http://cvpr.thecvf.com/api/miniconf/users/134906?format=json", "institution": "Boston University"}, {"id": 188107, "fullname": "Shrimai Prabhumoye", "url": "http://cvpr.thecvf.com/api/miniconf/users/188107?format=json", "institution": "Boston University, Boston University; NVIDIA"}, {"id": 77007, "fullname": "Bryan A. Plummer", "url": "http://cvpr.thecvf.com/api/miniconf/users/77007?format=json", "institution": "Boston University"}], "abstract": "Parameter Recombination (PR) methods aim to efficiently compose the weights of a neural network, and encompasses tasks like Parameter-Efficient FineTuning (PEFT) and Model Compression (MC), among others. Most methods typically focus on one application of PR, which can make composing them challenging.  For example, when deploying a large model you may wish to compress the  model and also quickly adapt to new settings.  However, PEFT methods often can still contain millions of parameters. This may be small compared to the original model size, but can be problematic in resource constrained deployments like edge devices, where they take a larger portion of the compressed model's parameters.  To address this, we present Coefficient-gated weight Recombination by Interpolated Shared basis Projections (\\method{}), a general approach that can address multiple PR tasks within the same framework, which can enable seamless integration.  It accomplishes this by using a  factorization process that decomposes pretrained weights into basis matrices and their component projections. Sharing these basis matrices across layers and adjusting its size enables us to perform MC, whereas the small size of the projection weights (fewer than 200 in some experiments) enables \\method{} support PEFT.  Experiments on ViT models show \\method{} outperforms methods from prior work capable of dual-task applications by 4-5\\% while also outperforming the state-of-the-art in PEFT by 1.5\\% and PEFT+MC combinations by almost 1\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37727", "url": null, "sourceid": 35161, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37730, "uid": "35a4d59f4afb1fc480e759a20dd385a1", "name": "SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation", "authors": [{"id": 182906, "fullname": "Sheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182906?format=json", "institution": "Beijing Institute of Technology"}, {"id": 188115, "fullname": "Di-Hua Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188115?format=json", "institution": "Beijing Institute of Technology"}, {"id": 188116, "fullname": "Yuanqing Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/188116?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Object pose estimation is a key task for embodied robots, enabling them to interact with objects effectively. Category-level object pose estimation provides a way for robots to estimate the pose of unknown objects. However, estimating object pose from point clouds alone remains challenging. In this paper, we introduce SEGPose, a novel category-level object pose estimation method based on point clouds. Unlike previous methods, SEGPose leverages geometric, topological information, and SE(3)-equivariance, enhancing the network's accuracy in pose prediction. To utilize geometric and topological features, we propose a constraint-based feature extraction and 3D reconstruction method, enabling effective object shape reconstruction. We also design an SE(3)-equivariance feature prediction network to handle pose transformations consistently across viewpoints, improving pose accuracy. Experimental results on benchmark datasets show that SEGPose outperforms all current category-level pose estimation methods based on point clouds. Additionally, we apply the SEGPose to the robotic grasping tasks in real-world scenarios, and the results indicate that SEGPose exhibits excellent generalization capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37730", "url": null, "sourceid": 42819, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37731, "uid": "6c8075e295e3f39797ded21a0a94ed08", "name": "StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues", "authors": [{"id": 188117, "fullname": "Zanxi Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188117?format=json", "institution": "University of Verona"}, {"id": 188118, "fullname": "Songqun Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188118?format=json", "institution": "University of Trento"}, {"id": 188119, "fullname": "Qiuyu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188119?format=json", "institution": "University of Verona; University of Roma &quot;La Sapienza&quot;"}, {"id": 106509, "fullname": "Yiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106509?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 154368, "fullname": "Marco Cristani", "url": "http://cvpr.thecvf.com/api/miniconf/users/154368?format=json", "institution": "Universit\u00e0 degli Studi di Verona"}], "abstract": "Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them \u201cstructure-centric\u201d. Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval on both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37731", "url": null, "sourceid": 35451, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37732, "uid": "45b5846f3d142b0748d3dc2ab223ab6b", "name": "SVHalluc: Benchmarking Speech\u2013Vision Hallucination in Audio-Visual Large Language Models", "authors": [{"id": 129062, "fullname": "Chenshuang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129062?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 132899, "fullname": "Kyeongseon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/132899?format=json", "institution": "POSTECH"}, {"id": 180054, "fullname": "Chengxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180054?format=json", "institution": "Korea Advanced Institute of Science and Technology (KAIST)"}, {"id": 152617, "fullname": "Tae-Hyun Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/152617?format=json", "institution": "KAIST"}], "abstract": "Unlike environmental sounds that mainly indicate event occurrence (e.g., dog barking), human speech carries rich semantics and temporal structures. Despite the advancement of audio-visual large-language models (LLMs) in video understanding, it remains unexplored whether current models can accurately align speech contents with corresponding visual signals.  In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech\u2013vision hallucination in audio-visual LLMs.Our benchmark diagnoses speech\u2013vision hallucinations from two complementary perspectives: semantic and temporal. Experimental results demonstrate that most advanced audio-visual LLMs struggle with aligning speech content with corresponding visual signals.  Our work uncovers a fundamental limitation of current audio-visual LLMs and highlights the need for speech-aware and grounded speech-video perception and comprehension. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37732", "url": null, "sourceid": 45992, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37733, "uid": "1fcceea4a8f4e128f39d1fe92d66a0d9", "name": "Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage", "authors": [{"id": 181988, "fullname": "Peiyu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181988?format=json", "institution": "UCLA"}, {"id": 135485, "fullname": "Suraj Kothawade", "url": "http://cvpr.thecvf.com/api/miniconf/users/135485?format=json", "institution": "Google"}, {"id": 188120, "fullname": "Sirui Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188120?format=json", "institution": "Google DeepMind"}, {"id": 86898, "fullname": "Ying Nian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86898?format=json", "institution": "UCLA"}, {"id": 158671, "fullname": "Hongliang Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/158671?format=json", "institution": "Google"}], "abstract": "Most post-training methods for text-to-image samplers focus on the model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency.We take a different route: rescheduling the sampling timeline of a frozen sampler.Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James\u2013Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior results.Our rescheduled samplers consistently improve text\u2013image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families.Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37733", "url": null, "sourceid": 44153, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37736, "uid": "46f3a1ee6a5611c6ab450dea47b25d15", "name": "ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts", "authors": [{"id": 180016, "fullname": "Haoyang Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/180016?format=json", "institution": "Peking University"}, {"id": 128035, "fullname": "Hao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128035?format=json", "institution": "Peking University"}, {"id": 89566, "fullname": "Yadong Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89566?format=json", "institution": "Peking University"}], "abstract": "As an important research task of human cultural heritage, the restoration of artworks and calligraphy is of great significance.Seldom existing works have taken the multi-source (*i.e.*, fragments are not ensured to be from the same piece of artworks) fragment-oriented restoration task into account.We propose ShreddingNet, a coarse-to-fine two-stage pipeline for multi-source manuscript restoration that operates without restrictive conditions.The proposed coarse stage compares the features of each fragment, selecting top-K candidates and clustering fragments by source. This design leverages the key insight that erroneous matches rarely cross source boundaries, enabling high-precision clustering.The proposed fine-grained stage evaluates candidates, yielding matching scores and filters out erroneous matching pairs from the candidate set; producing more precise final matching pairs for global assembly. Experiments conducted on more than 4,000 images from two datasets demonstrate the average reconstruction F1-score achieves 98.37\\%, which is 5.72\\% higher than the current state-of-the-art method, confirming the method\u2019s effectiveness and robustness.Source code is available in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37736", "url": null, "sourceid": 31561, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37742, "uid": "c790787d9a7c4289410e43ee1cc27373", "name": "Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory", "authors": [{"id": 181074, "fullname": "Yu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181074?format=json", "institution": "Hefei University of Technology"}, {"id": 155705, "fullname": "Hongyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155705?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 90221, "fullname": "Shaofei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90221?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 88811, "fullname": "Tianrui Hui", "url": "http://cvpr.thecvf.com/api/miniconf/users/88811?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 154334, "fullname": "Yaxiong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154334?format=json", "institution": "Hefei University of Technology"}, {"id": 129355, "fullname": "Lechao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129355?format=json", "institution": "Hefei University of Technology"}, {"id": 77038, "fullname": "Zhun Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77038?format=json", "institution": "University of Nottingham"}, {"id": 75839, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75839?format=json", "institution": "Beihang University"}, {"id": 85089, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85089?format=json", "institution": "Hefei University of Technology"}], "abstract": "In this paper, we present PSC-AVDN, a training-free framework for Aerial Vision-and-Dialog Navigation that integrates a three-stage Parsing-Search-Confirmation reasoning pipeline with a Structured Spatial Memory (SSM) module. The parsing stage converts ambiguous instructions into stable geometric cues, Search-CoT conducts stepwise high-altitude target exploration, and Confirmation-CoT performs fine-grained verification to resolve visual ambiguity and confirm the final target. Meanwhile, SSM integrates multi-scale visual observation, spatial visual memory, and structured geometric memory to provide global spatial context and long-horizon consistency.Extensive experiments on the AVDH and AVDH-Full datasets show that PSC-AVDN sets new state-of-the-art performance in the training-free setting, matching or surpassing several finetuned methods. We believe this framework offers a principled way to combine explicit CoT-style reasoning with structured spatial memory for scalable and generalizable aerial embodied navigation in the future.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37742", "url": null, "sourceid": 41997, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37743, "uid": "6e35148cd8189c67fc462bb214e77f46", "name": "Large-scale Robust Enhanced Ensemble Clustering via Outlier Decoupling", "authors": [{"id": 182529, "fullname": "Jiaxuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182529?format=json", "institution": "Sichuan University"}, {"id": 188150, "fullname": "Lei Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188150?format=json", "institution": "Sichuan University"}, {"id": 178244, "fullname": "Xinye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178244?format=json", "institution": "Tiangong University"}, {"id": 188151, "fullname": "Liang Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/188151?format=json", "institution": "Shanxi University"}], "abstract": "Ensemble clustering aims to derive a consensus partition from multiple base clustering results. Anchor-based methods construct compact similarity representations via anchors, substantially improving computational efficiency. However, when outliers contaminate the data, reconstructing the base clustering results often yields biased anchors. These biased anchors degrade the quality of the anchor similarity matrix and lead to a decline in clustering accuracy. To address this issue, we propose a novel method called large-scale robust enhanced ensemble clustering via outlier decoupling (RANGE). Specifically, RANGE first converts the base clustering results into an initial bipartite graph. To enhance the reliability of this bipartite graph, RANGE designs a high-order fuzzy enhancement strategy (HFES) specifically for initial bipartite graphs. Next, a mapping matrix further filters redundant information from the enhanced bipartite graph. RANGE then reconstructs the mapped bipartite graph via matrix factorization. An anchor matrix is introduced to further enhance computational efficiency. To improve robustness, RANGE incorporates a decoupling term that separates the clean clustering structure and the outlier-contaminated structure in the anchor space. With this decoupling mechanism, RANGE is capable of performing robust ensemble clustering. Moreover, by applying outlier detectors to the decoupled outlier structure, RANGE can be extended to the outlier-detection task. Consequently, RANGE forms a cross-task general framework, and both tasks retain linear time complexity. Extensive cross-domain experiments indicate that RANGE delivers superior performance in both clustering validity and outlier detection. The code is available in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37743", "url": null, "sourceid": 35568, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37744, "uid": "c31cd9eb0233c998e5d682c4d826d8c6", "name": "First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models", "authors": [{"id": 177782, "fullname": "Jiwoo Ha", "url": "http://cvpr.thecvf.com/api/miniconf/users/177782?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 177799, "fullname": "Jongwoo Baek", "url": "http://cvpr.thecvf.com/api/miniconf/users/177799?format=json", "institution": "DGIST"}, {"id": 181326, "fullname": "Jinhyun So", "url": "http://cvpr.thecvf.com/api/miniconf/users/181326?format=json", "institution": "DGIST"}], "abstract": "Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs.However, object hallucination \u2014 the generation of nonexistent objects in answers \u2014 remains a persistent challenge.Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity.Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses.In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs.FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information.We observe that FLB (1) sustains the visual information embedded in the first token throughout generation,and (2) suppresses hallucinated words through the stabilizing effect of the \u201cThe\u201d token.Experimental results show that FLB significantly reduces object hallucination on AMBER and CHAIR benchmarks.Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37744", "url": null, "sourceid": 46770, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37749, "uid": "19df7cd7b27335f2efe6133c69f7688d", "name": "Learning Compact 3D Representations from Feed-Forward Novel View Synthesis", "authors": [{"id": 155738, "fullname": "Honggyu An", "url": "http://cvpr.thecvf.com/api/miniconf/users/155738?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 155741, "fullname": "Jaewoo Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/155741?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 188158, "fullname": "Mungyeom Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188158?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 188159, "fullname": "Chaehyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188159?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 188160, "fullname": "Minkyeong Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188160?format=json", "institution": null}, {"id": 155742, "fullname": "Jisang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/155742?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 188161, "fullname": "Kazumi Fukuda", "url": "http://cvpr.thecvf.com/api/miniconf/users/188161?format=json", "institution": "Sony AI"}, {"id": 188162, "fullname": "Takuya Narihira", "url": "http://cvpr.thecvf.com/api/miniconf/users/188162?format=json", "institution": "Sony AI; Sony AI"}, {"id": 142054, "fullname": "HYUNAH KO", "url": "http://cvpr.thecvf.com/api/miniconf/users/142054?format=json", "institution": "Yonsei University"}, {"id": 188163, "fullname": "Junsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188163?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187942, "fullname": "Sunghwan Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187942?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 153173, "fullname": "Yuki Mitsufuji", "url": "http://cvpr.thecvf.com/api/miniconf/users/153173?format=json", "institution": "Sony AI"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Reconstructing and understanding 3D scenes from sparse views in a feed-forward manner remains challenging. While recent approaches use per-pixel 3D Gaussian Splatting for reconstruction and 2D-to-3D feature lifting for scene understanding, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation. We propose a feed-forward framework that estimates compact Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns to efficiently lift features. Extensive experiments on 3D open-vocabulary segmentation and view-invariant feature generation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction, achieving superior memory efficiency and feature fidelity compared to existing methods. All of our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37749", "url": null, "sourceid": 43608, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37746, "uid": "4549427ab501fca3449d6450165e060a", "name": "From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction", "authors": [{"id": 188152, "fullname": "Gaoge Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188152?format=json", "institution": "MBZUAI"}, {"id": 188153, "fullname": "Yongkang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188153?format=json", "institution": null}, {"id": 185728, "fullname": "Zhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185728?format=json", "institution": "La Trobe University"}, {"id": 87300, "fullname": "Shaoli Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87300?format=json", "institution": "Tencent AI Lab"}, {"id": 84796, "fullname": "Tongliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84796?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}], "abstract": "Two-hand reconstruction from monocular images is hampered by complex poses and severe occlusions, which often cause interaction misalignment and two\u2013hand penetration. We address this by decoupling the problem into 2D structural alignment and 3D spatial interaction alignment, each handled by a tailored component. For 2D alignment, we pioneer the attempt to unify heterogeneous structural priors (keypoints, segmentation, and depth) from vision foundation models as complementary structured guidance for two-hand recovery. Instead of extracting priors prediction as explicit inputs, we propose a fusion-alignment encoder that absorbs their structural knowledge implicitly, achieving foundation-level guidance without foundation-level cost. For 3D spatial alignment, we propose a two-hand diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configurations. Guided by collision gradients during denoising, the model converges toward the manifold of valid two-hand interactions, preserving geometric and kinematic coherence. This generative formulation approach enables physically credible reconstructions even under occlusion or ambiguous visual input. Extensive experiments on InterHand2.6M, HIC, and FreiHAND show state-of-the-art or leading performance in interaction alignment and penetration suppression. Code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37746", "url": null, "sourceid": 46159, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37754, "uid": "737e6e42f8a0c420edf0d2889f18a361", "name": "Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering", "authors": [{"id": 182226, "fullname": "Taichun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182226?format=json", "institution": "National University of Defense Technology; University of the Chinese Academy of Sciences"}, {"id": 91513, "fullname": "Zhibin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/91513?format=json", "institution": "National University of Defense Technology"}, {"id": 155810, "fullname": "Hao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155810?format=json", "institution": "National University of Defense Technology"}, {"id": 90222, "fullname": "Siwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90222?format=json", "institution": "Academy of Military Sciences"}, {"id": 90226, "fullname": "Xinwang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90226?format=json", "institution": "National University of Defense Technology"}, {"id": 90201, "fullname": "En Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90201?format=json", "institution": "National University of Defense Technology"}, {"id": 127885, "fullname": "Di Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127885?format=json", "institution": "Renmin University of China"}, {"id": 156893, "fullname": "Tianrui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156893?format=json", "institution": "National University of Defense Technology"}, {"id": 149227, "fullname": "chuankun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/149227?format=json", "institution": "North University of China"}, {"id": 188173, "fullname": "Kunlun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188173?format=json", "institution": null}], "abstract": "In real-world applications, multi-view data often suffer from missing situations due to privacy protection and sensor failure factors. The Incomplete scenarios not only lead to partial information availability but also cause significant imbalance learning among views: certain \u201cstrong views\u201d dominate the fusion process, while \u201cweak views\u201d contribute marginally, thereby undermining cross-view collaboration. Existing incomplete multi-view clustering methods mainly focus on \"how to handle missing data\", yet they largely overlook the imbalance view contribution induced by incompleteness and its profound impact on representation learning and clustering performance. To address these issues, our paper first analyzes the data imbalance caused by missing views and the resulting disparities in view learning quality. Then, we propose a collaborative evaluation and enhancement framework (\\textbf{ICER}) for imbalanced incomplete multi-view clustering . Specifically, we employ shapley values to quantify the marginal contribution of each view, and incorporate imbalanced optimal transport to characterize distributional deviations across views. On this basis, we construct the view contribution imbalance metric to comprehensively evaluate cross-view collaboration and fusion quality, and design a collaboration enhancement module to explicitly reinforce inter-view cooperative optimization and feature fusion. Extensive experiments on multiple benchmark datasets demonstrate that the proposed method outperforms existing incomplete multi-view clustering approaches, validating the effectiveness and necessity of explicitly modeling and mitigating view imbalance in imbalanced incomplete scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37754", "url": null, "sourceid": 44621, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37755, "uid": "83b12c7d5f1bc35f22e866f5fcef9bc3", "name": "Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens", "authors": [{"id": 157170, "fullname": "Yuqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157170?format=json", "institution": "The University of HongKong"}, {"id": 156035, "fullname": "Chuofan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/156035?format=json", "institution": "University of Hong Kong"}, {"id": 87202, "fullname": "Zhijie Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87202?format=json", "institution": "Zhejiang University"}, {"id": 171459, "fullname": "Yao Teng", "url": "http://cvpr.thecvf.com/api/miniconf/users/171459?format=json", "institution": "University of Hong Kong"}, {"id": 158398, "fullname": "Lijun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158398?format=json", "institution": "Google DeepMind"}, {"id": 188174, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188174?format=json", "institution": "Nanjing University"}, {"id": 153261, "fullname": "Jiaming Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/153261?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86968, "fullname": "Jiashi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86968?format=json", "institution": "ByteDance"}, {"id": 90635, "fullname": "Yi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90635?format=json", "institution": "bytedance"}, {"id": 86697, "fullname": "Xihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86697?format=json", "institution": "The University of Hong Kong"}], "abstract": "Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional VAE tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges.In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. Instead of treating spatial positions atomically, CubiD performs fine-grained masking throughout the high-dimensional discrete representation\u2014any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions through attention, transforming an intractable $O(hwd)$ sequential generation problem into $O(T)$ parallel iterations where $T \\ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37755", "url": null, "sourceid": 37858, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37757, "uid": "ce3c9c905135bab43c25500d3435a7a7", "name": "GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis", "authors": [{"id": 168257, "fullname": "Xuqin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/168257?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 188176, "fullname": "Tao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188176?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188177, "fullname": "Yanfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188177?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188178, "fullname": "Lu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188178?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188179, "fullname": "mingwei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188179?format=json", "institution": "Huawei Technologies Ltd.; Wuhan University"}, {"id": 188180, "fullname": "Yongliang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188180?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188181, "fullname": "Niclas Zeller", "url": "http://cvpr.thecvf.com/api/miniconf/users/188181?format=json", "institution": "Karlsruhe University of Applied Sciences"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}], "abstract": "Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions.We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling.To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples.Empirically, our method surpasses diffusion-based NVS baselines, demonstrating improved structural coherence and smoother transitions across views. These results highlight the advantages of incorporating data-dependent geometric regularization into deterministic flow matching for consistent novel view generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37757", "url": null, "sourceid": 40751, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37758, "uid": "175e9308ea835facdc5c74c75acc450f", "name": "Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration", "authors": [{"id": 188182, "fullname": "Xiaowan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188182?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 181617, "fullname": "Jing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181617?format=json", "institution": "Beihang University"}, {"id": 188183, "fullname": "Henan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188183?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 188184, "fullname": "Li Huaqiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188184?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88543, "fullname": "Mai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88543?format=json", "institution": "Beihang University, Tsinghua University"}], "abstract": "Zero-shot image restoration provides a flexible way to handle diverse degradations without task-specific training. However, existing methods typically rely on stacked layers or pre-trained features to enhance degradation expression, while overlooking physically consistent priors. The insufficient degradation prompts impose the heavy training burden and high sampling costs during zero-shot diffusion. Moreover, the fixed inference trajectory often collapses to suboptimal solutions under complex corruptions. We observe that heterogeneous degradations can be reparameterized into a minimal set of physically coherent parameters for compact representation. Based on this insight, we first propose a unified physical zero-shot image restoration (UP-ZeroIR) framework that explicitly models heterogeneous degradations into a homogeneous all-in-one distribution. The distribution can be optimized directly in the latent space, enabling principled solution exploration and effective prompt adaptation. Besides, we introduce a dynamic quality-refinement strategy that adaptively adjusts the diffusion trajectory for robust globally optimal convergence. Extensive experiments demonstrate that our method achieves state-of-the-art performance across both single and mixed degradations. The code will be publicly available soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37758", "url": null, "sourceid": 32731, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37759, "uid": "2c12edb773c206a91798f98be367a590", "name": "Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos", "authors": [{"id": 181746, "fullname": "ZIREN GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/181746?format=json", "institution": "University of Bologna"}, {"id": 188185, "fullname": "Xiaohan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188185?format=json", "institution": null}, {"id": 75726, "fullname": "Fabio Tosi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75726?format=json", "institution": "University of Bologna"}, {"id": 142332, "fullname": "Jiawei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/142332?format=json", "institution": "Beijing Institute of Technology"}, {"id": 87188, "fullname": "Stefano Mattoccia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87188?format=json", "institution": "University of Bologna"}, {"id": 126993, "fullname": "Jianfei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126993?format=json", "institution": "Monash University"}, {"id": 87171, "fullname": "Matteo Poggi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87171?format=json", "institution": "Universit\u00e0 di Bologna"}], "abstract": "We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D\u20133D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation \u2014 marking a step forward toward real-time, semantics-aware Spatial AI.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37759", "url": null, "sourceid": 37383, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37760, "uid": "25ccdad9dedf0b93fa4eed8de42e362f", "name": "Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability", "authors": [{"id": 153227, "fullname": "Tuomas Oikarinen", "url": "http://cvpr.thecvf.com/api/miniconf/users/153227?format=json", "institution": "University of California, San Diego"}, {"id": 153225, "fullname": "Ge Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153225?format=json", "institution": "University of California, San Diego"}, {"id": 153224, "fullname": "Akshay R. Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/153224?format=json", "institution": "University of California, San Diego"}, {"id": 153228, "fullname": "Tsui-Wei Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153228?format=json", "institution": "University of California, San Diego"}], "abstract": "Interpreting individual neurons or directions in activation space is an important topic in mechanistic interpretability. Numerous automated interpretability methods have been proposed to generate such explanations, but it remains unclear how reliable these explanations are, and which methods produce the most accurate descriptions. While crowd-sourced evaluations are commonly used, existing pipelines are noisy, costly, and typically assess only the highest-activating inputs, leading to unreliable results. In this paper, we introduce two techniques to enable cost-effective and accurate crowdsourced evaluation of automated interpretability methods beyond top activating inputs. First, we propose Model-Guided  Importance Sampling (MG-IS) to select the most informative inputs to show human raters. In our experiments, we show this reduces the number of inputs needed to reach the same evaluation accuracy by $\\sim13\\times$. Second, we address label noise in crowd-sourced ratings through Bayesian Rating Aggregation (BRAgg), which allows us to reduce the number of ratings per input required to overcome noise by $\\sim3\\times$. Together, these techniques reduce the evaluation cost by $\\sim40\\times$, making large-scale evaluation feasible. Finally, we use our methods to conduct a large scale crowd-sourced study comparing recent automated interpretability methods for vision networks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37760", "url": null, "sourceid": 35653, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37761, "uid": "c5ee9f155fdf0ea2ce012c7ba202f373", "name": "Anchoring and Rescaling Attention for Semantically Coherent Inbetweening", "authors": [{"id": 183925, "fullname": "Tae Eun Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183925?format=json", "institution": "Yonsei University"}, {"id": 188186, "fullname": "Sumin Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188186?format=json", "institution": "Yonsei University"}, {"id": 153959, "fullname": "Junhyeok Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153959?format=json", "institution": "Yonsei University"}, {"id": 107168, "fullname": "Seong Jae Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107168?format=json", "institution": "Yonsei University"}], "abstract": "Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation.As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment.Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path.Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully.TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37761", "url": null, "sourceid": 41527, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37762, "uid": "f191bee63afb73c64730435e063e61b1", "name": "Latent Chain-of-Thought World Modeling for End-to-End Driving", "authors": [{"id": 157164, "fullname": "Shuhan Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/157164?format=json", "institution": "University of Texas at Austin"}, {"id": 188187, "fullname": "Kashyap Chitta", "url": "http://cvpr.thecvf.com/api/miniconf/users/188187?format=json", "institution": "NVIDIA"}, {"id": 138722, "fullname": "Yuxiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/138722?format=json", "institution": "California Institute of Technology"}, {"id": 184525, "fullname": "Thomas Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/184525?format=json", "institution": null}, {"id": 98591, "fullname": "Yurong You", "url": "http://cvpr.thecvf.com/api/miniconf/users/98591?format=json", "institution": "NVIDIA"}, {"id": 184527, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184527?format=json", "institution": "NVIDIA"}, {"id": 184524, "fullname": "Wenjie Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184524?format=json", "institution": "NVIDIA"}, {"id": 184526, "fullname": "Yulong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184526?format=json", "institution": "NVIDIA"}, {"id": 89183, "fullname": "Philipp Kr\u00e4henb\u00fchl", "url": "http://cvpr.thecvf.com/api/miniconf/users/89183?format=json", "institution": "University of Texas at Austin"}, {"id": 88162, "fullname": "Marco Pavone", "url": "http://cvpr.thecvf.com/api/miniconf/users/88162?format=json", "institution": "NVIDIA"}, {"id": 127258, "fullname": "Boris Ivanovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/127258?format=json", "institution": "NVIDIA"}], "abstract": "Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model\u2019s output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model\u2019s action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37762", "url": null, "sourceid": 46232, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37766, "uid": "e4c9f7ec8caa4aca38efbbcae59b6472", "name": "Unified Personalized Understanding, Generating and Editing", "authors": [{"id": 180439, "fullname": "Yu Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180439?format=json", "institution": "\u6d59\u6c5f\u5927\u5b66\u8f6f\u4ef6\u5b66\u9662\uff08\u5b81\u6ce2\uff09\u521b\u65b0\u4e0e\u7ba1\u7406\u4e2d\u5fc3"}, {"id": 188210, "fullname": "Tianwei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188210?format=json", "institution": "Zhejiang University"}, {"id": 188211, "fullname": "Ruike Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188211?format=json", "institution": "Department of Computer Science in UIUC"}, {"id": 113817, "fullname": "Yuqian Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/113817?format=json", "institution": "Zhejiang University"}, {"id": 188212, "fullname": "Haoyu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188212?format=json", "institution": "Zhejiang University"}, {"id": 188213, "fullname": "Liang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188213?format=json", "institution": "Central South UniversityCh"}, {"id": 126912, "fullname": "Wenqiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126912?format=json", "institution": "National University of Singapore"}, {"id": 154184, "fullname": "Feifei Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154184?format=json", "institution": "Zhejiang University"}, {"id": 87164, "fullname": "Haoyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87164?format=json", "institution": "Zhejiang University"}, {"id": 157825, "fullname": "Wanggui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/157825?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 87163, "fullname": "Hao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87163?format=json", "institution": "Eindhoven University of Technology"}, {"id": 129046, "fullname": "Yueting Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129046?format=json", "institution": "Zhejiang University"}], "abstract": "Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a ''one-size-fits-all'' paradigm and struggle to model user-specific concepts (e.g., generate a photo of $\\texttt{\\<maeve>}$) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge.We present $\\textbf{OmniPersona}$, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior.To systematically evaluate unified personalization, we propose $\\textbf{\\texttt{OmniPBench}}$, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37766", "url": null, "sourceid": 35224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37767, "uid": "170507b9c14160af6048d41c9d6dcce7", "name": "FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model", "authors": [{"id": 90859, "fullname": "Xiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90859?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 87633, "fullname": "Jinshan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87633?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 89307, "fullname": "Jiangxin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89307?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 85000, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85000?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 89846, "fullname": "Jinhui Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89846?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37767", "url": null, "sourceid": 37364, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37769, "uid": "710eec06bc39ec7c6bd28ddf5f3a5668", "name": "Feed-forward Gaussian Registration for Head Avatar Creation and Editing", "authors": [{"id": 71850, "fullname": "Malte Prinzler", "url": "http://cvpr.thecvf.com/api/miniconf/users/71850?format=json", "institution": "Max Planck Institute for Intelligent Systems T\u00fcbingen"}, {"id": 85017, "fullname": "Paulo Gotardo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85017?format=json", "institution": "Google"}, {"id": 69176, "fullname": "Siyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69176?format=json", "institution": "ETH Zurich"}, {"id": 88338, "fullname": "Timo Bolkart", "url": "http://cvpr.thecvf.com/api/miniconf/users/88338?format=json", "institution": "Google"}], "abstract": "We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatars require time-consuming head tracking, which is followed by an expensive avatar optimization, often resulting in a total creation time that exceeds one day. MATCH instead directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in 0.5 seconds per frame. While the learned intra-subject correspondence across frames allows us to quickly build personalized head avatars, correspondence across subjects enables various applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We learn to establish such correspondences end-to-end, with a transformer-based model that predicts textures of Gaussian splats in the fixed UV layout of a template mesh. To this end, we introduce a novel registration-guided attention block, in which each UV map token attends exclusively to image tokens depicting its corresponding mesh region. MATCH outperforms existing methods for novel-view synthesis, geometry registration, and head avatar generation, the latter being $10\\times$ faster than the qualitatively closest baseline. Code and model weights will be published upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37769", "url": null, "sourceid": 35367, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37782, "uid": "5e7cefa9b606dcd7b0faa082d82cdb1d", "name": "SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains", "authors": [{"id": 188255, "fullname": "Qingmei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188255?format=json", "institution": "Tsinghua University"}, {"id": 161757, "fullname": "Yang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161757?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 176098, "fullname": "peifeng zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176098?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 131048, "fullname": "Haohuan Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131048?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 131057, "fullname": "Juepeng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131057?format=json", "institution": "Sun Yat-Sen University"}], "abstract": "Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a Style-Adaptive GEneralization framework (SAGE), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37782", "url": null, "sourceid": 33451, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37784, "uid": "e702a7e0d9355796847321bc16ffef7f", "name": "MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection", "authors": [{"id": 180359, "fullname": "HAOCHEN ZHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/180359?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u9662\u4fe1\u606f\u5de5\u7a0b\u7814\u7a76\u6240"}, {"id": 188257, "fullname": "Yuyao Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188257?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 144535, "fullname": "Yongxiu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144535?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 154418, "fullname": "Gaopeng Gou", "url": "http://cvpr.thecvf.com/api/miniconf/users/154418?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 188258, "fullname": "Hongbo Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188258?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 188259, "fullname": "Yubin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188259?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 188260, "fullname": "Haoliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188260?format=json", "institution": null}], "abstract": "Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose a Cross-Image Reasoning Model (CIRM), integrating a Dual-Stage Bridge Module and Relevance-Guided Fusion Module to model inter-image dependencies and cross-modal correspondences. Complementarily, we establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0, and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37784", "url": null, "sourceid": 39180, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37787, "uid": "4f1780d09a304fd4583f290babc0797d", "name": "Multimodal Distribution Matching for Vision-Language Dataset Distillation", "authors": [{"id": 179884, "fullname": "Jongoh Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/179884?format=json", "institution": "KAIST"}, {"id": 130885, "fullname": "Hoyong Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/130885?format=json", "institution": "KAIST"}, {"id": 147668, "fullname": "Minseok Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/147668?format=json", "institution": "KAIST"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision\u2013language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image\u2013text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal alignment and discrepancy directions along with symmetric contrastive learning. Across image\u2013text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37787", "url": "andyj1.github.io/mdm", "sourceid": 46478, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65752, "file": "/media/PosterPDFs/CVPR%202026/37787.png", "modified": "2026-04-30T01:35:09.187284-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65753, "file": "/media/PosterPDFs/CVPR%202026/37787-thumb.png", "modified": "2026-04-30T01:35:09.367778-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65754, "modified": "2026-04-30T01:37:17.504485-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/37787.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37789, "uid": "68624a8a1f8dd04e260c0173bad7ee31", "name": "PhyGaP: Physically-Grounded Gaussians with Polarization Cues", "authors": [{"id": 171770, "fullname": "Jiale Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/171770?format=json", "institution": "Zhejiang University"}, {"id": 180504, "fullname": "Xiaoyang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180504?format=json", "institution": "The University of Hong Kong"}, {"id": 188271, "fullname": "Zongqi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188271?format=json", "institution": "University of Hong Kong"}, {"id": 90181, "fullname": "Weiwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90181?format=json", "institution": "Zhejiang University"}, {"id": 129481, "fullname": "YIFAN PENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129481?format=json", "institution": "University of Hong Kong"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via **deferred rendering (DR)**. However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of **shape and material** information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37789", "url": null, "sourceid": 38185, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40313?format=json"], "related_events_ids": [40313]}, {"id": 40313, "uid": "68624a8a1f8dd04e260c0173bad7ee31", "name": "PhyGaP: Physically-Grounded Gaussians with Polarization Cues", "authors": [{"id": 171770, "fullname": "Jiale Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/171770?format=json", "institution": "Zhejiang University"}, {"id": 180504, "fullname": "Xiaoyang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180504?format=json", "institution": "The University of Hong Kong"}, {"id": 188271, "fullname": "Zongqi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188271?format=json", "institution": "University of Hong Kong"}, {"id": 90181, "fullname": "Weiwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90181?format=json", "institution": "Zhejiang University"}, {"id": 129481, "fullname": "YIFAN PENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129481?format=json", "institution": "University of Hong Kong"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via **deferred rendering (DR)**. However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of **shape and material** information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40313", "url": null, "sourceid": -38185, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37789?format=json"], "related_events_ids": [37789]}, {"id": 37795, "uid": "9c2aef8bf511d208ee623850400a7f9e", "name": "Coupling Liquid Time\u2011Constant Encoders with Modern Hopfield Memory", "authors": [{"id": 181230, "fullname": "Bishal Swain", "url": "http://cvpr.thecvf.com/api/miniconf/users/181230?format=json", "institution": "Kumoh National Institute of Technology"}, {"id": 149252, "fullname": "Kyung Joo Cheoi", "url": "http://cvpr.thecvf.com/api/miniconf/users/149252?format=json", "institution": "Chungbuk National University"}, {"id": 188288, "fullname": "Jaepil Ko", "url": "http://cvpr.thecvf.com/api/miniconf/users/188288?format=json", "institution": "Kumoh National Institute of Technology"}], "abstract": "Continuous-time neural networks provide adaptive dynamics, but rely on a single hidden state to encode both fast input fluctuations and longer-term context. This shared representation forces rapidly changing inputs to overwrite slower contextual signals, causing the model to lose past information as new observations arrive. In contrast, biological perceptual systems maintain stable behaviour under evolving sensory input by integrating ongoing signals with stored associative patterns rather than relying on a single evolving state.Motivated by this distinction, we study a simple coupling of Liquid Time-Constant Networks (LTCs) with a Modern Hopfield Network (MHN) that serves as a content-addressable memory. At each time step, the liquid state is projected into a query, the MHN retrieves a memory vector, and the two representations are concatenated before a readout layer. We analyse this coupling under standard norm and Lipschitz assumptions and show that the combined representation remains bounded. We further show that the retrieval map contracts gradients for parameters upstream of the memory query, which provides a mechanism for reducing curvature in the loss landscape.On public time-series benchmarks, the coupled LTC-MHN model improves mean accuracy by 2.3\\% over competitive recurrent and continuous-time baselines and reduces the estimated Hessian trace by about an order of magnitude relative to a standalone LTC encoder, with the largest gains on classification tasks and competitive performance on a regression task. Qualitative analyses of training curves, loss landscapes, and latent embeddings support the interpretation that Hopfield retrieval smooths optimization and encourages more compact, linearly separable class manifolds. Code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37795", "url": null, "sourceid": 37402, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37793, "uid": "f0c19e4e1cbcc224b862bb4579a06a7e", "name": "VDOT: Efficient Unified Video Creation via Optimal Transport Distillation", "authors": [{"id": 156098, "fullname": "Yutong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156098?format=json", "institution": "Beijing Institute of Technology"}, {"id": 71241, "fullname": "Haiyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71241?format=json", "institution": "BUAA"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86632, "fullname": "Yu Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86632?format=json", "institution": "Shanghai Aritifcal Intelligence Laboratory"}, {"id": 126959, "fullname": "Yaohui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126959?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}, {"id": 126954, "fullname": "Xinyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126954?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}], "abstract": "The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to advance the field. Experiments demonstrate that our 4-step VDOT outperforms or matches the performance of other baselines with 100 denoising steps.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37793", "url": null, "sourceid": 31643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37799, "uid": "eb901025f1ff4d8010f655b42a33210d", "name": "EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy", "authors": [{"id": 157065, "fullname": "Jinzhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157065?format=json", "institution": "Tsinghua University"}, {"id": 178688, "fullname": "Yinuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/178688?format=json", "institution": "Tsinghua University"}, {"id": 188294, "fullname": "Dongxu Piao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188294?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 181092, "fullname": "Panwang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181092?format=json", "institution": "ByteDance"}, {"id": 153032, "fullname": "Yifan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153032?format=json", "institution": "PICO, ByteDance"}, {"id": 188295, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188295?format=json", "institution": null}, {"id": 178438, "fullname": "Honglei Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/178438?format=json", "institution": "ByteDance Ltd"}, {"id": 188296, "fullname": "Liang Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/188296?format=json", "institution": null}, {"id": 179635, "fullname": "Shaofei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179635?format=json", "institution": "Department of Computer Science, Swiss Federal Institute of Technology"}, {"id": 75772, "fullname": "Yixin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/75772?format=json", "institution": "BIGAI"}, {"id": 75767, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75767?format=json", "institution": "Beijing Institute of General Artificial Intelligence"}, {"id": 181048, "fullname": "Miao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181048?format=json", "institution": "Tsinghua University"}], "abstract": "Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37799", "url": null, "sourceid": 32151, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37805, "uid": "2f6abd819fde7defc29b69f7fb2d9fb6", "name": "Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints", "authors": [{"id": 188309, "fullname": "Chenxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188309?format=json", "institution": "De Intelligence Technology Co, Ltd"}, {"id": 188310, "fullname": "Xianggan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188310?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 182659, "fullname": "Dake Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182659?format=json", "institution": "Hainan Degong Artificial Intelligence Technology Co., Ltd."}, {"id": 188311, "fullname": "Yaosong Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/188311?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188312, "fullname": "Zhibo Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188312?format=json", "institution": "University of Glasgow"}, {"id": 188313, "fullname": "Hao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188313?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188314, "fullname": "Linyi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188314?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188315, "fullname": "Chengwei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188315?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188316, "fullname": "Jingzhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188316?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188317, "fullname": "RanYi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188317?format=json", "institution": "University of Glasgow"}, {"id": 188318, "fullname": "Peiling Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188318?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188319, "fullname": "Xiande Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188319?format=json", "institution": "De Artificial Intelligence Technology Co., Ltd; University of Electronic Science and Technology of China"}], "abstract": "Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose \\ours, a simple yet effective single-query jailbreak framework under black-box settings. \\ours decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), \\ours exploits LVLMs\u2019 reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models across two widely used benchmarks demonstrate the effectiveness of our proposed \\ours.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37805", "url": null, "sourceid": 42527, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40314, "uid": "03e3685ec5e4518558e64360a570cc34", "name": "Dual Band Video Thermography: Separating Time-Varying Reflection and Emission Near Ambient Conditions", "authors": [{"id": 131212, "fullname": "Sriram Narayanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131212?format=json", "institution": "Carnegie Mellon University"}, {"id": 131210, "fullname": "Mani Ramanagopal", "url": "http://cvpr.thecvf.com/api/miniconf/users/131210?format=json", "institution": "Carnegie Mellon University"}, {"id": 89406, "fullname": "Srinivasa G. Narasimhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89406?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Long-wave infrared radiation captured by a thermal camera includes (a) emission from an object governed by its temperature and emissivity, and (b) reflected radiation from the surrounding environment. Separating these components is a long-standing challenge in thermography. Even when using multiple bands, the problem is under-determined without priors on emissivity. This difficulty is amplified in near ambient conditions, where emitted and reflected signals are of comparable magnitude. We present a dual-band video thermography framework that reduces this ambiguity by combining two complementary ideas at a per-pixel level: (i) spectral cues (ratio of emissivity between bands is unknown but fixed), and (ii) temporal cues (object radiation changes smoothly while background radiation changes rapidly). We derive an image formation model and an algorithm to jointly estimate the object's emissivity at each band, and the time-varying object and background temperatures. Experiments with calibrated and uncalibrated emissivities in everyday scenes (e.g., coffee pot heating up, palm print on mirrors dissipating, reflections of moving people), demonstrate robust separation and recovery of temperature fields. We will release code and data upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40314", "url": null, "sourceid": -38951, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37803?format=json"], "related_events_ids": [37803]}, {"id": 37806, "uid": "1e012f8cfabbcde57804cdaafca52a3f", "name": "3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding", "authors": [{"id": 176940, "fullname": "Xiaoye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176940?format=json", "institution": "University of Cambridge"}, {"id": 151909, "fullname": "Chen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151909?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 153562, "fullname": "Wei-Hong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153562?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37806", "url": null, "sourceid": 42817, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37808, "uid": "bca0870cfef8e4e835b6ac8cb8b5b9bb", "name": "OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text", "authors": [{"id": 136675, "fullname": "Weiguo Pian", "url": "http://cvpr.thecvf.com/api/miniconf/users/136675?format=json", "institution": "The University of Texas at Dallas"}, {"id": 136674, "fullname": "Saksham Singh Kushwaha", "url": "http://cvpr.thecvf.com/api/miniconf/users/136674?format=json", "institution": "University of Texas at Dallas"}, {"id": 188321, "fullname": "Zhimin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188321?format=json", "institution": null}, {"id": 136608, "fullname": "Shijian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/136608?format=json", "institution": "The University of Texas at Dallas"}, {"id": 133684, "fullname": "Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133684?format=json", "institution": "University of Toronto"}, {"id": 73934, "fullname": "Yunhui Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/73934?format=json", "institution": "The University of Texas at Dallas"}, {"id": 87904, "fullname": "Yapeng Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/87904?format=json", "institution": "University of Texas at Dallas"}], "abstract": "In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching\u2013based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech\u2013environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Our data, code, and pre-trained models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37808", "url": null, "sourceid": 45551, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37809, "uid": "9ea68e97266389f3fe227b9c0e0084f3", "name": "BulletTime: Decoupled Control of Time and Camera Pose for Video Generation", "authors": [{"id": 188322, "fullname": "Yiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188322?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 133025, "fullname": "Qihang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133025?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 71296, "fullname": "Shengqu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/71296?format=json", "institution": "Stanford University"}, {"id": 76203, "fullname": "Tong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76203?format=json", "institution": "Stanford University"}, {"id": 180269, "fullname": "Jan Ackermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/180269?format=json", "institution": "Google DeepMind"}, {"id": 91739, "fullname": "Zhengfei Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91739?format=json", "institution": "Stanford University"}, {"id": 155164, "fullname": "Yang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155164?format=json", "institution": "Stanford University"}, {"id": 152848, "fullname": "Frano Raji\u010d", "url": "http://cvpr.thecvf.com/api/miniconf/users/152848?format=json", "institution": "Department of Computer Science, ETHZ - ETH Zurich"}, {"id": 69176, "fullname": "Siyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69176?format=json", "institution": "ETH Zurich"}, {"id": 85845, "fullname": "Gordon Wetzstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85845?format=json", "institution": "Stanford University"}], "abstract": "Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional embedding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37809", "url": null, "sourceid": 34025, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37811, "uid": "4a9a32256809fc17cd9e68e5f6feb3ed", "name": "RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model", "authors": [{"id": 102024, "fullname": "Junjin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/102024?format=json", "institution": "School of Computer Science and Engineering, Sun Yat-sen University"}, {"id": 135100, "fullname": "Yandan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135100?format=json", "institution": "Beijing Institute for General Artificial Intelligence"}, {"id": 158334, "fullname": "Xinyuan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158334?format=json", "institution": "Alibaba Group"}, {"id": 188328, "fullname": "Ronghan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188328?format=json", "institution": "Alibaba Group"}, {"id": 154901, "fullname": "Feng Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154901?format=json", "institution": "Alibaba Group"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87176, "fullname": "Qing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87176?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose RehearseVLA, an RL-based post-training framework that replaces physical interaction with a low-cost world model-based virtual simulator. RehearseVLA consists of two key components: (1) a physically-consistent world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that RehearseVLA effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37811", "url": null, "sourceid": 38881, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37813, "uid": "fc37c234a9b820027243ea2255da90b7", "name": "Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning", "authors": [{"id": 172104, "fullname": "Ruoran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/172104?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 188332, "fullname": "Haoyu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188332?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 188333, "fullname": "Bin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188333?format=json", "institution": "Ricoh Software Research Center Beijing Co., Ltd."}, {"id": 154499, "fullname": "Qiufeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154499?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image).In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes.Notably, our proposed Hilbert-Geo is also applicable to plane geometry.To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3\\% in SolidFGeo2k and 84.1\\% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2\\% on SolidFGeo2k) and GPT-5 (62.9\\% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2\\% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37813", "url": null, "sourceid": 43246, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37814, "uid": "d40bdee0fd721835b35fd121261946f9", "name": "Open-Vocabulary Domain Generalization in Urban-Scene Segmentation", "authors": [{"id": 84913, "fullname": "Dong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84913?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 188334, "fullname": "Qi Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188334?format=json", "institution": "Hefei University of Technology"}, {"id": 180552, "fullname": "Nan Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180552?format=json", "institution": "Hefei University of Technology"}, {"id": 130948, "fullname": "Wenjing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130948?format=json", "institution": "University of Nottingham"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 77038, "fullname": "Zhun Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77038?format=json", "institution": "University of Nottingham"}], "abstract": "Domain Generalization in Semantic Segmentation (DG-SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG-SS methods are restricted to a fixed set of known categories, limiting their applicability in open-world scenarios. Recent progress in Vision-Language Models (VLMs) has advanced Open-Vocabulary Semantic Segmentation (OV-SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban-driving scenarios. To bridge this gap, we introduce Open-Vocabulary Domain Generalization in Semantic Segmentation (OVDG-SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG-SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic-to-real and real-to-real generalization across diverse unseen domains and unseen categories. In OVDG-SS, we observe that domain shifts often distort text\u2013image correlations in pre-trained VLMs, which hinders the performance of OV-SS models. To tackle this challenge, we propose S$^2$-Corr, a state-space-driven text\u2013image correlation refinement mechanism that can mitigate domain-induced distortions and produce a more  consistent text\u2013image correlation under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross-domain performance and efficiency compared to existing OV-SS approaches.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37814", "url": null, "sourceid": 42830, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37815, "uid": "c723579a45101c08f6e4a185e0f2384b", "name": "BDNet:Bio-Inspired dual-backbone Small Object Detection Network", "authors": [{"id": 180978, "fullname": "Wenchao Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180978?format=json", "institution": "Guangxi University of Science and Technology"}, {"id": 184023, "fullname": "Chuan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184023?format=json", "institution": "Guangxi University of Science and Technology"}, {"id": 188335, "fullname": "Sihan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188335?format=json", "institution": "Guangxi University of Science and Technology"}, {"id": 188336, "fullname": "Xiongzhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188336?format=json", "institution": "Guangxi University of Science and Technology"}, {"id": 188337, "fullname": "Xintao Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188337?format=json", "institution": "Macao Polytechnic University"}], "abstract": "In remote sensing images, small objects often suffer from low color contrast and blurred edges, resulting in suboptimal feature extraction performance. Physiological studies indicate that the LGN/V1\u2013V2\u2013V4 pathway offers color opponency sensitivity and hierarchical enhancement advantages for the extraction of color information, while the V1\u2013V4 pathway shows strong orientation selectivity in edge information extraction. The integration of these two types of information in the V4 region significantly improves target discrimination.Inspired by this, this paper proposes a dual-backbone network (BDNet) to enhance small object feature extraction. BDNet adopts a dual-backbone parallel structure to capture fine-grained features from color and edge dimensions: the color extraction backbone simulates the color antagonistic mechanism in the LGN/V1 region by designing a Color Antagonism Module (CAM) to amplify color differences, and further mimics the chromatic processing hierarchy in the V2 region with a Visual Cortex Hue-enhancement Module (VCHM) to enrich hue representations. These two components work collaboratively to address the issue of low color contrast.The edge extraction backbone simulates the orientation selectivity of receptive fields in the V1 region by designing an Orientation Selective Module (OrSM) to select and enhance salient edges, thereby mitigating the issue of edge blurring caused by dispersed edge information. Finally, the two types of extracted features are interactively integrated through a Feature Fusion Module (FFM) that emulates the integration mechanism in the V4 region, generating a comprehensive feature representation.Experiments demonstrate that BDNet outperforms state-of-the-art (SOTA) methods on the VisDrone2019, NWPU VHR-10, and AI-TODv2 datasets, thus providing a bio-inspired solution for small object detection in remote sensing images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37815", "url": null, "sourceid": 39913, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37822, "uid": "6b15a5dc44e3676a5cf62e5ccf0d542e", "name": "CogniVerse: Revolutionizing Multi-modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning", "authors": [{"id": 76223, "fullname": "Xiang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76223?format=json", "institution": "Nanyang Technological University"}, {"id": 188346, "fullname": "Wanlong Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188346?format=json", "institution": "Interdisciplinary Graduate Programme (AI-X) , Nanyang Technological University"}, {"id": 169665, "fullname": "Changshuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169665?format=json", "institution": "Nanyang Technological University"}], "abstract": "Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models (MLLMs) in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \\textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module (CRM) that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Grounded in advanced theoretical frameworks, including convergence guarantees for geometric alignment and spectral optimization, CogniVerse achieves robust cross-modal integration and adaptive knowledge utilization. Extensive experiments on benchmark multi-modal question answering datasets demonstrate that CogniVerse significantly outperforms state-of-the-art MMRAG systems in both accuracy and coherence, while reducing retrieval latency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37822", "url": null, "sourceid": 40466, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37816, "uid": "7af61e5f524cea0e7dab3a8dfc4c00d6", "name": "DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR", "authors": [{"id": 76908, "fullname": "Hoonhee Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/76908?format=json", "institution": "KAIST"}, {"id": 152691, "fullname": "Jae-Young Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152691?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 129337, "fullname": "Yuhwan Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129337?format=json", "institution": "KAIST"}, {"id": 188338, "fullname": "Yunseo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188338?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 176159, "fullname": "Wonyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/176159?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology (KAIST)"}, {"id": 152692, "fullname": "Youngho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/152692?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "In this paper, we present DSERT-RoLL, a driving dataset that incorporates stereo event, RGB, and thermal cameras together with 4D radar and dual LiDAR, collected across diverse weather and illumination conditions. The dataset provides precise 2D and 3D bounding boxes with track IDs and ego vehicle odometry, enabling fair comparisons within and across sensor combinations. It is designed to alleviate data scarcity for novel sensors such as event cameras and 4D radar and to support systematic studies of their behavior. We establish unified 3D and 2D benchmarks that enable direct comparison of characteristics and strengths across sensor families and within each family. We report baselines for representative single modality and multimodal methods and provide protocols that encourage research on different fusion strategies and sensor combinations. In addition, we propose a fusion framework that integrates sensor specific cues into a unified feature space and improves 3D detection robustness under varied weather and lighting. We will make our code and dataset publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37816", "url": null, "sourceid": 40987, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37818, "uid": "c1c18ff65e38f52cedda5a47acc77320", "name": "Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions", "authors": [{"id": 175290, "fullname": "Jianan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/175290?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 107390, "fullname": "Xiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107390?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 188341, "fullname": "Tao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188341?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 148810, "fullname": "Tien-Tsin Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/148810?format=json", "institution": "Monash University"}], "abstract": "Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require  3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing *Mimic2DM*, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37818", "url": null, "sourceid": 36186, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37820, "uid": "4a2ad15a73d498efa82cc2893a52d08e", "name": "LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding", "authors": [{"id": 181513, "fullname": "Rongge Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181513?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u6280\u672f\u5927\u5b66\u82cf\u5dde\u9ad8\u7b49\u7814\u7a76\u9662"}, {"id": 188344, "fullname": "Chengqi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188344?format=json", "institution": "University of Science and Technology of China"}, {"id": 130254, "fullname": "S Kevin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/130254?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Visual Autoregressive (VAR) modeling introduces a new paradigm for image generation by extending autoregressive mechanisms from next-token prediction to next-scale prediction, achieving remarkable performance. However, as the number of tokens increases rapidly with scale, processing full token maps at high resolution becomes computationally expensive. In addition, the inherently sequential nature of autoregressive modeling prevents parallel inference across scales, which further increases latency.To address these challenges, we propose LazyVAR, a training-free and plug-and-play acceleration method for VAR models. Our key observation is that the similarity of aggregated latent features between adjacent scales progressively increases with the scale index, reaching particularly higher values at larger scales. We treat this similarity as a Scale-Wise Update Index, which serves as the pruning criterion. Consequently, more tokens can be pruned at larger scales to improve efficiency. Furthermore, we propose Parallel Group Decoding, which leverages this high similarity at larger scales to decode tokens from different scales in parallel, further accelerating inference.Experimental results show that the proposed LazyVAR achieves up to a 2.94\u00d7 speedup over FlashAttention-accelerated VAR models with negligible performance loss, allowing the Infinity-2B text-to-image model to generate 1024\u00d71024 resolution images within 0.5 seconds on a single RTX 4090 GPU. Our code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37820", "url": null, "sourceid": 39987, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37821, "uid": "c184064068f67936a71e38a4e6a9e78e", "name": "Chain-of-Thought Guided Multi-Modal Object Re-Identification", "authors": [{"id": 162907, "fullname": "gao ya", "url": "http://cvpr.thecvf.com/api/miniconf/users/162907?format=json", "institution": "Anhui University"}, {"id": 181116, "fullname": "Shihao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181116?format=json", "institution": "Anhui University"}, {"id": 188345, "fullname": "ZhaoJun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188345?format=json", "institution": "Anhui University"}, {"id": 128149, "fullname": "AIHUA ZHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/128149?format=json", "institution": "Anhui University"}, {"id": 183391, "fullname": "Chenglong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183391?format=json", "institution": "Anhui University"}, {"id": 126842, "fullname": "Jin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126842?format=json", "institution": "Anhui University"}], "abstract": "With the rise of visual-language models, multi-modal ReID retrieves specific targets by integrating different spectra and textual descriptions. Existing methods merely adopt descriptive representation learning for image-text, ignoring the relationships among the intrinsic logical hierarchies of semantic features. Since Chain-of-Thought (CoT) can provide textual logical context and enhance semantic perception in large-model reasoning, we propose CoT-ReID, a CoT-guided framework that injects the Multi-modal Large Language Models (MLLMs) reasoning into multi-modal ReID. Specifically, we simulate the joint visual-textual logical decision-making of human reasoning, leveraging CoT textual logical reasoning to guide visual feature learning at the early, late, and decision-making level: At the early level, we embed the semantic reversion of CoT hierarchical reasoning  into visual features to calibrate bottom-level features and emphasize visual hierarchical reasoning. Next, we take CoT hierarchical reasoning text as an anchor condition to constrain the consistency of visual cross-modal semantics. Finally, through the hierarchical reasoning process of CoT, we embed logically reasoned text attribute features into multi-modal decision-making, providing logical support for selecting discriminative identity features. By constructing CoT textual benchmarks and our proposed modules, our framework generates more robust multi-modal features in complex scenarios. Comprehensive experiments on four datasets (RGBNT100, MSVR310, WMVeID863, RGBNT201) demonstrate that our method outperforms existing approaches. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37821", "url": null, "sourceid": 31586, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37817, "uid": "1af762c872080b066c4cd5ec1663ba91", "name": "Parallelised Differentiable Straightest Geodesics for 3D Meshes", "authors": [{"id": 188339, "fullname": "Hippolyte Verninas", "url": "http://cvpr.thecvf.com/api/miniconf/users/188339?format=json", "institution": "INRIA"}, {"id": 188340, "fullname": "Caner Korkmaz", "url": "http://cvpr.thecvf.com/api/miniconf/users/188340?format=json", "institution": "Imperial College London"}, {"id": 86877, "fullname": "Stefanos Zafeiriou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86877?format=json", "institution": "Imperial College London"}, {"id": 73518, "fullname": "Tolga Birdal", "url": "http://cvpr.thecvf.com/api/miniconf/users/73518?format=json", "institution": "Imperial College London"}, {"id": 182061, "fullname": "Simone Foti", "url": "http://cvpr.thecvf.com/api/miniconf/users/182061?format=json", "institution": "Imperial College London"}], "abstract": "Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demonstrate how our differentiable exponential map can supercharge geometrically-correct learning and optimisation pipelines. In particular, to showcase the versatility of our method, we propose a new geodesic convolutional layer, a new flow matching method for learning on meshes, and a second-order optimiser that we apply to centroidal Voroni tesselation. Our code, pre-trained models, and pip-installable library will be made available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37817", "url": null, "sourceid": 32370, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37823, "uid": "ee617a1f3bbdbbbc0996f12fca2a3701", "name": "Enhancing Video VLM with Visual-Audio Supersensing", "authors": [{"id": 135921, "fullname": "Xu Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135921?format=json", "institution": "University of Illinois Urbana-Champaign"}], "abstract": "Current video vision language models (VLMs) process information passively, lacking the ability to dynamically plan their analysis or perform joint reasoning across crucial modalities such as video and audio. To address this, we introduce Visual-Audio Supersensing (VAS), a learning paradigm that shifts the focus from temporal predictive sensing (e.g., Cambrian-S) to cross-modal prediction. The core objective of VAS is to train the model to anticipate audio-caption summarizations from video and vice versa. We present VA-R1, a VLM that operationalizes this paradigm. Instead of passively ingesting all data, VA-R1 actively reasons about its information needs using Chain-of-Thought (CoT). Our training process is twofold: we first finetune VA-R1 with VAS, and then apply a novel contrastive Reinforcement Learning (RL) algorithm, Video-Audio Negative-aware Optimization (VANAO), to optimize this selective co-reasoning process. This approach proves highly effective: despite their significantly smaller size, our VA-R1-7B and VA-R1-8B models achieve competitive performance to massive MLLMs like GPT-4o and Gemini 1.5 Pro on multiple video VQA benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37823", "url": null, "sourceid": 33577, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37825, "uid": "ed2778e8d8cc95fc48592e71a3341841", "name": "PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion", "authors": [{"id": 98566, "fullname": "Phuc Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/98566?format=json", "institution": "VinAI Research"}, {"id": 76220, "fullname": "Phong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76220?format=json", "institution": "VinAI"}, {"id": 76979, "fullname": "Anh Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/76979?format=json", "institution": "VinAI Research"}], "abstract": "Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image.In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a $10\\times$ to $35\\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37825", "url": null, "sourceid": 46498, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37826, "uid": "4e2250c35e90c65b9ead3841c20fcb28", "name": "PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback", "authors": [{"id": 152539, "fullname": "Sixiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152539?format=json", "institution": "HKUST(GZ)"}, {"id": 151772, "fullname": "Jianyu LAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/151772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 159498, "fullname": "Jialin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/159498?format=json", "institution": "Meituan"}, {"id": 187140, "fullname": "Hengyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187140?format=json", "institution": "Meituan"}, {"id": 187141, "fullname": "Zhongying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187141?format=json", "institution": "Meituan; Fudan University"}, {"id": 152540, "fullname": "Tian Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/152540?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 154365, "fullname": "Junfeng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154365?format=json", "institution": "Meituan"}, {"id": 84905, "fullname": "Xiaoming Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84905?format=json", "institution": "Meituan"}, {"id": 187144, "fullname": "Lei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187144?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts.  These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image\u2013prompt control.To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data\u2013distillation\u2013reward pipeline, which includes:(i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation;(ii) distilling knowledge between local and global experts for supervised fine-tuning; and(iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks.Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37826", "url": null, "sourceid": 38768, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37828, "uid": "8d2cbe9d22a199026ddcbae5f0f67ef0", "name": "Long-Tail Internet Photo Reconstruction", "authors": [{"id": 128791, "fullname": "Yuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128791?format=json", "institution": "Zhejiang University"}, {"id": 70327, "fullname": "Yuanbo Xiangli", "url": "http://cvpr.thecvf.com/api/miniconf/users/70327?format=json", "institution": "Cornell University"}, {"id": 156128, "fullname": "Hadar Averbuch-Elor", "url": "http://cvpr.thecvf.com/api/miniconf/users/156128?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 85450, "fullname": "Noah Snavely", "url": "http://cvpr.thecvf.com/api/miniconf/users/85450?format=json", "institution": "Google / Cornell"}, {"id": 90264, "fullname": "Ruojin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/90264?format=json", "institution": "Cornell University"}], "abstract": "Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed, while most real-world sites contain only sparse, noisy, and uneven imagery that defeats classical and learned 3D methods. Existing 3D foundation models generalize well to curated datasets but collapse under the sparsity, ambiguity, and irregularity of Internet photos. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large-scale, clean, and depth-refined dataset, together with a sparse-aware sampling strategy that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, demonstrating emergent symmetry disambiguation while preserving generalization to standard 3D benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37828", "url": null, "sourceid": 32887, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37829, "uid": "699e1538d09a77a2242eb43a19d175b0", "name": "Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework", "authors": [{"id": 153592, "fullname": "Jinrong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153592?format=json", "institution": "Dalian University of Technology"}, {"id": 188350, "fullname": "Zhaoyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188350?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 188351, "fullname": "Xusheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188351?format=json", "institution": "Harbin Institute of Technology"}, {"id": 159580, "fullname": "Xinrui Xinrui", "url": "http://cvpr.thecvf.com/api/miniconf/users/159580?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 158761, "fullname": "Na Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158761?format=json", "institution": "National University of Singapore"}, {"id": 84825, "fullname": "Jianlong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84825?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "High-quality pixel-level responses remain a major bottleneck for multimodal large language models (MLLMs) in regional perception. Existing approaches generally attach regression decoders to MLLM features, achieving strong grounding performance but compromising end-to-end design and increasing training costs. Researchers have applied parameter and data scaling to improve pure MLLMs\u2019 ability to generate pixel coordinates in natural language, yet the performance gains on grounding tasks remain markedly weaker than those in standard QA tasks. Our analysis shows the primary bottleneck is that conventional scaling fails to effectively enhance the key reasoning stage required for pixel-level regional perception. To address this, we propose R-Ground, a reasoning framework for MLLM-based grounding built upon a multimodal Monte Carlo Tree Search algorithm. R-Ground leverages structured reasoning actions, multimodal feature alignment scoring, and regional feature weighted voting to perform scaling at the designated reasoning stage. Extensive experiments demonstrate that R-Ground achieves effective reasoning scaling, enabling a 7B MLLM to match or even surpass a 72B model on the grounding task. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37829", "url": null, "sourceid": 42694, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37830, "uid": "6642c3aa4d190ba07f671b296f8d12f0", "name": "4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video", "authors": [{"id": 155311, "fullname": "Jin Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155311?format=json", "institution": "Southern University of Science and Technology"}, {"id": 155317, "fullname": "Liang An", "url": "http://cvpr.thecvf.com/api/miniconf/users/155317?format=json", "institution": "Dept. of Automation, Tsinghua University"}, {"id": 188352, "fullname": "Pujin Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188352?format=json", "institution": "University of Hong Kong"}, {"id": 75944, "fullname": "Yebin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75944?format=json", "institution": "Tsinghua University"}, {"id": 155316, "fullname": "Xiaoying Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155316?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Code and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37830", "url": null, "sourceid": 38319, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37834, "uid": "da9278fb741a9aba56bd8e184933871d", "name": "Improved Mean Flows: On the Challenges of Fastforward Generative Models", "authors": [{"id": 140665, "fullname": "ZHENGYANG GENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/140665?format=json", "institution": "CMU"}, {"id": 180383, "fullname": "Yiyang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180383?format=json", "institution": "Tsinghua University"}, {"id": 156829, "fullname": "Zongze Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156829?format=json", "institution": "Adobe Research"}, {"id": 75717, "fullname": "Eli Shechtman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75717?format=json", "institution": "Adobe Research, US"}, {"id": 84657, "fullname": "Zico Kolter", "url": "http://cvpr.thecvf.com/api/miniconf/users/84657?format=json", "institution": "Carnegie Mellon University"}, {"id": 150920, "fullname": "Kaiming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/150920?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "MeanFlow provides a principled framework for fastforward generative modeling. However, the original MeanFlow has key limitations in both the training objective and the guidance. First, the original MeanFlow prediction depends not only on the noisy state but also explicitly on the noise and data, causing the training target to drift with the network. We reformulate it as velocity prediction, predicting the instantaneous velocity solely from the noisy state and reducing it to the regression problem. Second, on the guidance side, the original MeanFlow fixes the guidance scale during training by directly learning a guided field, achieving 1-NFE sampling but losing the flexibility to adjust the guidance at inference. Instead, we condition the model on guidance scale and train it on a range of guidance scales, enabling flexible guidance as diffusion/flow models in inference while preserving one-step sampling. On ImageNet 256$\\times$256, our improved MeanFlow (iMF) achieves a 1-step FID of 2.74 with a model of 118M parameters, and our largest model further pushes the 1-step FID to 1.72, establishing a new state of the art for one-step generative modeling.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37834", "url": null, "sourceid": 36416, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37836, "uid": "069041db774c7c0baa53b1779ae1e29b", "name": "PersonaLive! Expressive Portrait Image Animation for Live Streaming", "authors": [{"id": 188370, "fullname": "Zhiyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188370?format=json", "institution": "University of Macau"}, {"id": 87613, "fullname": "Chi-Man Pun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87613?format=json", "institution": "University of Macau"}, {"id": 88294, "fullname": "Chen Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88294?format=json", "institution": "Tencent PCG"}, {"id": 88155, "fullname": "Jue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88155?format=json", "institution": "Tencent AI Lab"}, {"id": 85466, "fullname": "Xiaodong Cun", "url": "http://cvpr.thecvf.com/api/miniconf/users/85466?format=json", "institution": "Tencent AI Lab"}], "abstract": "Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to **7-22**$\\times$ speedup over prior diffusion-based portrait animation models. The code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37836", "url": null, "sourceid": 38725, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37838, "uid": "76cad25a83564610f23e26a317dd1570", "name": "CLIP-like Model as a Foundational Density Ratio Estimator", "authors": [{"id": 182428, "fullname": "Fumiya Uchiyama", "url": "http://cvpr.thecvf.com/api/miniconf/users/182428?format=json", "institution": "The University of Tokyo"}, {"id": 140608, "fullname": "Rintaro Yanagi", "url": "http://cvpr.thecvf.com/api/miniconf/users/140608?format=json", "institution": "AIST, National Institute of Advanced Industrial Science and Technology"}, {"id": 188375, "fullname": "Shohei Taniguchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188375?format=json", "institution": "The University of Tokyo"}, {"id": 188376, "fullname": "Shota Takashiro", "url": "http://cvpr.thecvf.com/api/miniconf/users/188376?format=json", "institution": "The University of Tokyo"}, {"id": 188377, "fullname": "Masahiro Suzuki", "url": "http://cvpr.thecvf.com/api/miniconf/users/188377?format=json", "institution": "The University of Tokyo, Tokyo Institute of Technology"}, {"id": 87986, "fullname": "Hirokatsu Kataoka", "url": "http://cvpr.thecvf.com/api/miniconf/users/87986?format=json", "institution": "National Institute of Advanced Industrial Science and Technology (AIST)"}, {"id": 152470, "fullname": "Yusuke Iwasawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/152470?format=json", "institution": "The University of Tokyo, The University of Tokyo"}, {"id": 152471, "fullname": "Yutaka Matsuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152471?format=json", "institution": "The University of Tokyo"}], "abstract": "Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image\u2013text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications.To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities.We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimationOur Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points.We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering.Our code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37838", "url": null, "sourceid": 45016, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37842, "uid": "659d33c1b83122ee316c7943d8b091eb", "name": "SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling", "authors": [{"id": 188395, "fullname": "Camile Lendering", "url": "http://cvpr.thecvf.com/api/miniconf/users/188395?format=json", "institution": "Eindhoven University of Technology"}, {"id": 73822, "fullname": "Erkut Akdag", "url": "http://cvpr.thecvf.com/api/miniconf/users/73822?format=json", "institution": "Eindhoven University of Technology"}, {"id": 156805, "fullname": "Egor Bondarev", "url": "http://cvpr.thecvf.com/api/miniconf/users/156805?format=json", "institution": "Eindhoven University of Technology"}], "abstract": "Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0\\% and 97.6\\% on the MVTec-AD dataset, and 93.3\\% and 98.3\\% on the VisA dataset, respectively, surpassing prior state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37842", "url": null, "sourceid": 43915, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37844, "uid": "49262d75ddb927676a9e17e9ad932089", "name": "Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging", "authors": [{"id": 181589, "fullname": "Yunpeng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181589?format=json", "institution": "Shenzhen University"}, {"id": 188397, "fullname": "Yimu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188397?format=json", "institution": "Shenzhen University"}, {"id": 188398, "fullname": "Jingxing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188398?format=json", "institution": "Shenzhen University"}, {"id": 127066, "fullname": "Huisi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127066?format=json", "institution": "Shenzhen University"}, {"id": 127069, "fullname": "Jing Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/127069?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Automatic and accurate echocardiography video segmentation is essential for efficient and repeatable measurements of key clinical functional indicators for  diagnosis of cardiovascular diseases. However, it is an extremely challenging task to obtain high-quality segmentation results throughout the cardiac cycle owing to (1) the inherent speckle noise in echocardiography videos, (2) the complex dynamic motions of cardiac structures, and (3) the scarcity of annotated data. To comprehensively address these challenges, we propose a novel semi-supervised model, which can achieve accurate and real-time echocardiography video segmentation with very limited annotations. The proposed model has two core innovative technologies. First, we propose a new anchor semantic awareness (ASA) module composed of an anchor recalibration (ARC) scheme and a temporal semantic fusion (TSF) algorithm. The former refines ambiguous feature regions by aligning them with learnable anchors, and the latter propagates structural semantic prototypes across frames to enhance boundary delineation and temporal consistency. Second, based on ASA, we developed a continuous pseudo-label reforging (CPR) module to gradually integrates high-quality pseudo-label through lightweight channel-wise attention, and reforge pseudo labels to provide more robust supervision.We extensively evaluated our method on two benchmarking datasets: CAMUS and EchoNet-Dynamic; experimental results show that our model outperforms SOTAs in segmentation accuracy while maintaining real-time performance. Codes will be publicly available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37844", "url": null, "sourceid": 46140, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37845, "uid": "4672a4597e331b354e95d558b9a63287", "name": "PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion", "authors": [{"id": 181998, "fullname": "Aiqiu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181998?format=json", "institution": "University of Science and Technology of China"}, {"id": 85198, "fullname": "Zhaofan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85198?format=json", "institution": "University of Science and Technology of China"}, {"id": 85029, "fullname": "Ting Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85029?format=json", "institution": "JD AI Research"}, {"id": 85027, "fullname": "Tao Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/85027?format=json", "institution": "JD Explore Academy"}], "abstract": "Video Super-Resolution (VSR) fundamentally struggles with a critical trade-off: single-step models offer unmatched efficiency but often lack the high-frequency detail, creativity, and visual quality of their multi-step diffusion counterparts, which are computationally prohibitive for practical use. In this paper, we propose PS-SR, a novel \"pseudo\" single-step VSR framework that transcends this trade-off through a computationally asymmetric sampling pipeline. The key to PS-SR lies in its speculative diffusion mechanism: a powerful base model performs only a single, comprehensive sampling step, establishing the global structure and content fidelity, after which a lightweight draft model, directly augmented by the base model's features, speculatively performs subsequent refinements. Crucially, we further enforce a frequency-domain update rule that constrains these refinements to exclusively inject high-frequency details, preserving the foundational low-frequency content and preventing semantic drift across sampling steps. By doing so, PS-SR creates the \"illusion\" of a single-step model\u2014delivering the similar inference speeds and input-output content consistency\u2014while achieving the visual richness and creativity typically reserved for costly multi-step generative models. We demonstrate that our \"pseudo-single-step\" paradigm achieves state-of-the-art quality with a comparable speed to single-step models, paving the way for real-time, high-fidelity video enhancement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37845", "url": null, "sourceid": 40215, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37846, "uid": "c1210dd1376c754b78be3d9709965276", "name": "DynFusion: Rethinking Condition Fusion for Adaptive Multi-condition Text-to-Image Generation", "authors": [{"id": 174232, "fullname": "Zheng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174232?format=json", "institution": "Department of Computer Science, University of Warwick"}, {"id": 188399, "fullname": "Lichuan Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188399?format=json", "institution": "Apple; University of Warwick"}, {"id": 188400, "fullname": "Xu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188400?format=json", "institution": "National University of Singapore; SUN YAT-SEN UNIVERSITY"}, {"id": 77364, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77364?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 85913, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85913?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 130189, "fullname": "Hongkai Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130189?format=json", "institution": "University of Warwick"}], "abstract": "Text-to-image diffusion models have achieved remarkable progress, generating visually realistic and semantically coherent images from textual prompts. However, natural language alone lacks the precision required for design-centric applications that demand strict spatial and structural fidelity\u2014particularly when representing complex concepts that integrate multi-level information, such as product or scene design.To address this limitation, controllable diffusion frameworks introduce auxiliary conditions (e.g., depth, edge, or reference images) to guide the generative process. Models like ControlNet and IP-Adapter effectively inject such priors, improving structural or appearance alignment.Yet, real-world design tasks rarely depend on a single type of condition. They often require simultaneous integration of multiple heterogeneous cues\u2014for instance, preserving spatial layout from depth maps, structural outlines from edge maps, and stylistic attributes from reference images. Current approaches either handle only one condition or naively stack multiple ones, resulting in computational inefficiency and conflicting guidance that degrade generation quality.This multi-condition inconsistency forms a critical bottleneck for applying diffusion models to real-world design workflows, motivating our proposed framework. We propose a data-driven adaptive condition fusion mechanism for multi-conditional diffusion. Our method introduces a novel condition adaptation module that dynamically selects and fuses subsets of conditions based on the diffusion timestep, task characteristics, and feature injection position. This adaptive strategy harmonizes diverse structural and appearance priors, achieving controllable yet flexible generation in complex design scenarios. Experiments demonstrate significant improvements in fidelity, consistency, and controllability across multi-condition tasks, establishing a new direction for practical, detail-preserving diffusion-based design generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37846", "url": null, "sourceid": 33284, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37848, "uid": "8e0f5a9efbf3ad58606380abe0decf95", "name": "Learn to Learn Weight Generation via Local Consistency Diffusion", "authors": [{"id": 188402, "fullname": "Yunchuan Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188402?format=json", "institution": null}, {"id": 188403, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188403?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 188404, "fullname": "Ke Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188404?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 188405, "fullname": "Zhiqi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188405?format=json", "institution": "Nanyang Technological University"}, {"id": 93440, "fullname": "Jenq-Neng Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93440?format=json", "institution": "University of Washington, Seattle"}, {"id": 188406, "fullname": "Lei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188406?format=json", "institution": null}], "abstract": "generation. However, existing solutions are limited by two challenges: generalizability and missing local supervision targets. The first challenge stems from the inherent lack of cross-task transferability in existing single-level optimization methods, which limits model performance on new tasks. The latter challenge lies in existing research modeling only global optimal weights, neglecting the supervision signals in local target weights. Furthermore, naively assigning local target weights leads to inconsistency between local and global objectives. To address these issues, we propose Mc-Di, which integrates the diffusion algorithm with meta-learning for better generalizability. Additionally, we extend the vanilla diffusion into a local consistency diffusion algorithm. Our theoretical analysis and experimental results demonstrate that the model can learn from local targets while preserving consistency with the global optimum. We validate Mc-Di's superior accuracy and inference efficiency on tasks that require frequent weight updates, including transfer learning, few-shot learning, domain generalization, and language model fine-tuning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37848", "url": null, "sourceid": 43189, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37849, "uid": "6591c2bacb2f38c5a1da173a284be004", "name": "PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation", "authors": [{"id": 181698, "fullname": "Gensheng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181698?format=json", "institution": "Sungkyunkwan University"}, {"id": 188407, "fullname": "Xiruo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188407?format=json", "institution": null}, {"id": 129602, "fullname": "Xinhao Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129602?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 107127, "fullname": "Tao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107127?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 188408, "fullname": "Byeungwoo Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188408?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Training-free open-vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post-processing or handle text and vision in isolation, leaving cross-modal geometry underutilized. Others introduce auxiliary vision backbones or multi-model pipelines, which increase complexity and latency while compromising design simplicity.We present PEARL, \\textbf{\\underline{P}}rocrust\\textbf{\\underline{e}}s \\textbf{\\underline{a}}lignment with text-awa\\textbf{\\underline{r}}e \\textbf{\\underline{L}}aplacian propagation, a compact two-step inference that follows an align-then-propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self-attention block, rotating keys toward the query subspace via a stable polar iteration. The text-aware Laplacian propagation then refines per-pixel logits on a small grid through a confidence-weighted, text-guided graph solve: text provides both a data-trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training-free, plug-and-play, and uses only fixed constants, adding minimal latency with a small per-head projection and a few conjugate-gradient steps. Our approach, PEARL, sets a new state-of-the-art in training-free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with-background and without-background protocols.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37849", "url": null, "sourceid": 35935, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37850, "uid": "559fdd4cb3c9fdc4a3fdb0940fe3bb64", "name": "PG-VTON: Single-Pass Training-Free Virtual Try-On via Patch-Guided Reference Alignment", "authors": [{"id": 182192, "fullname": "Guohao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182192?format=json", "institution": "Peking University"}, {"id": 87023, "fullname": "Yuxin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87023?format=json", "institution": "Peking University"}], "abstract": "Virtual try-on (VTON) aims to render a target garment onto a person while preserving pose, identity, and fine-grained appearance. Most existing methods rely on supervised paired data, limiting cross-domain generalization, while recent training-free approaches, though more robust, require multiple diffusion calls and complex compositing, making deployment impractical. We propose PG-VTON, a single-pass, training-free framework based on Patch-Guided Reference Alignment. Our key insight is that modern inpainting diffusion models already possess strong in-context completion: given a masked person and a small garment patch, they can synthesize plausible, pose-consistent clothing without task-specific training. PG-VTON exploits this capability with two lightweight components: Patch-Anchored Identity Priming (PIP) injects a localized garment patch only in early denoising steps to anchor garment identity, and Reference-Aware Attention (RAA) strengthens attention from masked-region tokens to garment tokens to enhance detail transfer, all without modifying model weights. With a single diffusion pass, PG-VTON achieves state-of-the-art performance among training-free methods on DressCode and VITON-HD and generalizes effectively to subject insertion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37850", "url": null, "sourceid": 43891, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37853, "uid": "d9d0f4bc089eb93b509195c4e285cf60", "name": "STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction", "authors": [{"id": 175937, "fullname": "Runze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175937?format=json", "institution": "University of Science and Technology of China"}, {"id": 176067, "fullname": "Yuxuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/176067?format=json", "institution": "University of Science and Technology of China"}, {"id": 188410, "fullname": "Youcheng Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188410?format=json", "institution": "University of Science and Technology of China"}, {"id": 85084, "fullname": "Ligang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85084?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. While causal VGGT transformers address this challenge through key-value (KV) cache mechanism, the linear growth of the cache introduces a significant memory bottleneck. When memory constraints trigger early eviction, reconstruction quality and temporal consistency deteriorate markedly. In this work, we observe that attention patterns in causal transformers for 3D reconstruction exhibit intrinsic spatio-temporal sparsity. Leveraging this insight, we propose **STAC**, a *S*patio-**T**emporally **A**ware **C**ache compression framework specifically designed for streaming 3D reconstruction using large causal transformers. STAC incorporates three key components: a **Working Temporal Token Caching** mechanism that preserves long-term informative tokens based on decayed cumulative attention scores; a **Long-term Spatial Token Caching** scheme that consolidates spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and a **Chunk-based Multi-frame Optimization** strategy that jointly optimizes consecutive frames to enhance temporal coherence and leverage GPU parallelism. Extensive experiments demonstrate that **STAC** achieves state-of-the-art reconstruction quality while reducing memory consumption by 8.5$\\times$ and accelerating inference by a factor of 3.5$\\times$, enabling scalable and real-time 3D reconstruction in streaming settings. The code will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37853", "url": null, "sourceid": 39381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37852, "uid": "20ded8fd4bb9c3d915be78cf00c35082", "name": "VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM", "authors": [{"id": 180182, "fullname": "Anh Thuan Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/180182?format=json", "institution": "George Mason University"}, {"id": 164431, "fullname": "Jana Kosecka", "url": "http://cvpr.thecvf.com/api/miniconf/users/164431?format=json", "institution": "George Mason University"}], "abstract": "Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then compute corresponding differentiable per-pixel uncertainty map. This variance map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, ScanNet, and ScanNet++ (real-world) show that VarSplat improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering compared to existing studies for dense RGB-D SLAM.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37852", "url": "https://anhthuan1999.github.io/varsplat/", "sourceid": 45130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37855, "uid": "5751693536add9cb4b813590b0fedbf9", "name": "DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors", "authors": [{"id": 181167, "fullname": "Mengyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181167?format=json", "institution": "Tianjin Normal University"}, {"id": 188419, "fullname": "Pinlong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188419?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "The efficiency of hyperparameter optimization (HPO) is critical for deep learning, yet state-of-the-art methods share a fundamental flaw: they are difficulty-agnostic, treating all hyperparameter configurations homogeneously. This approach leads to inefficient resource allocation, wasting budget in simple regions while under-exploring complex, rugged landscapes, and thereby critically undermining both search efficiency and final performance. To address this universal challenge, we introduce DABO, a framework that pioneers difficulty-aware tuning within the efficient context of Freeze-Thaw Bayesian Optimization. We first model optimization difficulty hierarchically. Then, departing from hand-crafted priors, we train a conditional diffusion model on 120,000 real learning curves, generating synthetic data with 2.3$\\times$ higher fidelity. This data trains our difficulty-aware surrogate model and acquisition function to dynamically adapt the search strategy. Across 75 tasks, DABO reduces regret by 11-18\\% compared to the leading difficulty-agnostic method, ifBO. Our work establishes a new paradigm for HPO, shifting the focus from configuration-centric to difficulty-aware resource allocation to enable more robust and efficient optimization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37855", "url": null, "sourceid": 35163, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37856, "uid": "92d96da583b3bf0ca7d61ab3b3aba04b", "name": "VIRST: Video-Instructed Reasoning assistant for SpatioTemporal Segmentation", "authors": [{"id": 181615, "fullname": "Jihwan Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181615?format=json", "institution": "Seoul National University"}, {"id": 87756, "fullname": "Jaeyoung Do", "url": "http://cvpr.thecvf.com/api/miniconf/users/87756?format=json", "institution": "Department of Computer Science, University of Wisconsin - Madison"}], "abstract": "Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, CLIP based and keyframe based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries that require multi step reasoning, which leads to sharp performance drops on motion intensive and reasoning oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation aware video features into the vision language backbone, and employs the Temporal Dynamic Anchor Updater (TDAU) to maintain dynamically updated anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37856", "url": null, "sourceid": 36493, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37857, "uid": "b496ec6fe0e41e4901e4a8345656a8c1", "name": "ConsistCompose: Unified Multimodal Layout Control for Image Composition", "authors": [{"id": 172845, "fullname": "Xuanke Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/172845?format=json", "institution": null}, {"id": 188420, "fullname": "Boxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188420?format=json", "institution": "Sensetime"}, {"id": 181569, "fullname": "Xiaoyang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/181569?format=json", "institution": "SenseTime"}, {"id": 128967, "fullname": "Zhongang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128967?format=json", "institution": "Nanyang Technological University"}, {"id": 89773, "fullname": "Lei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89773?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185515, "fullname": "Quan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185515?format=json", "institution": "SenseTime Group Limited, SenseTime Group Limited"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding\u2014aligning language with image regions\u2014while their generative counterpart, linguistic-embedded layout-grounded generation(LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset  with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance\u2013coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37857", "url": null, "sourceid": 37192, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37858, "uid": "7eab3cc8b79a0665f796eea7c14b2d90", "name": "FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures", "authors": [{"id": 172273, "fullname": "Yujie Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/172273?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 175366, "fullname": "Zhuoqiang CAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/175366?format=json", "institution": "SJTU"}, {"id": 128373, "fullname": "Chaoyue Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128373?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 151972, "fullname": "Jianchuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151972?format=json", "institution": "Alibaba Group"}, {"id": 158099, "fullname": "Zhiwen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158099?format=json", "institution": "Alibaba Group"}, {"id": 91226, "fullname": "Chengfei Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/91226?format=json", "institution": "Zhejiang University"}, {"id": 149695, "fullname": "Fan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149695?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouples two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from only a few observations of new identities within minutes, while supporting real-time animation, convenient hairstyle transfer, and stylized editing, broadening the accessibility and applicability of digital avatar creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37858", "url": null, "sourceid": 36680, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37859, "uid": "2842ad89a82393cd5b7f62fd3bb7afe9", "name": "CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction", "authors": [{"id": 157799, "fullname": "Pei Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157799?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 157800, "fullname": "Shanshan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157800?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}], "abstract": "Reconstructing 3D human-object interaction (HOI) from monocular images is highly challenging especially when human and object are mutually occluded. Existing methods primarily rely on single-view inputs, which fundamentally limit their ability to recover occluded regions and accurately estimate contact areas. To address these challenges, we for the first time, consider to introduce novel-view feature priors to enhance monocular 3D HOI reconstruction. We first design a cross-view generator that learns to infer novel-view image features from a single-view input, enriching spatial geometry at the feature level without requiring extra inputs during inference. Guided by both real and generated view features, a spatial cross-view feature fusion module adaptively aggregates complementary cues to enhance the initial reconstruction of human and object meshes. Built upon this reconstruction, we sample 3D vertex features from both views and introduce a bidirectional cross-view Transformer to integrate multi-view vertex representations for accurate contact estimation. Finally, the predicted contact maps are leveraged to refine human-object meshes, yielding geometrically consistent and physically plausible reconstructions.Experiments on BEHAVE and InterCap show that our proposed CrossHOI surpasses state-of-the-art methods in both reconstruction accuracy and contact prediction, especially under severe occlusions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37859", "url": null, "sourceid": 40959, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37863, "uid": "9fce355be08993f60c48e35a69b300ce", "name": "Lightmover: Towards Precise and Efficient Control for Light Movement", "authors": [{"id": 164920, "fullname": "Gengze Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/164920?format=json", "institution": "The University of Adelaide"}, {"id": 150606, "fullname": "Tianyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150606?format=json", "institution": "Adobe Research"}, {"id": 86366, "fullname": "Soo Ye Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86366?format=json", "institution": "Adobe Research"}, {"id": 137329, "fullname": "ZHIXIN SHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/137329?format=json", "institution": "Adobe Systems"}, {"id": 90670, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90670?format=json", "institution": "The University of Hong Kong"}, {"id": 85036, "fullname": "Yannick Hold-Geoffroy", "url": "http://cvpr.thecvf.com/api/miniconf/users/85036?format=json", "institution": "Adobe Research"}, {"id": 156652, "fullname": "Sumit Chaturvedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/156652?format=json", "institution": "Department of Computer Science, Yale University"}, {"id": 75891, "fullname": "Qi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75891?format=json", "institution": "University of Adelaide"}, {"id": 85199, "fullname": "Zhe Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85199?format=json", "institution": "Adobe Research"}, {"id": 86364, "fullname": "Scott Cohen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86364?format=json", "institution": "Adobe Systems"}], "abstract": "We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41\\% while maintaining editing fidelity. For training our framework, we construct a scalable rendering pipeline that can generate large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent to the original image. \\ours enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37863", "url": null, "sourceid": 31409, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37866, "uid": "357a689f054e8c8b3ea30d0492c52932", "name": "SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval", "authors": [{"id": 180543, "fullname": "Qunjie Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180543?format=json", "institution": "School of Information Science and Engineering, Yunnan University"}, {"id": 188441, "fullname": "Weina Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188441?format=json", "institution": "Yunnan University"}], "abstract": "Cross-subject EEG-to-image retrieval for visual decoding is hampered by subject shift and hubness in the embedding space, which distort similarity geometry and destabilize top-k rankings, making small candidate shortlists unreliable. We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free test-time calibration head that operates directly on the similarity matrix of frozen EEG\u2013image encoders. SATTC combines a geometric expert\u2014subject-adaptive whitening of EEG embeddings with an adaptive variant of Cross-domain Similarity Local Scaling (CSLS)\u2014and a structural expert that leverages mutual nearest neighbors, bidirectional top-k ranks, and class popularity, fused via a simple Product-of-Experts rule. On the THINGS-EEG cross-subject benchmark with a strict leave-one-subject-out protocol, standardizing inference with cosine similarities, \u21132-normalized embeddings, and candidate whitening already yields a strong baseline that improves Top-1 and Top-5 accuracy over the original ATM retrieval setup. Adding SATTC on top of this standardized inference further improves Top-1 and Top-5 accuracy and substantially reduces hubness, yielding more reliable small-k shortlists across multiple EEG encoders and establishing SATTC as a generic test-time calibration head for zero-shot neural decoding from EEG.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37866", "url": null, "sourceid": 41090, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37868, "uid": "acc4ba4eec87758e9b2ca94782c41bb9", "name": "Multi-Prototype Compactness and Boundary-Aware Synthesis for Unsupervised Anomaly Detection", "authors": [{"id": 175972, "fullname": "Liao Kailun", "url": "http://cvpr.thecvf.com/api/miniconf/users/175972?format=json", "institution": "wuhan university"}, {"id": 149758, "fullname": "Jianfeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149758?format=json", "institution": "Wuhan University"}, {"id": 188443, "fullname": "Tao Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188443?format=json", "institution": "Wuhan University"}, {"id": 188444, "fullname": "Wu Wenfei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188444?format=json", "institution": "Wuhan University"}, {"id": 188445, "fullname": "Jiaming Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188445?format=json", "institution": "Wuhan University"}, {"id": 72467, "fullname": "Jinsheng Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/72467?format=json", "institution": "Wuhan University"}], "abstract": "Unsupervised Anomaly Detection (UAD) is crucial for industrial quality control.Many existing embedding-based methods rely on a single-prototype assumption, learning, for instance, a compact hypersphere to enclose all normal features. However, this strategy often fails when confronted with significant intra-class variance caused by factors like illumination, pose, and texture. To accommodate all diverse normal samples, the decision boundary of a single prototype must become overly-general and loose, inevitably causing the model to miss subtle anomalies. To overcome this limitation, we propose PGBL (Prototype-Guided Boundary Learning), a framework that synergizes structured representation learning with targeted anomaly synthesis. First, we introduce the Multi-Prototype Compact Learning (MPCL) module, which explicitly models the complex normal feature distribution as a mixture of multiple semantic prototypes. This allows the model to learn tighter, local representations for each normal sub-pattern instead of a single loose, global boundary. Second, inspired by synthesis methods, we design the Boundary Pseudo-Anomaly Synthesis (BPAS) module. Unlike previous \"blind\" synthesis strategies, BPAS is a novel targeted strategy that first identifies feature points on the boundaries of the MPCL-defined clusters and then generates high-difficulty pseudo-anomalies only in these critical regions. Finally, a Discriminative Boundary Refiner (DBR) learns to shape the final decision surface by distinguishing between the compact normal clusters and the synthesized boundary anomalies. Extensive experiments demonstrate that PGBL achieves superior anomaly detection performance, significantly outperforming competitors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37868", "url": null, "sourceid": 34317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37870, "uid": "164ec3053b616b5034ece5db18f4faf0", "name": "ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets", "authors": [{"id": 183115, "fullname": "Hoyoung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183115?format=json", "institution": "KAIST"}, {"id": 188447, "fullname": "Minwoo Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188447?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 188448, "fullname": "Jabin Koo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188448?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 129088, "fullname": "Sangdoo Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129088?format=json", "institution": "NAVER"}, {"id": 128871, "fullname": "Jungseul Ok", "url": "http://cvpr.thecvf.com/api/miniconf/users/128871?format=json", "institution": "POSTECH"}], "abstract": "Beyond general recognition tasks, specialized domains including privacy-constrained medical applications and fine-grained settings often encounter data scarcity, especially for tail classes. To obtain less biased and more reliable models under such scarcity, practitioners leverage diffusion models to supplement underrepresented regions of real data. Specifically, recent studies fine-tune pretrained diffusion models with LoRA on few-shot real sets to synthesize additional images. While an image-wise LoRA trained on a single image captures fine-grained details yet offers limited diversity, a class-wise LoRA trained over all shots produces diverse images as it encodes class priors yet tends to overlook fine details. To combine both benefits, we separate the adapter into a class-shared LoRA $A$ for class priors and per-image LoRAs $\\mathcal{B}$ for image-specific characteristics. To expose coherent class semantics in the shared LoRA $A$, we propose a semantic boosting by preserving class bounding boxes during training. For generation, we compose $A$ with a mixture of $\\mathcal{B}$ using coefficients drawn from a Dirichlet distribution. Across diverse datasets, our synthesized images are both diverse and detail-rich while closely aligning with the few-shot real distribution, yielding robust gains in downstream classification accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37870", "url": null, "sourceid": 39319, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37871, "uid": "99f4a5fef7b45624924eb900758be690", "name": "ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration", "authors": [{"id": 180375, "fullname": "ZENG XIAOLONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180375?format=json", "institution": "Tsinghua University"}, {"id": 188449, "fullname": "Yitong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188449?format=json", "institution": null}, {"id": 188450, "fullname": "Shiyao Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188450?format=json", "institution": null}, {"id": 127480, "fullname": "Jinhua Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127480?format=json", "institution": "Kuaishou Tech"}, {"id": 89506, "fullname": "Ming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/89506?format=json", "institution": "Kuaishou Tech"}, {"id": 127516, "fullname": "Chao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127516?format=json", "institution": "kuaishou"}, {"id": 127348, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127348?format=json", "institution": "Tsinghua University"}], "abstract": "Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices.To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality.Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead.Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$\\times$ larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37871", "url": null, "sourceid": 31553, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37874, "uid": "507904e87e36bc5849ab6d3198183582", "name": "Homaloidal parametrization for detecting critical two-view configurations", "authors": [{"id": 181458, "fullname": "Rakshith Madhavan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181458?format=json", "institution": "Politecnico di Milano"}, {"id": 188459, "fullname": "Matteo Forlivesi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188459?format=json", "institution": "Purdue University"}, {"id": 188460, "fullname": "Marina Bertolini", "url": "http://cvpr.thecvf.com/api/miniconf/users/188460?format=json", "institution": "University of Milan"}, {"id": 188461, "fullname": "Cristina Turrini", "url": "http://cvpr.thecvf.com/api/miniconf/users/188461?format=json", "institution": "University of Milan"}, {"id": 167108, "fullname": "Federica Arrigoni", "url": "http://cvpr.thecvf.com/api/miniconf/users/167108?format=json", "institution": "Politecnico di Milano"}, {"id": 88371, "fullname": "Luca Magri", "url": "http://cvpr.thecvf.com/api/miniconf/users/88371?format=json", "institution": "Polytechnic Institute of Milan"}], "abstract": "We consider the problem of identifying degenerate configurations while estimating the fundamental matrix from (at least) 8 point correspondences. It is known that such configurations correspond to an ill-posed estimation of the fundamental matrix, so it is important to identify them in practice. So far, a practical degeneracy test is only available for the cases of planar scenes and pure rotation, while the case of the general critical surface (e.g., a hyperboloid/cone/cylinder containing 3D points and camera centres)  is less studied, and the only available method is highly unstable, involving a pre-computed fundamental matrix. In this paper, we propose a novel degeneracy test for detecting points  on the critical surface. By exploiting the geometry of the so-called ``homaloidal net of conics'', we are able to design a simple and very practical test that requires the linear estimation of a quadratic transformation from image correspondences. Our test does not require a fundamental matrix in advance and turns out to be more stable than its closest competitor, as shown in our  experiments on both synthetic and real-world degenerate configurations.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37874", "url": null, "sourceid": 31572, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37875, "uid": "9e96d422fba85185a33829439f5df09d", "name": "FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation", "authors": [{"id": 181804, "fullname": "Wuyang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181804?format=json", "institution": "Dalian University of Technology"}, {"id": 188462, "fullname": "Chengkaitan Chengkaitan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188462?format=json", "institution": "Dalian University of Technology"}, {"id": 188463, "fullname": "Chang Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/188463?format=json", "institution": "Dalian University of Technology"}, {"id": 188464, "fullname": "Binye Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188464?format=json", "institution": "Dalian University of Technology"}, {"id": 85520, "fullname": "Su Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85520?format=json", "institution": "Fudan University"}, {"id": 188465, "fullname": "Yongjiu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188465?format=json", "institution": "Dalian University of Technology"}], "abstract": "Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures.We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which comprises a diverse set of element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both the texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level.To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information while maintaining style consistency. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, especially in preserving the structural and textural fidelity, while supporting flexible controls, such as style mixture. The model and dataset will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37875", "url": null, "sourceid": 41245, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37881, "uid": "e72e3a90f8114313a48c5c4a51ddc63a", "name": "Depth Hypothesis Guided Iterative Refinement for Event\u2013Image Monocular Depth Estimation", "authors": [{"id": 144494, "fullname": "Daikun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144494?format=json", "institution": "Southeast University"}, {"id": 155309, "fullname": "Teng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155309?format=json", "institution": "Southeast University"}, {"id": 155310, "fullname": "Changyin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/155310?format=json", "institution": "Southeast University"}], "abstract": "Event cameras hold excellent dynamic properties, showing great potential for monocular depth estimation (MDE). However, existing methods mainly improve performance by optimizing contextual features, but still struggle with the ill-posed and nonlinear nature of direct full-depth regression. In this paper, we propose HypoDepth, the first event\u2013image monocular depth iterative refinement framework. By introducing a discrete Depth Hypothesis Volume (DHV), we transform the depth regression problem into a constrained depth search task. Specifically, we construct a 3D cost volume between the DHV features and contextual features and perform a multi-scale correlation search to guide stable residual optimization. This lightweight cost volume enables efficient global-to-local refinement across multi-resolution. Our method outperforms existing approaches on DSEC and MVSEC with state-of-the-art results and strong zero-shot generalization. Meanwhile, our tiny model achieves an excellent balance between accuracy and efficiency, enabling real-time performance on resource-limited devices.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37881", "url": null, "sourceid": 31532, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37882, "uid": "32a16c22a08b5360601d155c4803b7f2", "name": "JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization", "authors": [{"id": 180432, "fullname": "Haolun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180432?format=json", "institution": "Zhejiang University"}, {"id": 188483, "fullname": "Yu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188483?format=json", "institution": "Zhejiang University"}, {"id": 188484, "fullname": "Tailun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188484?format=json", "institution": "Zhejiang University"}, {"id": 70436, "fullname": "Shuo Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/70436?format=json", "institution": "Zhejiang University"}, {"id": 188485, "fullname": "Zhixuan Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188485?format=json", "institution": "Zhejiang University"}, {"id": 188486, "fullname": "Hongbin zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188486?format=json", "institution": null}, {"id": 188487, "fullname": "Lan Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188487?format=json", "institution": "Alibaba Group"}, {"id": 188488, "fullname": "Zhan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188488?format=json", "institution": "Zhejiang University"}, {"id": 85239, "fullname": "Kui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/85239?format=json", "institution": "Zhejiang University"}], "abstract": "Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS  , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS  replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30\\% to 43.15\\% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37882", "url": null, "sourceid": 46767, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37884, "uid": "95b2a1df756b9e6e67bfacd25c2c6110", "name": "GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training", "authors": [{"id": 180057, "fullname": "Tong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/180057?format=json", "institution": "Tsinghua University"}, {"id": 128888, "fullname": "Yijun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128888?format=json", "institution": "University of Technology Sydney"}, {"id": 188491, "fullname": "Changhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188491?format=json", "institution": "Tsinghua University"}, {"id": 75860, "fullname": "Junliang Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/75860?format=json", "institution": "Tsinghua University"}, {"id": 188492, "fullname": "Yuanchun Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188492?format=json", "institution": ", Tsinghua University"}, {"id": 87087, "fullname": "Zongqing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87087?format=json", "institution": "Peking University"}, {"id": 184409, "fullname": "Deheng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184409?format=json", "institution": "Tencent; Tencent"}], "abstract": "Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision\u2013language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a \"free\" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the \"entropy collapse\" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10\u201330\\% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37884", "url": null, "sourceid": 34367, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37885, "uid": "1b159dc50cad7253d6c91bc03c2bf33c", "name": "FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition", "authors": [{"id": 164848, "fullname": "Jie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/164848?format=json", "institution": "michigan state university"}, {"id": 72501, "fullname": "Xiao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/72501?format=json", "institution": "Michigan State University"}, {"id": 91845, "fullname": "Yiyang Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/91845?format=json", "institution": "Michigan State University"}, {"id": 152821, "fullname": "Anil Kumar Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/152821?format=json", "institution": "Michigan State University"}, {"id": 73926, "fullname": "Xiaoming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73926?format=json", "institution": "Michigan State University"}], "abstract": "Systematic human recognition requires integrating multiple biometric traits such as face, gait, and body shape, through specialized models to achieve robustness in unconstrained scenarios. However, existing score-fusion strategies typically adopt a static design, combining all models for every test sample regardless of sample quality. This not only increases unnecessary computation but can degrade performance by incorporating noisy or unreliable modalities. To overcome these limitations, we propose FusionAgent, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. Each model is treated as a tool, and through Reinforcement Fine-Tuning (RFT) with a metric-based reward, the agent learns to adaptively determine the optimal model combination for each test input. To address the model score misalignment and embedding heterogeneity, we introduce Anchor-based Confidence Top-k (ACT) score-fusion, which anchors on the most confident model and integrates complementary predictions in a confidence-aware manner. Extensive experiments on multiple whole-body biometric benchmarks demonstrate that \\ours significantly outperforms SoTA methods, underscoring the critical role of dynamic, explainable, and robust model fusion in real-world recognition systems. The proposed framework is scalable and adaptable to a wide range of multi-modal and multi-model tasks, such as vision-language retrieval, indicating its potential relevance to broader application scenarios. The code and model will be publicly released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37885", "url": null, "sourceid": 32013, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37886, "uid": "1abc4dd8e76bb2a9f79363d66d5d3f61", "name": "GROW: Watermark Generation with Progressive Guidance for Diffusion Models", "authors": [{"id": 181779, "fullname": "Pengcheng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181779?format=json", "institution": "\u817e\u8baf\u79d1\u6280\uff08\u5317\u4eac\uff09\u6709\u9650\u516c\u53f8"}, {"id": 180062, "fullname": "Zexi Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/180062?format=json", "institution": "Tencent"}, {"id": 185905, "fullname": "Yijia Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185905?format=json", "institution": "Fudan University"}, {"id": 155554, "fullname": "Jinchao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155554?format=json", "institution": "WeChat AI"}, {"id": 149440, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/149440?format=json", "institution": "Tencent Inc"}], "abstract": "Digital watermarking is a cornerstone for copyright protection. With the rapid advancement of generative models like diffusion models, in-generation and training-free watermarking techniques have garnered more attention for their endogeneity and convenience. These methods typically embed a watermark into the initial noise, where watermark extraction relies on Denoising Diffusion Implicit Models (DDIM) inversion. However, the computationally intensive extraction process severely hinders their path toward practical deployment. To overcome this critical bottleneck, we propose GROW, a novel training-free paradigm that reframes watermarking from a one-shot ``embedding''  to a progressive ``growth''. By progressively guiding using frequency-domain gradients, GROW naturally weaves the watermark into the image, which enables inversion-free extraction. Comprehensive experiments on multiple datasets show that GROW not only achieves superior robustness and imperceptibility but also offers a detection speed nearly 100x faster than inversion-based techniques. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37886", "url": null, "sourceid": 44497, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37887, "uid": "28a1faa9dd2f69eeef4279da40dcdfe0", "name": "Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification", "authors": [{"id": 90618, "fullname": "Qihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90618?format=json", "institution": "Johns Hopkins University"}, {"id": 87390, "fullname": "Chengzhi Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87390?format=json", "institution": "Columbia University"}, {"id": 88964, "fullname": "Yaojie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88964?format=json", "institution": "Google Research"}, {"id": 84745, "fullname": "Alan L. Yuille", "url": "http://cvpr.thecvf.com/api/miniconf/users/84745?format=json", "institution": "Johns Hopkins University"}, {"id": 88961, "fullname": "Wen-Sheng Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88961?format=json", "institution": "Google Research"}], "abstract": "Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce **AuditDM**, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37887", "url": null, "sourceid": 31019, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37888, "uid": "78d78a6d9952bcd96e3394f8f3c7c89d", "name": "Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery", "authors": [{"id": 154065, "fullname": "Jizhou Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/154065?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 154066, "fullname": "Chenhao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/154066?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87254, "fullname": "Yuhang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/87254?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 154063, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154063?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 157443, "fullname": "Shaokun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157443?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188493, "fullname": "SongLin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188493?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 87250, "fullname": "Yihong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87250?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual\u2013textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37888", "url": null, "sourceid": 41341, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40316?format=json"], "related_events_ids": [40316]}, {"id": 40316, "uid": "78d78a6d9952bcd96e3394f8f3c7c89d", "name": "Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery", "authors": [{"id": 154065, "fullname": "Jizhou Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/154065?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 154066, "fullname": "Chenhao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/154066?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87254, "fullname": "Yuhang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/87254?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 154063, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154063?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 157443, "fullname": "Shaokun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157443?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188493, "fullname": "SongLin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188493?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 87250, "fullname": "Yihong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87250?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual\u2013textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40316", "url": null, "sourceid": -41341, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37888?format=json"], "related_events_ids": [37888]}, {"id": 37890, "uid": "f8b98fff0e06af830aebadf79232fe74", "name": "Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework", "authors": [{"id": 187257, "fullname": "Li Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187257?format=json", "institution": "Xidian University"}, {"id": 188496, "fullname": "YingFu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188496?format=json", "institution": "Xidian University"}, {"id": 69415, "fullname": "Kepeng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/69415?format=json", "institution": "Xidian University"}, {"id": 154436, "fullname": "Gang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/154436?format=json", "institution": "Xidian University"}, {"id": 73148, "fullname": "Yunsong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73148?format=json", "institution": "Xidian University"}], "abstract": "With the rapid rise of Large Vision Language Models (LVLMs) for image understanding, the objective of image compression is gradually shifting from human visual perception to machine-oriented semantic understanding. However, conventional learned compression techniques are optimized for pixel-level fidelity and typically operate at fixed or rigid bitrate points, misaligned with the semantic consistency and flexible bitrate control. This gap becomes critical in ultra-low-bitrate regimes, where latent representations often ignore semantic relevance and struggle to disentangle meaningful content from redundant visual details as the bitrate varies. To address these challenges, we develop a token-based flexible compression framework, Token Flow Guided Compression (TFGC), which unifies human- and machine-oriented objectives. TFGC supports variable bitrate control in ultra-low bitrate regimes and enables LVLMs to directly process compressed token without image reconstruction. Specifically, we explore token flow phenomenon in 1D token sequences and exploit it to design token flow propagation, which predicts missing tokens by propagating contextual information from unmasked tokens. Moreover, token semantic guidance aligns compressed representations with LVLM semantic space, while a progressive semantic alignment training strategy further bridges the gap between perceptual reconstruction and semantic reasoning. Experiments show that our framework achieves state-of-the-art LVLMs understanding at comparable bitrates while maintaining satisfactory perceptual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37890", "url": null, "sourceid": 45708, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37891, "uid": "aad30043fc2c322d246f3279345448b9", "name": "GaussianDWM:  Driving World Model using  Language-aligned 3D Gaussians for Scene Understanding and Multi-modal Generation", "authors": [{"id": 69214, "fullname": "Tianchen Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/69214?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188497, "fullname": "Xuefeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188497?format=json", "institution": null}, {"id": 188498, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188498?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188499, "fullname": "Qu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188499?format=json", "institution": "Megvii Technology Inc."}, {"id": 188500, "fullname": "Yuyao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188500?format=json", "institution": null}, {"id": 188501, "fullname": "Lijin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188501?format=json", "institution": null}, {"id": 188502, "fullname": "Le Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188502?format=json", "institution": null}, {"id": 188503, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188503?format=json", "institution": null}, {"id": 188504, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188504?format=json", "institution": null}, {"id": 188505, "fullname": "Wuxiong.Huang Wuxiong.Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188505?format=json", "institution": null}, {"id": 129735, "fullname": "Hesheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129735?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided token sampling strategy that removes redundant 3D information and injects accurate and compact 3D tokens into textual understanding.Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, OmniDrive-nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37891", "url": null, "sourceid": 40638, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37895, "uid": "03795f5a6244fde28a1abfafed2c1075", "name": "StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References", "authors": [{"id": 182212, "fullname": "Boyu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182212?format=json", "institution": "National University of Defense Technology"}, {"id": 91018, "fullname": "Yunfan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/91018?format=json", "institution": "National University of Defense Technology"}, {"id": 143124, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143124?format=json", "institution": "National University of Defense Technology"}, {"id": 188520, "fullname": "Weishang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188520?format=json", "institution": "National University of Defense Technology"}, {"id": 188521, "fullname": "FANG LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/188521?format=json", "institution": "Hunan University"}, {"id": 91021, "fullname": "Zhiping Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/91021?format=json", "institution": "National University of Defense Technology"}], "abstract": "Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability.To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization.It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function\u2013guided diffusion sampling with regional style loss to optimize stylization).Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references. The source code and dataset have been released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37895", "url": null, "sourceid": 38547, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37897, "uid": "3bc8f7011e08bfe6830c967b497bdf6d", "name": "Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation", "authors": [{"id": 161900, "fullname": "Yongbo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/161900?format=json", "institution": "Zhejiang University"}, {"id": 155582, "fullname": "Zirun Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/155582?format=json", "institution": "Zhejiang University"}, {"id": 152602, "fullname": "Tao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/152602?format=json", "institution": "Zhejiang University"}], "abstract": "Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose **D**ecoupling **A**daptation for **S**tability and **P**lasticity (**DASP**), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an **asymmetric adaptation** strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL divergence regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37897", "url": null, "sourceid": 32781, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37899, "uid": "573c979abe8d189246cab7ef50685f0b", "name": "Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design", "authors": [{"id": 172019, "fullname": "Haoxiang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/172019?format=json", "institution": "Sichuan University"}, {"id": 70064, "fullname": "Tao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70064?format=json", "institution": "Sichuan University"}, {"id": 184377, "fullname": "Chenwei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184377?format=json", "institution": "Sichuan University"}, {"id": 186105, "fullname": "Li Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186105?format=json", "institution": "Peking University"}, {"id": 86144, "fullname": "Jiancheng Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/86144?format=json", "institution": "Sichuan University"}], "abstract": "Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.Seg improves performance in complex visual scenarios while maintaining strong generalization. Code, data, and models will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37899", "url": null, "sourceid": 30825, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37903, "uid": "e3dd0f24ae71a2646850db1513dd36ef", "name": "Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation", "authors": [{"id": 176172, "fullname": "Yuyang You", "url": "http://cvpr.thecvf.com/api/miniconf/users/176172?format=json", "institution": "Peking University"}, {"id": 180766, "fullname": "Yongzhi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180766?format=json", "institution": "Kuaishou"}, {"id": 188540, "fullname": "Jiahui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188540?format=json", "institution": null}, {"id": 89566, "fullname": "Yadong Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89566?format=json", "institution": "Peking University"}, {"id": 185056, "fullname": "Quan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185056?format=json", "institution": "Kuaishou"}, {"id": 185057, "fullname": "Peng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185057?format=json", "institution": "Kuaishou Technology"}], "abstract": "Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37903", "url": null, "sourceid": 36553, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37906, "uid": "2a0433fde38b262e5db7db757057c7c0", "name": "PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition", "authors": [{"id": 181409, "fullname": "Anni Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181409?format=json", "institution": "Nanjing University"}, {"id": 157868, "fullname": "Yu-Bin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157868?format=json", "institution": "Nanjing University, China"}], "abstract": "Prototype-based methods enhance interpretability in image recognition by establishing intermediate part prototypes to build interpretable classifiers, enabling transparent reasoning through part-level attention and reference to prototypical examples. However, existing methods typically depend on unimodal visual supervision and constrain prototypes within the visual embedding space, which inherently restricts their semantic alignment with human-interpretable concepts. In this work, we present PRISM (Prototype-based Reasoning with Inter-modal Semantic Mining), an interpretable image recognition framework that leverages natural language as an auxiliary modality to guide the learning of class-specific part prototypes. PRISM introduces an information-theoretic attribution mechanism that identifies semantically salient image regions conditioned on textual descriptions. By aligning these attribution maps with prototype activation patterns, PRISM implicitly anchors visual part prototypes to conceptually meaningful image regions, enhancing interpretability without requiring explicit concept modeling. To further enhance the distinctiveness and localization of prototypes, we introduce a spatial compactness constraint that encourages each prototype to attend to specific, non-overlapping image regions. Extensive experiments on fine-grained benchmarks demonstrate that the proposed PRISM not only improves classification performance but also provides faithful and semantically grounded visual explanations.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37906", "url": null, "sourceid": 42403, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37911, "uid": "bce59b83a986e549f44ed60eb6c9960d", "name": "ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation", "authors": [{"id": 188572, "fullname": "Zichen Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188572?format=json", "institution": "University of Western Australia"}, {"id": 106229, "fullname": "Zeeshan Hayder", "url": "http://cvpr.thecvf.com/api/miniconf/users/106229?format=json", "institution": "Google"}, {"id": 188573, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188573?format=json", "institution": "University of Western Australia"}, {"id": 129735, "fullname": "Hesheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129735?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 90780, "fullname": "Ajmal Mian", "url": "http://cvpr.thecvf.com/api/miniconf/users/90780?format=json", "institution": "University of Western Australia"}], "abstract": "3D human reaction generation faces three main challenges: (1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant {ReMFlow}, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions. Code is available in the supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37911", "url": null, "sourceid": 42973, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37913, "uid": "d4d2775cb2a95db2d267b98def4c3e6b", "name": "RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations", "authors": [{"id": 154107, "fullname": "I-Hsiang (Aaron) Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154107?format=json", "institution": "NTU"}, {"id": 154108, "fullname": "Yu-Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154108?format=json", "institution": "National Taiwan University"}, {"id": 188576, "fullname": "Tse-Yu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188576?format=json", "institution": null}, {"id": 188577, "fullname": "Yu-Chien Chiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188577?format=json", "institution": "National Taiwan University"}, {"id": 188578, "fullname": "Jen-Chieh Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188578?format=json", "institution": "National Taiwan University"}, {"id": 126247, "fullname": "Wei-Ting Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126247?format=json", "institution": "National Taiwan University"}], "abstract": "Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37913", "url": null, "sourceid": 36281, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37918, "uid": "cd0973da79d301fb821f4aae71f36173", "name": "Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation", "authors": [{"id": 174738, "fullname": "Liu Kejia", "url": "http://cvpr.thecvf.com/api/miniconf/users/174738?format=json", "institution": "Zhejiang University"}, {"id": 126936, "fullname": "Haoyang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/126936?format=json", "institution": "Zhejiang University"}, {"id": 188589, "fullname": "Ruoyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188589?format=json", "institution": "Zhejiang University"}, {"id": 188590, "fullname": "Peicheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188590?format=json", "institution": "Zhejiang University"}, {"id": 85446, "fullname": "Mingli Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85446?format=json", "institution": "Zhejiang University"}, {"id": 85428, "fullname": "Haofei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85428?format=json", "institution": "Zhejiang University"}], "abstract": "Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV\u2019s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37918", "url": null, "sourceid": 37750, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37920, "uid": "3238b5fdacca32ae65f7486caf88ee59", "name": "GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation", "authors": [{"id": 188334, "fullname": "Qi Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188334?format=json", "institution": "Hefei University of Technology"}, {"id": 84913, "fullname": "Dong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84913?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 180552, "fullname": "Nan Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180552?format=json", "institution": "Hefei University of Technology"}, {"id": 130948, "fullname": "Wenjing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130948?format=json", "institution": "University of Nottingham"}, {"id": 77038, "fullname": "Zhun Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77038?format=json", "institution": "University of Nottingham"}, {"id": 85089, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85089?format=json", "institution": "Hefei University of Technology"}], "abstract": "Vision Foundation Models (VFMs) provide rich and transferable representations through large-scale pretraining, yet their high-capacity representations remains underutilized when adapted to downstream tasks. In Domain Generalization Semantic Segmentation (DGSS), parameter-efficient fine-tuning (PEFT) often overfits adapters to source-domain statistics and seen-class boundaries, leading to representation degradation manifested as domain bias and semantic rigidity. Existing regularization strategies alleviate this through random perturbations, but such operations disrupt the pretrained geometric structure, causing semantic drift and unstable generalization.We propose Geometry-Consistent Regularization (GeCo), which extrapolates the pretrained representation space toward the target task under structure-respected constraints, thereby preserving the inherent generalization of VFMs while enhancing their task-specific adaptation. GeCo introduces curvature-guided perturbation to modulate feature variation according to local manifold complexity of the pre-trained embedding space, enabling structure-aligned representation expansion. Complementarily, a geodesic-based regularization constrains prediction shifts along smooth, manifold-aligned trajectories, ensuring semantic continuity and stable decision behavior.Extensive experiments demonstrate that GeCo achieves superior generalization across both closed-set and open-set DGSS benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37920", "url": null, "sourceid": 44063, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37921, "uid": "890f3707a3d8af2ca24401484a6ee64a", "name": "Globscope: Toward a Global View of the Loss Landscape", "authors": [{"id": 188593, "fullname": "Mashiat Mustaq", "url": "http://cvpr.thecvf.com/api/miniconf/users/188593?format=json", "institution": "Purdue University"}, {"id": 188594, "fullname": "Xavier Michel Tricoche", "url": "http://cvpr.thecvf.com/api/miniconf/users/188594?format=json", "institution": "Purdue University"}], "abstract": "Understanding the global structure of neural network loss landscapes is important for gaining insight into model merging, hyperparameter selection, generalization, and the relationships between distinct solutions. Visualizing the global structure of loss landscapes is very challenging because of the high dimensionality of the parameter space of neural networks. Prior work has primarily focused on visualizing the loss landscape around one single basin, missing how different minima or basins relate to each other. We introduce Globscope, a framework for providing a global view of the loss landscape across multiple solutions or basins. Globscope learns a low-dimensional non-linear manifold of model parameters using an autoencoder framework, enabling both latent-space visualization and reconstruction of full model weights. Then it summarizes the relations among minima and connecting regions on this manifold through topological data analysis. Our framework produces continuous, interpretable visualizations that reveal global connectivity patterns in the landscape. We compare Globscope with kernel-based methods and demonstrate how it performs in preserving the global structure across diverse solutions. We further show how Globscope can be used to analyze two applications: revealing global low-loss solution pathways between distinct solutions using mode connectivity algorithms, and visualizing permutation symmetries of different solutions using re-basin approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37921", "url": null, "sourceid": 38088, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37923, "uid": "00d424190dde9518f8edc5a2658b230d", "name": "Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation", "authors": [{"id": 181523, "fullname": "Kai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181523?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 84930, "fullname": "Li Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84930?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 188599, "fullname": "Jun Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188599?format=json", "institution": "Institute for Infocomm Reserach, A*STAR"}], "abstract": "Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domain-specific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) \u2014 a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained and domain-adaptive enhancement: block-level internal adapters refine local structural representations, while external adapters facilitate cross-domain feature alignment. Specifically, the internal adapters are embedded within each Transformer block to locally adapt and refine features for thin and intricate curvilinear patterns, while the external adapters operate across blocks to capture global, multi-layer contextual information and facilitate domain adaptation. Furthermore, SACM introduces a feature fusion mechanism that aggregates multi-layer features from all external adapters and fuses them via a Feed-Forward Network (FFN) module, and a dual-stage refinement process in the mask decoder to enhance topology and connectivity. This design enables prompt-free, data-efficient fine-tuning and achieves robust cross-domain generalization when trained with only 18 annotated images. Extensive experiments across twelve diverse curvilinear datasets validate that SACM achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37923", "url": null, "sourceid": 43599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40317?format=json"], "related_events_ids": [40317]}, {"id": 40317, "uid": "00d424190dde9518f8edc5a2658b230d", "name": "Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation", "authors": [{"id": 181523, "fullname": "Kai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181523?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 84930, "fullname": "Li Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84930?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 188599, "fullname": "Jun Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188599?format=json", "institution": "Institute for Infocomm Reserach, A*STAR"}], "abstract": "Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domain-specific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) \u2014 a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained and domain-adaptive enhancement: block-level internal adapters refine local structural representations, while external adapters facilitate cross-domain feature alignment. Specifically, the internal adapters are embedded within each Transformer block to locally adapt and refine features for thin and intricate curvilinear patterns, while the external adapters operate across blocks to capture global, multi-layer contextual information and facilitate domain adaptation. Furthermore, SACM introduces a feature fusion mechanism that aggregates multi-layer features from all external adapters and fuses them via a Feed-Forward Network (FFN) module, and a dual-stage refinement process in the mask decoder to enhance topology and connectivity. This design enables prompt-free, data-efficient fine-tuning and achieves robust cross-domain generalization when trained with only 18 annotated images. Extensive experiments across twelve diverse curvilinear datasets validate that SACM achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40317", "url": null, "sourceid": -43599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37923?format=json"], "related_events_ids": [37923]}, {"id": 37925, "uid": "dc07e93076253017193100356d788742", "name": "GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space", "authors": [{"id": 180513, "fullname": "Yonghao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180513?format=json", "institution": "Nankai University"}, {"id": 154137, "fullname": "Yupeng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154137?format=json", "institution": "nanjing university"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}, {"id": 145863, "fullname": "Beibei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145863?format=json", "institution": "Nanjing University"}], "abstract": "Recent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multi-view images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3D **G**aussian **O**bject **R**emoval in the **I**ntrinsic **S**pace (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decomposes the scene into intrinsic components and explicitly models light transport to maintain global lighting effects consistency. Furthermore, we introduce an intrinsic-space inpainting module that operates directly in the material and lighting domains, effectively addressing the challenges posed by non-Lambertian surfaces. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework substantially improves the physical consistency and visual coherence of object removal, outperforming existing methods by 13\\% in perceptual similarity (LPIPS) and 2dB in peak signal-to-noise ratio (PSNR).", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37925", "url": null, "sourceid": 33466, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37926, "uid": "dd093b11f9127e9d7d591129be671183", "name": "LAM: Language Articulated Object Modelers", "authors": [{"id": 158579, "fullname": "Yipeng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158579?format=json", "institution": "University of Southern California"}, {"id": 132877, "fullname": "Yunhao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/132877?format=json", "institution": "NVIDIA"}, {"id": 175786, "fullname": "Peilin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/175786?format=json", "institution": "University of Southern California"}, {"id": 188604, "fullname": "Daniel Seita", "url": "http://cvpr.thecvf.com/api/miniconf/users/188604?format=json", "institution": "University of Southern California"}, {"id": 73133, "fullname": "Laurent Itti", "url": "http://cvpr.thecvf.com/api/miniconf/users/73133?format=json", "institution": "USC"}], "abstract": "We introduce LAM, a system that explores the collaboration of large-language mod-els and vision-language models to generate articulated objects from text prompts.Our approach differs from previous methods that either rely on input visual structure(e.g., an image) or assemble articulated models from pre-built assets. In contrast,we formulate articulated object generation as a unified code generation task, wheregeometry and articulations can be co-designed from scratch. Given an input text,LAM coordinates a team of specialized modules to generate code to represent thedesired articulated object procedurally. The LAM first reasons about the hierarchi-cal structure of parts (links) with Link Designer, then writes code, compiles it, anddebugs it with Geometry & Articulation Coders and self-corrects with Geometry& Articulation Checkers. The code serves as a structured and interpretable bridgebetween individual links, ensuring correct relationships among them. Representingeverything with code allows the system to determine appropriate joint types andcalculate their exact placements more reliably. Experiments demonstrate the powerof leveraging code as a generative medium within an agentic system, showcasingits effectiveness in automatically constructing complex articulated objects.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37926", "url": null, "sourceid": 35310, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37927, "uid": "b71a60c8bdb4d9e8cfe4790de7d18c40", "name": "ART: Articulated Reconstruction Transformer", "authors": [{"id": 182076, "fullname": "Zizhang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182076?format=json", "institution": "Stanford University"}, {"id": 126458, "fullname": "Cheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126458?format=json", "institution": "Facebook"}, {"id": 126494, "fullname": "Zhengqin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126494?format=json", "institution": "Facebook"}, {"id": 140486, "fullname": "Henry Howard-Jenkins", "url": "http://cvpr.thecvf.com/api/miniconf/users/140486?format=json", "institution": "Facebook; Meta"}, {"id": 99138, "fullname": "Zhaoyang Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/99138?format=json", "institution": "Impossible, Inc."}, {"id": 86204, "fullname": "Chen Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86204?format=json", "institution": "Stanford University"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}, {"id": 127741, "fullname": "Richard Newcombe", "url": "http://cvpr.thecvf.com/api/miniconf/users/127741?format=json", "institution": "Meta, Reality Labs Research"}, {"id": 127728, "fullname": "Jakob Engel", "url": "http://cvpr.thecvf.com/api/miniconf/users/127728?format=json", "institution": "Research, Meta Reality Labs"}, {"id": 126493, "fullname": "Zhao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126493?format=json", "institution": "Meta RL Research"}], "abstract": "We introduce ART, Articulated Reconstruction Transformer\u2014a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as a part-based prediction problem. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable to standard simulation formats. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37927", "url": null, "sourceid": 46164, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37930, "uid": "0f0671779c244dbf8ee04425ad06a093", "name": "Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models", "authors": [{"id": 188611, "fullname": "Runsen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188611?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 130164, "fullname": "Weiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130164?format=json", "institution": "Facebook"}, {"id": 130151, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130151?format=json", "institution": "Meta Platforms"}, {"id": 130167, "fullname": "Xingyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130167?format=json", "institution": "Facebook"}, {"id": 188612, "fullname": "Xiaodong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188612?format=json", "institution": "Meta Platforms, Inc."}, {"id": 85560, "fullname": "Fu-Jen Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85560?format=json", "institution": "Facebook"}, {"id": 75834, "fullname": "Matt Feiszli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75834?format=json", "institution": "Meta AI"}, {"id": 127733, "fullname": "Kevin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127733?format=json", "institution": "FAIR at Meta"}], "abstract": "Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37930", "url": null, "sourceid": 32084, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37931, "uid": "a0946d7385a92c63af1fba1f83cb2238", "name": "MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision", "authors": [{"id": 176213, "fullname": "Yunpeng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/176213?format=json", "institution": "Guangdong University of Technology"}, {"id": 177210, "fullname": "Lihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177210?format=json", "institution": "Guangdong University of Technology"}, {"id": 177188, "fullname": "Zhaoshen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/177188?format=json", "institution": "Guangdong University of Technology"}, {"id": 177185, "fullname": "Xinqiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/177185?format=json", "institution": "Guangdong University of Technology"}, {"id": 188613, "fullname": "Xingming Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188613?format=json", "institution": "Guangdong University of Technology"}, {"id": 177068, "fullname": "Zhuowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177068?format=json", "institution": null}, {"id": 188614, "fullname": "Lianglun Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188614?format=json", "institution": "Guangdong University of Technology"}], "abstract": "Maritime multimodal vision faces significant challenges due to the complexity and variability of oceanic weather and environmental conditions. While modern vessels are commonly equipped with visible and infrared imaging systems, the complementary nature of these modalities fundamentally depends on accurate cross-modal registration. However, the absence of paired visible\u2013infrared datasets that realistically capture diverse maritime scenarios has severely hindered progress in this field. To overcome this limitation, we present MMVIP, the first large-scale visible\u2013infrared maritime vision dataset covering a wide spectrum of weather conditions and sea states. The dataset contains 128,100 images and 50 video sequences with precise spatial\u2013temporal alignment. Comprehensive evaluations across image registration, fusion, maritime object detection, and cross-modal image translation tasks demonstrate the dataset\u2019s effectiveness and challenge. Furthermore, MMVIP establishes a new benchmark for advancing multimodal maritime perception. The dataset and corresponding benchmarks are publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37931", "url": null, "sourceid": 36269, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37933, "uid": "d26ae53c9441b24e4286842b32b98637", "name": "The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery", "authors": [{"id": 188622, "fullname": "Haiyang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188622?format=json", "institution": "University of Trento"}, {"id": 180552, "fullname": "Nan Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180552?format=json", "institution": "Hefei University of Technology"}, {"id": 188623, "fullname": "Yaqi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188623?format=json", "institution": "University of Trento"}, {"id": 188624, "fullname": "Teng Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/188624?format=json", "institution": "University of Amsterdam, University of Amsterdam"}, {"id": 130948, "fullname": "Wenjing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130948?format=json", "institution": "University of Nottingham"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 77038, "fullname": "Zhun Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77038?format=json", "institution": "University of Nottingham"}], "abstract": "Generalized Category Discovery (GCD) aims to categorize unlabeled samples that may belong to either known or unknown categories by leveraging the knowledge from labeled data. Most previous methods jointly optimize supervised and unsupervised objectives and achieve promising results. However, inherent optimization interference still limits their ability to improve further. Through quantitative analysis, we identify a key issue, *i.e.*, **gradient entanglement**, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation-subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy-Aware Gradient Coordinator (EAGC), a plug-and-play gradient-level module that explicitly regulates the optimization process. EAGC comprises two components: Anchor-based Gradient Alignment (AGA) and Energy-aware Elastic Projection (EEP). AGA introduces a reference model to anchor the gradient directions of labeled samples, preserving the discriminative structure of known classes against the interference of unlabeled gradients. EEP softly projects unlabeled gradients onto the complement of the known-class subspace and derives an energy-based coefficient to adaptively scale the projection for each unlabeled sample according to its degree of alignment with the known subspace, thereby reducing subspace overlap without suppressing unlabeled samples that likely belong to known classes. EAGC can be seamlessly integrated with both parametric and non-parametric GCD methods. Experiments show that EAGC consistently boosts existing approaches and establishes new state-of-the-art results on multiple GCD benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37933", "url": null, "sourceid": 45401, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37936, "uid": "5707abfe502cadf6688e68ddbae96060", "name": "MatSpray: Fusing 2D Material World Knowledge on 3D Geometry", "authors": [{"id": 174246, "fullname": "Philipp Langsteiner", "url": "http://cvpr.thecvf.com/api/miniconf/users/174246?format=json", "institution": "University of T\u00fcbingen"}, {"id": 132185, "fullname": "Jan-Niklas Dihlmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/132185?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 126217, "fullname": "Hendrik Lensch", "url": "http://cvpr.thecvf.com/api/miniconf/users/126217?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and  projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied.  The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37936", "url": null, "sourceid": 31791, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37940, "uid": "9fae7f55b59b284b106e7be7c783054c", "name": "FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding", "authors": [{"id": 181190, "fullname": "Da Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181190?format=json", "institution": "STUDENT"}, {"id": 188640, "fullname": "Xuesong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188640?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 76730, "fullname": "Zonghao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76730?format=json", "institution": "Tsinghua University"}, {"id": 188641, "fullname": "Yichen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188641?format=json", "institution": "Tsinghua University"}, {"id": 156515, "fullname": "Chi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156515?format=json", "institution": "Tsinghua University"}, {"id": 187614, "fullname": "Yidan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187614?format=json", "institution": null}, {"id": 106924, "fullname": "Yuan Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/106924?format=json", "institution": "Tsinghua University"}, {"id": 90742, "fullname": "Fang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90742?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 86951, "fullname": "Wei Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/86951?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 131381, "fullname": "Maosong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131381?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Natural videos exhibit heterogeneous temporal dynamics, with certain segments undergoing high-dynamic scene transitions and others dominated by low-dynamic visual changes. However, treating all frames identically, a common practice in most MLLMs, leads to redundant visual encoding, which results in significant computational overhead. The recent state-of-the-art model, i.e., Qwen2.5-VL, adopts a fixed two-frame encoding scheme, but our pilot experiments indicate that it encounters a visual confusion problem under high-dynamic frame pairs. To address this issue, we propose FlexiVideo, an efficient MLLM that models temporal dynamics leveraging visual variation. FlexiVideo first employs an adaptive temporal segmentation module to estimate inter-frame differences, grouping consecutive frames into scene segments with subtle visual changes. Subsequently, a dynamical spatio-temporal embedding module adjusts the temporal window for scene-level encoding. By restructuring scene-level visual representations within a structured temporal organization, our approach models dynamics more effectively and reduces the encoding burden while preserving fine-grained visual variations. Extensive experiments show that FlexiVideo-3B consistently outperforms Qwen2.5-VL-3B across 6 general video benchmarks. Notably, when evaluated on MotionBench at 10 FPS, FlexiVideo-3B reduces visual tokens by 43.5% compared with Qwen2.5-VL-3B while achieving a 1.3% performance gain, striking a significantly better balance between efficiency and effectiveness. Code and checkpoints will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37940", "url": null, "sourceid": 35048, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37944, "uid": "b38eb0ae818b10cc35c9a4c7dc19f952", "name": "Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards", "authors": [{"id": 86332, "fullname": "Seungwook Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86332?format=json", "institution": "POSTECH"}, {"id": 86304, "fullname": "Minsu Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/86304?format=json", "institution": "POSTECH"}], "abstract": "Text-to-image generation powers content creation across design, media, and data augmentation.Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics.We introduce ARC (\\textbf{A}daptive \\textbf{R}ewarding by Self-\\textbf{C}onfidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes.ARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models.Empirically, by reinforcing high-confidence generations, ARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating ARC with external rewards results in a complementary improvement, with alleviated reward hacking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37944", "url": null, "sourceid": 40169, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37945, "uid": "75534c35295bfd22689f8f9ce066ee06", "name": "EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions", "authors": [{"id": 172229, "fullname": "Taegyoon Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/172229?format=json", "institution": "Seoul National University"}, {"id": 183418, "fullname": "Yegyu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/183418?format=json", "institution": "Seoul National University R&amp;DB Foundation"}, {"id": 183419, "fullname": "Seojin Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/183419?format=json", "institution": "Seoul National University R&amp;DB Foundation"}, {"id": 183417, "fullname": "Jaewoo Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/183417?format=json", "institution": "Seoul National University R&amp;DB Foundation"}, {"id": 183416, "fullname": "Sojeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183416?format=json", "institution": "Seoul National University R&amp;DB Foundation"}, {"id": 188653, "fullname": "Taein Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188653?format=json", "institution": "Meta"}, {"id": 128077, "fullname": "Hyung-Sin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/128077?format=json", "institution": "Seoul National University"}], "abstract": "Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios\u2014industrial maintenance, sports, and emergency rescue\u2014designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37945", "url": null, "sourceid": 44704, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37947, "uid": "86c1638ed997351a04bcc55a57dc05cd", "name": "Prompt-Anchored Vision\u2013Text Distillation for Lifelong Person Re-identification", "authors": [{"id": 183012, "fullname": "Wen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183012?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 156040, "fullname": "Hao CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/156040?format=json", "institution": "Peking University"}, {"id": 89087, "fullname": "Shiliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89087?format=json", "institution": "Peking University"}], "abstract": "Lifelong person re-identification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches mainly focus on visual encoder distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision\u2013language models can serve as a stable semantic anchor, offering consistent guidance throughout lifelong learning. To leverage the synergy between vision and text, we propose Prompt-Anchored vision\u2013text Distillation (PAD), a unified framework that enhances semantic alignment and cross-domain generalization. On the textual side, we distill semantic prompts that maintain vision\u2013text alignment under a fixed semantic coordinate system. On the visual side, an EMA-based teacher performs model distillation assisted by an adaptive prompt pool that allocates new slots for each incoming domain while freezing past ones, achieving both adaptability and memory retention. Extensive experiments demonstrate that our PAD substantially outperforms state-of-the-art methods across multiple LReID benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37947", "url": null, "sourceid": 35848, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37952, "uid": "e326806ec9f754e7df058498a641fe8b", "name": "Spatial Matters: Position-Guided 3D Referring Expression Segmentation", "authors": [{"id": 156724, "fullname": "Yabing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156724?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 89574, "fullname": "Zhuotao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89574?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 85375, "fullname": "Le Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85375?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 71727, "fullname": "Zheng Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/71727?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 88174, "fullname": "Sanping Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88174?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "3D Referring Expression segmentation (3D-RES) is an emerging field that segments 3D objects in point cloud scenes based on given referring expressions. Although existing methods have achieved substantial progress, they primarily focus on semantic cues and often overlook spatial relations, which are essential for segmenting the referred objects in complex 3D scenes, especially those containing multiple visually similar instances. In this paper, we propose Position3D, a novel approach that explicitly incorporates spatial relation modeling into 3D-RES. Specifically, we introduce a spatial-aware query generation module that constructs point proxies by aggregating local context and incorporating spatial relations, from which the most text-relevant are selected as queries. Furthermore, we design a position-guided deformable attention in the decoder, which progressively refines attention to concentrate on the target object under positional relationship guidance.  Extensive experiments on two benchmark datasets, \\ie, ScanRefer, and Multi3DRefer, validate the effectiveness of the proposed method Position3D.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37952", "url": null, "sourceid": 45524, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37955, "uid": "93f52fc53bde4ada81365d7c2acb0735", "name": "CADC: Content Adaptive Diffusion-Based Generative Image Compression", "authors": [{"id": 188678, "fullname": "Xihua Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188678?format=json", "institution": "City University of Hong Kong"}, {"id": 145474, "fullname": "lingyu ZHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/145474?format=json", "institution": "City University of Hong Kong"}, {"id": 181225, "fullname": "Tianyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181225?format=json", "institution": "University of Science and Technology of China"}, {"id": 87804, "fullname": "Dong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87804?format=json", "institution": "University of Science and Technology of China"}, {"id": 86613, "fullname": "Shiqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86613?format=json", "institution": "City University of Hong Kong"}, {"id": 159693, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159693?format=json", "institution": null}], "abstract": "Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck---arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input---prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec (CADC) with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization (UGAQ) method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration (ADGIC) method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning (BFATC) method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost. Comprehensive experimental results show that our codec achieves state-of-the-art perceptual quality at ultra-low bitrates.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37955", "url": null, "sourceid": 31808, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37956, "uid": "ac13ef4bd5a77bb7ad082cc2428ae72d", "name": "Smart Replay: Adaptive Scheduling of Memory Rehearsal for Computational Resource-Aware Incremental Learning", "authors": [{"id": 183445, "fullname": "Jianting CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/183445?format=json", "institution": "Tongji University"}, {"id": 157340, "fullname": "Dianzhi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157340?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 188679, "fullname": "Irwin King", "url": "http://cvpr.thecvf.com/api/miniconf/users/188679?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Incremental learning (IL) arises from the need to continuously update models under limited data and computational resources. Most existing IL studies focus on data\u2011scarce settings. They often develop complex methods that rely on heavy computation, while overlooking the computational resource constraints common in real\u2011world scenarios. This motivates us to formalize the problem of Computational Resource\u2011Aware Incremental Learning, which explicitly considers the computational budget during model training. To tackle this problem, we propose Smart Replay, an efficient memory rehearsal algorithm that adaptively allocates resources by scheduling the replay ratio across mini\u2011batches. We cast replay\u2011ratio optimization into an optimal control formulation that jointly minimizes new\u2011task and memory losses. We further propose a heuristic Q-function to guide ratio adjustments, adaptively balancing short-term efficiency and long-term stability. Finally, we develop a practical algorithm that periodically updates the replay ratio during training. Experiments on multiple benchmarks validate that Smart Replay consistently outperforms fixed\u2011replay baselines, achieving higher accuracy and lower forgetting under the same computational budget.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37956", "url": null, "sourceid": 36345, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37957, "uid": "ae77ae50016d01f6c9d504e8194b9615", "name": "FINER: MLLMs Hallucinate under Fine-grained Negative Queries", "authors": [{"id": 154718, "fullname": "Rui Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154718?format=json", "institution": "Technical University of Munich"}, {"id": 154717, "fullname": "Sanghwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/154717?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 76276, "fullname": "Yongqin Xian", "url": "http://cvpr.thecvf.com/api/miniconf/users/76276?format=json", "institution": "Google"}, {"id": 154682, "fullname": "Zeynep Akata", "url": "http://cvpr.thecvf.com/api/miniconf/users/154682?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 154719, "fullname": "Stephan Alaniz", "url": "http://cvpr.thecvf.com/api/miniconf/users/154719?format=json", "institution": "T\u00e9l\u00e9com Paris"}], "abstract": "Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce **FI**ne-grained **NE**gative que**R**ies (**FINER**), alongside two benchmarks: **FINER-CompreCap** and **FINER-DOCCI**. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and \u201cwhat\u201d questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose **FINER-Tuning**, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Benchmarks, training data, code and model checkpoints will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37957", "url": null, "sourceid": 43563, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40318?format=json"], "related_events_ids": [40318]}, {"id": 40318, "uid": "ae77ae50016d01f6c9d504e8194b9615", "name": "FINER: MLLMs Hallucinate under Fine-grained Negative Queries", "authors": [{"id": 154718, "fullname": "Rui Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154718?format=json", "institution": "Technical University of Munich"}, {"id": 154717, "fullname": "Sanghwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/154717?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 76276, "fullname": "Yongqin Xian", "url": "http://cvpr.thecvf.com/api/miniconf/users/76276?format=json", "institution": "Google"}, {"id": 154682, "fullname": "Zeynep Akata", "url": "http://cvpr.thecvf.com/api/miniconf/users/154682?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 154719, "fullname": "Stephan Alaniz", "url": "http://cvpr.thecvf.com/api/miniconf/users/154719?format=json", "institution": "T\u00e9l\u00e9com Paris"}], "abstract": "Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce **FI**ne-grained **NE**gative que**R**ies (**FINER**), alongside two benchmarks: **FINER-CompreCap** and **FINER-DOCCI**. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and \u201cwhat\u201d questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose **FINER-Tuning**, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Benchmarks, training data, code and model checkpoints will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40318", "url": null, "sourceid": -43563, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37957?format=json"], "related_events_ids": [37957]}, {"id": 37959, "uid": "01639908adcc6c9f525b328e06566e27", "name": "GFRRN: Explore the Gaps in Single Image Reflection Removal", "authors": [{"id": 177873, "fullname": "Yu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/177873?format=json", "institution": "Zhejiang University"}, {"id": 188687, "fullname": "Zewei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188687?format=json", "institution": "Zhejiang University"}, {"id": 188688, "fullname": "Xingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188688?format=json", "institution": "Zhejiang University"}, {"id": 152359, "fullname": "Zixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152359?format=json", "institution": "Zhejiang University"}, {"id": 152362, "fullname": "Zhe-Ming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152362?format=json", "institution": "Zhejiang University"}], "abstract": "Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37959", "url": null, "sourceid": 45014, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37961, "uid": "600af0c2c15ab477a81bc0c80626f080", "name": "Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution", "authors": [{"id": 188691, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188691?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 154352, "fullname": "Junyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154352?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 87633, "fullname": "Jinshan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87633?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 89307, "fullname": "Jiangxin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89307?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. First, we propose an LQ-guided feature modulation module that leverages original uncompressed information from LQ inputs to provide high-fidelity conditioning for the diffusion process. We then develop a region-adaptive generative prior activation method to effectively enhance perceptual richness without sacrificing local structural fidelity. Finally, we employ a text-matching guidance strategy to fully harness the conditioning potential of text prompts. Extensive experiments demonstrate that CODSR achieves superior perceptual quality and competitive fidelity compared with state-of-the-art methods while maintaining efficient one-step inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37961", "url": null, "sourceid": 42269, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37966, "uid": "2b66636acb3cee29eb2990dbc255f3a1", "name": "Dynamic Label Noise Suppression with Optimal Teacher Pool for Facial Expression Recognition", "authors": [{"id": 180855, "fullname": "Yuzhuang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180855?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 180558, "fullname": "Xiaolin Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/180558?format=json", "institution": "Xidian University \uff08\u897f\u5b89\u7535\u5b50\u79d1\u6280\u5927\u5b66\uff09"}, {"id": 188703, "fullname": "Qigong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188703?format=json", "institution": "Sensetime"}], "abstract": "Due to the inherent ambiguity of facial expressions and subjectivity in dataset labeling, learning with noisy labels remains a critical challenge in facial expression recognition (FER). The supervisory mechanism of teacher-student network offers a promising approach for noisy-labeled FER. However, this approach is prone to noise accumulation and gradual coupling between the teacher and student parameters during training. We propose an $\\textbf{O}$ptimal $\\textbf{T}$eacher $\\textbf{P}$ool-driven dynamic label $\\textbf{N}$oise $\\textbf{S}$uppression framework for facial expression recognition (OTP-NS). Specifically, we construct an optimal teacher pool architecture that dynamically maintains multiple best teacher models while fusing their predictions, thereby mitigating noise accumulation and coupling of teacher-student parameters via update mechanisms. Furthermore, we develop two sample-level noise suppression parts: (1) Similarity-Aware Label Smoothing (SALS), diverging from the static smoothing strength in traditional label smoothing, automatically modulates the smoothing strength for teacher model based on prediction-label similarity, achieving fine-grained noise suppression. (2) Confidence-Weighted Logits (CWL), adaptively adjusting the classification loss of student model based on sample-to-centroid confidence metrics, alleviates the detrimental effects of noisy samples on model training. Extensive experiments on multiple benchmark datasets demonstrate that our method  outperforms state-of-the-art approaches across various noise levels, validating the effectiveness of our proposed framework in learning robust representations from noisy data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37966", "url": null, "sourceid": 36418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37968, "uid": "b4a13cb59f877eeee5aa39820330313c", "name": "Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation", "authors": [{"id": 151464, "fullname": "Fu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151464?format=json", "institution": "Southeast University"}, {"id": 157444, "fullname": "Yucheng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/157444?format=json", "institution": "Southeast University"}, {"id": 188707, "fullname": "Ruixiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188707?format=json", "institution": "Southeast University"}, {"id": 126445, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126445?format=json", "institution": "Southeast University"}, {"id": 157445, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157445?format=json", "institution": "Southeast University"}, {"id": 84884, "fullname": "Xin Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84884?format=json", "institution": "Southeast University"}], "abstract": "Text-to-image (T2I) diffusion models effectively produce semantically aligned images, but their reliance on training distributions constrains their capacity for synthesizing truly novel, out-of-distribution concepts.    Existing methods attempt to enhance creativity through semantic exploration, such as fusing known concept pairs, but the resulting images remain linguistically describable and confined to familiar semantic spaces.    Inspired by the soft probabilistic outputs of classifiers on novel or out-of-distribution inputs, we propose Distribution-Conditional Generation, a paradigm that models novel concepts as image synthesis conditioned on class distributions, enabling controllable yet semantically unconstrained creative generation.    Building on this, we propose DisTok, an encoder\u2013decoder framework that unifies conditional and unconditional creative generation by decoding latent representations\u2014either randomly sampled or mapped from conditions (e.g., class distributions)\u2014into tokens representing novel concepts.    DisTok is trained by iteratively sampling and fusing concept pairs from a dynamic pool to model progressively complex distributions, while enforcing semantic consistency through a vision-language model that aligns the class distributions of generated images with the input distributions.    Extensive experiments demonstrate that DisTok enables efficient and flexible semantic exploration for token-level creative synthesis, achieving state-of-the-art text\u2013image alignment and human preference.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37968", "url": null, "sourceid": 38568, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40319?format=json"], "related_events_ids": [40319]}, {"id": 40319, "uid": "b4a13cb59f877eeee5aa39820330313c", "name": "Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation", "authors": [{"id": 151464, "fullname": "Fu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151464?format=json", "institution": "Southeast University"}, {"id": 157444, "fullname": "Yucheng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/157444?format=json", "institution": "Southeast University"}, {"id": 188707, "fullname": "Ruixiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188707?format=json", "institution": "Southeast University"}, {"id": 126445, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126445?format=json", "institution": "Southeast University"}, {"id": 157445, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157445?format=json", "institution": "Southeast University"}, {"id": 84884, "fullname": "Xin Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84884?format=json", "institution": "Southeast University"}], "abstract": "Text-to-image (T2I) diffusion models effectively produce semantically aligned images, but their reliance on training distributions constrains their capacity for synthesizing truly novel, out-of-distribution concepts.    Existing methods attempt to enhance creativity through semantic exploration, such as fusing known concept pairs, but the resulting images remain linguistically describable and confined to familiar semantic spaces.    Inspired by the soft probabilistic outputs of classifiers on novel or out-of-distribution inputs, we propose Distribution-Conditional Generation, a paradigm that models novel concepts as image synthesis conditioned on class distributions, enabling controllable yet semantically unconstrained creative generation.    Building on this, we propose DisTok, an encoder\u2013decoder framework that unifies conditional and unconditional creative generation by decoding latent representations\u2014either randomly sampled or mapped from conditions (e.g., class distributions)\u2014into tokens representing novel concepts.    DisTok is trained by iteratively sampling and fusing concept pairs from a dynamic pool to model progressively complex distributions, while enforcing semantic consistency through a vision-language model that aligns the class distributions of generated images with the input distributions.    Extensive experiments demonstrate that DisTok enables efficient and flexible semantic exploration for token-level creative synthesis, achieving state-of-the-art text\u2013image alignment and human preference.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40319", "url": null, "sourceid": -38568, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37968?format=json"], "related_events_ids": [37968]}, {"id": 37970, "uid": "fb5ead1fd337d4ac2581b075b6244ad1", "name": "DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video", "authors": [{"id": 188710, "fullname": "Jiawei Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188710?format=json", "institution": "Fudan University"}, {"id": 188712, "fullname": "Shenghao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188712?format=json", "institution": null}, {"id": 181441, "fullname": "Can Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181441?format=json", "institution": null}, {"id": 181691, "fullname": "Zheng Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181691?format=json", "institution": "Shenzhen University"}, {"id": 155745, "fullname": "Yonggen Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/155745?format=json", "institution": "Tencent Robotics X"}, {"id": 188713, "fullname": "Taiping Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188713?format=json", "institution": "Fudan University"}, {"id": 89233, "fullname": "Xiangyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/89233?format=json", "institution": "Fudan University"}, {"id": 113813, "fullname": "Jingbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/113813?format=json", "institution": "City University of Hong Kong"}], "abstract": "Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-trained foundational models and designs a geometry-aware spatiotemporal decoder to effectively capture both spatial and temporal dynamics. Furthermore, it adopts a multi-task learning architecture coupled with a dedicated training strategy to maintain global consistency across sequences of varying lengths. Extensive experiments show that DetAny4D achieves competitive detection accuracy and significantly improves temporal stability, effectively addressing long-standing issues of jitter and inconsistency in 4D object detection. Data and code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37970", "url": null, "sourceid": 45486, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37979, "uid": "33bd495470ddcf80911ca403ad6e3dd6", "name": "Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling", "authors": [{"id": 182713, "fullname": "Kai Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/182713?format=json", "institution": "Case Western Reserve University"}, {"id": 188736, "fullname": "Qingtao Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188736?format=json", "institution": "Case Western Reserve University"}, {"id": 90856, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90856?format=json", "institution": "Case Western Reserve University"}], "abstract": "Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \\emph{marginal} guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \\emph{Conditional Factuality Control} (CFC), a black-box conformal framework that returns \\emph{set-valued} outputs with \\emph{conditional} coverage guarantees. CFC learns a continuous, feature-conditional acceptance threshold via augmented quantile regression on a latent ``success'' score (the best score among correct candidates), and uses it to filter samples at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its \\emph{efficiency}, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most $O(\\sqrt{\\log(1/\\delta)/N})$. Empirically, on synthetic data and real-world reasoning and QA benchmarks, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37979", "url": null, "sourceid": 30599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37983, "uid": "295ade13c3d8aa09b542b332336a852d", "name": "UniPart: Part-Level 3D Generation with Unified 3D Geom\u2013Seg Latents", "authors": [{"id": 181624, "fullname": "Xufan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181624?format=json", "institution": null}, {"id": 90360, "fullname": "Yushuang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90360?format=json", "institution": "The Chinese University of Hong Kong (Shenzhen)"}, {"id": 157912, "fullname": "Xiaoyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157912?format=json", "institution": "ByteDance Inc."}, {"id": 90336, "fullname": "Chongjie Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/90336?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 126830, "fullname": "Jiaqing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/126830?format=json", "institution": "bytedance"}, {"id": 188750, "fullname": "Tianlei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188750?format=json", "institution": ""}, {"id": 88683, "fullname": "Xiaoguang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88683?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 157788, "fullname": "Dong Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/157788?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry\u2013segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances geometric fidelity by predicting part latents in both global and canonical spaces. Extensive experiments demonstrate that UniPart achieves superior segmentation controllability and part-level geometric quality compared with existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37983", "url": null, "sourceid": 43901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37984, "uid": "9e550bb1034a12dea7d970c623dbd9e6", "name": "Reward Sharpness-Aware Fine-Tuning for Diffusion Models", "authors": [{"id": 182399, "fullname": "Kwanyoung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182399?format=json", "institution": "Gwangju Institute of Science and Technology (GIST)"}, {"id": 152103, "fullname": "Byeongsu Sim", "url": "http://cvpr.thecvf.com/api/miniconf/users/152103?format=json", "institution": "Samsung Research"}], "abstract": "Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model, without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37984", "url": null, "sourceid": 30771, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37986, "uid": "9f855684c47d3fc3ed51ce64cb89708b", "name": "Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models", "authors": [{"id": 182706, "fullname": "Abhishek Kumar Sinha", "url": "http://cvpr.thecvf.com/api/miniconf/users/182706?format=json", "institution": "Indian Institute of Science, Indian institute of science, Bangalore; INDIAN SPACE RESEARCH ORGANIZATION"}, {"id": 188755, "fullname": "Nitant Dube", "url": "http://cvpr.thecvf.com/api/miniconf/users/188755?format=json", "institution": "Space Applications Centre, ISRO"}, {"id": 183196, "fullname": "Soma Biswas", "url": "http://cvpr.thecvf.com/api/miniconf/users/183196?format=json", "institution": "Indian Institute of Science, Bangalore"}], "abstract": "Few-shot Class-Incremental Learning (FSCIL) requires learning new classes from very limited data while preventing catastrophic forgetting. Existing methods rely mainly on visual features and are prone to overfitting, while recent vision\u2013language models (VLMs) offer better transferability but suppress fine-grained information due to contrastive feature decorrelation. Moreover, current FSCIL approaches often use static or fully optimizable prompts, making them either rigid or susceptible to semantic drift in incremental sessions. We introduce QR-Prompt, a residual-driven framework that leverages the visual\u2013textual feature residual of VLMs to recover discriminative fine-grained cues missing from the contrastive space. To ensure stability, we propose Discriminative Subspace Quantization (DSQ), which builds a discrete memory of residual subspaces. To enable plasticity, a Hierarchical Prompt Encoder (HPE) and Prompt Composer (PC) transform these discrete codes into continuous, class-adaptive prompts for novel classes. We derive bounds relating DSQ codebook size to generalization and classification margin, and achieve consistent improvements over state-of-the-art FSCIL methods on CUB200, CIFAR100, and miniImageNet. Our results show that residual-based quantization combined with hierarchical prompt composition yields stable and expressive VLM adaptation for FSCIL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37986", "url": null, "sourceid": 42618, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37987, "uid": "b19db9e093184a424038e07ab2ce8425", "name": "BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models", "authors": [{"id": 188756, "fullname": "Yu-Wei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188756?format=json", "institution": "Tsinghua University"}, {"id": 77260, "fullname": "Xin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77260?format=json", "institution": "Tsinghua University"}, {"id": 188757, "fullname": "Pengzhe Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188757?format=json", "institution": "Shandong University"}, {"id": 188758, "fullname": "Tongtong Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188758?format=json", "institution": "Tsinghua University"}, {"id": 76409, "fullname": "Ren Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76409?format=json", "institution": "Tsinghua University"}, {"id": 84883, "fullname": "Wenwu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84883?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM\u2019s latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM\u2019s latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM\u2019s semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Modeling, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37987", "url": null, "sourceid": 38071, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37985, "uid": "676a5420f160ad3e229d508cd3aefb1f", "name": "Towards Human-Like Robot Handwriting via Contour-Aware Generation", "authors": [{"id": 188751, "fullname": "Yutao Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188751?format=json", "institution": "South China University of Technology"}, {"id": 188752, "fullname": "Gang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188752?format=json", "institution": "Guangdong University of Technology"}, {"id": 73774, "fullname": "Yifan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73774?format=json", "institution": "National University of Singapore"}, {"id": 188753, "fullname": "Youwei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188753?format=json", "institution": null}, {"id": 188754, "fullname": "Qisheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188754?format=json", "institution": "South China University of Technology"}, {"id": 84875, "fullname": "Shuangping Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84875?format=json", "institution": "South China University of Technology"}], "abstract": "Empowering machines to simulate human handwriting is a promising research direction. Most existing methods, however, primarily focus on reproducing the writing trajectory to capture the overall character structure,  while neglecting the critical aspect of stroke contour modeling. Consequently, these methods struggle to generate visually realistic, human-like handwriting, limiting their applicability in scenarios such as calligraphy robots. To address this issue, we propose a new task, called Contour-aware Handwriting Trajectory Reconstruction (CHTR). This task presents two major challenges: 1) Existing handwriting datasets lack stroke contour annotations, making supervised learning difficult; 2) Previous methods are unable to recover stroke contour and preserve the overall character structure jointly. To address the dataset limitation, we present CHTR-110K, a large-scale character dataset with refined stroke contour annotations. To tackle the technical challenge, we propose Graph-based Handwriting Trajectory Reconstruction (G-HTR), a novel method using contour-aware graphs to jointly model stroke contour and character structure. We use a Graph Neural Network to capture structural relationships among nodes and introduce a multi-scale graph learning strategy to encode both fine-grained stroke details and global character structure. Extensive experiments verify the effectiveness of G-HTR, outperforming previous state-of-the-art methods on both our CHTR-110K and the widely-used CASIA-OLHWDB dataset. G-HTR further shows strong real-world results when deployed on robots, confirming its practical value. To support future research, we will release source code and dataset.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37985", "url": null, "sourceid": 39138, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37990, "uid": "ae17a511178b123c2b3c61f0283eeaea", "name": "NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining", "authors": [{"id": 182146, "fullname": "Liang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182146?format=json", "institution": "KU Leuven"}, {"id": 166584, "fullname": "Valerio Marsocci", "url": "http://cvpr.thecvf.com/api/miniconf/users/166584?format=json", "institution": "European Space Agency \u03a6-lab"}, {"id": 178398, "fullname": "Wufan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/178398?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 188767, "fullname": "Andrea Nascetti", "url": "http://cvpr.thecvf.com/api/miniconf/users/188767?format=json", "institution": "KTH Royal Institute of Technology"}, {"id": 188768, "fullname": "Maarten Vergauwen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188768?format=json", "institution": "KU Leuven"}], "abstract": "Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Since we focus here on geometric (spatial) innovation rather than spectral, we conduct experiment in a RGB setting following previous works. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37990", "url": null, "sourceid": 32542, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37991, "uid": "10e8bd26bb63fead09767e79b7ee4326", "name": "Demo2Tutorial: From Human Experience to Multimodal Software Tutorials", "authors": [{"id": 129007, "fullname": "Zechen Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129007?format=json", "institution": "Show Lab, National University of Singapore"}, {"id": 180927, "fullname": "Zhiheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180927?format=json", "institution": "National University of Singapore"}, {"id": 150934, "fullname": "Yiqi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/150934?format=json", "institution": "National University of Singapore"}, {"id": 88436, "fullname": "Kevin Qinghong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88436?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 77256, "fullname": "Difei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/77256?format=json", "institution": "NUS"}, {"id": 131811, "fullname": "Xiangwu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131811?format=json", "institution": "National University of Singapore"}, {"id": 188769, "fullname": "WANG XIN", "url": "http://cvpr.thecvf.com/api/miniconf/users/188769?format=json", "institution": null}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and, ultimately, agent capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37991", "url": null, "sourceid": 32410, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37998, "uid": "ed4f2064045ca72b7f2a53f052d82c55", "name": "TopoSlide - Topologically-Informed Histopathology Whole Slide Image Representation Learning", "authors": [{"id": 134616, "fullname": "Shahira Abousamra", "url": "http://cvpr.thecvf.com/api/miniconf/users/134616?format=json", "institution": "Stanford University"}, {"id": 178660, "fullname": "Asmita Sood", "url": "http://cvpr.thecvf.com/api/miniconf/users/178660?format=json", "institution": "Stanford University"}, {"id": 188790, "fullname": "Sylvia Plevritis", "url": "http://cvpr.thecvf.com/api/miniconf/users/188790?format=json", "institution": "Stanford University"}], "abstract": "Histopathology whole slide images are massive gigapixel images that present significant challenges in generating effective representations that accurately capture their histological content and the spatial organization of their various components. In this study, we introduce TopoSlide, a novel approach for self-supervised representation learning specifically designed for whole slide histopathology images. Our method leverages topological features of image data to optimize the learning process. We demonstrate that TopoSlide, even when trained on relatively small datasets, achieves comparable or superior performance to existing pathology foundation models across multiple retrieval and linear probing benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37998", "url": null, "sourceid": 45685, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38000, "uid": "f9c99524d8cf6cc6b51eade30dd95437", "name": "GenMatter: Perceiving Physical Objects with Generative Matter Models", "authors": [{"id": 181683, "fullname": "Eric Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181683?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 188795, "fullname": "Arijit Dasgupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/188795?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 188796, "fullname": "Yoni Friedman", "url": "http://cvpr.thecvf.com/api/miniconf/users/188796?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 188797, "fullname": "Mathieu Huot", "url": "http://cvpr.thecvf.com/api/miniconf/users/188797?format=json", "institution": null}, {"id": 188798, "fullname": "Vikash Mansinghka", "url": "http://cvpr.thecvf.com/api/miniconf/users/188798?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 188799, "fullname": "Thomas O&amp;#x27;Connell", "url": "http://cvpr.thecvf.com/api/miniconf/users/188799?format=json", "institution": "MIT"}, {"id": 75516, "fullname": "William Freeman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75516?format=json", "institution": "MIT and Google"}, {"id": 75999, "fullname": "Joshua B. Tenenbaum", "url": "http://cvpr.thecvf.com/api/miniconf/users/75999?format=json", "institution": "MIT"}], "abstract": "Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion and appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38000", "url": null, "sourceid": 39832, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38001, "uid": "cc01090a65650c8670a1d8b13a0a6fc2", "name": "Reinforcing Video Object Segmentation to Think before it Segments", "authors": [{"id": 155018, "fullname": "Sitong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155018?format=json", "institution": "Dalian University of Technology"}, {"id": 155019, "fullname": "Yunzhi Zhuge", "url": "http://cvpr.thecvf.com/api/miniconf/users/155019?format=json", "institution": "Dalian University of Technology"}, {"id": 130941, "fullname": "Lu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130941?format=json", "institution": "Dalian University of Technology"}, {"id": 103019, "fullname": "Jiazuo Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103019?format=json", "institution": "Dalian University of Technology"}, {"id": 128308, "fullname": "Pingping Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128308?format=json", "institution": "Dalian University of Technology"}, {"id": 87542, "fullname": "Xu Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87542?format=json", "institution": "Dalian University of Technology"}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}], "abstract": "Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into \\SEG tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation.  Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding.  Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (\\textit{e.g.}, +1.3 $\\mathcal{J}\\&\\mathcal{F}$ in ReVOS and +10.0 $\\mathcal{J}\\&\\mathcal{F}$ in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 $\\mathcal{R}$).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38001", "url": null, "sourceid": 33940, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38004, "uid": "02c32194f5d8a18f6d4aa773b988462c", "name": "Continual Distillation of Teachers from Different Domains", "authors": [{"id": 107486, "fullname": "Nicolas Michel", "url": "http://cvpr.thecvf.com/api/miniconf/users/107486?format=json", "institution": "The University of Tokyo"}, {"id": 107372, "fullname": "Maorong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107372?format=json", "institution": "The University of Tokyo"}, {"id": 92188, "fullname": "Jiangpeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/92188?format=json", "institution": "Purdue University"}, {"id": 92654, "fullname": "Toshihiko Yamasaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/92654?format=json", "institution": "The University of Tokyo"}], "abstract": "Deep learning models continue to scale, with some requiring more storage than numerous datasets. We introduce a new paradigm: Continual Distillation (CD), where a student learns sequentially from a stream of teacher models without retaining earlier teachers. CD faces two challenges: teacher training data is unavailable, and teachers have varying expertise. We show that external unlabeled data enables Unseen Knowledge Transfer (UKT), allowing the student to acquire information from domains not present in the training data, while known to the teacher; but also that sequential distillation causes Unseen Knowledge Forgetting (UKF) when transferred knowledge is lost after training on later teachers. To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. Experiments on multiple benchmarks show that SE2D reduces UKF and improves cross-domain generalization. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38004", "url": null, "sourceid": 44810, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38009, "uid": "2636d8136e53d78459c2718e1bf82d34", "name": "VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression", "authors": [{"id": 133005, "fullname": "Kyle Sargent", "url": "http://cvpr.thecvf.com/api/miniconf/users/133005?format=json", "institution": "Stanford University"}, {"id": 151093, "fullname": "Ruiqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151093?format=json", "institution": "Google"}, {"id": 126134, "fullname": "Philipp Henzler", "url": "http://cvpr.thecvf.com/api/miniconf/users/126134?format=json", "institution": "Google"}, {"id": 87557, "fullname": "Charles Herrmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/87557?format=json", "institution": "Google"}, {"id": 85769, "fullname": "Aleksander Holynski", "url": "http://cvpr.thecvf.com/api/miniconf/users/85769?format=json", "institution": "UC Berkeley &amp; Google Research"}, {"id": 150947, "fullname": "Li Fei-Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/150947?format=json", "institution": "Stanford University"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}, {"id": 134450, "fullname": "Jason Y. Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134450?format=json", "institution": "Google"}], "abstract": "Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception.In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38009", "url": null, "sourceid": 39755, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38010, "uid": "7ea5d3c15c91293b398bc81c9c1d42c4", "name": "SpatialTree: How Spatial Intelligence Branches Out in MLLMs", "authors": [{"id": 107152, "fullname": "Yuxi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107152?format=json", "institution": "Zhejiang University"}, {"id": 181647, "fullname": "longfei li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181647?format=json", "institution": "Bytedance"}, {"id": 140057, "fullname": "Shen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/140057?format=json", "institution": "ByteDance Inc."}, {"id": 91898, "fullname": "Xinhang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91898?format=json", "institution": "HKUST"}, {"id": 76570, "fullname": "Sida Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/76570?format=json", "institution": "Zhejiang University"}, {"id": 75837, "fullname": "Yunchao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/75837?format=json", "institution": "Beijing Jiaotong University"}, {"id": 76363, "fullname": "Xiaowei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76363?format=json", "institution": "Zhejiang University"}, {"id": 128453, "fullname": "Bingyi Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128453?format=json", "institution": "TikTok"}], "abstract": "Spatial Intelligence (SI) has emerged as a critical frontier for MLLMs, encompassing a hierarchy of skills from foundational perception to high level spatial reasoning. However, how these abilities are acquired, emerge, and transferred remains largely unknown. To investigate this, we propose SpatialTree a hierarchical taxonomy that organizes SI into a capability tree\u2014from low level perception (L1), mental mapping (L2), mental simulation (L3), to agentic competence (L4). Building on this, we construct a hierarchical, capability-centric benchmark using our proposed Spatial Engine, annotating each ability according to its level. Guided by the benchmark's correlation analysis, we conduct targeted supervised fine-tuning (SFT) and prompting experiments on key abilities. The results confirm the independence of abilities at the same level, reveal cross-level transfer, and further demonstrate a multi-ability synergy when these abilities are trained jointly. Our work provides a novel framework for analyzing SI in MLLMs, offering a comprehensive methodology to study how foundational abilities emerge and support higher-level competencies.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38010", "url": null, "sourceid": 35859, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38012, "uid": "c0e23406f578063150f4e2951421c995", "name": "UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios", "authors": [{"id": 152540, "fullname": "Tian Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/152540?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 152542, "fullname": "Song Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/152542?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 187144, "fullname": "Lei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187144?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K  across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data--model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval@4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and\u2014with a LLM prompt refiner\u2014matches or surpasses the proprietary Seedream 4.0.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38012", "url": null, "sourceid": 37135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38017, "uid": "2af62a0f91808c286ad2023293b5ac91", "name": "SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion", "authors": [{"id": 183997, "fullname": "Chuanmao Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183997?format=json", "institution": "University of Missouri-Columbia"}, {"id": 169560, "fullname": "Chenxi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/169560?format=json", "institution": "Clemson University"}, {"id": 139396, "fullname": "Ye Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/139396?format=json", "institution": "Clemson University"}], "abstract": "Recently, the community has witnessed significant progress in human modeling from a single view or multi-views, which often involves \"guessing\" the occluded parts using either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies from sparse views only. We propose an end-to-end implicit 3D reconstruction framework using a sparse multi-view setup. Specifically, we achieve this by exploring fusion blocks at three stages of the network. First, 2D feature encoders carrying out locally and globally, which produce enhanced features. Second, 3D feature grid, formed by attentional fusion of warped multi-view and multi-level 2D features, which follows 3D regularization of feature grids to aggregate spatially coherent multi-view features. Third, attentional 2D3D feature aggregation associated to query point generate  enhanced latent embedding, which is fed into an implicit field decoder for robust occupancy prediction. Evaluation on the THUman 2.1, MultiGarment dataset demonstrates that our system significantly outperforms state-of-the-art methods both qualitatively and quantitatively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38017", "url": null, "sourceid": 44019, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38018, "uid": "3314dfc68a97404aeecc86a670e50766", "name": "NIL: No-data Imitation Learning", "authors": [{"id": 146436, "fullname": "Mert Albaba", "url": "http://cvpr.thecvf.com/api/miniconf/users/146436?format=json", "institution": "ETH Z\u00fcrich"}, {"id": 188837, "fullname": "Chenhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188837?format=json", "institution": "ETH Zurich"}, {"id": 93293, "fullname": "Markos Diomataris", "url": "http://cvpr.thecvf.com/api/miniconf/users/93293?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 73811, "fullname": "Omid Taheri", "url": "http://cvpr.thecvf.com/api/miniconf/users/73811?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 188838, "fullname": "Andreas Krause", "url": "http://cvpr.thecvf.com/api/miniconf/users/188838?format=json", "institution": "ETH Zurich"}, {"id": 85690, "fullname": "Michael J. Black", "url": "http://cvpr.thecvf.com/api/miniconf/users/85690?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "Acquiring physically plausible motor skills across diverse and unconventional embodiments, including humanoids and quadrupeds, is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL), require extensive reward function engineering. Imitation learning (IL) offers an alternative but relies heavily on curated 3D expert demonstrations, which are scarce and difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic-looking videos of various morphologies, from humans to ants. However, these videos are often not physically plausible, which limits their direct use for skill acquisition. We introduce \"No-data Imitation Learning\" (NIL): an imitation learning framework that replaces curated expert demonstrations with videos generated by a pretrained video diffusion model. Our key insight is that the physics simulator enforces physical constraints, while the video provides visual guidance. NIL learns 3D motor skills in a physics simulator from 2D-generated videos, with generalization capability to unconventional forms. Specifically, NIL computes a discriminator-free imitation reward that combines (i) a video-embedding similarity between generated and simulated videos using a pretrained video vision transformer, and (ii) an image-based similarity term derived from video segmentation masks. We evaluate NIL on locomotion and whole-body control tasks across unique body configurations. Our experiments show that in humanoid locomotion, NIL matches the performance of state-of-the-art IL baselines trained on motion-capture data; and in whole-body manipulation, it exceeds the performance of RL baselines without requiring any curated data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38018", "url": null, "sourceid": 32242, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38020, "uid": "bcc3a047fe706bac8650c8d9533a420a", "name": "MA-Bench: Towards Fine-grained Micro-Action Understanding", "authors": [{"id": 128709, "fullname": "Kun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128709?format=json", "institution": "Hefei University of Technology"}, {"id": 188841, "fullname": "Jihao Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188841?format=json", "institution": "University College London, University of London"}, {"id": 107609, "fullname": "Fei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107609?format=json", "institution": "Hefei University of Technology"}, {"id": 188842, "fullname": "zhiliang wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188842?format=json", "institution": "Zhejiang University"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}, {"id": 128503, "fullname": "Dan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128503?format=json", "institution": "Hefei University of Technology"}], "abstract": "With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present **MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question\u2013answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct **MA-Bench-Train**, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38020", "url": null, "sourceid": 45664, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38022, "uid": "98af8de5c746356a4b872fa0857e540c", "name": "TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction", "authors": [{"id": 174216, "fullname": "Yukuo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/174216?format=json", "institution": "Fudan University"}, {"id": 188844, "fullname": "Cong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188844?format=json", "institution": null}, {"id": 77020, "fullname": "Junke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77020?format=json", "institution": "Fudan University"}, {"id": 188845, "fullname": "Junqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188845?format=json", "institution": "China Telecom"}, {"id": 87044, "fullname": "Haibin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87044?format=json", "institution": "Kuaishou Technology"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}, {"id": 152943, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152943?format=json", "institution": "China Telecom"}, {"id": 185717, "fullname": "Xuelong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185717?format=json", "institution": "China Telecom; Northwestern Polytechnical University"}], "abstract": "We present TempoMaster, a novel framework that formulates long video generation as next-frame-rate prediction. Specifically, we first generate a low-frame-rate clip that serves as a coarse blueprint of the entire video sequence, and then progressively increase the frame rate to refine visual details and motion continuity. During generation, TempoMaster employs bidirectional attention within each frame-rate level while performing autoregression across frame rates, thus achieving long-range temporal coherence while enabling efficient and parallel synthesis.  Extensive experiments demonstrate that TempoMaster establishes a new state-of-the-art in long video generation, excelling in both visual and temporal quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38022", "url": null, "sourceid": 32795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38074, "uid": "9f035001ce09d0f33f4bf01eb34c15b6", "name": "EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories", "authors": [{"id": 188992, "fullname": "Lu Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188992?format=json", "institution": "The University of Osaka"}, {"id": 188993, "fullname": "Yuta Nakashima", "url": "http://cvpr.thecvf.com/api/miniconf/users/188993?format=json", "institution": "The University of Osaka"}, {"id": 89038, "fullname": "Noa Garcia", "url": "http://cvpr.thecvf.com/api/miniconf/users/89038?format=json", "institution": "Osaka University"}], "abstract": "The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright).Our results show that existing methods struggle with implicit prompts (i.e. generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e. failing to generated non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38074", "url": null, "sourceid": 41101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38028, "uid": "ac8104196de5509e94247b712f18e8f3", "name": "QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy", "authors": [{"id": 103191, "fullname": "Adam Lilja", "url": "http://cvpr.thecvf.com/api/miniconf/users/103191?format=json", "institution": "Zenseact &amp; Chalmers"}, {"id": 161367, "fullname": "Ji Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/161367?format=json", "institution": "Zenseact AB"}, {"id": 105793, "fullname": "Junsheng Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105793?format=json", "institution": "Zenseact"}, {"id": 129636, "fullname": "Lars Hammarstrand", "url": "http://cvpr.thecvf.com/api/miniconf/users/129636?format=json", "institution": "Chalmers University of Technology"}], "abstract": "Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving.Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels.Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability.We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames.The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data.To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions.QueryOcc surpasses previous camera-based methods by 26\\% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning.Link to code", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38028", "url": null, "sourceid": 43671, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38031, "uid": "2029ecd2552569728dc1a9825542fd40", "name": "FILTR: Extracting Topological Features from Pretrained 3D Models", "authors": [{"id": 183778, "fullname": "Louis Martinez", "url": "http://cvpr.thecvf.com/api/miniconf/users/183778?format=json", "institution": "Ecole Polytechnique"}, {"id": 85376, "fullname": "Maks Ovsjanikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85376?format=json", "institution": "Ecole Polytechnique, France"}], "abstract": "Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38031", "url": "https://filtr-topology.github.io/", "sourceid": 33981, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40323?format=json"], "related_events_ids": [40323]}, {"id": 38033, "uid": "165d6a6256273a3430fa4891413ac2f0", "name": "Iris: Integrating Language into Diffusion-based Monocular Depth Estimation", "authors": [{"id": 101985, "fullname": "Ziyao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/101985?format=json", "institution": "Yale University"}, {"id": 182545, "fullname": "Jingcheng Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/182545?format=json", "institution": "Brown University"}, {"id": 130826, "fullname": "Daniel Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130826?format=json", "institution": "Yale University"}, {"id": 156899, "fullname": "Patrick Rim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156899?format=json", "institution": "Yale University / Google"}, {"id": 148491, "fullname": "Younjoon Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/148491?format=json", "institution": "Carnegie Mellon University"}, {"id": 128507, "fullname": "Fengyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128507?format=json", "institution": "Yale University"}, {"id": 149177, "fullname": "Byung-Woo Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/149177?format=json", "institution": "Chung-Ang University"}, {"id": 72573, "fullname": "Alex Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/72573?format=json", "institution": "Yale University"}], "abstract": "Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth predication can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38033", "url": null, "sourceid": 43443, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40323, "uid": "2029ecd2552569728dc1a9825542fd40", "name": "FILTR: Extracting Topological Features from Pretrained 3D Models", "authors": [{"id": 183778, "fullname": "Louis Martinez", "url": "http://cvpr.thecvf.com/api/miniconf/users/183778?format=json", "institution": "Ecole Polytechnique"}, {"id": 85376, "fullname": "Maks Ovsjanikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85376?format=json", "institution": "Ecole Polytechnique, France"}], "abstract": "Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40323", "url": "https://filtr-topology.github.io/", "sourceid": -33981, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38031?format=json"], "related_events_ids": [38031]}, {"id": 38036, "uid": "71bdb6f4ebabe93c8809e94d5db2a8f2", "name": "NEAF: Natural Image Editing with Attention Fusion for Generalizable Tuning-Free Text-Guided Image Editing", "authors": [{"id": 183286, "fullname": "Jisoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183286?format=json", "institution": "Hansung University"}, {"id": 181990, "fullname": "Heeseok Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/181990?format=json", "institution": "Hansung University"}], "abstract": "Diffusion-based text-to-image (T2I) models have enabled remarkable generative capabilities, yet precise text-based image editing that preserves the original\u2019s structural and perceptual fidelity remains non-trivial. Existing approaches either rely on retraining with large bespoke datasets, incurring significant computational and curation costs, or adopt lightweight fine-tuning strategies that still require optimization and often fail in fine-grained or semantically complex edits.We propose NEAF (Natural image Editing with Attention Fusion), a novel zero-shot, universal tuning-free framework for arbitrary T2I models, obviating the need for dataset curation or retraining. NEAF introduces a lightweight, learnable XA-Conductor module that dynamically identifies salient cross-attention contributions pertinent to the edit. This module optimizes a weight vector to orchestrate an adaptive fusion of cross-attention maps derived from the source, edited, and reconstruction branches. This triadic-feedback optimization strategy ensures the precise instantiation of user directives while rigorously preserving the fidelity of quiescent regions.Extensive experiments validate NEAF as a flexible and general framework that consistently surpasses existing methods across diverse editing tasks, demonstrating particular dominance in complex, non-rigid editing scenarios where other approaches falter.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38036", "url": null, "sourceid": 38157, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38043, "uid": "419503081d19da6921ec70b5e5ac8dfb", "name": "MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures", "authors": [{"id": 183155, "fullname": "Tim Strohmeyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/183155?format=json", "institution": "International Business Machines (IBM)"}, {"id": 143941, "fullname": "Lucas Morin", "url": "http://cvpr.thecvf.com/api/miniconf/users/143941?format=json", "institution": "IBM Research, ETH Z\u00fcrich"}, {"id": 155354, "fullname": "Gerhard Ingmar Meijer", "url": "http://cvpr.thecvf.com/api/miniconf/users/155354?format=json", "institution": "International Business Machines"}, {"id": 155353, "fullname": "Valery Weber", "url": "http://cvpr.thecvf.com/api/miniconf/users/155353?format=json", "institution": null}, {"id": 144771, "fullname": "Ahmed Nassar", "url": "http://cvpr.thecvf.com/api/miniconf/users/144771?format=json", "institution": "International Business Machines"}, {"id": 155355, "fullname": "Peter W. J. Staar", "url": "http://cvpr.thecvf.com/api/miniconf/users/155355?format=json", "institution": "International Business Machines"}], "abstract": "Automatically extracting chemical structures from documents is essential for information retrieval in the chemistry domain and for creating training datasets for machine learning models. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing.In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure.To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M-1k, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38043", "url": null, "sourceid": 42298, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38046, "uid": "c029d9d4b124cef75291bea08bd390c5", "name": "Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes", "authors": [{"id": 180903, "fullname": "Changqing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180903?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 145274, "fullname": "Yueru Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/145274?format=json", "institution": "Beijing University of Post and Telecommunications; The Chinese University of Hong Kong"}, {"id": 188912, "fullname": "Han Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188912?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 188913, "fullname": "Zeyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188913?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187943, "fullname": "Changhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187943?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian\u2013language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38046", "url": null, "sourceid": 40417, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40324?format=json"], "related_events_ids": [40324]}, {"id": 38094, "uid": "f76d445d6ca1d2d62872bc0ea1c298bf", "name": "GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion", "authors": [{"id": 172227, "fullname": "Enda Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172227?format=json", "institution": "Beihang University"}, {"id": 126989, "fullname": "Haoxiang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/126989?format=json", "institution": "Beihang University"}, {"id": 159457, "fullname": "Xinzhu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/159457?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189043, "fullname": "Zicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189043?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 87605, "fullname": "Di Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87605?format=json", "institution": "Beihang University"}], "abstract": "This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38094", "url": null, "sourceid": 39237, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38054, "uid": "9593a442d5efc4732b3c25e7381d4049", "name": "PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation", "authors": [{"id": 150958, "fullname": "Wenlong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150958?format=json", "institution": "Stanford University"}, {"id": 76010, "fullname": "Yu-Wei Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76010?format=json", "institution": "NVIDIA"}, {"id": 136921, "fullname": "Arsalan Mousavian", "url": "http://cvpr.thecvf.com/api/miniconf/users/136921?format=json", "institution": "NVIDIA"}, {"id": 90941, "fullname": "Ming-Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90941?format=json", "institution": "NVIDIA"}, {"id": 87330, "fullname": "Dieter Fox", "url": "http://cvpr.thecvf.com/api/miniconf/users/87330?format=json", "institution": "University of Washington"}, {"id": 73488, "fullname": "Kaichun Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/73488?format=json", "institution": "NVIDIA Research"}, {"id": 150947, "fullname": "Li Fei-Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/150947?format=json", "institution": "Stanford University"}], "abstract": "Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots, crucial for contact reasoning, while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38054", "url": null, "sourceid": 36239, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38055, "uid": "bd8581995b520d5df3b7e28a47b427b1", "name": "EDGS: Eliminating Densification for Efficient Convergence of 3DGS", "authors": [{"id": 188939, "fullname": "Dmytro Kotovenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/188939?format=json", "institution": null}, {"id": 180106, "fullname": "Olga Grebenkova", "url": "http://cvpr.thecvf.com/api/miniconf/users/180106?format=json", "institution": "LMU Munich"}, {"id": 85132, "fullname": "Bj\u00f6rn Ommer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85132?format=json", "institution": "University of Munich"}], "abstract": "3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and refiningunder-reconstructed regions. This process is slow, as it requires multiple densification steps where Gaussians arerepeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often yieldssuboptimal renderings in high-frequency regions. We propose a fundamentally different approach: eliminate densification with a one-step approximation of scenegeometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimatethe rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian withwell-informed color, scale, and position. As a result, we dramatically shorten the optimization path and remove theneed for densification. Unlike methods that rely on sparse keypoints, our dense initialization ensures uniform detailacross the scene, even in high-frequency regions where other methods struggle. Moreover, since all splats are initializedin parallel at the start of optimization, we remove the need to wait for densification to adjust new Gaussians.EDGS reaches LPIPS and SSIM performance of standard 3DGS significantly faster than existing efficiency-focused approaches. When trained further, it exceeds the reconstruction quality of state-of-the-art models aimed at maximizing fidelity. Our method is fully compatible with other acceleration techniques, making it a versatile and efficient solution that can be integrated with existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38055", "url": null, "sourceid": 38402, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38059, "uid": "9a3f34a2d6ad7dcd61c116f52e398d81", "name": "Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization", "authors": [{"id": 182301, "fullname": "Hongyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182301?format=json", "institution": "Jilin University"}, {"id": 128934, "fullname": "Haipeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128934?format=json", "institution": "Jilin University"}, {"id": 188943, "fullname": "Zhimin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188943?format=json", "institution": null}, {"id": 188944, "fullname": "Chengxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188944?format=json", "institution": "Jilin University; Chongqing University of Technology"}, {"id": 128961, "fullname": "Yingda Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128961?format=json", "institution": "Jilin University"}], "abstract": "Diffusion models (DMs) demonstrate strong capabilities in generating anatomically realistic medical images, enabling promising avenues for improving model generalization via synthetic augmentation. However, bridging the gap between generative prowess (realism) and measurable improvements in downstream generalization (utility) remains a key challenge. This work unifies theory and practice to tackle two central questions: (1) What to synthesize? We identify synthetic adversariality\u2014the expected empirical loss induced by synthetic data\u2014as a key driver of generalization. Crucially, only native adversariality (i.e., hard examples drawn from the DM's distribution) yields consistent improvements, while artificial adversariality from attack-style perturbations degrades performance. (2) How to synthesize? We introduce the Adversariality Miner, a lightweight, plug-and-play module that efficiently selects initial noise to elicit native adversarial samples, without modifying or retraining the DM. Extensive experiments across diverse diffusion backbones and medical benchmarks confirm the effectiveness of our approach, establishing a principled path toward diffusion-driven generalization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38059", "url": null, "sourceid": 30669, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38062, "uid": "1757e3f18eafbd664e0bc8cd4c2e0e39", "name": "FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning", "authors": [{"id": 188952, "fullname": "Zhengyu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188952?format=json", "institution": null}, {"id": 185191, "fullname": "Ren\u00e9 Zurbr\u00fcgg", "url": "http://cvpr.thecvf.com/api/miniconf/users/185191?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 188953, "fullname": "Kaixian Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188953?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}, {"id": 88665, "fullname": "Marco Hutter", "url": "http://cvpr.thecvf.com/api/miniconf/users/88665?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 139149, "fullname": "Hermann Blum", "url": "http://cvpr.thecvf.com/api/miniconf/users/139149?format=json", "institution": "Uni Bonn                Lamarr Institute"}, {"id": 151077, "fullname": "Zuria Bauer", "url": "http://cvpr.thecvf.com/api/miniconf/users/151077?format=json", "institution": "ETH Zurich"}], "abstract": "Recent work in 3D scene understanding has begun to shift from purely spatial analysis to the more complex challenge of functional scene understanding.However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependencies that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their uncertainties, yielding substantially better-calibrated confidence scores. To benchmark this setting, we also introduce FunThor, a synthetic dataset based on AI2THOR with part-level geometry and systematically-defined rule-based functional annotations. Experiments on SceneFun3D, FunGraph3D, and FunThor show that FunFact improves node and relation discovery recall and significantly reduces calibration error for ambiguous relations, highlighting the benefits of holistic probabilistic modeling for functional scene understanding. We will release the code and dataset to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38062", "url": null, "sourceid": 37136, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38068, "uid": "57c50209953cb4fb4c0e9b9631f3802c", "name": "Boosting Reasoning in Large Multimodal Models via Activation Replay", "authors": [{"id": 128445, "fullname": "Yun Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/128445?format=json", "institution": "Nanyang Technological University"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 154787, "fullname": "Qingdong He", "url": "http://cvpr.thecvf.com/api/miniconf/users/154787?format=json", "institution": "Tencent Youtu Lab"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 106922, "fullname": "Shuicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/106922?format=json", "institution": "National University of Singapore, Department of Electrical and Computer Engineering"}, {"id": 87301, "fullname": "Shijian Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87301?format=json", "institution": "Nanyang Technological University"}, {"id": 86826, "fullname": "Yu-Gang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86826?format=json", "institution": "Fudan University"}], "abstract": "Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38068", "url": null, "sourceid": 31621, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38072, "uid": "387ba2ba73ce1e4c81fb09175bf2a1f3", "name": "FusionRegister: Every Infrared and Visible Image Fusion Deserves Registrtaion", "authors": [{"id": 175452, "fullname": "Congcong Bian", "url": "http://cvpr.thecvf.com/api/miniconf/users/175452?format=json", "institution": "Jiangnan University"}, {"id": 188989, "fullname": "HaoLong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188989?format=json", "institution": "Jiangnan University"}, {"id": 157157, "fullname": "Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157157?format=json", "institution": "School of Artificial Intelligence and Computer Science"}, {"id": 174148, "fullname": "Zhongwei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/174148?format=json", "institution": "Suzhou University of Science and Technology"}, {"id": 152476, "fullname": "Xiaoqing Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152476?format=json", "institution": "Jiangnan University"}, {"id": 188990, "fullname": "Xiaoning Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/188990?format=json", "institution": "Jiangnan University"}, {"id": 129533, "fullname": "Xiaojun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129533?format=json", "institution": "Jiangnan University"}], "abstract": "Spatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although there are several methods are proposed to address this issue, the existing registration joint fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross modality registration method guided by visual priors is proposed for multi-modality image fusion task, termed as FusionRegister.Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions.Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectively handled, enabling seamless integration with diverse fusion methods while preserving their intrinsic properties. In addition, its efficiency is further enhanced by serving the backbone fusion method as a natural visual prior provider, which guide the registration process to focus only on regions affected by misregistration, thereby avoiding redundant operation. Extensive experiments on three datasets demonstrate that FusionRegister not only inherits the fusion quality of state-of-the-art methods, but also delivers superior detail alignment, robustness, and adaptability, making it highly suitable for any infrared and visible image fusion method. The code is available in supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38072", "url": null, "sourceid": 41111, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38073, "uid": "9a993917980a1319b27acd7ead4cd93b", "name": "FlashPortrait: 6$\\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction", "authors": [{"id": 127163, "fullname": "Shuyuan Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127163?format=json", "institution": "Fudan University"}, {"id": 153229, "fullname": "Yueming Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153229?format=json", "institution": "Institute of Artificial Intelligence and Robotics, Xi\u2019an Jiaotong University"}, {"id": 188991, "fullname": "Yinming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188991?format=json", "institution": "Fudan University"}, {"id": 90390, "fullname": "Xintong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/90390?format=json", "institution": "Huya Inc"}, {"id": 86636, "fullname": "Zhen Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/86636?format=json", "institution": "Fudan University"}, {"id": 85511, "fullname": "Qi Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85511?format=json", "institution": "Microsoft Research Asia"}, {"id": 130460, "fullname": "Kai Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130460?format=json", "institution": "Microsoft"}, {"id": 86583, "fullname": "Chong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86583?format=json", "institution": "Microsoft Research Asia"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}], "abstract": "Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6$\\times$ acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6$\\times$ speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38073", "url": null, "sourceid": 31563, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38082, "uid": "3fa14c8a59cc0c01c983bb7cbdce16d5", "name": "HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling", "authors": [{"id": 153234, "fullname": "Joungbin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/153234?format=json", "institution": "The University of Texas at Austin"}, {"id": 69188, "fullname": "Kristen Grauman", "url": "http://cvpr.thecvf.com/api/miniconf/users/69188?format=json", "institution": "University of Texas at Austin"}], "abstract": "Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or using fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba\u2019s selective scanning to produce compact anchor tokens summarizing video content across scales. We further introduce anchor-conditioned and segment-pooled contrastive losses-two complementary objectives that encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38082", "url": null, "sourceid": 38635, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38083, "uid": "597d738d76c1e2822a00f65736b20f06", "name": "Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer", "authors": [{"id": 100778, "fullname": "Yajing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/100778?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 189025, "fullname": "Yumeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189025?format=json", "institution": "Shenyang Ligong University"}, {"id": 189026, "fullname": "Yue Si", "url": "http://cvpr.thecvf.com/api/miniconf/users/189026?format=json", "institution": "Shenyang Ligong University"}, {"id": 129300, "fullname": "Baojie Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129300?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 129312, "fullname": "Jiandong Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/129312?format=json", "institution": "The Shenyang Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Cross-modal transfer methods have achieved significant progress in extending RGB-based foundation models to non-RGB modalities. However, existing transfer paradigms are primarily task-oriented, meaning that changing tasks requires re-training and re-storing, leading to substantial redundancy in data, computation and storage. To address this limitation, we propose an efficient cross-modal transfer paradigm that decouples the process into a one-time general modality knowledge transfer and a flexible task knowledge transfer. In Stage 1, we propose a Progressive Self-Supervised Tuning strategy that integrates modality-aware structural reconstruction with semantic discriminative learning, which enables task-agnostic modality knowledge learning using only unlabeled data through a one-time training process, resulting in reusable target-modality LoRAs. In Stage 2, we incorporate the modality LoRAs and further propose a Task-Prompted Mixture-of-Modality Experts module. This design enables lightweight task knowledge injection while effectively balancing task-specific, modality-general and modality-specific knowledge in multimodal fusion process for diverse downstream tasks. Extensive experiments across six cross-modal transfer scenarios, along with analyses of data, computation, and storage efficiency, demonstrate the superiority of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38083", "url": null, "sourceid": 46530, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38085, "uid": "74c156781ba6ffe18400a88689715dd1", "name": "CLEP: Contrastive Language-Pose Pretraining", "authors": [{"id": 158508, "fullname": "Sen Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/158508?format=json", "institution": "University of Washington"}, {"id": 135592, "fullname": "Huayu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135592?format=json", "institution": "University of Washington"}, {"id": 93202, "fullname": "Hsiang-Wei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93202?format=json", "institution": "University of Washington"}, {"id": 113866, "fullname": "Zhaochong An", "url": "http://cvpr.thecvf.com/api/miniconf/users/113866?format=json", "institution": "University of Copenhagen"}, {"id": 93440, "fullname": "Jenq-Neng Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93440?format=json", "institution": "University of Washington, Seattle"}, {"id": 189030, "fullname": "Zhang Huaping", "url": "http://cvpr.thecvf.com/api/miniconf/users/189030?format=json", "institution": "Beijing Institute of Technology"}, {"id": 188406, "fullname": "Lei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188406?format=json", "institution": null}], "abstract": "Aligning natural language descriptions with precise 3D human poses remains a big challenge due to the scarcity of effective pose representation mechanisms and large-scale, semantically rich datasets. To overcome these limitations, we first introduce **CLEP-2M**, the largest 3D pose-language dataset to date, comprising two million high-quality 3D pose-language pairs. This dataset provides a **20-fold** increase in scale and far richer semantic diversity than existing benchmarks. Second, we propose **CLEP**, a novel contrastive pretraining framework. The core of CLEP is HierFormer, a hierarchical pose encoder specifically designed for language alignment. Its key innovation is a Cross-Scale Attention Fusion (CSAF) mechanism  that dynamically integrates features from the joint, limb, and body levels. This enables CLEP to precisely align complex, multi-scale text descriptions with the pose representation. Extensive experimental evaluations on CLEP-2M and PoseScript  demonstrate that our method consistently outperforms existing approaches across a range of downstream tasks. CLEP shows exceptional zero-shot generalization, achieving a 34.8 mRecall on the human-annotated PoseScript-H benchmark\u2014a nearly **6-fold** improvement from the baseline. Furthermore, CLEP demonstrates superior performance on pose generation  and fine-grained pose editing. These results establish CLEP as a strong multimodal foundation model for human-centric understanding and generation tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38085", "url": null, "sourceid": 37270, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38093, "uid": "6c8a12209e4818416c7a0fac9fe555c4", "name": "Choreographing a World of Dynamic Objects", "authors": [{"id": 145035, "fullname": "Yanzhe Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145035?format=json", "institution": "University of Science and Technology of China"}, {"id": 86204, "fullname": "Chen Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86204?format=json", "institution": "Stanford University"}, {"id": 189042, "fullname": "Karthik Dharmarajan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189042?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 159444, "fullname": "Yunzhi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159444?format=json", "institution": "Stanford University"}, {"id": 76025, "fullname": "Hadi Alzayer", "url": "http://cvpr.thecvf.com/api/miniconf/users/76025?format=json", "institution": "University of Maryland"}, {"id": 152816, "fullname": "Shangzhe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152816?format=json", "institution": "University of Cambridge"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}], "abstract": "Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we study a universal generative pipeline for synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38093", "url": null, "sourceid": 36575, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38087, "uid": "b86d9a8e244a8af4610b10875b98230e", "name": "MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters", "authors": [{"id": 175352, "fullname": "Soomin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/175352?format=json", "institution": "KAIST"}, {"id": 175362, "fullname": "Eunseong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/175362?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 178448, "fullname": "Kwang Bin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/178448?format=json", "institution": "KAIST"}, {"id": 189033, "fullname": "Sung-Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189033?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control.The framework follows a two-stage residual learning paradigm.In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions.This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions.In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere.We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator.Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38087", "url": null, "sourceid": 34080, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38090, "uid": "1d44a1f32c916debdadfa721ebbd864d", "name": "GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis", "authors": [{"id": 144296, "fullname": "Jianwei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144296?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 128933, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128933?format=json", "institution": "AIQ"}, {"id": 128954, "fullname": "XIN LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/128954?format=json", "institution": "G42"}, {"id": 156499, "fullname": "Qiang Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156499?format=json", "institution": "Sichuan Agricultural University"}, {"id": 154155, "fullname": "Ao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154155?format=json", "institution": "Southwest Jiaotong University"}, {"id": 189037, "fullname": "Ziqi Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/189037?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 156393, "fullname": "Zhicheng Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156393?format=json", "institution": "Brown University"}, {"id": 156501, "fullname": "Hong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156501?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Understanding how transcriptomic programs shape tissue morphology remains a central challenge in computational pathology. Gene-to-WSI tile synthesis offers a principled generative framework to translate molecular profiles into histological images. However, most existing methods compress RNA-Seq into a single global embedding injected once at initialization, an oversimplified design that weakens transcriptomic signals and induces non-causal associations between gene expression and tissue morphology. We present GeneVAR, an Autoregressive Gene-to-WSI model that reformulates synthesis as an iterative, coarse-to-fine generative process. At its core is a novel Causal MeanFlow module that reinforces transcriptome-informed guidance at multiple stages and mitigates non-causal factors through counterfactual-style interventions, thereby ensuring biological fidelity throughout the generative trajectory. Combined with a $\\beta$-VAE for compact gene embeddings and a multi-scale vector quantizer for discrete morphology representation, GeneVAR generates H\\&E-stained WSI tiles that are both visually realistic and transcriptomically faithful. Extensive experiments across five TCGA cancer benchmarks demonstrate consistent state-of-the-art performance, surpassing prior methods in both generative fidelity and downstream classification accuracy. All models and code will be released to facilitate reproducibility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38090", "url": null, "sourceid": 36531, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38091, "uid": "5b4ba6e852444462a8e1223fc42e1af8", "name": "LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens", "authors": [{"id": 91650, "fullname": "Zekun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91650?format=json", "institution": "Tencent AI Lab"}, {"id": 189038, "fullname": "Sizhe An", "url": "http://cvpr.thecvf.com/api/miniconf/users/189038?format=json", "institution": "Meta Reality Labs"}, {"id": 157318, "fullname": "Chengcheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157318?format=json", "institution": "Meta"}, {"id": 96456, "fullname": "Chuan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96456?format=json", "institution": "Meta Reality Lab"}, {"id": 189039, "fullname": "Ivan Shugurov", "url": "http://cvpr.thecvf.com/api/miniconf/users/189039?format=json", "institution": "Facebook"}, {"id": 156809, "fullname": "Linguang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156809?format=json", "institution": "Meta Reality Labs"}, {"id": 139332, "fullname": "Amy Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/139332?format=json", "institution": "Meta Reality Labs"}, {"id": 76158, "fullname": "Srinath Sridhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/76158?format=json", "institution": "Brown University"}, {"id": 189040, "fullname": "Lingling Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189040?format=json", "institution": "Facebook"}, {"id": 91346, "fullname": "Abhay Mittal", "url": "http://cvpr.thecvf.com/api/miniconf/users/91346?format=json", "institution": "Amazon"}], "abstract": "Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs.Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization.To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation.We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS).Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.We will release the code and model upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38091", "url": null, "sourceid": 38533, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38097, "uid": "c07d8f3be977218f8ba2bcd5e74525ed", "name": "Unsafe2Safe: Controllable Image Anonymization for Downstream Utility", "authors": [{"id": 181822, "fullname": "Minh Dinh", "url": "http://cvpr.thecvf.com/api/miniconf/users/181822?format=json", "institution": "Dartmouth College"}, {"id": 87358, "fullname": "SouYoung Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87358?format=json", "institution": "Dartmouth College"}], "abstract": "Large-scale image datasets frequently contain identifiable or sensitive content, raising privacy risks when training models that may memorize and leak such information. We present Unsafe2Safe, a fully automated pipeline that detects privacy-prone images and rewrites only their sensitive regions using multimodally guided diffusion editing. Unsafe2Safe operates in two stages. Stage 1 uses a vision--language model to (i) inspect images for privacy risks, (ii) generate paired private and public captions that respectively include and omit sensitive attributes, and (iii) prompt a large language model to produce structured, identity-neutral edit instructions conditioned on the public caption. Stage 2 employs instruction-driven diffusion editors to apply these dual textual prompts, producing privacy-safe images that preserve global structure and task-relevant semantics while neutralizing private content. To measure anonymization quality, we introduce a unified evaluation suite covering Quality, Cheating, Privacy}, and Utility dimensions. Across Caltech101 and MIT Indoor67, Unsafe2Safe reduces face similarity, text similarity, and demographic predictability by large margins, while maintaining downstream model accuracy comparable to training on raw data. Fine-tuning diffusion editors on our automatically generated triplets (private caption, public caption, edit instruction) further improves both privacy protection and semantic fidelity. Unsafe2Safe provides a scalable, principled solution for constructing large, privacy-safe datasets without sacrificing visual consistency or downstream utility.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38097", "url": null, "sourceid": 45727, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38101, "uid": "9377e91beb4b28f3f27a6c7060b8fc2a", "name": "DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers", "authors": [{"id": 179930, "fullname": "Hang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/179930?format=json", "institution": "Jilin University"}, {"id": 189059, "fullname": "Hang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189059?format=json", "institution": null}, {"id": 188956, "fullname": "Qianyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188956?format=json", "institution": "Jilin University"}, {"id": 156718, "fullname": "Xuequan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156718?format=json", "institution": "The University of Western Australia"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 189060, "fullname": "Hao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189060?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186347, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186347?format=json", "institution": "Jilin University"}, {"id": 131537, "fullname": "Yiren Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/131537?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Diffusion models have recently motivated great success in many generation tasks like object removal. Nevertheless, existing image decomposition methods struggle to disentangle semi-transparent or transparent layer occlusions due to mask prior dependencies, static object assumptions, and the lack of datasets. In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. To address challenges in layer ambiguity, generalization, and data scarcity, we first introduce AlphaBlend, the first large\u2011scale and high-quality dataset for transparent and semi\u2011transparent layer decomposition, containing six subtasks with different characteristics (e.g., translucent flare removal, semi-transparent cell decomposition, glassware decomposition). Building on this dataset, we present DiffDecompose, a diffusion Transformer-based framework that learns the posterior over possible layer decompositions conditioned on the input image, semantic prompts, and blending type. Rather than regressing alpha mattes directly, DiffDecompose performs In\u2011Context Decomposition, enabling the model to predict one or multiple layers without per\u2011layer supervision, and introduces Layer Position Encoding Cloning to maintain pixel\u2011level correspondence across layers. Extensive experiments on the proposed AlphaBlend dataset and public LOGO dataset verify the effectiveness of DiffDecompose. The code and dataset will be available upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38101", "url": null, "sourceid": 46204, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38105, "uid": "0522f0f077611617c09c361e91503db9", "name": "Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model", "authors": [{"id": 87805, "fullname": "Dongwon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87805?format=json", "institution": "POSTECH"}, {"id": 151957, "fullname": "Gawon Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/151957?format=json", "institution": "POSTECH Computer Vision Lab"}, {"id": 189070, "fullname": "Jinsung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189070?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 86304, "fullname": "Minsu Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/86304?format=json", "institution": "POSTECH"}, {"id": 87833, "fullname": "Suha Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/87833?format=json", "institution": "POSTECH"}], "abstract": "World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning.Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control.A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive.To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning.An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38105", "url": null, "sourceid": 37187, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38106, "uid": "37a02e4012f8ec6fc59da19e05d983e3", "name": "AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting", "authors": [{"id": 133547, "fullname": "Artur Xarles i Esparraguera", "url": "http://cvpr.thecvf.com/api/miniconf/users/133547?format=json", "institution": "Universitat de Barcelona"}, {"id": 129850, "fullname": "Sergio Escalera", "url": "http://cvpr.thecvf.com/api/miniconf/users/129850?format=json", "institution": "Computer Vision Center"}, {"id": 75622, "fullname": "Thomas B. Moeslund", "url": "http://cvpr.thecvf.com/api/miniconf/users/75622?format=json", "institution": "Aalborg University"}, {"id": 166503, "fullname": "Albert Clap\u00e9s", "url": "http://cvpr.thecvf.com/api/miniconf/users/166503?format=json", "institution": "Universitat de Barcelona"}], "abstract": "Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \\textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that AdaSpot achieves state-of-the-art performance under strict evaluation metrics (\\eg, $+3.96$ and $+2.26$ mAP$@0$ frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38106", "url": null, "sourceid": 44601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38107, "uid": "78aafe78cef466668ddd95d4d9d7e1ad", "name": "Cross-Architecture Adaptation: Cloud-Edge Continual Test-Time Adaptation with Dynamic Sampling and Heterogeneous Distillation", "authors": [{"id": 189071, "fullname": "Zirui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189071?format=json", "institution": "Xidian University"}, {"id": 189072, "fullname": "Xianhang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189072?format=json", "institution": "Xidian University"}, {"id": 189073, "fullname": "Li jiahao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189073?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 126600, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126600?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 88245, "fullname": "Cheng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88245?format=json", "institution": "Xidian University"}], "abstract": "Cloud-Edge Continual Test-Time Adaptation (CTTA)\u2014with edge devices processing real-time data and the cloud offering strong computing power\u2014is a critical paradigm for models that adapt to dynamic data distributions in real-world scenarios. However, most existing frameworks assume architectural homogeneity between cloud and edge CNNs, which poses a significant performance bottleneck, particularly given the rapid emergence of Transformer-based models. Current methods fail to bridge this architectural gap, resulting in significant deficiencies in adaptation accuracy and practical applicability. To address this, we propose a novel Cross-Architecture Adaptation (CAA) framework for heterogeneous Cloud-Edge CTTA that enables effective adaptation to shifting data distributions. Specifically, CAA deploys a large Transformer-based teacher model on the cloud for robust feature extraction and prediction, and a lightweight CNN-based student model on edge devices to fit resource constraints. Based on such cloud-edge models, a synergistic edge-to-cloud communication strategy, Multi-criteria Dynamic Cross-domain Sampling, ensures only the most informative, class-balanced samples are uploaded, minimizing communication costs while guaranteeing stable, unbiased adaptation. Moreover, a Multi-level Adaptive Heterogeneous Distillation module is proposed to facilitate effective knowledge transfer across the architecturally disparate models, and improve the learning efficiency of the edge one. Experiments on several benchmarks demonstrate that CAA achieves state-of-the-art performance with low edge resource consumption and minimal edge-to-cloud communication overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38107", "url": null, "sourceid": 35743, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38108, "uid": "c3a93eac6e6d4ec470b1f7f5dc024e20", "name": "LaVR: Latent Space Conditioned Video Re-rendering using Large 4D Reconstruction Models", "authors": [{"id": 133273, "fullname": "Mingyang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/133273?format=json", "institution": "University of Maryland, College Park"}, {"id": 158211, "fullname": "Numair Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/158211?format=json", "institution": "Reality Labs, Meta"}, {"id": 152633, "fullname": "Tianfu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152633?format=json", "institution": "University of Maryland, College Park"}, {"id": 92337, "fullname": "Naina Dhingra", "url": "http://cvpr.thecvf.com/api/miniconf/users/92337?format=json", "institution": "Meta Platforms Inc"}, {"id": 87118, "fullname": "Seonghyeon Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/87118?format=json", "institution": "Facebook"}, {"id": 88405, "fullname": "Haitao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88405?format=json", "institution": "Meta Platforms"}, {"id": 88434, "fullname": "Zhuo Hui", "url": "http://cvpr.thecvf.com/api/miniconf/users/88434?format=json", "institution": "Facebook"}, {"id": 128414, "fullname": "Christopher Metzler", "url": "http://cvpr.thecvf.com/api/miniconf/users/128414?format=json", "institution": "University of Maryland, College Park"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}, {"id": 87440, "fullname": "Hamed Pirsiavash", "url": "http://cvpr.thecvf.com/api/miniconf/users/87440?format=json", "institution": "University of California, Davis"}, {"id": 88412, "fullname": "Lei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/88412?format=json", "institution": "Meta"}], "abstract": "Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors.We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38108", "url": null, "sourceid": 31937, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38110, "uid": "82717311c998510059faacba4518f33d", "name": "Task-Driven Implicit Representations for Automated Design of LiDAR Systems", "authors": [{"id": 96307, "fullname": "Nikhil Behari", "url": "http://cvpr.thecvf.com/api/miniconf/users/96307?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 137593, "fullname": "Aaron Young", "url": "http://cvpr.thecvf.com/api/miniconf/users/137593?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 85604, "fullname": "Tzofi Klinghoffer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85604?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 189077, "fullname": "Akshat Dave", "url": "http://cvpr.thecvf.com/api/miniconf/users/189077?format=json", "institution": ", State University of New York at Stony Brook"}, {"id": 85615, "fullname": "Ramesh Raskar", "url": "http://cvpr.thecvf.com/api/miniconf/users/85615?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Imaging system design is a complex, time-consuming, and largely manual process; LiDAR design, ubiquitous in mobile devices, autonomous vehicles, and aerial imaging platforms, adds further complexity through unique spatial and temporal sampling requirements. In this work, we propose a framework for automated, task-driven LiDAR system design under arbitrary constraints. To achieve this, we represent LiDAR configurations in a continuous six-dimensional design space and learn task-specific implicit densities in this space via flow-based generative modeling. We then synthesize new LiDAR systems by modeling sensors as parametric distributions in 6D space and fitting these distributions to our learned implicit density using expectation-maximization, enabling efficient, constraint-aware LiDAR system design. We validate our method on diverse tasks in 3D vision, enabling automated LiDAR system design across real-world-inspired applications in face scanning, robotic tracking, and object detection.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38110", "url": null, "sourceid": 43238, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38111, "uid": "2244143888b22cced378fcd1c6ff4695", "name": "DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment", "authors": [{"id": 180458, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180458?format=json", "institution": "Nanyang Technological University"}, {"id": 184843, "fullname": "Bingchen Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184843?format=json", "institution": "Zhejiang University"}, {"id": 184844, "fullname": "Wendong Bu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184844?format=json", "institution": "Zhejiang University"}, {"id": 88890, "fullname": "Juncheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88890?format=json", "institution": "Zhejiang University"}, {"id": 91500, "fullname": "Hanwang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91500?format=json", "institution": "Nanyang Technological University"}, {"id": 84756, "fullname": "Fei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84756?format=json", "institution": "Zhejiang University"}], "abstract": "Multimodal Large Language Models (MLLMs) have demonstrated promising advancements in augmenting the capabilities of LLMs to comprehend visual input. However, modality misalignment between vision and text remains a key challenge in MLLM, which can be attributed to two aspects: misalignment of modality-specific representations and depletion of modality-specific details. To address the issue of modality misalignment, we propose DeepAlign, a novel multimodal alignment framework to mitigate modality conflict, which employs representation intervention and structure-induced knowledge distillation to prevent the misalignment and depletion of modality-specific information. Extensive experiments demonstrate that DeepAlign significantly mitigates modality conflicts, leading to substantial performance improvements compared to backbone models across multiple vision-language tasks. It also stimulates some emergent abilities in MLLMs, such as multimodal in-context learning on interleaved text-image sequences.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38111", "url": null, "sourceid": 33146, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38123, "uid": "5f6e016045d06bab7f37a045c0ea0603", "name": "LumiX: Structured and Coherent Text-to-Intrinsic Generation", "authors": [{"id": 133602, "fullname": "Xu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/133602?format=json", "institution": "Huazhong University Of Science And Technology"}, {"id": 185521, "fullname": "Biao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185521?format=json", "institution": "KAUST"}, {"id": 150519, "fullname": "Xiangjun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150519?format=json", "institution": "KAUST"}, {"id": 130950, "fullname": "Xianzhi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130950?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}], "abstract": "We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38123", "url": null, "sourceid": 45348, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38113, "uid": "4a9a54fadb6115f95f9c10fab3c1659c", "name": "VL-RouterBench: A Benchmark for Vision\u2013Language Model Routing", "authors": [{"id": 175259, "fullname": "Zhehao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175259?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189080, "fullname": "Baijiong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189080?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 181067, "fullname": "JINGYUAN ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/181067?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189081, "fullname": "Jingying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189081?format=json", "institution": null}, {"id": 189082, "fullname": "Yuhang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189082?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189083, "fullname": "Ning Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189083?format=json", "institution": "Hong Kong University of Science and Technology; Tsinghua University; Korea Advanced Institute of Science &amp; Technology"}, {"id": 189084, "fullname": "Tao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189084?format=json", "institution": "ByteDance Inc."}, {"id": 127908, "fullname": "Xiaolin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127908?format=json", "institution": "Shanghai Jiao Tong University, Tsinghua University"}], "abstract": "Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision\u2013language models (VLMs). We present **VL-RouterBench** to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample\u2013model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample\u2013model pairs and a total input\u2013output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38113", "url": null, "sourceid": 35188, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38114, "uid": "c975f21874ed5bb553d3e61d41244688", "name": "MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation", "authors": [{"id": 174334, "fullname": "yibo zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/174334?format=json", "institution": "nankai university"}, {"id": 189085, "fullname": "Yigong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189085?format=json", "institution": "Nankai University"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}], "abstract": "Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38114", "url": null, "sourceid": 44914, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38116, "uid": "d9466601bb1cec0a0df354707b5a1c08", "name": "Agentic Video Summarization via Self-Reflecting Multimodal Understanding", "authors": [{"id": 181714, "fullname": "Miaotian Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181714?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 158339, "fullname": "Shuguang Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/158339?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 156096, "fullname": "Yin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156096?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 131693, "fullname": "Aidong Men", "url": "http://cvpr.thecvf.com/api/miniconf/users/131693?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131145, "fullname": "Dongsheng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131145?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "The rise of AI agents powered by large language models (LLMs) has transformed intelligent systems by enabling autonomous tool utilizing, reasoning, and action across diverse tasks. Despite this rapid progress, existing video summarization approaches primarily focus on feature extraction or frame-level importance regression but lack the autonomous reasoning, self-correction, and decision-making capabilities that define true agent-based intelligence. To bridge this gap, we propose AgenticVS\u2014the first agentic workflow for video summarization that leverages multimodal large language models (MLLMs) to complete the summarization\u2013verify\u2013reflection loop in a fully autonomous manner. Rather than designing new architectures for feature extraction or regression, we exploit the understanding and reflective reasoning abilities of MLLMs to build an adaptive summarization framework with a self-reflecting workflow.  Experiments on SumMe and TVSum demonstrate that our agentic workflow outperforms state-of-the-art methods, enhancing interpretability, adaptability, and paving the way for agent-based multimodal video understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38116", "url": null, "sourceid": 30911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38120, "uid": "f965ce468b3afc578525cd9758975cc2", "name": "Event-based Visual Deformation Measurement", "authors": [{"id": 145465, "fullname": "Yuliang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145465?format=json", "institution": "university of science and technology of china"}, {"id": 86247, "fullname": "Wei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86247?format=json", "institution": "University of Science and Technology of China"}, {"id": 189095, "fullname": "Yuxin Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/189095?format=json", "institution": "University of Science and Technology of China; University of Science and Technology of China"}, {"id": 153911, "fullname": "Tiesong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153911?format=json", "institution": "Fuzhou University"}, {"id": 86250, "fullname": "Yang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86250?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Visual Deformation Measurement (VDM) aims to recover dense deformation fields by tracking surface motion from camera observations. Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space, which limits their applicability to highly dynamic scenes or necessitates high-speed cameras at the cost of prohibitive storage and computational overhead.We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation.By revisiting the solid elastic modeling prior, we propose an Affine Invariant Simplicial (AIS) framework that partitions the deformation field into multiple sub-regions and linearize the deformation within each sub-region using a low-parametric representation, effectively mitigating motion ambiguities arising from the sparse and noisy nature of event observations. To speed up parameter searching and reduce error accumulation, a neighborhood-greedy optimization strategy is introduced, enabling well-converged sub-regions to guide their poorly-converged neighbors, effectively suppress local error accumulation in long-term dense tracking.To evaluate the proposed method, a benchmark dataset with temporally aligned event streams and high-frame-rate videos is established, encompassing over 120 sequences spanning diverse deformation scenarios. Experimental results show that the proposed method outperforms the state-of-the-art baseline by 1.6\u00d7 in terms of continuous measurement success rate (survival rate). Remarkably, our approach achieves superior performance while requiring only 18.9\\% of the data storage and processing resources compared to traditional high-speed video-based methods, without compromising accuracy.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38120", "url": null, "sourceid": 34625, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38121, "uid": "5dc8c5a8868c77ddf8a8e1aa840c8884", "name": "Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery", "authors": [{"id": 173106, "fullname": "Yiming Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/173106?format=json", "institution": " University of Electronic Science and Technology of China"}, {"id": 90811, "fullname": "Xile Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90811?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 189096, "fullname": "Wei-Hao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189096?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 189097, "fullname": "Teng-Yu Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/189097?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 184783, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184783?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address the two limitations, we propose a  Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional image.  Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the GSLR, we develop an unsupervised GSLR-based  multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38121", "url": null, "sourceid": 33478, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38127, "uid": "6f3d86720d498a0f707dc24326038c8a", "name": "Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI", "authors": [{"id": 74032, "fullname": "Xinhao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74032?format=json", "institution": "New York University"}, {"id": 189113, "fullname": "Jiaqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189113?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 189114, "fullname": "Youming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189114?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 189115, "fullname": "Ruxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189115?format=json", "institution": "New York University"}, {"id": 189116, "fullname": "Yingjia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189116?format=json", "institution": "New York University"}, {"id": 189117, "fullname": "Yifei.Ma Yifei.Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189117?format=json", "institution": "New York University"}, {"id": 189118, "fullname": "Li Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189118?format=json", "institution": "New York University Shanghai"}, {"id": 77530, "fullname": "Yiming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77530?format=json", "institution": "New York University"}, {"id": 104589, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104589?format=json", "institution": "New York University"}, {"id": 86335, "fullname": "Chen Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86335?format=json", "institution": "New York University"}], "abstract": "Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38127", "url": null, "sourceid": 32475, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38134, "uid": "194191c15060e323bb610faf147e1700", "name": "Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation", "authors": [{"id": 181521, "fullname": "Yajun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181521?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Semi-supervised medical image segmentation (SSMIS) aims to alleviate annotation scarcity, but general methods, often developed on few-class datasets, suffer performance degradation in class-imbalanced multi-organ scenarios. Existing class-imbalanced SSMIS methods also struggle, as their single-decoder architecture is forced to handle vastly different scales with shared parameters. This process is easily dominated by majority classes, fundamentally limiting tail-class segmentation capability.To address this, we propose a \u2018\u2019$\\textbf{D}$ivide, $\\textbf{C}$onquer, and $\\textbf{A}$ggregate\" ($\\textbf{DCA}$) framework, featuring a unified encoder, three expert decoders, and an aggregation decoder. First, we $\\textbf{D}$ivide by applying a Logarithmic Gap Analysis to statically partition foreground classes into stable Head, Medium, and Tail sets, which aligns with anatomical priors. Then, we $\\textbf{C}$onquer by training the three architecturally asymmetric experts independently using a label-split strategy. This fundamentally alleviates the burden on a single decoder. The experts' predictions on unlabeled data are fused via logit stitching to generate high-quality pseudo-labels. Finally, we $\\textbf{A}$ggregate using an aggregation decoder with a Dynamic Feature Aggregation Module (DFAM), which dynamically fuses priors from all three experts to achieve unbiased predictions and fully leverage unlabeled data. Experiments demonstrate that our DCA framework significantly outperforms state-of-the-art general and class-imbalanced SSMIS methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38134", "url": null, "sourceid": 33753, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38131, "uid": "00acdef5b32f4a7089378d3b50b047a3", "name": "Image-to-Point Cloud Feature Back-projection for Multimodal Training of 3D Semantic Segmentation", "authors": [{"id": 142332, "fullname": "Jiawei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/142332?format=json", "institution": "Beijing Institute of Technology"}, {"id": 87171, "fullname": "Matteo Poggi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87171?format=json", "institution": "Universit\u00e0 di Bologna"}, {"id": 147049, "fullname": "HUAN LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/147049?format=json", "institution": "University of Bologna"}, {"id": 169665, "fullname": "Changshuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169665?format=json", "institution": "Nanyang Technological University"}, {"id": 189129, "fullname": "Kaiqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189129?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189130, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189130?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "The effective integration and utilization of multimodal data acquired from image cameras and LiDAR is of paramount importance for perception systems. This paper proposes **I**mage-to-**P**oint Cloud **F**eature Back-**P**rojection (**IPFP**), a novel method for training multimodal fusion networks that back-projects aggregated image-feature centers (from non-projection-aligned image pixels) into the point-cloud feature set via the estimated depth map. Consequently, image features and point cloud features reside within the same three-dimensional space, enabling the natural enrichment of image information into the point cloud during the network forward pass. This process can be selectively enabled when desired -- for instance, at training time -- and turned off in the absence of multimodal data -- for example, at testing time if only LiDAR sensors are available. Experimental results demonstrate that **IPFP** can consistently improve state-of-the-art 3D semantic segmentation models, while retaining the ability to process LiDAR-only data at inference time.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38131", "url": null, "sourceid": 43947, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38133, "uid": "b822fb166c626a15d168207f96a0a3bc", "name": "Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing", "authors": [{"id": 113809, "fullname": "Yusu Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/113809?format=json", "institution": "Apple"}, {"id": 178523, "fullname": "Eli Bocek-Rivele", "url": "http://cvpr.thecvf.com/api/miniconf/users/178523?format=json", "institution": "Apple"}, {"id": 162151, "fullname": "Liangchen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/162151?format=json", "institution": "Apple"}, {"id": 189133, "fullname": "Jialing Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189133?format=json", "institution": "Apple"}, {"id": 84650, "fullname": "Yinfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84650?format=json", "institution": "Apple"}, {"id": 150987, "fullname": "Jiasen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150987?format=json", "institution": "Apple"}, {"id": 87537, "fullname": "Wenze Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87537?format=json", "institution": "UCLA, University of California, Los Angeles"}, {"id": 87966, "fullname": "Zhe Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87966?format=json", "institution": "Apple"}], "abstract": "Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38133", "url": null, "sourceid": 32059, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38135, "uid": "a7c1aed2ae6b6ced5c3e83fb7c74d65d", "name": "Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models", "authors": [{"id": 189134, "fullname": "Zixuan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/189134?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189135, "fullname": "Quande Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189135?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 130281, "fullname": "Cong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130281?format=json", "institution": "University of Waterloo"}, {"id": 168185, "fullname": "Yuanxing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/168185?format=json", "institution": "Kuaishou Technology"}, {"id": 75722, "fullname": "Xintao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75722?format=json", "institution": "Tencent"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 156268, "fullname": "Kun Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156268?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 86370, "fullname": "Wenhan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86370?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Recently, the introduction of Chain-of-Thought (CoT) has largely improved generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the visual context consistency with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38135", "url": null, "sourceid": 35771, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38136, "uid": "f702407c8ab99af3b508ed528516b86c", "name": "UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair", "authors": [{"id": 181239, "fullname": "Chuanrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181239?format=json", "institution": "Nanyang Technological University"}, {"id": 152939, "fullname": "Yingshuang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152939?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 189136, "fullname": "ZhengXian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189136?format=json", "institution": "Tsinghua University"}, {"id": 155745, "fullname": "Yonggen Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/155745?format=json", "institution": "Tencent Robotics X"}, {"id": 189137, "fullname": "Yuxiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189137?format=json", "institution": null}, {"id": 151631, "fullname": "Ziwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151631?format=json", "institution": "Nanyang Technological University"}], "abstract": "Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community.Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline.However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context.To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework.Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity.We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks.Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area.Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38136", "url": null, "sourceid": 41495, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38137, "uid": "cb70a510d7c83a14b374c56b7ed2ed83", "name": "CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics", "authors": [{"id": 182294, "fullname": "Andrew Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182294?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 176229, "fullname": "Jaemin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/176229?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 130163, "fullname": "Sebin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/130163?format=json", "institution": "Korea Advanced Institute of Science and Technology (KAIST)"}, {"id": 89458, "fullname": "Sung-Eui Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/89458?format=json", "institution": "KAIST"}], "abstract": "We propose CLaD (Cross-modal Latent Dynamics), a framework for learning temporally consistent cross-modal representations in robotic manipulation. Our approach models transition dynamics rather than static state correspondences: asymmetric cross-attention enables proprioceptive transitions to query semantic ones, extracting shared dynamics structure that respects the causal ordering imposed by actions. We formalize grounded latent foresight as predictions anchored through EMA-based targets from observed trajectories and auxiliary reconstruction to observable space\u2014preventing collapse to abstract representations. A diffusion policy conditions on these learned foresights via feature modulation, decoupling dynamics learning from control optimization. Evaluated on LIBERO-LONG, our method achieves 94.9\\% success with 0.66B parameters, demonstrating that explicit cross-modal transition modeling enables parameter-efficient planning outperforming larger VLAs.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38137", "url": null, "sourceid": 36400, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38138, "uid": "56fe38b77cb4f52e8f2770e874f57875", "name": "OralGPT-Omni: A Versatile Dental Multimodal Large Language Model", "authors": [{"id": 182651, "fullname": "JING HAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/182651?format=json", "institution": "University of Hong Kong"}, {"id": 189138, "fullname": "Yuci Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189138?format=json", "institution": "Shenzhen University"}, {"id": 189139, "fullname": "Lin Lizhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189139?format=json", "institution": "University of Hong Kong"}, {"id": 189140, "fullname": "Yuxuan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189140?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189141, "fullname": "Wenkai Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189141?format=json", "institution": "University of Hong Kong"}, {"id": 189142, "fullname": "Kaixin Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189142?format=json", "institution": "University of Hong Kong"}, {"id": 184598, "fullname": "Zanting Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184598?format=json", "institution": "Southern Medical University"}, {"id": 128187, "fullname": "Yanpeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128187?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 128184, "fullname": "Xinyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128184?format=json", "institution": "The University of Adelaide"}, {"id": 189143, "fullname": "Yanqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189143?format=json", "institution": "University of Hong Kong"}, {"id": 182645, "fullname": "Qiankun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182645?format=json", "institution": "Nanyang Technological University"}, {"id": 77263, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77263?format=json", "institution": "ETH Zurich and CMU"}, {"id": 189144, "fullname": "James Tsoi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189144?format=json", "institution": "University of Hong Kong"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}, {"id": 189145, "fullname": "Kuo Hung", "url": "http://cvpr.thecvf.com/api/miniconf/users/189145?format=json", "institution": "University of Hong Kong"}], "abstract": "Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties, yet dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists\u2019 diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists\u2019 decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model\u2019s capacity for dental image understanding and analysis.  In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question\u2013answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38138", "url": null, "sourceid": 31753, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38139, "uid": "ece9a91895eaebd3abdc9e9f7175ef5b", "name": "Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge", "authors": [{"id": 153128, "fullname": "Yu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153128?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 107369, "fullname": "Zelin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107369?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 76417, "fullname": "Changsong Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76417?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 76466, "fullname": "Wei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76466?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Affordance segmentation aims to decompose 3D objects into parts that serve distinct functional roles, enabling models to reason about object interactions rather than mere recognition. Existing methods, mostly following the paradigm of 3D semantic segmentation or prompt-based frameworks, struggle when geometric cues are weak or ambiguous, as sparse point clouds provide limited functional information. To overcome this limitation, we leverage the rich semantic knowledge embedded in large-scale 2D Vision Foundation Models (VFMs) to guide 3D representation learning through a cross-modal alignment mechanism. Specifically, we propose Cross-Modal Affinity Transfer (CMAT), a pretraining strategy that compels the 3D encoder to align with the semantic structures induced by lifted 2D features. CMAT is driven by a core affinity alignment objective, supported by two auxiliary losses, geometric reconstruction and feature diversity, which together encourage structured and discriminative feature learning. Built upon the CMAT-pretrained backbone, we employ a lightweight affordance segmentor that injects text or visual prompts into the learned 3D space through an efficient cross-attention interface, enabling dense and prompt-aware affordance prediction while preserving the semantic organization established during pretraining. Extensive experiments demonstrate consistent improvements over previous state-of-the-art methods in both accuracy and efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38139", "url": null, "sourceid": 35718, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38151, "uid": "92e53de4d143543b931fb5a393b44f93", "name": "Semantic Scale Space: A Framework for Controllable Image Abstraction", "authors": [{"id": 181704, "fullname": "Kazu Mishiba", "url": "http://cvpr.thecvf.com/api/miniconf/users/181704?format=json", "institution": "Tottori University"}], "abstract": "Image abstraction, a fundamental component of non-photorealistic rendering (NPR), aims to simplify photographs into stylized depictions while preserving perceptually important structures. A central difficulty is selectivity: removing fine textures while preserving semantically meaningful boundaries. Existing approaches often expose only a few entangled controls, so smoothing strength and structural scale cannot be adjusted independently, which limits intuitive user control.We propose the Semantic Scale Space (SSS), a framework that organizes abstraction on two decoupled axes, abstraction strength and semantic granularity. SSS externalizes the stopping criteria by using a controllable semantic boundary detector to specify which structures act as barriers to smoothing, independently of how strongly homogeneous regions are simplified. We instantiate SSS with Adaptive Granularity Scheduling Smoothing (AGSS), which combines a donor-gated diffusion operator with a fine-to-coarse granularity schedule, and we introduce an effect-matched evaluation protocol based on a Region Homogeneity Index that compares methods at matched smoothing levels. On SBD and DIV2K, AGSS achieves higher boundary preservation and lower geometric drift than strong baselines at the same degree of smoothing, and a user study shows that its abstractions are consistently preferred in downstream NPR pipelines. These results demonstrate that SSS and AGSS provide practical, controllable image abstraction for modern creative applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38151", "url": null, "sourceid": 45059, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38143, "uid": "df1d759af9661d783a0a36ef0ef288e9", "name": "VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer", "authors": [{"id": 172922, "fullname": "Rui Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/172922?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 85144, "fullname": "Chuanming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85144?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 85161, "fullname": "Huadong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/85161?format=json", "institution": "Beijing University of Post and Telecommunication, Tsinghua University"}], "abstract": "With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \\emph{\\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams.To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling.To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts.Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38143", "url": null, "sourceid": 39748, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38144, "uid": "917652aa518c7e4f75a9f41bbf03909e", "name": "PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion", "authors": [{"id": 189150, "fullname": "Qingdong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189150?format=json", "institution": "Northeastern University"}, {"id": 189151, "fullname": "Jiajun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189151?format=json", "institution": null}, {"id": 189152, "fullname": "Shilin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189152?format=json", "institution": "Northeastern University"}, {"id": 189153, "fullname": "XinJingHe XinJingHe", "url": "http://cvpr.thecvf.com/api/miniconf/users/189153?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 189154, "fullname": "Chao.Lu Chao.Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189154?format=json", "institution": null}, {"id": 189155, "fullname": "wanghuanran wanghuanran", "url": "http://cvpr.thecvf.com/api/miniconf/users/189155?format=json", "institution": "Mach"}, {"id": 155831, "fullname": "Jiyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155831?format=json", "institution": "Peking University"}], "abstract": "We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38144", "url": null, "sourceid": 42584, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38146, "uid": "51bc6b9193a6829af6044cdfa8c1a69d", "name": "EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Frame", "authors": [{"id": 189159, "fullname": "Jiansong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189159?format=json", "institution": "Shenzhen University"}, {"id": 189160, "fullname": "Xiaying Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189160?format=json", "institution": "Fujian Medical University"}, {"id": 189161, "fullname": "Xiaoling Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189161?format=json", "institution": "Shenzhen University"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}], "abstract": "Reconstructing a physiologically plausible cardiac video from a single image remains a fundamental challenge in generative modeling, owing to the complex and nonlinear periodic dynamics of echocardiography. Previous image-to-video (I2V) approaches primarily focus on temporal continuity, yet often struggle to capture the intrinsic periodicity of cardiac motion, leading to limited temporal coherence and semantic consistency. We present EchoVDiff, a novel phase-aware diffusion model that reconstructs a full cardiac cycle from any single frame. Instead of direct pixel synthesis, EchoVDiff integrates physiological priors into a diffusion paradigm, learning interpretable mappings between cardiac phase, anatomy, and motion. By jointly modeling temporal rhythm and spatial semantics within a disentangled latent space, it achieves controllable and physiologically consistent generation. Extensive experiments on EchoNet-Dynamic and EchoNet-Pediatric demonstrate that EchoVDiff consistently surpasses state-of-the-art methods in both fidelity and temporal coherence. Remarkably, it enables accurate reconstruction of complete cardiac cycles from arbitrary phases, marking the first demonstration of single-frame-driven echocardiographic video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38146", "url": null, "sourceid": 43579, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38149, "uid": "5f1a8a8fddbc8a5ff4bf0f111dc69ff4", "name": "Label-Free Cross-Task LoRA Merging with Null-Space Compression", "authors": [{"id": 176159, "fullname": "Wonyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/176159?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology (KAIST)"}, {"id": 129325, "fullname": "Wooseong Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129325?format=json", "institution": "KAIST"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "Model merging constructs a single model by combining multiple independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches that uses entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $\\Delta W = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as a optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision\u2013language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38149", "url": null, "sourceid": 44695, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38152, "uid": "dcd1f29d5cf389f753b87b30f472c6f3", "name": "Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding", "authors": [{"id": 76154, "fullname": "Fatih Ilhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76154?format=json", "institution": "Georgia Institute of Technology"}, {"id": 92855, "fullname": "Gaowen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92855?format=json", "institution": "Cisco Systems"}, {"id": 129129, "fullname": "Ramana Kompella", "url": "http://cvpr.thecvf.com/api/miniconf/users/129129?format=json", "institution": "Cisco"}, {"id": 131096, "fullname": "Selim Tekin", "url": "http://cvpr.thecvf.com/api/miniconf/users/131096?format=json", "institution": "College of Computing, Georgia Institute of Technology"}, {"id": 131101, "fullname": "Tiansheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131101?format=json", "institution": "Georgia Institute of Technology"}, {"id": 189166, "fullname": "Zachary Yahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/189166?format=json", "institution": "Georgia Institute of Technology"}, {"id": 181811, "fullname": "Yichang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181811?format=json", "institution": "Georgia Institute of Technology"}, {"id": 181432, "fullname": "Ling Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181432?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, compared to the existing representative decoding optimization methods, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack with eviction, quantization and kernel fusion, showing further memory efficiency gains in resource-limited environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38152", "url": null, "sourceid": 40634, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38155, "uid": "6fcd734d28ae00944f8f7c68a219bbc5", "name": "SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens", "authors": [{"id": 180167, "fullname": "Xiaoyan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180167?format=json", "institution": "University of Michigan"}, {"id": 129007, "fullname": "Zechen Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129007?format=json", "institution": "Show Lab, National University of Singapore"}, {"id": 126670, "fullname": "Haofan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126670?format=json", "institution": "Xiaohongshu"}, {"id": 131537, "fullname": "Yiren Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/131537?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Recent unified models such as Bagel demonstrate that paired image\u2013edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text\u2013image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38155", "url": null, "sourceid": 37538, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38156, "uid": "b4049e3fde920ab9e5af9f47b292a6c5", "name": "MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation", "authors": [{"id": 77449, "fullname": "Yuqi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/77449?format=json", "institution": "Zhejiang university"}, {"id": 155623, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155623?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87374, "fullname": "Wenqi Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87374?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 178534, "fullname": "Shiqu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178534?format=json", "institution": null}, {"id": 189169, "fullname": "Zhihong Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189169?format=json", "institution": "Beijing Automobile Works"}, {"id": 88183, "fullname": "Wenxiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88183?format=json", "institution": "Zhejiang University"}, {"id": 135750, "fullname": "Xiaofei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/135750?format=json", "institution": "Zhejiang University"}, {"id": 184560, "fullname": "Kaipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184560?format=json", "institution": "Shanda AI Research"}], "abstract": "Current semantic segmentation models are very data-hungry and require massive costly pixel-wise human annotations. Generative data augmentation, which scales the train set using generative models, provides a potential remedy. In this paper, we propose MatchMask, a novel mask-centric generative data augmentation approach tailored for label-scarce semantic segmentation. By leveraging a limited set of labeled semantic masks, MatchMask generates diverse, realistic, and well-aligned image-mask pairs, thereby enhancing the performance of semantic segmentation models. Specifically, to adapt existing text-to-image models for semantic image synthesis in the few-shot setting, we first propose a Gradient Probe Method to investigate the role of each layer in the diffusion model. On this basis, a lightweight LoRA-style adapter is designed for critical layers to enable efficient adaptation, coupled with a Layer-adaptive Cross-attention Fusion mechanism. Meanwhile, we present a robust relative filtering principle to suppress incorrectly synthesized regions. Moreover, the proposed approach is extended to MatchMask++ in the semi-supervised setting to take advantage of additional unlabeled data. Experimental results on PASCAL VOC, COCO and ADE20K demonstrate that MatchMask remarkably enhances the performance of segmentation models, surpassing prior data augmentation techniques in various benchmarks, e.g., 67.5%->74.3% mIoU on PASCAL VOC. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38156", "url": null, "sourceid": 38323, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38162, "uid": "0be8394f8877c4ead942672b113f81c2", "name": "Stabilizing Streaming Video Geometry via Dynamic Feature Normalization", "authors": [{"id": 90677, "fullname": "Xiaoyang Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90677?format=json", "institution": "University of Hong Kong"}, {"id": 172662, "fullname": "Muxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/172662?format=json", "institution": "The University of Hong Kong"}, {"id": 189186, "fullname": "Xiaoshan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189186?format=json", "institution": "University of Hong Kong"}, {"id": 143492, "fullname": "Ruicheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143492?format=json", "institution": "University of Science and Technology of China"}, {"id": 70960, "fullname": "Yihua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70960?format=json", "institution": "University of Hong Kong"}, {"id": 102107, "fullname": "Yangtian Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/102107?format=json", "institution": "University of Hong Kong"}, {"id": 86239, "fullname": "Shaoshuai Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86239?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}], "abstract": "Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale\u2013shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth\u2019s scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN-- a mere 2\\% additional parameters-- while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14\\% and even outperforming heavier non-causal video baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38162", "url": null, "sourceid": 41662, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38166, "uid": "5e2967fb6dd36c6d23eae2230759ccd3", "name": "Resolving the Identity Crisis in Text-to-Image Generation", "authors": [{"id": 75964, "fullname": "Shubhankar Borse", "url": "http://cvpr.thecvf.com/api/miniconf/users/75964?format=json", "institution": "Qualcomm AI Research"}, {"id": 176944, "fullname": "Farzad Farhadzadeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/176944?format=json", "institution": "Qualcomm AI Research"}, {"id": 85738, "fullname": "Munawar Hayat", "url": "http://cvpr.thecvf.com/api/miniconf/users/85738?format=json", "institution": "Monash University"}, {"id": 85634, "fullname": "Fatih Porikli", "url": "http://cvpr.thecvf.com/api/miniconf/users/85634?format=json", "institution": "QualComm"}], "abstract": "State-of-the-art text-to-image models demonstrate impressive realism but suffer from a persistent identity crisis when generating scenes with multiple humans: producing duplicate faces, merging identities, and miscounting individuals. We present DisCo, Reinforcement with DiverSity Constraints, a novel reinforcement learning framework that directly optimizes identity diversity both within images and across groups of generated samples. DisCo fine-tunes flow-matching models using Group-Relative Policy Optimization (GRPO), guided by a compositional reward that: (i) penalizes facial similarity within images, (ii) discourages identity repetition across samples, (iii) enforces accurate person counts, and (iv) preserves visual fidelity via human preference scores. A single-stage curriculum stabilizes training as prompt complexity increases, requiring no additional annotations. On the DiverseHumans Testset, DisCo achieves 98.6\\% Unique Face Accuracy and near-perfect Global Identity Spread, outperforming both open-source and proprietary models (e.g., Gemini, GPT-Image) while maintaining competitive perceptual quality. Our results establish cross-sample diversity as a critical axis for resolving identity collapse in generative models, and position DisCo as a scalable, annotation-free solution for multi-human image synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38166", "url": null, "sourceid": 38384, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38167, "uid": "8f56c9b214a9d6f78296e2daf6614fb5", "name": "Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs", "authors": [{"id": 87110, "fullname": "Lianyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87110?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 87112, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87112?format=json", "institution": "Institute of High Performance Computing,  A*STAR, Singapore"}, {"id": 87102, "fullname": "Huazhu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87102?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}, {"id": 87100, "fullname": "Daoqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87100?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "The rapid adoption of vision-language models (VLMs) in visual recognition and multimodal reasoning has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on predefined and static authorized domain during training, limiting flexibility in dynamic real-world environments. In addition, they often produce opaque and unsafe responses to unauthorized inputs, lacking explicit alerts for illegal usage.To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38167", "url": null, "sourceid": 42121, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38170, "uid": "186f0892f13de5443c0b6d042a6ddde0", "name": "Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models", "authors": [{"id": 189201, "fullname": "Rowan Bradbury", "url": "http://cvpr.thecvf.com/api/miniconf/users/189201?format=json", "institution": "Bradbury Group; Wand AI"}, {"id": 189202, "fullname": "Elea Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189202?format=json", "institution": "Bradbury Group; Wand Technologies Inc."}], "abstract": "Linearly interpolating between VAE latents using a downsampled mask field remains a common heuristic for diffusion inpainting. However, this approach systematically violates a key principle: latent compositing must respect pixel equivalence; compositing latents must approximate compositing pixels. Because VAE latents capture global context rather than pixel-local structure, linear interpolation fails this requirement, producing seams, color shifts, and halos that diffusion subsequently amplifies into larger artifacts.We propose Pixel-Equivalent Latent Compositing (PELC) and instantiate it with DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and a nonlinear residual to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07\\% of FLUX.1-Dev\u2019s parameters and 3.5\\% FLOP overhead.On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53\\% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing (e.g., overlays, tone/relighting, warps), as we demonstrate on a complex color-correction task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38170", "url": null, "sourceid": 40766, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38179, "uid": "319b4401f81f42d06aec44e1fd6edd9a", "name": "Hierarchically Robust Zero-shot Vision-Language Models", "authors": [{"id": 107430, "fullname": "Junhao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107430?format=json", "institution": "Nanyang Technological University &amp; A*STAR"}, {"id": 152452, "fullname": "Yifei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152452?format=json", "institution": "Nanyang Technological University"}, {"id": 152451, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152451?format=json", "institution": "CSIRO"}, {"id": 131393, "fullname": "Yew-Soon Ong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131393?format=json", "institution": "Nanyang Technological University"}, {"id": 86028, "fullname": "Piotr Koniusz", "url": "http://cvpr.thecvf.com/api/miniconf/users/86028?format=json", "institution": "Data61/CSIRO + Australian National University"}], "abstract": "Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their  robustness, existing approaches align fixed text embeddings with an image embedding,  sacrificing  natural performance and robustness. A  robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., \\texttt{mammal}) in addition to their base (leaf) classes (e.g., \\texttt{cat}). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38179", "url": null, "sourceid": 31130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38175, "uid": "f72f05e91d583cf63ae47e955ca9c312", "name": "Revisiting Multimodal KV Cache Compression:  A Frequency-Domain-Guided Outlier-KV-Aware Approach", "authors": [{"id": 180738, "fullname": "Yaoxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180738?format=json", "institution": "Fudan University"}, {"id": 126133, "fullname": "Peng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/126133?format=json", "institution": "Fudan University"}, {"id": 157367, "fullname": "Xudong Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/157367?format=json", "institution": "Fudan University"}, {"id": 189217, "fullname": "Chongjun Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189217?format=json", "institution": "Fudan University"}, {"id": 157364, "fullname": "Maosen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157364?format=json", "institution": "Fudan University"}, {"id": 189218, "fullname": "Jia Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189218?format=json", "institution": "Zhangjiang Laboratory"}, {"id": 76400, "fullname": "Tao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76400?format=json", "institution": "Fudan University"}], "abstract": "Multimodal large language models suffer from substantial inference overhead since multimodal KV Cache grows proportionally with the visual input length. Existing multimodal KV cache compression methods mostly rely on attention score to reduce cache size, which makes them are incompatible with established efficient attention kernels (e.g., FlashAttention) and ignores the contribution of value vectors to the attention output. In this work, we revisit multimodal KV cache compression from the perspective of the KV matrices\u2019 distribution. First, we observe that frequency-domain energy of multimodal KV matrices is predominantly concentrated in low-frequency and extract this principal energy via a low-pass filter. Further, we find that removing KV pairs that deviate substantially from this principal energy leads to a pronounced performance drop, which we define as Outlier KVs.  Considering Outlier KVs are more likely to encode features critical for inference, we propose FlashCache, a frequency-domain\u2013guided, Outlier-KV-aware KV Cache compression framework. First, we introduce an Outlier KV Recognition Module that models the principal component of multimodal KV matrices in the frequency domain and preferentially retains KV pairs that significantly deviate from it. Furthermore, Dynamic Budget Allocation Module is designed to adaptively determine the per-layer KV cache size to retain more Outlier KVs. Experiments on multiple MLLMs and benchmarks demonstrate that FlashCache outperforms state-of-the-art multimoal KV compression methods, achieving up to 1.69\u00d7 faster decoding with 80\\% lower KV memory usage while maintaining task performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38175", "url": null, "sourceid": 38752, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38177, "uid": "2cef62389cd33c26dc98735ef4cb5676", "name": "Linking Modality Isolation in Heterogeneous Collaborative Perception", "authors": [{"id": 130339, "fullname": "Changxing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130339?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189222, "fullname": "Zichen Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189222?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86708, "fullname": "Siheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86708?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Collaborative perception leverages data exchange among multiple agents to enhance overall perception capabilities. However, heterogeneity across agents introduces domain gaps that hinder collaboration, and this is further exacerbated by an underexplored issue: modality isolation. It arises when multiple agents with different modalities never co-occur in any training data frame, enlarging cross-modal domain gaps. Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. The key idea is to explicitly identify the representation consistency through codebook, and directly learn mappings between modality-specific feature spaces, thereby eliminating the need for spatial correspondence. Codebooks regularize feature spaces into code spaces, providing compact yet expressive representations. With a prepared code space for each modality, CodeAlign learns FCF translations that map features to the corresponding codes of other modalities, which are then decoded back into features in the target code space, enabling effective alignment. Experiments show that, when integrating three modalities, CodeAlign requires only 8% of the training parameters of prior alignment methods, reduces communication load by 1024x, and achieves state-of-the-art perception performance on both OPV2V and DAIR-V2X dataset. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38177", "url": null, "sourceid": 37015, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38180, "uid": "785f9daa6aedd3afc9ef23137a51fe3e", "name": "Global Information Thresholding for Sufficient and Necessary Circuits", "authors": [{"id": 88150, "fullname": "Jegyeong Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/88150?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "We study the problem of extracting causal circuits-small edge-level subgraphs inside a trained network that are sufficient on their own and necessary to the model\u2019s behavior under explicit error control. Prior work largely optimizes observational rankings or applies ad-hoc sparsification, which can sever paths, ignore inhibitory edges, and admit ``ghost\" components that fail under intervention. We recast circuit discovery as information-constrained selection rather than ranking: a single global threshold chooses edges by their marginal contribution, combined with a null hypothesis-based statistical threshold to control family-wise errors. Edge scores are computed by rank-consistent attribution aligned to the task metric, stabilized with Fisher-diagonal variance normalization, projected to an edge coordinate system that preserves paths, and enforced with hard gates for interventional semantics. We propose an evaluation protocol that prioritizes sufficiency/necessity (CPR, CMD), editability, error rates, and standard ranking metrics. The result is a small, path-faithful circuit with reproducible selection criteria. Our motivation is to replace visually appealing heatmaps with interventional guarantees and explicit error control.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38180", "url": null, "sourceid": 36942, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38183, "uid": "5c733bd63223c2d18f5d66f0c15a88cb", "name": "Any Resolution Any Geometry: From Multi-View To Multi-Patch", "authors": [{"id": 159272, "fullname": "Wenqing Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/159272?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 183153, "fullname": "Zhenyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183153?format=json", "institution": "KAUST"}, {"id": 164160, "fullname": "Mykola Lavreniuk", "url": "http://cvpr.thecvf.com/api/miniconf/users/164160?format=json", "institution": "Space Research Institute"}, {"id": 107375, "fullname": "Jian Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/107375?format=json", "institution": "KAUST"}, {"id": 189234, "fullname": "Ramzi Idoughi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189234?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 150519, "fullname": "Xiangjun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150519?format=json", "institution": "KAUST"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}], "abstract": "Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. We address this challenge by adapting the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation---reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36$^\\circ$ to 18.27$^\\circ$\u2014while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38183", "url": null, "sourceid": 35385, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38186, "uid": "9acd09f9c8b6991cfb734e9dbd3acd05", "name": "CRFT: Consistent\u2013Recurrent Feature Flow Transformer for Cross-Modal Image Registration", "authors": [{"id": 174172, "fullname": "Xuecong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174172?format=json", "institution": "Northeastern University"}, {"id": 189239, "fullname": "Mengzhu Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/189239?format=json", "institution": "Northeastern University at Qinhuangdao"}, {"id": 189240, "fullname": "Zixuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189240?format=json", "institution": "National University of Defense Technology"}, {"id": 189241, "fullname": "Zhang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189241?format=json", "institution": "National University of Defense Technology"}, {"id": 184087, "fullname": "Xichao Teng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184087?format=json", "institution": "National University of Defense Technology"}], "abstract": "We present Consistent\u2013Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework that learns feature flow for robust cross-modal registration. CRFT learns a modality-consistent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38186", "url": null, "sourceid": 36816, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38190, "uid": "4f7ef7308c8eb7e5e3730a15a66c0fb3", "name": "Semantic Audio-Visual Navigation in Continuous Environments", "authors": [{"id": 174932, "fullname": "Yichen Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/174932?format=json", "institution": "Wuhan University"}, {"id": 180748, "fullname": "Hebaixu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180748?format=json", "institution": "Zhongguancun Academy, Beijing, China"}, {"id": 154769, "fullname": "Meng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154769?format=json", "institution": "Shandong Jianzhu University"}, {"id": 152555, "fullname": "Yu ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/152555?format=json", "institution": "Nankai University"}, {"id": 189250, "fullname": "Chen Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189250?format=json", "institution": "Tsinghua University"}, {"id": 189251, "fullname": "Kehan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189251?format=json", "institution": "University of the Chinese Academy of Sciences; Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189252, "fullname": "Gongping Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189252?format=json", "institution": null}], "abstract": "Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches depend on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38190", "url": null, "sourceid": 34240, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38193, "uid": "83aa658c240d1badb5185b3d6fc8c808", "name": "ARCache: Mitigating Error Accumulation for Caching-based Acceleration in Autoregressive Video Diffusion Models", "authors": [{"id": 157478, "fullname": "Kepan Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/157478?format=json", "institution": "Nanjing University"}, {"id": 129149, "fullname": "Wangbo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129149?format=json", "institution": "National University of Singapore"}, {"id": 100053, "fullname": "Penghao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/100053?format=json", "institution": "ByteDance TikTok"}, {"id": 189269, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189269?format=json", "institution": "nanjing university"}, {"id": 157173, "fullname": "Zhenheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157173?format=json", "institution": "Tiktok"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}, {"id": 107265, "fullname": "Ying Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/107265?format=json", "institution": "Nanjing University"}], "abstract": "Caching-based acceleration methods have recently driven significant progress in efficient video generation with diffusion models. However, we identify a critical limitation when directly applying these acceleration techniques to autoregressive video diffusion models, which generate long videos by sequentially synthesizing segments conditioned on historical context. In such settings, any approximation errors introduced by acceleration tend to propagate and accumulate over time, resulting in severe error accumulation and progressive degradation of video quality. To address this challenge, we propose ARCache, the first training-free caching-based acceleration framework specifically designed for autoregressive video diffusion models. ARCache improves both the timing and quality of caching through two key components. First, History-Guided Cache (HGC) leverages historical information to adaptively schedule caching for each segment, enabling more accurate and efficient cache utilization. Second, Enhanced Residual Correction (ERC) adaptively approximates model residuals and refines the residual trajectory for subsequent segments, effectively mitigating error accumulation while simultaneously reducing computational overhead. Extensive experiments on Framepack-F1, SkyReels-V2, and autoregressive world model Matrix-Game demonstrate that ARCache achieves state-of-the-art acceleration and visual fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38193", "url": null, "sourceid": 36913, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38196, "uid": "d44836f6af3750d0a10bff027133bb73", "name": "TRANSPORTER: Transferring Visual Semantics from VLM Manifolds", "authors": [{"id": 172352, "fullname": "Alexandros Stergiou", "url": "http://cvpr.thecvf.com/api/miniconf/users/172352?format=json", "institution": "University of Twente"}], "abstract": "How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38196", "url": null, "sourceid": 31636, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38198, "uid": "18c7d3e908572f255a619c09e04b461a", "name": "Uni-Hema: Unified Model for Digital Hematopathology", "authors": [{"id": 189288, "fullname": "Abdul Rehman", "url": "http://cvpr.thecvf.com/api/miniconf/users/189288?format=json", "institution": "Information Technology University, Lahore"}, {"id": 189289, "fullname": "Iqra Rasool", "url": "http://cvpr.thecvf.com/api/miniconf/users/189289?format=json", "institution": "Chughtai institute of pathology"}, {"id": 189290, "fullname": "Ayisha Imran", "url": "http://cvpr.thecvf.com/api/miniconf/users/189290?format=json", "institution": "Chughtai Institute of Pathology"}, {"id": 131713, "fullname": "Mohsen Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/131713?format=json", "institution": "Information Technology University"}, {"id": 97909, "fullname": "Waqas Sultani", "url": "http://cvpr.thecvf.com/api/miniconf/users/97909?format=json", "institution": null}], "abstract": "Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation: they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose \\textbf{Uni-Hema}, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question\u2013answer pairs, and is built upon \\textbf{Hema-Former}, a multimodal module that bridges visual and linguistic representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models,  across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38198", "url": null, "sourceid": 44606, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38199, "uid": "2c90f710b89e811f1368d0a48804d255", "name": "Coupled Diffusion Sampling for Training-free Multi-view Image Editing", "authors": [{"id": 76025, "fullname": "Hadi Alzayer", "url": "http://cvpr.thecvf.com/api/miniconf/users/76025?format=json", "institution": "University of Maryland"}, {"id": 159444, "fullname": "Yunzhi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159444?format=json", "institution": "Stanford University"}, {"id": 86204, "fullname": "Chen Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86204?format=json", "institution": "Stanford University"}, {"id": 88945, "fullname": "Jia-Bin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88945?format=json", "institution": "University of Maryland, College Park"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}], "abstract": "Given a collection of multi-view images, we perform consistent multi-view editing with a training-free framework using pre-trained 2D editing models and a generative multi-view model.While 2D editing models can independently edit each image in a set of multi-view images of a 3D scene, they do not maintain consistency across views.Existing approaches typically rely on explicit 3D representations to average out the inconsistencies, but they suffer from a lengthy optimization,  instability under sparse view settings, and can produce blurry results.We address the problem from a different lens, where we use the 2D editing model to steer a multi-view generative model in the diffusion sampling process.This is achieved through our novel coupled diffusion sampling process. We concurrently sample two trajectories from both a multi-view image distribution and a 2D edited image distribution, and connect the samples with a coupling term. Effectively, the two models guide each other during sampling, and the resulting sample from the multi-view model remains consistent while satisfying the desired edit.We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, and demonstrate its applicability across various model architectures. We further illustrate the effects of coupling on SoTA image and video generation models, highlighting the potential of our method beyond multi-view editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38199", "url": null, "sourceid": 42251, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38203, "uid": "b043afb8c56c57b7a4015bbc613de3d4", "name": "Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events", "authors": [{"id": 180967, "fullname": "Yunshan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180967?format=json", "institution": "BeiHang University"}, {"id": 133113, "fullname": "Lin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133113?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189300, "fullname": "Nan Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189300?format=json", "institution": "Beihang University"}, {"id": 133150, "fullname": "Yifan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/133150?format=json", "institution": "Beihang University"}, {"id": 88093, "fullname": "Jia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88093?format=json", "institution": "Beihang University"}], "abstract": "Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We utilize NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above NeRF-rendered HDR pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results from single-exposure blurry LDR images and corresponding events.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38203", "url": null, "sourceid": 34954, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38205, "uid": "dd31058a4e2ad163eb0c08c07dea8dfb", "name": "AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM", "authors": [{"id": 183411, "fullname": "Zhiyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/183411?format=json", "institution": "South China University of Technology"}, {"id": 189304, "fullname": "Feng Hui", "url": "http://cvpr.thecvf.com/api/miniconf/users/189304?format=json", "institution": "South China University of Technology"}, {"id": 189305, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189305?format=json", "institution": "South China University of Technology"}], "abstract": "3D Gaussian splatting (3DGS) has emerged as a revolutionary scene representation in simultaneous localization and mapping (SLAM) research. However, existing research on 3DGS-based SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. Our key contributions are two fold. Firstly, we propose a camera exposure network to model the camera exposure process, which we integrate with Gaussian splatting to achieve exposure-controlled novel view synthesis. Secondly, we exploit an illumination-robust geometric feature for localization and Gaussian map initialization, enhancing localization accuracy under exposure-varying scenarios. Extensive experiments on public datasets and our self-collected real-world dataset demonstrate that AERGS-SLAM outperforms baselines in both localization performance and photorealistic mapping quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38205", "url": null, "sourceid": 35626, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38207, "uid": "c87532b4c0c78e92fdfd956ecc21e165", "name": "Efficiency Follows Global-Local Decoupling", "authors": [{"id": 189312, "fullname": "Zhenyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189312?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 181698, "fullname": "Gensheng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181698?format=json", "institution": "Sungkyunkwan University"}, {"id": 107127, "fullname": "Tao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107127?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 174175, "fullname": "Yichao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/174175?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 89778, "fullname": "Tianfei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89778?format=json", "institution": "Swiss Federal Institute of Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 72996, "fullname": "Fumin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72996?format=json", "institution": "UESTC"}], "abstract": "Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38207", "url": null, "sourceid": 37330, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38208, "uid": "3e2449ebbeb149857a783b0845c671c2", "name": "Ef\ufb01cient and Training-Free Single-Image Diffusion Models", "authors": [{"id": 183747, "fullname": "Haojun Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183747?format=json", "institution": "University of Toronto; Waabi"}, {"id": 93592, "fullname": "Kiriakos Kutulakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/93592?format=json", "institution": "University of Toronto"}, {"id": 77223, "fullname": "David B. Lindell", "url": "http://cvpr.thecvf.com/api/miniconf/users/77223?format=json", "institution": "University of Toronto"}], "abstract": "We consider the problem of generating images whose internal structure---defined by the distribution of patches across multiple scales---matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38208", "url": null, "sourceid": 33980, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38215, "uid": "33ca818e77cf91be0c24c27d53b15826", "name": "HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps", "authors": [{"id": 180205, "fullname": "Xuchang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180205?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189345, "fullname": "Xu Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189345?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189346, "fullname": "Jinke Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189346?format=json", "institution": "University of Science and Technology of China"}, {"id": 189347, "fullname": "Hao Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189347?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38215", "url": null, "sourceid": 35972, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38216, "uid": "1128b9108f8997cf4b24e3b20e5ecceb", "name": "Regulating Rather than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening", "authors": [{"id": 180233, "fullname": "Zhuwei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180233?format=json", "institution": "The State Key Lab. LIESMARS, Wuhan University, Wuhan, PR China, Wuhan University"}, {"id": 151548, "fullname": "Zimin Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/151548?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 189348, "fullname": "He Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189348?format=json", "institution": "Wuhan University"}, {"id": 157459, "fullname": "Linwei Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/157459?format=json", "institution": "China University of Geosciences Wuhan"}, {"id": 129140, "fullname": "Xianwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129140?format=json", "institution": "Wuhan University"}], "abstract": "In remote sensing pansharpening, spectrally mixed regions, where the spectral interactions among adjacent land covers lead to highly inconsistent reconstruction patterns, remain the most challenging areas. Due to the complex spatial distribution and heterogeneous spectral characteristics of ground objects, existing methods relying on rigid architectures and physical constraints struggle to learn generalized reconstruction patterns from limited spectral mixing samples, resulting in unstable generalization. To address this limitation, we propose an architecture-agnostic regularization-guided mechanism that adaptively directs the model to focus on learning reliable reconstruction priors for challenging regions. Specifically, we introduce a simple data-level transformation, MixShuffle, which performs random convex combinations across spatial positions and spectral channels to generate training data with richer spatial structures and stronger spectral mixing. In parallel, we propose a hierarchical attention weighting mechanism, a loss-level gradient reallocation strategy at the sample, channel, and pixel levels, enabling the model to emphasize structurally complex regions. Extensive experiments on multiple benchmark datasets (WV3, GF2, QB) and across various network architectures demonstrate the strong generality and effectiveness of the proposed strategies, achieving state-of-the-art performance when integrated into DANet.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38216", "url": null, "sourceid": 36997, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38217, "uid": "71afb186ca8924414be94770dba45137", "name": "Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO", "authors": [{"id": 170550, "fullname": "JUNHAO CHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/170550?format=json", "institution": "City University of Hong Kong"}, {"id": 129317, "fullname": "Liang Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129317?format=json", "institution": "Kuaishou Technology"}, {"id": 87541, "fullname": "Xin Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87541?format=json", "institution": "Kuaishou"}, {"id": 86410, "fullname": "Jing Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86410?format=json", "institution": "City University of Hong Kong"}], "abstract": "While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as **Video-Next-Event Prediction (VNEP)**. While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce **VANS**, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed **Joint-GRPO** that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft **VANS-Data-100K**, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38217", "url": null, "sourceid": 32876, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38222, "uid": "ae2772b21b613743d53e95ec3d6e7041", "name": "Human Geometry Distribution for 3D Animation Generation", "authors": [{"id": 150519, "fullname": "Xiangjun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150519?format=json", "institution": "KAUST"}, {"id": 185521, "fullname": "Biao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185521?format=json", "institution": "KAUST"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}], "abstract": "Generating realistic human geometry animations remains a challenging task, as it requires modeling natural clothing dynamics with fine-grained geometric details under limited data. To address these challenges, we propose two novel designs. First, we propose a compact distribution-based latent representation that enables efficient and high-quality geometry generation. We improve upon previous work by establishing a more uniform mapping between SMPL and avatar geometries. Second, we introduce a generative animation model that fully exploits the diversity of limited motion data. We focus on short-term transitions while maintaining long-term consistency through an identity-conditioned design. These two designs formulate our method as a two-stage framework: the first stage learns a latent space, while the second learns to generate animations within this latent space. We conducted experiments on both our latent space and animation model. We demonstrate that our latent space produces high-fidelity human geometry surpassing previous methods (90% lower Chamfer Dist.). The animation model synthesizes diverse animations with detailed and natural dynamics (2.2 x higher user study score), achieving the best results across all evaluation metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38222", "url": null, "sourceid": 41371, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38225, "uid": "97bb6bebe0602bb94df8f8382b7df9c7", "name": "Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition", "authors": [{"id": 181564, "fullname": "Shengming Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/181564?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 189370, "fullname": "Zekai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189370?format=json", "institution": "Peking University"}, {"id": 189371, "fullname": "Zecheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189371?format=json", "institution": "Soochow University"}, {"id": 189372, "fullname": "Kaiyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189372?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189373, "fullname": "Xiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189373?format=json", "institution": "Alibaba Group"}, {"id": 155387, "fullname": "Kun Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155387?format=json", "institution": "Beihang University"}, {"id": 152834, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152834?format=json", "institution": "Tencent"}, {"id": 189374, "fullname": "Yilei chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189374?format=json", "institution": null}, {"id": 189375, "fullname": "Yuxiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189375?format=json", "institution": null}, {"id": 88593, "fullname": "Heung-Yeung Shum", "url": "http://cvpr.thecvf.com/api/miniconf/users/88593?format=json", "institution": "Microsoft"}, {"id": 189376, "fullname": "Lionel Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/189376?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)); Hong Kong University of Science and Technology"}, {"id": 185837, "fullname": "Junyang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185837?format=json", "institution": "Alibaba Group"}, {"id": 189377, "fullname": "Chenfei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189377?format=json", "institution": "Microsoft"}], "abstract": "Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose Qwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling inherent editability, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38225", "url": null, "sourceid": 46420, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38226, "uid": "c6ef94de41b0abd7509a937022950f8b", "name": "DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models", "authors": [{"id": 181893, "fullname": "Patrick Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/181893?format=json", "institution": "University of Central Florida"}, {"id": 73542, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73542?format=json", "institution": "University of Central Florida"}], "abstract": "Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2\\% increase in character consistency and 36.2\\% increase in style similarity compared to previous methods, while displaying high spatial accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38226", "url": null, "sourceid": 33558, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38228, "uid": "fa59dcd8a934426f4894d1f6d87e698d", "name": "AnimaMimic: Imitating 3D Animation from Video Priors", "authors": [{"id": 132167, "fullname": "Tianyi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/132167?format=json", "institution": "University of California, Los Angeles"}, {"id": 189381, "fullname": "Yunuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189381?format=json", "institution": "University of California, Los Angeles"}, {"id": 127624, "fullname": "Yaowei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/127624?format=json", "institution": "Zhejiang University"}, {"id": 152336, "fullname": "Yin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152336?format=json", "institution": "University of Utah"}, {"id": 89955, "fullname": "Bolei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89955?format=json", "institution": "University of California, Los Angeles"}, {"id": 106975, "fullname": "Demetri Terzopoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/106975?format=json", "institution": "University of California, Los Angeles"}, {"id": 136657, "fullname": "Ying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136657?format=json", "institution": "University of California, Los Angeles"}, {"id": 150889, "fullname": "Chenfanfu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150889?format=json", "institution": "University of California, Los Angeles"}], "abstract": "Creating realistic 3D animation remains a time-consuming and expertise-dependent process, requiring manual rigging, keyframing, and fine-tuning of complex motions. Meanwhile, video diffusion models have recently demonstrated remarkable motion imagination in 2D, generating dynamic and visually coherent motion from text or image prompts. However, their results lack explicit 3D structure and cannot be directly used for animation or simulation. We present AnimaMimic, a framework that animates static 3D meshes using motion priors learned from video diffusion models. Starting from an input mesh, AnimaMimic synthesizes a monocular animation video, automatically constructs a skeleton with skinning weights, and refines joint parameters through differentiable rendering and video-based supervision. To further enhance realism, we integrate a differentiable simulation module that refines mesh deformation through physically grounded soft-tissue dynamics. Our method bridges the creativity of video diffusion and the structural control of 3D rigged animation, producing physically plausible, temporally coherent, and artist-editable motion sequences that integrate seamlessly into standard animation pipelines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38228", "url": null, "sourceid": 38348, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38229, "uid": "2d900e04fb7fe731dd78384d431be953", "name": "Unsupervised 3d Motion Estimation Using Event Camera", "authors": [{"id": 145004, "fullname": "Han Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/145004?format=json", "institution": "University of Science and Technology of China"}, {"id": 86247, "fullname": "Wei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86247?format=json", "institution": "University of Science and Technology of China"}, {"id": 153911, "fullname": "Tiesong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153911?format=json", "institution": "Fuzhou University"}, {"id": 88935, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88935?format=json", "institution": "University of Science and Technology of China"}, {"id": 86250, "fullname": "Yang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86250?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Estimating the 3D motion of scene points from 2D observations, typically parameterized by optical flow and motion in depth, is a fundamental problem in computer vision. Existing learning-based methods usually rely on supervised regression from densely labeled data, but their dependence on annotations and limited use of geometric constraints restricts generalization, motivating unsupervised solutions. Unsupervised 3D motion estimation is challenging because motion along the viewing direction is unobservable, and optical flow and motion in depth are geometrically coupled, making their separation ambiguous. Event cameras capture per-pixel brightness changes asynchronously with microsecond latency, providing high temporal resolution and motion continuity. Projecting event streams along different axes reveals spatiotemporal expansion and contraction patterns that encode depth variation and geometric structure, offering rich cues for unsupervised estimation. Leveraging these properties, we propose an unsupervised event-based 3D motion estimation framework that jointly models optical flow and motion in depth. We first derive an analytical relationship to infer initial motion in depth from estimated flow and further refine it using a directional expansion modulation module that captures horizontal and vertical expansion\u2013contraction patterns in event projections. Finally, motion in depth is incorporated into optical flow warping under a contrast maximization objective. Experiments on the CarlaEvent3D dataset show that our method achieves competitive accuracy and strong generalization, advancing unsupervised 3D motion estimation in the event domain.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38229", "url": null, "sourceid": 31900, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38230, "uid": "91dc5b777ae38db7cb3b26cf7c42294b", "name": "MoVie: Broaden Your Views with Human Motion for Action Detection", "authors": [{"id": 189382, "fullname": "Di Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189382?format=json", "institution": "University of Science and Technology of China"}, {"id": 180295, "fullname": "Mahmoud Ahmed Mohamed ALI", "url": "http://cvpr.thecvf.com/api/miniconf/users/180295?format=json", "institution": "INRIA"}, {"id": 145915, "fullname": "Xuanlong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145915?format=json", "institution": "Intellindust"}, {"id": 87658, "fullname": "Xi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87658?format=json", "institution": "Tencent AI Lab"}, {"id": 85099, "fullname": "Quan Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85099?format=json", "institution": "Woven by Toyota"}, {"id": 152971, "fullname": "Gianpiero Francesca", "url": "http://cvpr.thecvf.com/api/miniconf/users/152971?format=json", "institution": "Toyota Motor Europe"}, {"id": 76930, "fullname": "Francois Bremond", "url": "http://cvpr.thecvf.com/api/miniconf/users/76930?format=json", "institution": "inria"}], "abstract": "Human action detection in videos requires both semantic recognition and accurate modeling of motion. While recent video foundation models have advanced visual semantics, they still struggle to capture complex and compositional actions due to the limited representation ability of motion. Human skeleton sequences, which explicitly describe the body structure and movement, provide valuable physical and geometric motions that complement RGB videos. However, combining video and skeleton modalities faces two key challenges: (i)  label-driven skeleton features are too coarse to describe fine-grained motion, and (ii) skeleton motion and RGB video lie in heterogeneous feature spaces, so current fusion strategies often cause feature interference. To address these, we propose MoVie, a unified Motion-Video processing framework that uses structured human motion as a bridge between the two signals. We first propose a Structural Motion Projection module that decomposes motion into primitive components using a learnable motion dictionary, to produce fine-grained descriptors. Then, we design a Motion-guided Feature Regularization mechanism that aligns visual features with motion through an orthogonality-based transformation, so that fine-grained motion cues can guide visual representations without collapsing semantic diversity. Extensive evaluations on Toyota Smarthome Untrimmed, Charades, Multi-THUMOS and PKU-MMD datasets demonstrate that MoVie significantly improves state-of-the-art action detection performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38230", "url": null, "sourceid": 33158, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38231, "uid": "893cba398418b8cfd0fe1a08e83f6224", "name": "Align Images Before You Generate", "authors": [{"id": 129134, "fullname": "Shihua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129134?format=json", "institution": "Wuhan University"}, {"id": 189383, "fullname": "Qiuhong Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189383?format=json", "institution": "National University of Singapore"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Multi-image diffusion models can generate images like multi-views or videos to describe static or dynamic scenes, yet texture and structure drift persist, severely undermining the spatiotemporal consistency. Addressing this issue remains challenging, especially without any external geometric or semantic priors during the pure generative inference. In this paper, we introduce CorrAdapter, a plug-and-play adapter that discovers and exploits an innate property of the multi-image diffusion itself, aligning all output images before they are in fact generated. Specifically, CorrAdapter designs a bypass branch for transformer blocks in the multi-image diffusion model, encompassing a native correspondence constructor that builds reliable correspondences from the diffusion model's intermediate features, and an aligned area aggregator that integrates messages from only matching regions to avoid ambiguous information interactions. Given the native correspondences as guidance, CorrAdapter can enhance spatiotemporal consistency without any auxiliary inputs, and remains training-free and baseline-agnostic, which enables it to generalize seamlessly to various generation tasks. Additionally, we provide an optional training scheme to explore further-improved possibilities. Experiments on both static multi-view generation and dynamic video generation show that CorrAdapter consistently improves spatiotemporal consistency and perceptual quality over strong baselines, offering a simple yet versatile drop-in approach to geometrically faithful multi-image diffusion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38231", "url": null, "sourceid": 41084, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38232, "uid": "5429ee7f1bb3ae94e7042acad2375136", "name": "Generalized-CVO: Fast and Correspondence-Free Point Cloud Registration in RKHS with Second Order Riemannian Optimization", "authors": [{"id": 183479, "fullname": "Ray Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183479?format=json", "institution": "Toyota Research Institute"}, {"id": 182694, "fullname": "Carl Greiff", "url": "http://cvpr.thecvf.com/api/miniconf/users/182694?format=json", "institution": "Toyota Research Institute"}, {"id": 189384, "fullname": "Thomas Lew", "url": "http://cvpr.thecvf.com/api/miniconf/users/189384?format=json", "institution": null}, {"id": 189385, "fullname": "John Subosits", "url": "http://cvpr.thecvf.com/api/miniconf/users/189385?format=json", "institution": "Toyota Research Institute"}], "abstract": "We propose a fast and correspondence-free point cloud registration method that leverages local geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The proposed method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order methods used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR registration task in the driving domain, we achieve a reduction of $>55\\%$ in both translational and rotational drift in challenging feature-sparse environments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38232", "url": null, "sourceid": 37178, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38235, "uid": "59a3e5c5684f2219aeba5934fc50e8bb", "name": "AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking", "authors": [{"id": 180230, "fullname": "Jianibieke Adalibieke", "url": "http://cvpr.thecvf.com/api/miniconf/users/180230?format=json", "institution": "Shanghai Qi Zhi Institute"}, {"id": 189390, "fullname": "Qianwei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/189390?format=json", "institution": "Shanghai Qi Zhi Institute"}, {"id": 132599, "fullname": "Xueyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132599?format=json", "institution": "Tsinghua University"}, {"id": 128795, "fullname": "Yuzhe Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128795?format=json", "institution": "University of California, San Diego, University of California, San Diego"}, {"id": 73517, "fullname": "Li Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/73517?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Language is a natural way to command robots, but converting a single instruction into a long-horizon, contact-rich hand\u2013object interaction remains challenging: synthesized references are noisy, human-to-robot retargeting introduces embodiment bias, and fixed-reference tracking lets small errors snowball. We address this with AdaDexTrack, a modulator-in-the-loop framework for language-conditioned manipulation tracking. A distilled generalist tracker serves as the skill carrier, while a tightly aligned modulator performs three feedback corrections: reference modulation (continual adjustment of what to track), object-latent modulation (online adaptation of the object representation to recruit suitable skills), and positional-target modulation (small state-dependent refinements for execution). The tracker is learned via large-scale specialist to generalist distillation on a corpus of language-conditioned hand\u2013object trajectories; the modulator is trained with RL under the same task objective, ensuring tight coupling. Across large-scale evaluations, AdaDexTrack consistently outperforms prior SOTA on unseen-trajectory and unseen-object sets in both average tracking error and success rate, demonstrating robustness and generalization. We further show zero-shot sim-to-real transfer on real hardware, where adding the modulator yields substantial gains over a tracker-only variant. AdaDexTrack reframes language-conditioned dexterous manipulation as modulated tracking, replacing the open-loop, fixed-reference tracking with in-loop modulation that adjusts the reference, object latent, and positional target, yielding drift-resistant execution from noisy text references.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38235", "url": null, "sourceid": 35545, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38236, "uid": "83b63c5362fd392b81c8136c6919ed36", "name": "Dynamic Momentum Recalibration in Online Gradient Learning", "authors": [{"id": 180324, "fullname": "Zhipeng Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180324?format=json", "institution": "Shenyang University of Chemical Technology"}, {"id": 157863, "fullname": "Rui Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157863?format=json", "institution": "University of Louisville"}, {"id": 189391, "fullname": "Guisong Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189391?format=json", "institution": "Northeastern University"}, {"id": 189392, "fullname": "Ying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189392?format=json", "institution": null}, {"id": 189393, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189393?format=json", "institution": "Shenyang University of Chemical Technology"}, {"id": 180327, "fullname": "Dazhou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180327?format=json", "institution": "Shenyang University of Chemical Technology"}], "abstract": "Stochastic Gradient Descent (SGD) and its momentum-driven variants form the backbone of deep learning optimization, yet the underlying dynamics of their gradient behavior remain insufficiently understood. In this work, we reinterpret gradient updates through the lens of signal processing and reveal that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates. To address this, we propose SGDF (SGD with Filter), an optimizer inspired by the principles of Optimal Linear Filtering. SGDF computes an online, time-varying gain to dynamically refine gradient estimation by minimizing the mean-squared error, thereby achieving an optimal trade-off between noise suppression and signal preservation. Furthermore, our approach could extend to adaptive optimizers, enhancing their generalization potential. Extensive experiments across diverse architectures and benchmarks demonstrate that SGDF outperforms conventional momentum-based methods and achieves performance on par with or surpassing state-of-the-art optimizers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38236", "url": null, "sourceid": 37637, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38237, "uid": "a10765ead0373244d0b92935ed504753", "name": "FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement", "authors": [{"id": 86987, "fullname": "Haobo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86987?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}, {"id": 152259, "fullname": "Liang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152259?format=json", "institution": "Alibaba Group"}, {"id": 152258, "fullname": "Jianmin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152258?format=json", "institution": "Nanyang Technological University"}], "abstract": "Registration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into  low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., $\\pi^3$) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce FUSER-DF, an SE(3) diffusion refinement framework to correct  FUSER's estimates through a denoising process over the joint SE(3)$^N$ space. Here, FUSER serves as a surrogate multiview register to model the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch and ScanNet confirm the superior registration accuracy and efficiency of our method.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38237", "url": null, "sourceid": 41275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40330?format=json"], "related_events_ids": [40330]}, {"id": 40330, "uid": "a10765ead0373244d0b92935ed504753", "name": "FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement", "authors": [{"id": 86987, "fullname": "Haobo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86987?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}, {"id": 152259, "fullname": "Liang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152259?format=json", "institution": "Alibaba Group"}, {"id": 152258, "fullname": "Jianmin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152258?format=json", "institution": "Nanyang Technological University"}], "abstract": "Registration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into  low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., $\\pi^3$) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce FUSER-DF, an SE(3) diffusion refinement framework to correct  FUSER's estimates through a denoising process over the joint SE(3)$^N$ space. Here, FUSER serves as a surrogate multiview register to model the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch and ScanNet confirm the superior registration accuracy and efficiency of our method.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40330", "url": null, "sourceid": -41275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38237?format=json"], "related_events_ids": [38237]}, {"id": 38239, "uid": "50fb5ad5d9eb3bb648e6c8d18453a7f7", "name": "Image Generation from Contextually-Contradictory Prompts", "authors": [{"id": 182073, "fullname": "Saar Huberman", "url": "http://cvpr.thecvf.com/api/miniconf/users/182073?format=json", "institution": "Tel Aviv University"}, {"id": 88548, "fullname": "Or Patashnik", "url": "http://cvpr.thecvf.com/api/miniconf/users/88548?format=json", "institution": "Tel Aviv University"}, {"id": 189400, "fullname": "Omer Dahary", "url": "http://cvpr.thecvf.com/api/miniconf/users/189400?format=json", "institution": "Snap Inc.; Tel Aviv University"}, {"id": 88328, "fullname": "Ron Mokady", "url": "http://cvpr.thecvf.com/api/miniconf/users/88328?format=json", "institution": "Tel Aviv University"}, {"id": 87616, "fullname": "Daniel Cohen-Or", "url": "http://cvpr.thecvf.com/api/miniconf/users/87616?format=json", "institution": "Google"}], "abstract": "Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38239", "url": null, "sourceid": 39757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38242, "uid": "20fcec0873b39b4df3df34140d77d6e7", "name": "Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers", "authors": [{"id": 161560, "fullname": "Youngjun Jun", "url": "http://cvpr.thecvf.com/api/miniconf/users/161560?format=json", "institution": "Yonsei University"}, {"id": 101322, "fullname": "seil kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101322?format=json", "institution": "yonsei university"}, {"id": 128752, "fullname": "Woojung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/128752?format=json", "institution": "Yonsei University"}, {"id": 107168, "fullname": "Seong Jae Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107168?format=json", "institution": "Yonsei University"}], "abstract": "Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity to text descriptions involving motion.  However, the understanding of how Video DiTs convert motion words into video remains lagging behind. Furthermore, prior studies on interpretable saliency maps primarily target objects, leaving it behind to observe how Video DiTs behave with respect to motion. In this paper, we inquire into concrete motion features that specify which object moves and at what time for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively renders per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose an automatic motion-feature selecting algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motions spatially and temporally. Our methods discover concept saliency maps without the need for any gradient-based training or parameters. Experimentally, our methods show standout localization capability in the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38242", "url": null, "sourceid": 39837, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38245, "uid": "590823fe0eb3a5ebe69d84a116624394", "name": "Dynamic Important Example Mining for Reinforcement Finetuning", "authors": [{"id": 128162, "fullname": "Haoru Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128162?format=json", "institution": "HKU"}, {"id": 128180, "fullname": "WU Sitong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128180?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 189414, "fullname": "Yanfeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189414?format=json", "institution": "Tencent"}, {"id": 156034, "fullname": "Shizhen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156034?format=json", "institution": "The University of Hong Kong,"}, {"id": 102107, "fullname": "Yangtian Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/102107?format=json", "institution": "University of Hong Kong"}, {"id": 189415, "fullname": "Tianjia Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189415?format=json", "institution": ", University of Hong Kong"}, {"id": 76914, "fullname": "Chirui Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76914?format=json", "institution": "University of Hong Kong"}, {"id": 181212, "fullname": "Shaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181212?format=json", "institution": "University of Science and Technology of China"}, {"id": 154025, "fullname": "Xingwu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/154025?format=json", "institution": "Tencent AI Platform"}, {"id": 184902, "fullname": "Xiuzhe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184902?format=json", "institution": null}, {"id": 154023, "fullname": "Ruobing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/154023?format=json", "institution": "Tencent"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}], "abstract": "Reinforcement fine-tuning (RFT) is increasingly used to strengthen the reasoning abilities of large models, yet its effectiveness is bounded by how training data are selected and used. Most data-centric RFT methods rely on static or heuristic sample selection, implicitly assuming a sample\u2019s value is fixed over training. This overlooks the non-stationary dynamics of policy learning and can lead to suboptimal updates.We propose **Dynamic Important Example Mining (DIEM)**, a principled and fully automated framework that makes data utilization adaptive throughout RFT. DIEM integrates two components into each optimization step: (i) a gradient-alignment importance estimator that efficiently approximates each sample\u2019s marginal contribution to policy improvement; and (ii) a constrained batch reweighting scheme that maximizes aggregate utility while preserving the update\u2019s gradient magnitude to stabilize optimization. This converts data selection from a one-time preprocessing heuristic into an intrinsic part of the learning algorithm, yielding a self-organizing, curriculum-like training trajectory driven by model dynamics rather than external scores.Across several multimodal reasoning benchmarks, DIEM consistently outperforms strong static and dynamic baselines, providing a significant performance uplift to the base RFT algorithm of approximately **1%** to **6%**, while introducing only a minimal **1.2%** training overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38245", "url": null, "sourceid": 42607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38247, "uid": "8f90930944ee00a0e740f9f3bdf289c3", "name": "BuildingGPT: Auto-Regressive Building Wireframe Reconstruction Model with Reinforcement Learning", "authors": [{"id": 98504, "fullname": "Yuzhou Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/98504?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 153357, "fullname": "Lingjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153357?format=json", "institution": "Cenozoic Robotics"}, {"id": 153358, "fullname": "Hanqiao Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/153358?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 155876, "fullname": "Yujun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155876?format=json", "institution": "Shenzhen University"}, {"id": 106671, "fullname": "Shangfeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106671?format=json", "institution": "University of Calgary"}, {"id": 153359, "fullname": "Xiang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153359?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 150979, "fullname": "Ruisheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150979?format=json", "institution": "University of Calgary"}, {"id": 129136, "fullname": "Shuhan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129136?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "In this paper, we propose BuildingGPT, a novel auto-regressive model for building wireframe reconstruction from point clouds with reinforcement learning.Unlike prior works based on detection or diffusion models, BuildingGPT reformulates the building wireframe reconstruction task into a sequence prediction problem.Based on the hierarchical building wireframe tokenization, the wireframe sequences are organized in a structurally- and semantically-aware order for the next-token prediction.The point cloud encoder first transforms the input point cloud into a fixed-length latent code  that serves as the starting of the sequence.Then, BuildingGPT auto-regressively predicts tokens conditioned on the latent code and previously generated tokens.With token sequence predicted, the building wireframe is obtained through detokenization.To enhance the model performance, we adopt a two-stage training paradigm including the pre-training and post-training.After the auto-regressive pre-training, Direct Preference Optimization (DPO) is employed as a post-training strategy to align reconstruction results with human preferences.Extensive experiments on the large-scale MunichWF dataset show that BuildingGPT outperforms existing state-of-the-art methods.We commit to release the code and dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38247", "url": null, "sourceid": 36818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38250, "uid": "eea4b36884e7e6ef7ecfc146a7d8f827", "name": "VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision\u2013Language Models", "authors": [{"id": 180608, "fullname": "XUEGE HOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180608?format=json", "institution": "Tsinghua University"}, {"id": 154101, "fullname": "Wenshuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154101?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88509, "fullname": "Yali Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88509?format=json", "institution": "Tsinghua University"}, {"id": 189422, "fullname": "Han Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189422?format=json", "institution": "Huawei Technologies"}, {"id": 102550, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102550?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 154103, "fullname": "Xinghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154103?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 88500, "fullname": "Shengjin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88500?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Vision-Language Models (VLMs) often over-rely on linguistic priors even when images are provided, leading to object hallucinations. We revisit object-wise hallucination from the perspective of how visual evidence shapes the model\u2019s uncertainty. For each input, we measure decision uncertainty with and without the image, and define a Visual Evidence Sensitivity (VES) signal as the image-attributable change in entropy. Building on this signal, we introduce Visual Evidence Sensitivity Reinforcement Fine-Tuning (VES-RFT), a training-time reinforcement fine-tuning method that explicitly rewards reliance on correct visual evidence. We pair this continuous, annotation-free signal with a verifiable reward that enforces factual object correctness by automatically checking generated object mentions against the image, yielding a computable objective without human annotations. We optimize the dual objective using critic-free GRPO with KL regularization, requiring only parallel image and no-image passes during training while preserving single-pass inference. Across multiple VLM families and benchmarks, VES-RFT consistently suppresses hallucinations and improves robustness under ambiguity without degrading general language ability. Specifically, on LLaVA-7B, VES-RFT reduces 12.8 and 1.8 on CHAIR$_S$/CHAIR$_I$ of MS-COCO, and increases POPE accuracy by 4.92\\%. Extensive experiments indicate that turning uncertainty into a learnable reward, paired with verifiable correctness signals, provides a scalable mechanism for training-time hallucination mitigation and stronger visual grounding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38250", "url": null, "sourceid": 39102, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38254, "uid": "fe9fb47fd2333245796f4348f44d60dc", "name": "Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding", "authors": [{"id": 180617, "fullname": "Wang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180617?format=json", "institution": "Xiamen University"}, {"id": 146415, "fullname": "Yuhui zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/146415?format=json", "institution": "Xiamen University"}, {"id": 153680, "fullname": "Yongdong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/153680?format=json", "institution": "Xiamen University"}, {"id": 189434, "fullname": "Tianyu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189434?format=json", "institution": "Xiamen University"}, {"id": 184415, "fullname": "Luojun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184415?format=json", "institution": "Fuzhou University"}, {"id": 131760, "fullname": "Jiayi Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/131760?format=json", "institution": "Xiamen University"}, {"id": 152812, "fullname": "Yan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152812?format=json", "institution": "Xiamen University"}, {"id": 128362, "fullname": "Xiawu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128362?format=json", "institution": "Xiamen University"}], "abstract": "Frame selectoin is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce $\\textbf{W}$avelet-based $\\textbf{F}$rame $\\textbf{S}$election by Detecting $\\textbf{S}$emantic $\\textbf{B}$oundary ($\\textbf{WFS-SB}$), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts\u2014pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, a direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations.To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by $\\textbf{5.5\\\\% on VideoMME, 9.5\\\\% on MLVU, and 6.2\\\\% on LongVideoBench}$, consistently outperforming state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38254", "url": null, "sourceid": 44578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38257, "uid": "862090b94c4637688088941f122041df", "name": "Information-Theoretic Decomposition for Multimodal Interaction Learning", "authors": [{"id": 156831, "fullname": "Zequn Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156831?format=json", "institution": "Renmin University of China"}, {"id": 127912, "fullname": "Yake Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/127912?format=json", "institution": "Renmin University of China"}, {"id": 189439, "fullname": "HaoTian Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/189439?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 189440, "fullname": "Zhihao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189440?format=json", "institution": "gaotu"}, {"id": 127885, "fullname": "Di Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127885?format=json", "institution": "Renmin University of China"}], "abstract": "Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38257", "url": null, "sourceid": 39380, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38259, "uid": "6141f07ed643366dfb6b5346be38b176", "name": "Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection", "authors": [{"id": 183282, "fullname": "Shiman He", "url": "http://cvpr.thecvf.com/api/miniconf/users/183282?format=json", "institution": "College of Electronic Science and Technology, National University of Defense Technology"}, {"id": 91307, "fullname": "Nuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91307?format=json", "institution": "National University of Defense Technology"}, {"id": 71947, "fullname": "Xinyi Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/71947?format=json", "institution": "National University of Defence Technology"}, {"id": 189441, "fullname": "Yihang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189441?format=json", "institution": "National University of Defense Technology"}, {"id": 189442, "fullname": "Yangsi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189442?format=json", "institution": "National University of Defense Technology"}, {"id": 151076, "fullname": "Zaiping Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/151076?format=json", "institution": "National University of Defense Technology"}, {"id": 151071, "fullname": "Miao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/151071?format=json", "institution": "National University of Defense Technology"}], "abstract": "Small object detection (SOD) plays a vital role in applications such as anti-UAV tasks,  yet conventional image-based methods struggle in high-speed scenarios due to the limited frame rate. Event cameras offer a promising alternative by capturing spatiotemporal event streams with microsecond-level temporal resolution. To address the inherent sparsity of small objects in event data, existing methods typically formulate the detection task as semantic segmentation on spatiotemporal point clouds to leverage long-term contextual information. However, these methods often fail to enforce effective spatiotemporal consistency constraints, resulting in fragmented object trajectories. To mitigate these problems, we propose a topology-constrained sparse convolutional network (SpTopoNet), which models the topological structure of moving object trajectories in event point clouds. Our network comprises two key components: a Topology Learning Module (TLM) that discriminates local structures to separate genuine targets from noise, and a Spatial Consistency Module (SCM) that captures long-range spatiotemporal dependencies to enhance trajectory continuity. Additionally, we introduce an event topology-aware loss function that leverages topological correlations to guide the network to maintain structural integrity of target event patterns.Experiments on the benchmark dataset demonstrate the superiority of our method in both detection performance and trajectory completeness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38259", "url": null, "sourceid": 43791, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38322, "uid": "14aa62c2835528477b121f5fa0a2d7d7", "name": "Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution", "authors": [{"id": 171357, "fullname": "Guanghui Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/171357?format=json", "institution": "Xidian University"}, {"id": 182920, "fullname": "Xuefeng Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182920?format=json", "institution": "Xidian University"}, {"id": 189602, "fullname": "Qixiang Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189602?format=json", "institution": "Xidian University"}], "abstract": "Dataset Distillation (DD) aims to compress large-scale datasets into a small number of condensed Images Per Class (IPC), enabling efficient network training. Previous core-set selection and synthetic-based DD methods achieve reasonable performance. However, our in-depth investigation reveals that existing methods share a common issue: pattern imbalance. Specifically, they either overemphasize class-general patterns representing the majority of each class or focus on fewer marginal patterns critical for model generalization. To address this issue, we propose a novel framework, Balanced Patterns Selection (BPS). Unlike prior methods that assume each class forms a single cluster, BPS models the multiple visual pattern distribution within each class via a hierarchical semantic structure inherent to the dataset. It then selects two complementary subsets in a balanced manner from the center (class-general patterns) and the margins (marginal patterns) of each pattern, producing a pattern-balanced coreset. Theoretically, we prove that the BPS-selected coreset aligns with the original dataset in both distribution and performance. Moreover, its model-agnostic selection nature ensures cross-architecture generalization, while the Optimize-Once-for-All-IPCs property guarantees efficiency. Extensive experiments on four benchmarks demonstrate that BPS significantly outperforms existing state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38322", "url": null, "sourceid": 37933, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38260, "uid": "3715e8a79b3b9622c179617533e5654f", "name": "Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers", "authors": [{"id": 180886, "fullname": "Yuhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180886?format=json", "institution": "National University of Singapore"}, {"id": 127173, "fullname": "Zhenxiong Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127173?format=json", "institution": "National University of Singapore"}, {"id": 189443, "fullname": "Yujia Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189443?format=json", "institution": null}, {"id": 180264, "fullname": "Songhua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180264?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively harmonizes multi-type conditional inputs, such as image, semantic, and spatial cues, while maintaining training stability. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability. Codes will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38260", "url": null, "sourceid": 38463, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38262, "uid": "c96af0661b844eacad759c2acf9e1486", "name": "PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes", "authors": [{"id": 184222, "fullname": "Christina Ourania Tze", "url": "http://cvpr.thecvf.com/api/miniconf/users/184222?format=json", "institution": "University of T\u00fcbingen"}, {"id": 141169, "fullname": "Daniel Dauner", "url": "http://cvpr.thecvf.com/api/miniconf/users/141169?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 88190, "fullname": "Yiyi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88190?format=json", "institution": "Zhejiang University"}, {"id": 130185, "fullname": "Dzmitry Tsishkou", "url": "http://cvpr.thecvf.com/api/miniconf/users/130185?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 69174, "fullname": "Andreas Geiger", "url": "http://cvpr.thecvf.com/api/miniconf/users/69174?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38262", "url": null, "sourceid": 37108, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38267, "uid": "241393669823fc88e479a88090df91c6", "name": "Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation", "authors": [{"id": 183764, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183764?format=json", "institution": "EPFL"}, {"id": 189462, "fullname": "Wuyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189462?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 151737, "fullname": "Po-Chien Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151737?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 97325, "fullname": "Alex Alahi", "url": "http://cvpr.thecvf.com/api/miniconf/users/97325?format=json", "institution": "EPFL"}], "abstract": "Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing self-supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under self-supervision, delivering 13.5\\% gains on human-centric instances and 10.9\\% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38267", "url": null, "sourceid": 34275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38268, "uid": "eb968139ca829c6e4192e49baa4a0578", "name": "MedFG-VQA: Low-Frequency Memory and Graph Attention for Lightweight Medical VQA", "authors": [{"id": 181728, "fullname": "haowen gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181728?format=json", "institution": "Nanjing University of science and technology"}, {"id": 181698, "fullname": "Gensheng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181698?format=json", "institution": "Sungkyunkwan University"}, {"id": 129610, "fullname": "Zeren Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129610?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189463, "fullname": "Mingwu Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/189463?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 157797, "fullname": "Xiangbo Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157797?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 72996, "fullname": "Fumin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72996?format=json", "institution": "UESTC"}], "abstract": "Medical Visual Question Answering (Med-VQA) holds significant promise for clinical decision support, yet faces challenges due to limited annotated data and the high computational demands of existing large vision-language models. We propose MedFG-VQA, a lightweight framework that leverages a memory bank to augment DCT-based low-frequency features and employs graph-enhanced cross-attention for effective visual-textual alignment. Specifically, our approach features two key components: Frequency-Memory Fusion (FMF), which enhances low-frequency features by retrieving from a learnable memory bank built on DCT decomposition, and Graph-Aware Cross-Attention (GACA), which aligns visual-textual features via cross-attention and refines them through graph-convolutional aggregation. To address data scarcity, we construct SynMed-VQA, a large-scale synthetic dataset comprising over 2 million question-answer pairs across 9 imaging modalities and 10 major organs, generated with GPT-4o. Extensive experiments on SynMed-VQA and three other standard biomedical VQA benchmarks demonstrate that MedFG-VQA achieves competitive or superior performance compared to much larger models while maintaining significantly lower computational costs, highlighting its efficiency and potential for clinical deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38268", "url": null, "sourceid": 44511, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38272, "uid": "9be7eaf5887dcb7227031f81b6fb4005", "name": "PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts", "authors": [{"id": 132670, "fullname": "Xianqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132670?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189473, "fullname": "Hao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189473?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189474, "fullname": "Hangtian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189474?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 126544, "fullname": "JunDa Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126544?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 69595, "fullname": "Gangwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/69595?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 181817, "fullname": "Min Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/181817?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90863, "fullname": "Xin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90863?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Modern stereo matching methods have leveraged monocular depth foundation models to achieve superior zero-shot generalization performance. However, most existing methods primarily focus on extracting robust features for cost volume construction or disparity initialization. At the same time, the iterative refinement stage, which is also crucial for zero-shot generalization, remains underexplored. Some methods treat monocular depth priors as guidance for iteration, but conventional GRU-based architectures struggle to exploit them due to the limited representation capacity. In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. By integrating monocular structure and stereo motion cues as prompts into the decoder, PRU enriches the latent representations of monocular depth foundation models with absolute stereo-scale information while preserving their inherent monocular depth priors. Experiments demonstrate that our PromptStereo achieves state-of-the-art zero-shot generalization performance across multiple datasets, while maintaining comparable or faster inference speed. Our findings highlight prompt-guided iterative refinement as a promising direction for zero-shot stereo matching.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38272", "url": null, "sourceid": 37436, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38274, "uid": "499d139d7c871d4b41ca6aade4419e42", "name": "Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification", "authors": [{"id": 165377, "fullname": "Zizhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/165377?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 120030, "fullname": "Ping Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/120030?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 132214, "fullname": "Ziyang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/132214?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 72271, "fullname": "Huan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/72271?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189475, "fullname": "Xiangru Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189475?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "As the harm caused by fake news grows, the task of detecting and grounding multi-modal media manipulation (DGM4) is gaining more attention. Existing multimodal methods overlook fine-grained semantic alignment between visual and textual modalities, thereby limiting their ability to detect sophisticated and subtle cross-modal manipulations. To address this challenge, we present MaLSF, a novel Mask-aware Local Semantic Fusion framework that explicitly bridges words and pixels via mask-label pairs, enabling the model to perform precise reasoning over fine-grained cross-modal correspondences. MaLSF captures cross-modal local semantics through two key innovations: 1) A Bidirectional Cross-modal Verification Module (BCV) that identifies semantic conflicts between masked regions and associated labels via a bidirectional query mechanism; 2) A Hierarchical Semantic Aggregation (HSA) Module that adaptively aggregates multi-granularity local semantics into decoupled features for task-specific verification. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. The proposed model is evaluated on multiple datasets and achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38274", "url": null, "sourceid": 31138, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38280, "uid": "7a99ddd64872e5a6b4084867dd14ef4a", "name": "Global Underwater Geolocation from Time-Lapse Polarization Imagery", "authors": [{"id": 181257, "fullname": "Sara Aghajanzadeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/181257?format=json", "institution": "University of Illinois Urbana Champaign"}, {"id": 180504, "fullname": "Xiaoyang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180504?format=json", "institution": "The University of Hong Kong"}, {"id": 148666, "fullname": "Zhongmin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/148666?format=json", "institution": null}, {"id": 69186, "fullname": "David Forsyth", "url": "http://cvpr.thecvf.com/api/miniconf/users/69186?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 189490, "fullname": "Viktor Gruev", "url": "http://cvpr.thecvf.com/api/miniconf/users/189490?format=json", "institution": "University of Illinois at Urbana-Champaign; University of Illinois at Urbana-Champaign"}], "abstract": "It is extremely hard for an underwater agent to know where it is. Satellite signals disappear within centimeters of the surface; acoustic baselines require heavy infrastructure to instrument small regions. The polarization of the sky, visible underwater, reveals the elevation of the sun. The  pattern of elevation over the day reveals location to an agent with a clock.  However, recovering elevation from polarization images is very difficult.  SOTA geolocalization methods can localize well for locations where they have seen data, but accuracy collapses when the data comes from a new location.  Our physics-guided synthesis pipeline expands a huge library of polarization images from a small set of sites to 2.8~million solar-elevation\u2013matched training sequences spanning latitudes, seasons, and water types. A compact two-stage transformer reconstructs the solar-elevation curve and predicts geolocation. Under leave-one-site-out tests, the site averaged median geodesic error is ~500km\u2014about an eightfold improvement over previous deep-learning baselines; with limited target-site data, the median error contracts to single-digit kilometers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38280", "url": null, "sourceid": 42636, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38279, "uid": "6c1f11f779599fe6d280c41d644ed43b", "name": "NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction", "authors": [{"id": 177759, "fullname": "Fei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177759?format=json", "institution": "AMap"}, {"id": 189487, "fullname": "Shichao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189487?format=json", "institution": null}, {"id": 189488, "fullname": "Minghua Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189488?format=json", "institution": "Alibaba Group"}, {"id": 150618, "fullname": "Zedong Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150618?format=json", "institution": null}, {"id": 189489, "fullname": "Junjun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189489?format=json", "institution": "Alibaba Group"}, {"id": 128317, "fullname": "Xiaolong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128317?format=json", "institution": "Georgia Institute of Technology"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}], "abstract": "Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term  planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework.Our approach empowers a single VLM to concurrently perform  planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by  decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38279", "url": null, "sourceid": 43426, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38282, "uid": "f63882d3bda05f2da7b8127a2b308364", "name": "Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models", "authors": [{"id": 181333, "fullname": "Mark Endo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181333?format=json", "institution": "Stanford University"}, {"id": 69178, "fullname": "Serena Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/69178?format=json", "institution": "Stanford"}], "abstract": "Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38282", "url": null, "sourceid": 38999, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38284, "uid": "21691483feeebcb266442bf36ef25230", "name": "Controllable Stereo Video Conversion with Guided Latent Decoding", "authors": [{"id": 137997, "fullname": "Nando Metzger", "url": "http://cvpr.thecvf.com/api/miniconf/users/137997?format=json", "institution": "ETH Zurich"}, {"id": 152179, "fullname": "Prune Truong", "url": "http://cvpr.thecvf.com/api/miniconf/users/152179?format=json", "institution": "Google"}, {"id": 156123, "fullname": "Goutam Bhat", "url": "http://cvpr.thecvf.com/api/miniconf/users/156123?format=json", "institution": "Google"}, {"id": 86863, "fullname": "Konrad Schindler", "url": "http://cvpr.thecvf.com/api/miniconf/users/86863?format=json", "institution": "ETH Zurich"}, {"id": 87927, "fullname": "Federico Tombari", "url": "http://cvpr.thecvf.com/api/miniconf/users/87927?format=json", "institution": "Google, TUM"}], "abstract": "The growing demand for immersive 3D content calls for automated monocular-to-stereo video conversion. We present a controllable, direct end-to-end method for upgrading a conventional video to a binocular one. Our approach, based on (conditional) latent diffusion, avoids artifacts due to explicit depth estimation and warping. The key to its high-quality stereo video output is a novel, guided VAE decoder that ensures sharp and epipolar-consistent stereo video output. Moreover, our method gives the user control over the strength of the stereo effect (respectively, the disparity range) at inference time, via an intuitive, scalar tuning knob. Experiments on three different datasets of real-world stereo videos show that our method outperforms both traditional warping-based and recent warping-free baselines and sets a new standard for reliable, controllable stereo video conversion.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38284", "url": null, "sourceid": 36251, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38288, "uid": "48d91c10005e38c7562368209ebd9be0", "name": "Geometry-Guided 3D Visual Token Pruning for Video-Language Models", "authors": [{"id": 181056, "fullname": "Han Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181056?format=json", "institution": "Zhongguancun Academy"}, {"id": 90235, "fullname": "Zehao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90235?format=json", "institution": "TuSimple"}, {"id": 155921, "fullname": "Jiahui Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155921?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 90217, "fullname": "Naiyan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90217?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 75839, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75839?format=json", "institution": "Beihang University"}], "abstract": "Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38288", "url": null, "sourceid": 34633, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38293, "uid": "689dbe32366f0b4fa515efd65836bf41", "name": "Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models", "authors": [{"id": 182477, "fullname": "Karim Kadry", "url": "http://cvpr.thecvf.com/api/miniconf/users/182477?format=json", "institution": "MIT"}, {"id": 189524, "fullname": "Abdalla Abdelwahed", "url": "http://cvpr.thecvf.com/api/miniconf/users/189524?format=json", "institution": null}, {"id": 189525, "fullname": "Ajay Manicka", "url": "http://cvpr.thecvf.com/api/miniconf/users/189525?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 189526, "fullname": "Naravich Chutisilp", "url": "http://cvpr.thecvf.com/api/miniconf/users/189526?format=json", "institution": "Tradition SA"}, {"id": 189527, "fullname": "Farhad R. Nezami", "url": "http://cvpr.thecvf.com/api/miniconf/users/189527?format=json", "institution": "Brigham and Women&#x27;s Hospital, Harvard Medical School"}, {"id": 189528, "fullname": "Elazer R Edelman", "url": "http://cvpr.thecvf.com/api/miniconf/users/189528?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "We present an inference-time guidance framework for generating 3D multi-class anatomical voxel maps with localized geometric and topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We penalize geometric features such as size, shape, position, and orientation through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement this guidance framework for latent diffusion models, where a neural field decoder can partially extract substructures, enabling efficient measurement and control of anatomical properties. This formulation unlocks a rich design space, where several constraints can be composed to control complex structures defined over arbitrary dimensions and coordinate systems. We show that Anatomica flexibly applies to a variety of anatomical systems, enabling the rational design of synthetic datasets for virtual simulation trials or machine learning workflows.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38293", "url": null, "sourceid": 37891, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38296, "uid": "90505b01728d89083db9e5b1804b52d4", "name": "Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning", "authors": [{"id": 180875, "fullname": "Chubin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180875?format=json", "institution": "Tsinghua University"}, {"id": 189536, "fullname": "Sujie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189536?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186285, "fullname": "Jiashu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186285?format=json", "institution": "Alibaba Group"}, {"id": 189537, "fullname": "Meiqi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189537?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 156765, "fullname": "Jintao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156765?format=json", "institution": "Peking University"}, {"id": 189538, "fullname": "Yanxun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189538?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 146451, "fullname": "Nisha Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146451?format=json", "institution": "Tsinghua University"}, {"id": 127801, "fullname": "Chengyu Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127801?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186287, "fullname": "Jiahong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186287?format=json", "institution": "Alibaba Group"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}, {"id": 90909, "fullname": "Xiu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90909?format=json", "institution": "Tsinghua University"}], "abstract": "Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to **Preference Mode Collapse (PMC)**-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing **DivGenBench**, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose **Directional Decoupling Alignment (D$^2$-Align)**, a novel framework that mitigates PMC by directionally correcting the reward signal.Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen.This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that **D$^2$-Align** achieves superior alignment with human preference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38296", "url": null, "sourceid": 44966, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38298, "uid": "cd2641acda3609cf9eeb2ca8b8fd50a7", "name": "Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering", "authors": [{"id": 146764, "fullname": "Keyang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/146764?format=json", "institution": "Zhejiang University"}, {"id": 85722, "fullname": "Hongzhi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85722?format=json", "institution": "Zhejiang University"}, {"id": 85731, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85731?format=json", "institution": "Zhejiang University"}], "abstract": "Novel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques across a wide range of scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38298", "url": null, "sourceid": 36524, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38300, "uid": "4428d361dbb6f73f849bf17d85c0aee7", "name": "Multimodal Semantic Bias Mitigation for Diverse Text-To-3D Generation", "authors": [{"id": 157853, "fullname": "Yukuan Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/157853?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 88220, "fullname": "Muli Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88220?format=json", "institution": "A*STAR"}, {"id": 161286, "fullname": "Jinhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161286?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 157852, "fullname": "Yuxuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157852?format=json", "institution": "Xidian University"}, {"id": 157854, "fullname": "Yihang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157854?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 189542, "fullname": "Jiexi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189542?format=json", "institution": "Xidian University"}, {"id": 88245, "fullname": "Cheng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88245?format=json", "institution": "Xidian University"}], "abstract": "The latest progress in text-to-3D generative models makes it possible to generate high-quality 3D content. Recent text-to-3D large model have achieved remarkable breakthroughs in multi-view consistency. However, their effectiveness is often affected by inherent biases, resulting in sensitivity to design settings such as prompt format, leading to difficulty understanding complex prompts. To help text-to-3D generative models understand more diverse prompts, we propose a framework to localize and mitigate the bias in the current text-to-3D large model. Specifically, we first use the existing model to generate 3D content and use the quality evaluation model to identify the cross-modality bias. Then, we use the predicted quality score to quantify the contribution of the prompt text to the bias. Finally, in order to reduce these biases, we construct diverse pairwise examples to help the current text-to-3D large model construct unbiased visual-text connections. The experiment shows that our method has achieved competitive results and can provide higher quality, more diverse 3D content compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38300", "url": null, "sourceid": 33496, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38303, "uid": "1d7cc0f1d0b54eac3609c5222246595b", "name": "RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting", "authors": [{"id": 152518, "fullname": "Ji Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/152518?format=json", "institution": "Peking University"}, {"id": 152511, "fullname": "Xianghua Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/152511?format=json", "institution": "Peking University"}, {"id": 152516, "fullname": "Bowei Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/152516?format=json", "institution": "Peking University"}, {"id": 151893, "fullname": "Ruohao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/151893?format=json", "institution": "Peking University"}, {"id": 152517, "fullname": "Wenzhen Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/152517?format=json", "institution": "Peking University"}], "abstract": "3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present **RT-Splatting**, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38303", "url": null, "sourceid": 40058, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38304, "uid": "5d156163616be03a50d2a546fa2068b2", "name": "ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation", "authors": [{"id": 181721, "fullname": "Hanlei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181721?format=json", "institution": "Zhejiang University"}, {"id": 128422, "fullname": "Jiahao Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128422?format=json", "institution": "Zhejiang University"}, {"id": 189547, "fullname": "Xinya Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189547?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 189548, "fullname": "Xiyang Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189548?format=json", "institution": "The University of British Columbia"}, {"id": 151764, "fullname": "Sheng Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151764?format=json", "institution": "Zhejiang University"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 88190, "fullname": "Yiyi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88190?format=json", "institution": "Zhejiang University"}], "abstract": "Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38304", "url": null, "sourceid": 43768, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38307, "uid": "5ac050d44e6476acecf88969950cf3a2", "name": "R2G\uff1aA Multi-View Circuit Graph Benchmark Suite from RTL to GDSII", "authors": [{"id": 180217, "fullname": "ZEWEI ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180217?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189556, "fullname": "Jiajun Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189556?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189557, "fullname": "Jiajia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189557?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189558, "fullname": "Ao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189558?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189559, "fullname": "Ruichao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/189559?format=json", "institution": null}, {"id": 189560, "fullname": "Haozheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189560?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189561, "fullname": "Ao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189561?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189562, "fullname": "Jiawei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189562?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189563, "fullname": "Leilei Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189563?format=json", "institution": null}, {"id": 189564, "fullname": "Shan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189564?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189565, "fullname": "Daying Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189565?format=json", "institution": null}], "abstract": "Progress in machine learning for electronic design automation (EDA) is constrained by the lack of open, multi-view graph datasets that coherently represent the same circuits across late physical-design stages. We present R2G (RTL-to-GDSII), a standardized benchmark and framework that converts DEF files into typed, heterogeneous, information-preserving circuit graphs and supports node- and edge-level tasks in placement and routing. R2G provides five stage-aware views with information parity and includes loaders, unified splits, domain-specific metrics, and reproducible baselines\u2014enabling fair cross-view comparison and isolating representation from modeling. In systematic studies with classic GNNs (GIN, GAT, GatedGCN), we show that view choice strongly affects performance, varies with stage and supervision, and that decoder-head depth (3--4 layers) improves accuracy and stability; these findings connect view semantics to objectives and message passing and offer practical guidance. By bridging EDA semantics and graph learning, R2G releases large-scale datasets and an end-to-end pipeline, creating an open testbed for principled representation design.  Datasets, loaders, and evaluation scripts will be released on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38307", "url": null, "sourceid": 40958, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38308, "uid": "fc8cfa21bc52ce06040158d584ee1147", "name": "LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers", "authors": [{"id": 180088, "fullname": "Hyunha Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180088?format=json", "institution": "Seoul National University"}, {"id": 189566, "fullname": "Xuan Truong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189566?format=json", "institution": "Seoul National University"}, {"id": 189567, "fullname": "Hyuk-Jae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189567?format=json", "institution": "Seoul National University"}], "abstract": "Vision Transformers (ViTs) have achieved state-of-the-art results across various vision tasks. To enable a practical deployment of ViTs on modern hardware systems, post-training quantization (PTQ) has been actively studied in recent years. In particular, Hessian-based block reconstruction approaches have demonstrated promising results in quantizing ViT models to ultra-low bitwidths (e.g., 4-bit). However, finding a representative approximate Hessian, a fundamental step in recent approaches such as APHQ-ViT and FIMA-Q, remains underexplored in terms of the quantization-induced error and estimation cost. To address these shortcomings, we first reveal that the sample independence assumption used in recent works, which ignores the covariance term, can lead to a significant approximation error, especially for sub-four-bits. Inspired by least-squares regression, we propose LS-ViT, a block reconstruction framework that effectively estimates a representative Hessian by explicitly minimizing this approximation error across all samples. Extensive experiments with various ViT models across different vision tasks demonstrate that LS-ViT achieves new state-of-the-art performance. In addition, LS-ViT reduces quantization time compared to prior work, enabling a practical, plug-and-play, quantization-aware deployment for ViTs. The code will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38308", "url": null, "sourceid": 45426, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38313, "uid": "d3a1142c37e1db8545c12aca5e0fc43f", "name": "ShadowDraw: From Any Object to Shadow\u2013Drawing Compositional Art", "authors": [{"id": 151094, "fullname": "Rundong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/151094?format=json", "institution": "Cornell University"}, {"id": 85450, "fullname": "Noah Snavely", "url": "http://cvpr.thecvf.com/api/miniconf/users/85450?format=json", "institution": "Google / Cornell"}, {"id": 127203, "fullname": "Wei-Chiu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/127203?format=json", "institution": "Cornell University"}], "abstract": "We introduce *ShadowDraw*, a framework that transforms ordinary 3D objects into shadow\u2013drawing compositional art. Given a 3D object, our system predicts scene parameters---including object pose and lighting---together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow contour to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that *ShadowDraw* produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow\u2013drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling.For more results and an end-to-end real-world demonstration of our pipeline, please refer to the **project page** in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38313", "url": null, "sourceid": 30792, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38311, "uid": "20ea88ff2d4986dc61c779051ca98a10", "name": "BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation", "authors": [{"id": 189571, "fullname": "Rachit Saluja", "url": "http://cvpr.thecvf.com/api/miniconf/users/189571?format=json", "institution": "Cornell University"}, {"id": 189572, "fullname": "Asli Cihangir", "url": "http://cvpr.thecvf.com/api/miniconf/users/189572?format=json", "institution": "Cornell University"}, {"id": 189573, "fullname": "Ruining Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189573?format=json", "institution": "Weill Cornell Medicine, Cornell University"}, {"id": 189574, "fullname": "Johannes C. Paetzold", "url": "http://cvpr.thecvf.com/api/miniconf/users/189574?format=json", "institution": "Weill Cornell Medicine, Cornell University"}, {"id": 129507, "fullname": "Fengbei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129507?format=json", "institution": "Cornell University"}, {"id": 183383, "fullname": "Mert Sabuncu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183383?format=json", "institution": "Cornell"}], "abstract": "Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all non-lesion pixels into a single \u201cbackground\u201d class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous\u2014composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models.In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38311", "url": null, "sourceid": 39090, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38317, "uid": "2208183e25fd18b7bf374189df696ede", "name": "Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling", "authors": [{"id": 179889, "fullname": "Qi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/179889?format=json", "institution": "City University of Hong Kong"}, {"id": 181441, "fullname": "Can Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181441?format=json", "institution": null}, {"id": 189584, "fullname": "Jiaxiang Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189584?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 189585, "fullname": "Yingchun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189585?format=json", "institution": null}, {"id": 86410, "fullname": "Jing Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86410?format=json", "institution": "City University of Hong Kong"}], "abstract": "Current 3D human animation methods fail at photorealism: kinematics-based approaches lack non-rigid dynamics like clothing, while methods reconstructing from generated videos suffer from low-quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion and residual non-rigid motion. Then, we use a pretrained video diffusion model to restore a coarse rendering from the mesh-rigged animation, which provides supervision for the motion field. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, our core technical contribution is self-guided stochastic sampling, which effectively solves the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of a realistic 4D motion field. Ani3DHuman achieves state-of-the-art results, and our ablations validate that both components of our sampler are essential for high-fidelity restoration.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38317", "url": null, "sourceid": 32556, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38320, "uid": "c3981c46bd4ddced63aff5fbe9a54463", "name": "Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment", "authors": [{"id": 177319, "fullname": "Yu Fanqi", "url": "http://cvpr.thecvf.com/api/miniconf/users/177319?format=json", "institution": "Italian Institute of Technology"}, {"id": 189598, "fullname": "Matteo Tiezzi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189598?format=json", "institution": "Istituto Italiano di Tecnologia"}, {"id": 189599, "fullname": "Tommaso Apicella", "url": "http://cvpr.thecvf.com/api/miniconf/users/189599?format=json", "institution": "Istituto Italiano di Tecnologia"}, {"id": 143927, "fullname": "Cigdem Beyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/143927?format=json", "institution": "University of Verona"}, {"id": 91823, "fullname": "Vittorio Murino", "url": "http://cvpr.thecvf.com/api/miniconf/users/91823?format=json", "institution": "University of Verona; Istituto Italiano di Tecnologia"}], "abstract": "We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness.  Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10\u201317 point gains in AUC and up to 65\\% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code will be made publicly available upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38320", "url": null, "sourceid": 38929, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38321, "uid": "b47e73a01de2eda1044f574a05f89daa", "name": "PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems", "authors": [{"id": 182470, "fullname": "Merve Gulle", "url": "http://cvpr.thecvf.com/api/miniconf/users/182470?format=json", "institution": "University of Minnesota"}, {"id": 103919, "fullname": "junno yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/103919?format=json", "institution": "University of Minnesota"}, {"id": 189600, "fullname": "Yasar Utku Alcalar", "url": "http://cvpr.thecvf.com/api/miniconf/users/189600?format=json", "institution": "University of Minnesota"}, {"id": 189601, "fullname": "Mehmet Akcakaya", "url": "http://cvpr.thecvf.com/api/miniconf/users/189601?format=json", "institution": "University of Minnesota"}], "abstract": "Diffusion models have found extensive use in solving numerous inverse problems. Such diffusion inverse problem solvers aim to sample from the posterior distribution of data given the measurements, using a combination of the unconditional score function and an approximation of the posterior related to the forward process. Recently, consistency models (CMs) have been proposed to directly predict the final output from any point on the diffusion ODE trajectory, enabling high-quality sampling in just a few NFEs. CMs have also been utilized for inverse problems, but existing CM-based solvers either require additional task-specific training or utilize data fidelity operations with slow convergence, not amenable to large-scale problems. In this work, we reinterpret CMs as proximal operators of a prior, enabling their integration into plug-and-play (PnP) frameworks. We propose a solver based on PnP-ADMM, which enables us to leverage the fast convergence of conjugate gradient method. We further accelerate this with noise injection and momentum, dubbed **PnP-CM**, and show it maintains the convergence properties of the baseline PnP-ADMM. We evaluate our approach on a variety of inverse problems, including inpainting, super-resolution, Gaussian deblurring, and magnetic resonance imaging (MRI) reconstruction. To the best of our knowledge, this is the *first CM trained for MRI* datasets. Our results show that PnP-CM achieves high-quality reconstructions in as few as 4 NFEs, and can produce meaningful results in 2 steps, highlighting its effectiveness in real-world inverse problems while outperforming comparable CM-based approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38321", "url": null, "sourceid": 36938, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38325, "uid": "15afaeb2c3c4446728e3adfbd35d0801", "name": "mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds", "authors": [{"id": 180725, "fullname": "Chang Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/180725?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 189611, "fullname": "Beihong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189611?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 189612, "fullname": "Qiwen Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189612?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 189613, "fullname": "Zhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189613?format=json", "institution": "University of the Chinese Academy of Sciences"}], "abstract": "Millimeter-wave (mmWave) point clouds have attracted growing interest in human sensing due to their robustness, privacy preservation, and low cost. However, their practical use is hindered by the inherent sparsity of data and the lack of large-scale data. We revisit generative modeling for mmWave point clouds and propose a unified flow-matching framework mmWaveFlow that unifies enhancement and generation by learning an invertible transport between dense and sparse point clouds. We leverage paired data and a latent-alignment module to enforce semantic alignment and bridge the modality gap. We find that condition-free flow matching is more vulnerable to latent path crossings, which impair bidirectional transport. Therefore, we propose Origin-Aware Flow Matching (OA-Flow), which conditioning transport on the origin of the path mitigates ambiguity in bidirectional transport. Results of experiments across multiple datasets demonstrate the effectiveness of mmWaveFlow for mmWave human point clouds generation and enhancement. We also observe consistent gains in downstream tasks, highlighting the promise of our framework for human sensing. We will release the code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38325", "url": null, "sourceid": 44163, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38328, "uid": "2f1227c50fcdddb18e869bfdeae65a06", "name": "AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion", "authors": [{"id": 164178, "fullname": "Hongjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/164178?format=json", "institution": "Stanford University"}, {"id": 176502, "fullname": "Heng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176502?format=json", "institution": "Stanford University"}, {"id": 87403, "fullname": "Jiaman Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87403?format=json", "institution": "Stanford University"}, {"id": 85365, "fullname": "Hong-Xing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85365?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 75810, "fullname": "Ehsan Adeli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75810?format=json", "institution": "Stanford University"}, {"id": 85573, "fullname": "Karen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85573?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}], "abstract": "Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on these domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38328", "url": null, "sourceid": 38554, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38327, "uid": "18cd6ebfb5567eb228c7ea1e1b792099", "name": "RebRL: Reinforcing Discrete Visual Diffusion Models with Rebalanced Timestep Credits", "authors": [{"id": 180414, "fullname": "Mu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180414?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 180338, "fullname": "Tianren Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/180338?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 153488, "fullname": "Yunfan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153488?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 189618, "fullname": "Kun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189618?format=json", "institution": "Beihang University"}, {"id": 87065, "fullname": "Qixiang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/87065?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "Discrete Diffusion Models (DDMs) have shown great potential in image generation, especially when equipped with reinforcement learning (RL) techniques.However, a fundamental yet overlooked limitation is revealed in our experiments: severe imbalance of credit assignment across timesteps during training. As a result, early generation timesteps, which carry higher exploration potential and determine the global structure, provide a smaller contribution to policy optimization.To conquer this, we propose a simple-yet-effective approach, to Re-balance timestep credit of Reinforcement Learning (RebRL) for better exploration-exploitation trade-off and more efficient training of DDMs.RebRL is plug-and-play$\\textemdash$ simply replacing uniform temporal policy with strategic rebalancing along masking stages. RebRL is analytically plausible$\\textemdash$derivation and analysis show that it enjoys a uniform token-level policy gradient, which benefits policy optimization.Experiments on text-to-image generation benchmarks show that RebRL achieves state-of-the-art performance on GenEval and improves human preference score by up to $\\textbf{3.40}$ while effectively reducing training steps by $\\textbf{$\\sim$40\\\\%}$.Code is enclosed in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38327", "url": null, "sourceid": 36310, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38334, "uid": "50f7de035c26f1b799c2a27d7e427e76", "name": "Landscape-Awareness for Geometric View Diffusion Model", "authors": [{"id": 182969, "fullname": "Yan-Ting Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182969?format=json", "institution": "National Taiwan University"}, {"id": 90398, "fullname": "Hao-Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90398?format=json", "institution": "National Tsing Hua University"}, {"id": 139669, "fullname": "Tsu-Ching Hsiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/139669?format=json", "institution": "Woven by Toyota"}, {"id": 184084, "fullname": "Chun-Yi Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184084?format=json", "institution": "National Taiwan University"}], "abstract": "Accuracy camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123, which synthesize novel views conditioned on relative viewpoint, and have demonstrated promising performance when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from non-convex loss landscape with numerous local minima, which makes them sensitive to initialization and reliant on na\\\"ive multi-start strategies to achieve reasonable results. We analyze these optimization challenges and visualize failure cases, showing that ambiguities in object geometry, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38334", "url": null, "sourceid": 42989, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38335, "uid": "581a4c33889fc7aeca599e03628e37d3", "name": "DGS: Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation for Class Incremental Learning", "authors": [{"id": 180854, "fullname": "KAI LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/180854?format=json", "institution": "East China Normal University"}, {"id": 85644, "fullname": "Jiafeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85644?format=json", "institution": "East China Normal University"}, {"id": 85653, "fullname": "Lianghua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/85653?format=json", "institution": "Tongji University"}, {"id": 73065, "fullname": "Ying Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73065?format=json", "institution": "East China Normal University"}], "abstract": "In Class-Incremental Learning (CIL), parameter efficient fine-tuning applied to Pre-trained Models (PTMs) remain vulnerable to catastrophic forgetting as they adapt to new tasks. The prevalent strategy to mitigate catastrophic forgetting is to constrain gradients within the orthogonal subspaces of past tasks, while rigid gradient constraints hinder plasticity. In this paper, we propose a novel CIL framework, Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation (DGS), that balances stability and plasticity via gradient fusion and maintains representation consistency through classifier and patch-token alignment. Specifically, our method introduces the Dual Gradient update strategy that first derives a base subspace projection from the PTMs and then fuses task-specific LoRA gradients with their aligned counterparts through interpolated combination. This design promotes knowledge retention without sacrificing task-specific expressiveness. Furthermore, we employ a Classifier Alignment mechanism with Semantic shift estimation which is based on the calibrated prototype statistics to mitigate classifier shift, and introduce a novel Patch-level Alignment loss to preserve feature consistency across tasks. Extensive experiments on six standard benchmarks demonstrate that our approach consistently outperforms existing CIL methods, highlighting its effectiveness and generalization capability in continual learning scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38335", "url": null, "sourceid": 46079, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38342, "uid": "24963e6f1f40d76dc0b25fe8b11c844e", "name": "VQ-VA World: Towards High-Quality Visual Question-Visual Answering", "authors": [{"id": 126984, "fullname": "Chenhui Gou", "url": "http://cvpr.thecvf.com/api/miniconf/users/126984?format=json", "institution": "Monash University"}, {"id": 128567, "fullname": "Zilong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128567?format=json", "institution": "Tsinghua University"}, {"id": 84813, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84813?format=json", "institution": "University of California, Santa Cruz"}, {"id": 84918, "fullname": "Feng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84918?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 138983, "fullname": "Deyao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/138983?format=json", "institution": "ByteDance Inc."}, {"id": 185854, "fullname": "Zicheng Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185854?format=json", "institution": "University of Adelaide"}, {"id": 107357, "fullname": "Kunchang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107357?format=json", "institution": "SIAT, UCAS"}, {"id": 168674, "fullname": "Chaorui Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/168674?format=json", "institution": "ByteDance Inc."}, {"id": 189647, "fullname": "Hongyi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189647?format=json", "institution": null}, {"id": 153751, "fullname": "Haoqi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153751?format=json", "institution": null}, {"id": 75526, "fullname": "Cihang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/75526?format=json", "institution": "University of California, Santa Cruz"}, {"id": 126993, "fullname": "Jianfei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126993?format=json", "institution": "Monash University"}, {"id": 75933, "fullname": "Hamid Rezatofighi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75933?format=json", "institution": "Monash University"}], "abstract": "This paper studies \\textit{Visual Question\u2013Visual Answering (VQ-VA)}: generating an image, rather than text, in response to a visual question---an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of $\\sim$1.8M high-quality, interleaved image\u2013text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of \\textit{world knowledge}, \\textit{design knowledge}, and \\textit{reasoning}. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (\\emph{i.e.}, 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (\\emph{e.g.}, 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, our work will greatly stimulate future research on VQ-VA.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38342", "url": null, "sourceid": 37362, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38345, "uid": "c5f5688aab21d00610e8cdeae7a56ebf", "name": "Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding", "authors": [{"id": 181743, "fullname": "Ziyao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181743?format=json", "institution": "East China Normal University"}, {"id": 189658, "fullname": "Yingjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189658?format=json", "institution": null}, {"id": 148828, "fullname": "ZhangYangRui ZhangYangRui", "url": "http://cvpr.thecvf.com/api/miniconf/users/148828?format=json", "institution": "East China Normal University"}, {"id": 128946, "fullname": "Mingsong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128946?format=json", "institution": "East China Normal University"}, {"id": 189659, "fullname": "Xuan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189659?format=json", "institution": "East China Normal University"}, {"id": 128955, "fullname": "Xian Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/128955?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. In this work, we propose a novel \\textbf{\\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38345", "url": null, "sourceid": 38311, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38349, "uid": "7da58332a407cf34d87ea109fdfba52f", "name": "Spatia: Video Generation with Updatable Spatial Memory", "authors": [{"id": 104113, "fullname": "Jinjing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/104113?format=json", "institution": "The University of Sydney"}, {"id": 77426, "fullname": "Fangyun Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/77426?format=json", "institution": "Microsoft Research"}, {"id": 183140, "fullname": "Zhening Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183140?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 84849, "fullname": "Hongyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84849?format=json", "institution": "School of Computer Science, University of Waterloo"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory\u2013aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic\u2013static disentanglement design enhances spatial consistency throughout the generation process while preserving the model\u2019s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38349", "url": null, "sourceid": 37677, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38360, "uid": "498b71407ed107b5a3f83951be5b4df4", "name": "VISTA: A Test-Time Self-Improving Video Generation Agent", "authors": [{"id": 189713, "fullname": "Do Xuan Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/189713?format=json", "institution": "National University of Singapore; Google"}, {"id": 189714, "fullname": "Xingchen Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189714?format=json", "institution": "Google DeepMind; Google"}, {"id": 189715, "fullname": "Hootan Nakhost", "url": "http://cvpr.thecvf.com/api/miniconf/users/189715?format=json", "institution": null}, {"id": 77295, "fullname": "Chen-Yu Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/77295?format=json", "institution": "Google"}, {"id": 84645, "fullname": "Tomas Pfister", "url": "http://cvpr.thecvf.com/api/miniconf/users/84645?format=json", "institution": "Google"}, {"id": 189716, "fullname": "Sercan O Arik", "url": "http://cvpr.thecvf.com/api/miniconf/users/189716?format=json", "institution": "Google"}], "abstract": "Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 66.4% of comparisons.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38360", "url": null, "sourceid": 31119, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38361, "uid": "fa17a68f853853baf9cf1ee222cfe999", "name": "VRCLIP: Multimodal Canonical Correlation Alignment for CLIP-Driven Vision-Radio Person Re-Identification", "authors": [{"id": 180151, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180151?format=json", "institution": "University of Science and Technology of China"}, {"id": 189717, "fullname": "Yaqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189717?format=json", "institution": "University of Science and Technology of China"}, {"id": 189718, "fullname": "Yadong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189718?format=json", "institution": "University of Washington"}, {"id": 189719, "fullname": "Ruixu Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189719?format=json", "institution": "University of Science and Technology of China"}, {"id": 189720, "fullname": "Jianyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189720?format=json", "institution": "University of Science and Technology of China"}, {"id": 156091, "fullname": "Qijun Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/156091?format=json", "institution": "University of Science and Technology of China"}, {"id": 189721, "fullname": "Dongheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189721?format=json", "institution": "University of Science and Technology of China"}, {"id": 189722, "fullname": "Yang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189722?format=json", "institution": "University of Science and Technology of China"}, {"id": 189723, "fullname": "Yan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189723?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Person re-identification (ReID) is critical for public safety, yet the performance of RGB-based methods is limited under challenging lighting and occlusion conditions. In contrast, low-frequency radio frequency (RF) signals, with their superior penetration capability and illumination invariance, provide ideal complementary information.  However, a key challenge in fusing these heterogeneous modalities lies in the conventional approach that relies heavily on cross\u2011modal distribution matching, which often over\u2011regularizes and weakens the discriminative capacity within each modality. Rather than enforcing direct distribution alignment, canonical correlation analysis (CCA) constructs a shared subspace that maximizes cross\u2011modal correlation, inherently balancing modality specificity and shared semantics. Inspired by this, we reformulate cross-modal alignment as a correlation maximization problem, avoiding direct constraints on feature distributions and guiding the model to harmonize intra\u2011modal discriminative learning with cross\u2011modal alignment. Specifically, VRCLIP first refines CLIP\u2019s visual encoder with illumination\u2011disentangling objectives, then aligns RGB and RF embeddings in a canonical correlation subspace, and finally employs an RF\u2011anchored reliability gate for adaptive fusion. To advance the area,  we will release {VRR}, the first large\u2011scale vision\u2013radio ReID dataset with over 650K paired image\u2013radar samples and position annotations for 31 participants. Extensive experiments show state\u2011of\u2011the\u2011art 93.9\\% mAP and robust generalization across diverse lighting and occlusion conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38361", "url": null, "sourceid": 35927, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38366, "uid": "b58face67d2ad0f6ae0de365c1fcd418", "name": "SASNet: Spatially-Adaptive Sinusoidal Networks for INRs", "authors": [{"id": 133725, "fullname": "Haoan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133725?format=json", "institution": "University of Maryland, College Park"}, {"id": 152098, "fullname": "Diana Aldana Moreno", "url": "http://cvpr.thecvf.com/api/miniconf/users/152098?format=json", "institution": "IMPA"}, {"id": 159537, "fullname": "Tiago Novello", "url": "http://cvpr.thecvf.com/api/miniconf/users/159537?format=json", "institution": "Instituto Nacional de Matem\u00e1tica Pura e Aplicada - IMPA"}, {"id": 189730, "fullname": "Leila Floriani", "url": "http://cvpr.thecvf.com/api/miniconf/users/189730?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Sinusoidal neural networks (SIRENs) are powerful implicit neural representations (INRs) for low-dimensional signals in vision and graphics. By encoding input coordinates with sinusoidal functions, they enable high-frequency image and surface reconstruction. However, training SIRENs is often unstable and highly sensitive to frequency initialization: small frequencies produce overly smooth reconstructions in detailed regions, whereas large ones introduce spurious high-frequency components that manifest as noise in smooth areas such as image backgrounds. To address these challenges, we propose $\\textbf{SASNet}$, a $\\textit{Spatially-Adaptive Sinusoidal Network}$ that couples a $\\textit{frozen frequency embedding layer}$, which explicitly fixes the network\u2019s frequency support, with $\\textit{jointly learned spatial masks}$ that localize neuron influence across the domain. This pairing stabilizes optimization, sharpens edges, and suppresses noise in smooth areas. Experiments on 2D image and 3D volumetric data fitting as well as signed distance field (SDF) reconstruction benchmarks demonstrate that SASNet achieves faster convergence, superior reconstruction quality, and robust frequency localization--assigning low- and high-frequency neurons to smooth and detailed regions respectively--while maintaining parameter efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38366", "url": null, "sourceid": 31181, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38367, "uid": "48fe118b07a43ce5c04a3789df5f152e", "name": "Spk2VidNet: A Hierarchical Recurrent Architecture for High-Fidelity Video Reconstruction from Long Spike-Camera Streams", "authors": [{"id": 153916, "fullname": "Yuanlin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153916?format=json", "institution": "Peking University"}, {"id": 127198, "fullname": "Ruiqin Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/127198?format=json", "institution": "Peking University"}, {"id": 189731, "fullname": "Jiyu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189731?format=json", "institution": null}, {"id": 189732, "fullname": "Zhenkun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189732?format=json", "institution": "Peking University"}, {"id": 89151, "fullname": "Zhaofei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89151?format=json", "institution": "Peking University"}, {"id": 128724, "fullname": "Xiaopeng Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128724?format=json", "institution": "Harbin Institute of Technology"}, {"id": 88774, "fullname": "Tiejun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88774?format=json", "institution": "Peking University"}], "abstract": "Spike camera is a neuromorphic vision sensor with ultra-high temporal resolution, capable of capturing fast-moving scenes by firing a stream of binary spikes. However, its relatively low spatial resolution limits the acquisition of fine-grained visual details, motivating research on spike camera super resolution (SCSR). Existing SCSR methods typically operate on fixed-length spike sequences, where the accessible information is confined to a local temporal neighborhood. Moreover, spike fluctuations hinder intensity information extraction. Both factors affect the performance of SCSR. To address these issues, we propose a hierarchical recurrent network named Spk2VidNet to reconstruct high-fidelity high resolution image sequences from low resolution spike data. To mitigate fluctuations, Spk2VidNet progressively exploits temporal correlations within spike stream to enhance feature representation by hierarchically enlarging temporal receptive fields. Within recurrent phase, we introduce an alignment module that leverages the motion consistency among multiple frames to jointly estimate and mutually refine inter-frame motions, achieving more accurate temporal alignment. In addition, we propose a fusion module to adaptively integrate neighboring aligned features based on multi-scale similarity for robust feature aggregation. We further propose a segment-wise training with state transfer strategy to efficiently model long-term dependencies with limited GPU memory, thereby leveraging rich subpixel cues for improved super resolution. Experiments on synthetic and real-captured spike data demonstrate that Spk2VidNet achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38367", "url": null, "sourceid": 40478, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38365, "uid": "d26d49b8e4c08eae95a256b23bee031c", "name": "OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding", "authors": [{"id": 126706, "fullname": "Sheng-Yu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126706?format=json", "institution": "National Taiwan University"}, {"id": 152616, "fullname": "Jaesung Choe", "url": "http://cvpr.thecvf.com/api/miniconf/users/152616?format=json", "institution": "NVIDIA"}, {"id": 89818, "fullname": "Yu-Chiang Frank Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89818?format=json", "institution": "NVIDIA"}, {"id": 129956, "fullname": "Cheng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129956?format=json", "institution": "NVIDIA"}], "abstract": "We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38365", "url": "https://peterjohnsonhuang.github.io/openvoxel-pages/", "sourceid": 39847, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38369, "uid": "709a255d7ae6859551c9cb810d091a7b", "name": "Hyperbolic Busemann Neural Networks", "authors": [{"id": 129542, "fullname": "Ziheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129542?format=json", "institution": "University of Trento"}, {"id": 130882, "fullname": "Bernhard Sch\u00f6lkopf", "url": "http://cvpr.thecvf.com/api/miniconf/users/130882?format=json", "institution": "ELLIS Institute"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}], "abstract": "Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth. To leverage these benefits, neural networks require intrinsic and efficient components that operate directly in hyperbolic space. In this work, we lift two core components of neural networks, Multinomial Logistic Regression (MLR) and Fully Connected (FC) layers, into hyperbolic space via Busemann functions, resulting in Busemann MLR (BMLR) and Busemann FC (BFC) layers with a unified mathematical interpretation. BMLR provides compact parameters, a point-to-horosphere distance interpretation, batch-efficient computation, and a Euclidean limit, while BFC generalizes FC and activation layers with comparable complexity. Experiments on image classification, genome sequence learning, node classification, and link prediction demonstrate improvements in effectiveness and efficiency over prior hyperbolic layers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38369", "url": null, "sourceid": 40395, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38370, "uid": "28172a8c15bfbc0714c5a2cd0d1ed2a4", "name": "GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer", "authors": [{"id": 180169, "fullname": "Wenbo Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180169?format=json", "institution": "Beijing Forestry University"}, {"id": 151910, "fullname": "Wenke Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/151910?format=json", "institution": "Renmin University"}, {"id": 189736, "fullname": "Weitao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189736?format=json", "institution": "Renmin University of China"}, {"id": 127885, "fullname": "Di Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127885?format=json", "institution": "Renmin University of China"}], "abstract": "Bridging the sim-to-real gap is important for applying low-cost simulation data to real-world robotic systems. However, previous methods are severely limited by treating each transfer as an isolated endeavor, demanding repeated, costly tuning and wasting prior transfer experience.To move beyond isolated sim-to-real, we build a continual cross-task sim-to-real transfer paradigm centered on knowledge accumulation across iterative transfers, thereby enabling effective and efficient adaptation to novel tasks. Thus, we propose GeCo-SRT, a geometry-aware continual adaptation method. It utilizes domain-invariant and task-invariant knowledge from local geometric features as a transferable foundation to accelerate adaptation during subsequent sim-to-real transfers. This method starts with a geometry-aware mixture-of-experts module, which dynamically activates experts to specialize in distinct geometric knowledge to bridge observation sim-to-real gap. Further, the geometry-expert-guided prioritized experience replay module preferentially samples from underutilized experts, refreshing specialized knowledge to combat forgetting and maintain robust cross-task performance.Leveraging knowledge accumulated during iterative transfer, GeCo-SRT method not only achieves 52\\% average performance improvement over the baseline, but also demonstrates significant data efficiency for new task adaptation with only 1/6 data.We hope this work inspires approaches for efficient, low-cost cross-task sim-to-real transfer.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38370", "url": null, "sourceid": 45899, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38372, "uid": "a4612eea51839d55630892fccd57c03c", "name": "WeDetect: Fast Open-Vocabulary Object Detection as Retrieval", "authors": [{"id": 156202, "fullname": "Shenghao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156202?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 189739, "fullname": "Yukun Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/189739?format=json", "institution": "Tencent WeChat AI"}, {"id": 126882, "fullname": "Fengyun Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126882?format=json", "institution": "WeChat, Tencent Inc."}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 87660, "fullname": "Xiaohua Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/87660?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, i.e., matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency. We will open-source all models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38372", "url": null, "sourceid": 36378, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38373, "uid": "f4d9387dec63ae41d4b40a146e759a72", "name": "Monet: Reasoning in Latent Visual Space Beyond Image and Language", "authors": [{"id": 152519, "fullname": "Qixun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152519?format=json", "institution": "Peking University"}, {"id": 185851, "fullname": "Yang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185851?format=json", "institution": "Peking University"}, {"id": 189740, "fullname": "Yifei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189740?format=json", "institution": "Amazon"}, {"id": 168185, "fullname": "Yuanxing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/168185?format=json", "institution": "Kuaishou Technology"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 156268, "fullname": "Kun Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156268?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 152511, "fullname": "Xianghua Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/152511?format=json", "institution": "Peking University"}, {"id": 84739, "fullname": "Yisen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84739?format=json", "institution": "Peking University"}], "abstract": "Thinking with images has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning\u2014high computational cost in latent\u2013vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates.To support SFT, we construct Monet-SFT-125K, a high-quality text\u2013image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38373", "url": null, "sourceid": 38592, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38374, "uid": "918b69ec3f995b4a6f804b03839f704f", "name": "Eulerian Gaussian Splatting using Hashed Probability Pyramids", "authors": [{"id": 105859, "fullname": "Mia Polansky", "url": "http://cvpr.thecvf.com/api/miniconf/users/105859?format=json", "institution": "Harvard University"}, {"id": 141062, "fullname": "George Kopanas", "url": "http://cvpr.thecvf.com/api/miniconf/users/141062?format=json", "institution": "Research, Google"}, {"id": 87361, "fullname": "Stephan J. Garbin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87361?format=json", "institution": "Microsoft"}, {"id": 131429, "fullname": "Todd Zickler", "url": "http://cvpr.thecvf.com/api/miniconf/users/131429?format=json", "institution": "Harvard University"}, {"id": 94620, "fullname": "Dor Verbin", "url": "http://cvpr.thecvf.com/api/miniconf/users/94620?format=json", "institution": "Google"}], "abstract": "We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density with a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based control over primitive population density. To stabilize stochastic training, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our method eliminates brittle priors and naturally explores the volume, achieving state-of-the-art reconstruction quality on mip-NeRF360 while preserving 3DGS-level rendering speed.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38374", "url": null, "sourceid": 38565, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38376, "uid": "8ffdcade0d19115e2b5bedb1d94dc878", "name": "Disentangled Textual Priors for Diffusion-based Image Super-Resolution", "authors": [{"id": 176803, "fullname": "Lei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176803?format=json", "institution": "nju"}, {"id": 156608, "fullname": "Xin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156608?format=json", "institution": "Nanjing University"}, {"id": 189745, "fullname": "Xinze Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189745?format=json", "institution": "nanjing university"}, {"id": 176883, "fullname": "Zhiliang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/176883?format=json", "institution": "Nanjing University"}, {"id": 156609, "fullname": "Jie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156609?format=json", "institution": "Nanjing University"}, {"id": 156610, "fullname": "Jie Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156610?format=json", "institution": "Nanjing University"}, {"id": 86405, "fullname": "Gangshan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86405?format=json", "institution": "Nanjing University"}], "abstract": "Image Super-Resolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content\u2014from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38376", "url": null, "sourceid": 33693, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38378, "uid": "fbae5e569612b58f7ae726cf94a78a8a", "name": "AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation", "authors": [{"id": 107470, "fullname": "Wenxuan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/107470?format=json", "institution": "Tsinghua University"}, {"id": 89364, "fullname": "Xiuwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89364?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 189747, "fullname": "Yichen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189747?format=json", "institution": "Tsinghua University"}, {"id": 189748, "fullname": "Xiangyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189748?format=json", "institution": "Tsinghua University"}, {"id": 157710, "fullname": "Hang Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/157710?format=json", "institution": "Tsinghua University"}, {"id": 189749, "fullname": "Huangxing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189749?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 130710, "fullname": "Wenzhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130710?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 130762, "fullname": "Jianjiang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130762?format=json", "institution": "Tsinghua University"}, {"id": 77142, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/77142?format=json", "institution": "Tsinghua University"}, {"id": 88597, "fullname": "Jiwen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88597?format=json", "institution": "Tsinghua University"}], "abstract": "Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38378", "url": null, "sourceid": 46027, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38379, "uid": "f648c726ded78a0f14942f50cb519ae1", "name": "Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis", "authors": [{"id": 174407, "fullname": "Yuanzhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/174407?format=json", "institution": "Shanghai Shuzhiwei Information Technology Co., Ltd."}, {"id": 181218, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181218?format=json", "institution": "Hao Chen"}, {"id": 189750, "fullname": "Rui Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189750?format=json", "institution": "Nanjing First Hospital, Nanjing Medical University"}, {"id": 174805, "fullname": "Juyan Ba", "url": "http://cvpr.thecvf.com/api/miniconf/users/174805?format=json", "institution": "Shenzhen university"}, {"id": 85821, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85821?format=json", "institution": "Nanjing University"}, {"id": 189751, "fullname": "Sheng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189751?format=json", "institution": "Ruijin Hospital"}], "abstract": "Recent vision language models (VLMs) have shown strong generalization and multimodal reasoning abilities in natural domains. However, their application to medical diagnosis remains limited by the lack of comprehensive and structured datasets that capture real clinical workflows. To advance the development of VLMs for clinical applications, particularly in gastric cancer, we introduce Gastric-X, a large-scale multimodal benchmark for gastric cancer analysis. Each case in Gastric-X includes paired resting and dynamic CT scans, a set of structured biochemical indicators, expert-authored diagnostic notes, and bounding box annotations of tumor regions, reflecting realistic clinical conditions. We systematically examine the capability of recent VLMs on five core tasks: Visual Question Answering (VQA), report generation, cross-modal retrieval, disease classification, and lesion localization. These tasks simulate critical stages of clinical workflow, from visual understanding and reasoning to multimodal decision support.  Through this evaluation, we aim not only to assess model performance but also to probe the nature of VLM understanding: Can current VLMs meaningfully correlate biochemical signals with spatial tumor features and textual reports? We envision Gastric-X as a step toward aligning machine intelligence with the cognitive and evidential reasoning processes of physicians, and as a resource to inspire the development of next-generation medical VLMs.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38379", "url": null, "sourceid": 45980, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38377, "uid": "3f8ee098f1300beb0464a8a8288ab931", "name": "PhysGen: Physically Grounded 3D Shape Generation for Industrial Design", "authors": [{"id": 184243, "fullname": "Yingxuan You", "url": "http://cvpr.thecvf.com/api/miniconf/users/184243?format=json", "institution": "EPFL"}, {"id": 127510, "fullname": "Chen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127510?format=json", "institution": "EPFL"}, {"id": 189746, "fullname": "Hantao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189746?format=json", "institution": "EPFL"}, {"id": 131126, "fullname": "Ming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131126?format=json", "institution": "Australian National University"}, {"id": 90581, "fullname": "Pascal Fua", "url": "http://cvpr.thecvf.com/api/miniconf/users/90581?format=json", "institution": "Swiss Federal Institute of Technology Lausanne"}], "abstract": "Existing generative models for 3D shapes can synthesize high-fidelity and visually plausible shapes. For certain classes of shapes that have undergone an engineering design process, the realism of the shape is tightly coupled with the underlying physical properties, e.g., aerodynamic efficiency for automobiles. Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. Specifically, we introduce a new flow matching model with explicit physical guidance, consisting of an alternating update process. We iteratively perform a velocity-based update and a physics-based refinement, progressively adjusting the latent code to align with the desired 3D shapes and physical properties. We further strengthen physical validity by incorporating a physics-aware regularization term into the velocity-based update step. To support such physics-guided updates, we build a shape-and-physics variational autoencoder (SP-VAE) that jointly encodes shape and physics information into a unified latent space. The experiments on three benchmarks show that this synergistic formulation improves shape realism beyond mere visual plausibility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38377", "url": null, "sourceid": 32642, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38382, "uid": "8e1d97d44ddafb6025fc1a0c97faff74", "name": "Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping", "authors": [{"id": 132550, "fullname": "Junmyeong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/132550?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 159336, "fullname": "Hoseung Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/159336?format=json", "institution": "Pohang University of Science &amp; Technology (POSTECH)"}, {"id": 86304, "fullname": "Minsu Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/86304?format=json", "institution": "POSTECH"}], "abstract": "Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution.We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation.MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations.Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution.Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38382", "url": null, "sourceid": 44620, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38384, "uid": "3f5a51db5c7936950261ef1da3b25202", "name": "Discriminative Perception via Anchored Description for Reasoning Segmentation", "authors": [{"id": 189762, "fullname": "Tao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189762?format=json", "institution": "Northwestern Polytechnical University, Northwest Polytechnical University Xi&#x27;an"}, {"id": 155749, "fullname": "Qing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155749?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 189763, "fullname": "Yanliang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189763?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 155751, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155751?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by sharply contrasting the caption\u2019s semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering significant performance gains, with the cIoU on ResonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38384", "url": null, "sourceid": 37027, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38391, "uid": "ab99bdd324780a346267d29dc02230d2", "name": "Grounding Everything in Tokens for Multimodal Large Language Models", "authors": [{"id": 128884, "fullname": "Xiangxuan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/128884?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189785, "fullname": "Zhongdao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189785?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186153, "fullname": "Liping Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186153?format=json", "institution": null}, {"id": 107339, "fullname": "Pin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107339?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 128866, "fullname": "Guoqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128866?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86334, "fullname": "Chao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86334?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can image tokenization be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive Transformer architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised and reinforcement learning contexts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38391", "url": null, "sourceid": 42721, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38385, "uid": "e6265a85e503bb5db629f379330fe08b", "name": "GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models", "authors": [{"id": 183755, "fullname": "Zilong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183755?format=json", "institution": "Fudan University"}, {"id": 189764, "fullname": "Xiang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189764?format=json", "institution": "City University of Hong Kong"}, {"id": 189765, "fullname": "Xiaosen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189765?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 189766, "fullname": "Bo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189766?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 90424, "fullname": "Xingjun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90424?format=json", "institution": "Deakin University"}], "abstract": "Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by deliberate adversaries.Recent research on red-teaming and adversarial attacks against T2I models faces a critical limitation: existing methods struggle to balance prompt stealthiness with high toxicity in generated images. Some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models.To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical safety weaknesses.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38385", "url": null, "sourceid": 38290, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38386, "uid": "2b1f292723cc6f4ca4761f8710141cbb", "name": "Streaming Video Instruction Tuning", "authors": [{"id": 180412, "fullname": "Jiaer Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/180412?format=json", "institution": "Hong Kong Baptist University"}, {"id": 126437, "fullname": "Peixian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126437?format=json", "institution": "Xiamen University"}, {"id": 126806, "fullname": "Mengdan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126806?format=json", "institution": "Tencent Youtu Lab"}, {"id": 126816, "fullname": "Xing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/126816?format=json", "institution": "Tencent YouTu Lab"}, {"id": 74166, "fullname": "Kaiyang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/74166?format=json", "institution": "Hong Kong Baptist University"}], "abstract": "We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38386", "url": null, "sourceid": 43387, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38387, "uid": "b1494532125308790dff5321c7a762ad", "name": "StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets", "authors": [{"id": 113787, "fullname": "Anh Quan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/113787?format=json", "institution": "INRIA"}, {"id": 128908, "fullname": "Ivan Lopes", "url": "http://cvpr.thecvf.com/api/miniconf/users/128908?format=json", "institution": "Inria"}, {"id": 126205, "fullname": "Raoul de Charette", "url": "http://cvpr.thecvf.com/api/miniconf/users/126205?format=json", "institution": "Inria"}], "abstract": "Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, although recent works have explored training with partial task labels.Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes an image generator for latent regression. Adapting a denoising framework with task encoding, task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient N-to-one attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38387", "url": null, "sourceid": 37725, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38392, "uid": "a2db3a7829a3d16e3735781ced544816", "name": "P-Flow: Prompting Visual Effects Generation", "authors": [{"id": 156504, "fullname": "Rui Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156504?format=json", "institution": "National University of Singapore"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38392", "url": null, "sourceid": 43827, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38393, "uid": "a2daf6d006b9b09d13bdb15017a4d714", "name": "Hear What Matters! Text-conditioned Selective Video-to-Audio Generation", "authors": [{"id": 181154, "fullname": "Junwon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/181154?format=json", "institution": "KAIST (Korea)"}, {"id": 189786, "fullname": "Juhan Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/189786?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 159489, "fullname": "Jiyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/159489?format=json", "institution": "Assistant Professor, Ewha Womans University"}], "abstract": "This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MonoAudio, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38393", "url": null, "sourceid": 35483, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38396, "uid": "0395b98de9a261e0dbc630e72b6bf183", "name": "Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation", "authors": [{"id": 181651, "fullname": "Won Shik Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181651?format=json", "institution": "GIST"}, {"id": 157838, "fullname": "Ue-Hwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/157838?format=json", "institution": "Gwangju Institute of Science and Technology"}], "abstract": "Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \\textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers---guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or finetuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This supports that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38396", "url": null, "sourceid": 39007, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38400, "uid": "ffec991bf0f35f7e7688cf0d39c9f9f4", "name": "Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection", "authors": [{"id": 182213, "fullname": "Ruichao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182213?format=json", "institution": "USTB"}, {"id": 181460, "fullname": "Wei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181460?format=json", "institution": "Singapore Management University"}, {"id": 126280, "fullname": "Xiaobin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126280?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 155500, "fullname": "Jing Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/155500?format=json", "institution": "Hong Kong Baptist University"}, {"id": 189796, "fullname": "Hongzhan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189796?format=json", "institution": "National University of Singapore; Hong Kong Baptist University"}, {"id": 155499, "fullname": "Ziyang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/155499?format=json", "institution": "Salesforce"}, {"id": 189797, "fullname": "Bo-Wen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189797?format=json", "institution": null}, {"id": 87788, "fullname": "Xu-Cheng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87788?format=json", "institution": "University of Science and Technology Beijing"}], "abstract": "Multimodal misinformation poses an escalating challenge that often evades traditional detectors, which are opaque black boxes and fragile against new manipulation tactics. We present Probabilistic Concept Graph Reasoning (PCGR), an interpretable, modular, and evolvable framework that reframes multimodal misinformation detection (MMD) as structured, concept-based reasoning. PCGR follows a build-then-infer paradigm, which first constructs a graph of human-understandable concept nodes, including novel high-level concepts automatically discovered and validated by multimodal large language models (MLLMs), and then applies hierarchical attention over this concept graph to infer claim veracity. This design produces interpretable reasoning chains linking evidence to conclusions. Experiments demonstrate that PCGR achieves state-of-the-art MMD accuracy and robustness to emerging manipulation types, outperforming prior methods in both coarse detection and fine-grained manipulation recognition.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38400", "url": null, "sourceid": 37212, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38401, "uid": "7b7b8fe6a7edcce3a60990b689411964", "name": "MatMart: Material Reconstruction of 3D Objects via Diffusion", "authors": [{"id": 172010, "fullname": "Xiuchao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/172010?format=json", "institution": "Zhejiang University &amp; Alibaba Group"}, {"id": 189798, "fullname": "Pengfei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189798?format=json", "institution": "Nanjing University"}, {"id": 104646, "fullname": "Jiangjing Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104646?format=json", "institution": "Alibaba Group"}, {"id": 184565, "fullname": "Xinguo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184565?format=json", "institution": "Zhejiang University"}, {"id": 188326, "fullname": "Jie Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188326?format=json", "institution": "Nanjing university"}, {"id": 90556, "fullname": "Yanwen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/90556?format=json", "institution": "Nanjing University"}, {"id": 90181, "fullname": "Weiwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90181?format=json", "institution": "Zhejiang University"}, {"id": 91226, "fullname": "Chengfei Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/91226?format=json", "institution": "Zhejiang University"}], "abstract": "Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose MatMart, a novel material reconstruction framework for 3D objects, offering the following advantages. First, MatMart adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), MatMart enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, MatMart achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that MatMart achieves superior performance in material reconstruction compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38401", "url": null, "sourceid": 40359, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38405, "uid": "a4e6a24729a963657bb163b6e0977043", "name": "Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning", "authors": [{"id": 158247, "fullname": "Yiheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158247?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 135744, "fullname": "Zichang Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/135744?format=json", "institution": "Baidu"}, {"id": 189809, "fullname": "Guoqing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189809?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}, {"id": 153636, "fullname": "Xu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153636?format=json", "institution": "Sangfor Technologies Inc."}, {"id": 158248, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158248?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each testing image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight learnable scaling factor. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and general conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the optimal input with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38405", "url": null, "sourceid": 44968, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38407, "uid": "5441a1aa8871476da67131c185a73c4d", "name": "SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead", "authors": [{"id": 152620, "fullname": "Chaojun Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/152620?format=json", "institution": "Peking University"}, {"id": 127834, "fullname": "Chen Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127834?format=json", "institution": "Nanyang Technological University"}, {"id": 179509, "fullname": "Xiaofeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179509?format=json", "institution": "Tsinghua University"}, {"id": 75853, "fullname": "Zheng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75853?format=json", "institution": "Tsinghua University"}, {"id": 130710, "fullname": "Wenzhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130710?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 152619, "fullname": "Boyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152619?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89939, "fullname": "Tianrun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89939?format=json", "institution": "Zhejiang University"}, {"id": 151907, "fullname": "Guosheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151907?format=json", "institution": "University of Chinese Academy Sciences"}, {"id": 189812, "fullname": "Haoyun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189812?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189813, "fullname": "Zhehao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189813?format=json", "institution": "Peking University"}, {"id": 149295, "fullname": "Qiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149295?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 189814, "fullname": "Yun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/189814?format=json", "institution": null}, {"id": 189815, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189815?format=json", "institution": "GigaAI"}, {"id": 152624, "fullname": "Guan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152624?format=json", "institution": "GigaAI"}, {"id": 153221, "fullname": "Wenjun Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/153221?format=json", "institution": "Peking University"}], "abstract": "Vision\u2013Language\u2013Action (VLA) models built on pretrained Vision\u2013Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM\u2019s ability to exploit both 2D images and 4D features, we introduce \\textit{Fusion Tokens}, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that randomly masks 4D inputs to the VLM and trains the VLA to reconstruct the masked features. This self-reconstruction objective helps learn effective 4D representations, allowing the 4D branch to be dropped at inference with minimal performance loss. Extensive experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to $7\\times$ larger. On edge devices, SwiftVLA achieves comparable performance while being $18\\times$ faster than the $\\pi_0$  and reducing the memory footprint by $12\\times$.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38407", "url": null, "sourceid": 37948, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38408, "uid": "199a323c2b7b2b65dae06d56d48b8073", "name": "Expert-Teacher-Student Collaborative Learning for  Domain Adaptive Object Detection", "authors": [{"id": 102394, "fullname": "Yiming Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/102394?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 77213, "fullname": "Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77213?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 189816, "fullname": "Haibing Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189816?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 189817, "fullname": "Yuhan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189817?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 158795, "fullname": "Xichun Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158795?format=json", "institution": "Macao Polytechnic University"}, {"id": 89584, "fullname": "Chenggang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89584?format=json", "institution": "Hangzhou Dianzi University, Tsinghua University"}], "abstract": "Domain adaptive object detection (DAOD) aims to generalize an object detector trained on a source domain to a target domain, where the domain gap degrades the adaptability. Recently, large-scale vision foundation models (VFMs), pretrained on web-scale datasets, exhibit such powerful generalization capabilities that many approaches leverage them to bridge the domain gap. However, their generalized knowledge is not tailored to the specific domain, which makes it difficult to offer precise guidance in the target domain. In this paper, we propose an Expert-Teacher-Student collaborative learning (ETS) framework to synergize the generalized knowledge from VFMs with the domain-specific knowledge from the teacher model. Concretely, we first design an Expert-Teacher Collaborative Teaching (ETCT) module, which leverages the complementary knowledge of expert and teacher models to collaboratively generate high-quality pseudo labels for supervising student model learning. Second, we devise an Expert-Teacher Joint Consolidating (ETJC) module, which introduces class-wise prototype alignment among expert, teacher, and student models, to jointly consolidate generalized and domain-specific knowledge within the student model. ETS leverages VFMs as the expert model in a free lunch manner, thus avoiding significant additional training costs. Extensive experiments exhibit that our method outperforms the existing SOTA methods on three benchmarks.Our code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38408", "url": null, "sourceid": 32277, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38422, "uid": "28c634381653ecb8ffcf146cd8caf34e", "name": "Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation", "authors": [{"id": 181405, "fullname": "Haidong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181405?format=json", "institution": "University of Oulu"}, {"id": 189839, "fullname": "Snehal Bhayani", "url": "http://cvpr.thecvf.com/api/miniconf/users/189839?format=json", "institution": "University of Oulu"}, {"id": 87938, "fullname": "Janne Heikkil\u00e4", "url": "http://cvpr.thecvf.com/api/miniconf/users/87938?format=json", "institution": "University of Oulu"}], "abstract": "Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gr\u00f6bner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Gr\u00f6bner-basis and resultant-based solvers.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38422", "url": null, "sourceid": 41426, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38410, "uid": "7714ab6ab1ea68593e80de97752745e8", "name": "ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions", "authors": [{"id": 181320, "fullname": "Zikai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181320?format=json", "institution": "Harbin Institute of Technology"}, {"id": 88919, "fullname": "Zhilu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88919?format=json", "institution": "Harbin Institute of Technology"}, {"id": 178484, "fullname": "Yiqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178484?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189824, "fullname": "Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189824?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. The code and datasets will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38410", "url": null, "sourceid": 33376, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38411, "uid": "1503e6be575c5632a2528736341aaec7", "name": "What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?", "authors": [{"id": 128197, "fullname": "David Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128197?format=json", "institution": "Princeton University"}, {"id": 70506, "fullname": "Alexander Raistrick", "url": "http://cvpr.thecvf.com/api/miniconf/users/70506?format=json", "institution": "Princeton University"}, {"id": 90297, "fullname": "Jia Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90297?format=json", "institution": "Princeton University"}], "abstract": "Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We validate our findings by collecting the best settings and creating a large-scale dataset. Training only on this dataset achieves better performance than training on a mixture of widely used datasets, and is competitive with training on the FoundationStereo dataset, with the additional benefit of open-source generation code and an accompanying parameter analysis to enable further research. We open-source our system to enable further research on procedural stereo datasets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38411", "url": null, "sourceid": 36340, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38413, "uid": "65d7720bddfbf1c6f6b5b4f21b96c7ea", "name": "Rectifying Latent Space for Generative Single-Image Reflection Removal", "authors": [{"id": 152021, "fullname": "Mingjia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152021?format=json", "institution": "Tianjin University"}, {"id": 189826, "fullname": "Jin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189826?format=json", "institution": "Tianjin University"}, {"id": 189827, "fullname": "Hainuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189827?format=json", "institution": "Tianjin University"}, {"id": 154282, "fullname": "Qiming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154282?format=json", "institution": "Tianjin University"}, {"id": 189828, "fullname": "Jiarui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189828?format=json", "institution": "Tianjin University"}, {"id": 154283, "fullname": "Xiaojie Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154283?format=json", "institution": "Tianjin University"}], "abstract": "Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers.Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for an optimal result. Extensive experiments reveal that our model achieves new state-of-the-art performance on multiple benchmarks and generalizes well to challenging real-world images. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38413", "url": null, "sourceid": 39569, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38414, "uid": "978f19a896d0dbcd035a1ee019510d07", "name": "VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement", "authors": [{"id": 91739, "fullname": "Zhengfei Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91739?format=json", "institution": "Stanford University"}, {"id": 189829, "fullname": "Rui Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189829?format=json", "institution": "Google"}, {"id": 128046, "fullname": "Long Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128046?format=json", "institution": "Google DeepMind"}, {"id": 85845, "fullname": "Gordon Wetzstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85845?format=json", "institution": "Stanford University"}, {"id": 90241, "fullname": "Saining Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90241?format=json", "institution": "Facebook"}, {"id": 107365, "fullname": "Sanghyun Woo", "url": "http://cvpr.thecvf.com/api/miniconf/users/107365?format=json", "institution": "New York University"}], "abstract": "Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38414", "url": null, "sourceid": 42107, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38415, "uid": "3bf47241fd8ea682509ba6b7cc875f72", "name": "FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs", "authors": [{"id": 182065, "fullname": "Andreas Zinonos", "url": "http://cvpr.thecvf.com/api/miniconf/users/182065?format=json", "institution": "Imperial College London"}, {"id": 152477, "fullname": "Micha\u0142 Stypu\u0142kowski", "url": "http://cvpr.thecvf.com/api/miniconf/users/152477?format=json", "institution": "Cantina Labs"}, {"id": 130755, "fullname": "Antoni Bigata Casademunt", "url": "http://cvpr.thecvf.com/api/miniconf/users/130755?format=json", "institution": "Imperial College London"}, {"id": 85676, "fullname": "Stavros Petridis", "url": "http://cvpr.thecvf.com/api/miniconf/users/85676?format=json", "institution": "Facebook"}, {"id": 85689, "fullname": "Maja Pantic", "url": "http://cvpr.thecvf.com/api/miniconf/users/85689?format=json", "institution": "Facebook"}, {"id": 107181, "fullname": "Nikita Drobyshev", "url": "http://cvpr.thecvf.com/api/miniconf/users/107181?format=json", "institution": "Meta"}], "abstract": "We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. Stage 1 is a compact, one-step latent-space editor that reconstructs an image using a reference identity, a masked target frame, and a low-dimensional lips-pose vector, trained purely with reconstruction losses - no GANs or diffusion. To remove explicit masks at inference, we use self-supervision: we generate mouth-altered variants of the target image, that serve as pseudo ground truth for fine-tuning, teaching the network to localize edits to the lips while preserving the rest. Stage 2 is an audio-to-pose transformer trained with a flow-matching objective to predict lips-poses vectors from speech. Together, these stages form a simple and stable pipeline that combines deterministic reconstruction with robust audio control, delivering high perceptual quality and faster-than-real-time speed.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38415", "url": null, "sourceid": 37317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38416, "uid": "cbf2da3dc6c42ee8b4e539b31b0e24e1", "name": "OctoT2I: A Self-Evolving Agentic Text-to-Image Router", "authors": [{"id": 101375, "fullname": "Jiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/101375?format=json", "institution": "Peking University"}, {"id": 151874, "fullname": "Bin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151874?format=json", "institution": "Peking University"}, {"id": 152984, "fullname": "Gehui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152984?format=json", "institution": "Peking University"}, {"id": 130327, "fullname": "Yule Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130327?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 86749, "fullname": "Ronggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86749?format=json", "institution": "Peking University Shenzhen Graduate School"}, {"id": 76749, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76749?format=json", "institution": "Peking University"}], "abstract": "The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (e.g., style, color, count) and then intelligently explores their combinations via an iterative \"Propose--Solve--Evaluate--Learn\" (PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38416", "url": null, "sourceid": 38315, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38419, "uid": "30bb20e13b018817fd47172ff321c685", "name": "Learning to Select Visual Tools from Experience", "authors": [{"id": 152776, "fullname": "Zeyi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152776?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 182345, "fullname": "Yuyang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/182345?format=json", "institution": "Drexel University"}, {"id": 178442, "fullname": "Anirudh Sundara Rajan", "url": "http://cvpr.thecvf.com/api/miniconf/users/178442?format=json", "institution": "University of Wisconsin-Madison"}, {"id": 189833, "fullname": "Zefan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189833?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 189834, "fullname": "Wen Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189834?format=json", "institution": "Microsoft"}, {"id": 189835, "fullname": "Haohan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189835?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 131545, "fullname": "Junjie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131545?format=json", "institution": "University of Wisconsin, Madison"}, {"id": 89480, "fullname": "Yong Jae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/89480?format=json", "institution": "Professor, UW-Madison and Research Scientist, Adobe"}], "abstract": "We introduce VisualToolAgent (VisTA), a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and compose tools from a diverse library based on empirical performance. Existing methods for tool-augmented visual reasoning either rely on training-free prompting or large-scale supervised fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, guided solely by task outcomes. Leveraging reinforcement learning with verifiable rewards (RLVR), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, MathVerse, and BlindTest benchmarks demonstrate that VisTA achieves significant performance gains over training-free and fine-tuning baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38419", "url": null, "sourceid": 43220, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38421, "uid": "5305891e9f4181f619781432e815dd5e", "name": "FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning", "authors": [{"id": 126442, "fullname": "Weijie Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126442?format=json", "institution": "University of California, Merced"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 137329, "fullname": "ZHIXIN SHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/137329?format=json", "institution": "Adobe Systems"}], "abstract": "We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38421", "url": null, "sourceid": 46258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38427, "uid": "86ebc43444cdf451b149e5de84ff3ef5", "name": "ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering", "authors": [{"id": 104585, "fullname": "Denis Lukovnikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/104585?format=json", "institution": "Ruhr University Bochum"}, {"id": 159132, "fullname": "Andreas M\u00fcller", "url": "http://cvpr.thecvf.com/api/miniconf/users/159132?format=json", "institution": "Ruhr-Universit\u00e4t Bochum"}, {"id": 158071, "fullname": "Erwin Quiring", "url": "http://cvpr.thecvf.com/api/miniconf/users/158071?format=json", "institution": "fbeta"}, {"id": 127800, "fullname": "Asja Fischer", "url": "http://cvpr.thecvf.com/api/miniconf/users/127800?format=json", "institution": "Ruhr-Universit\u00e4t Bochum"}], "abstract": "In-generation watermarking for latent diffusion models has recently shown high robustness for marking generated images for easier detection and attribution. However, its application to autoregressive (AR) image models is underexplored. Autoregressive models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a VQ-VAE decoder. Inspired by KGW watermarking for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose a watermarking approach based on visual token clustering, which assigns similar tokens to the same set (red or green). We investigate token clustering in a training-free setting, as well as in combination with a robust fine-tuned token or cluster predictor. Overall, our experiments show that cluster-based watermarks greatly improve robustness against perturbations and regeneration attacks while preserving image quality, outperforming a set of baselines and concurrent works. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38427", "url": null, "sourceid": 45825, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38430, "uid": "bb4968ed0b140b4248317ab001c966c5", "name": "PAI-Bench: A Comprehensive Benchmark For Physical AI", "authors": [{"id": 144906, "fullname": "Fengzhe Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/144906?format=json", "institution": "Georgia Institute of Technology"}, {"id": 189854, "fullname": "Jiannan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189854?format=json", "institution": "Georgia Institute of Technology"}, {"id": 184159, "fullname": "Jialuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184159?format=json", "institution": "Georgia Institute of Technology"}, {"id": 75455, "fullname": "Deva Ramanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75455?format=json", "institution": "Carnegie Mellon University"}, {"id": 75967, "fullname": "Humphrey Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75967?format=json", "institution": "Georgia Tech | UIUC / Oregon | PAIR"}], "abstract": "Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38430", "url": null, "sourceid": 45333, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40335?format=json"], "related_events_ids": [40335]}, {"id": 38429, "uid": "8b6acd3795ce607f8acd53d6954ebcaa", "name": "LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes", "authors": [{"id": 95045, "fullname": "Ruofan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/95045?format=json", "institution": "University of Toronto"}, {"id": 189852, "fullname": "Norman M\u00fcller", "url": "http://cvpr.thecvf.com/api/miniconf/users/189852?format=json", "institution": "Meta"}, {"id": 189853, "fullname": "Ethan Weber", "url": "http://cvpr.thecvf.com/api/miniconf/users/189853?format=json", "institution": "Facebook; Massachusetts Institute of Technology"}, {"id": 133976, "fullname": "Duncan Zauss", "url": "http://cvpr.thecvf.com/api/miniconf/users/133976?format=json", "institution": "Meta"}, {"id": 158164, "fullname": "Nandita Vijaykumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/158164?format=json", "institution": "University of Toronto"}, {"id": 87351, "fullname": "Peter Kontschieder", "url": "http://cvpr.thecvf.com/api/miniconf/users/87351?format=json", "institution": "Meta"}, {"id": 75913, "fullname": "Christian Richardt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75913?format=json", "institution": "Codec Avatars Lab, Meta"}], "abstract": "We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38429", "url": null, "sourceid": 32254, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40335, "uid": "bb4968ed0b140b4248317ab001c966c5", "name": "PAI-Bench: A Comprehensive Benchmark For Physical AI", "authors": [{"id": 144906, "fullname": "Fengzhe Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/144906?format=json", "institution": "Georgia Institute of Technology"}, {"id": 189854, "fullname": "Jiannan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189854?format=json", "institution": "Georgia Institute of Technology"}, {"id": 184159, "fullname": "Jialuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184159?format=json", "institution": "Georgia Institute of Technology"}, {"id": 75455, "fullname": "Deva Ramanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75455?format=json", "institution": "Carnegie Mellon University"}, {"id": 75967, "fullname": "Humphrey Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75967?format=json", "institution": "Georgia Tech | UIUC / Oregon | PAIR"}], "abstract": "Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40335", "url": null, "sourceid": -45333, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38430?format=json"], "related_events_ids": [38430]}, {"id": 38461, "uid": "31a34dfbb1e21d2119711042f6731578", "name": "GM-R$^2$: Generative Matching Learning for Unsupervised Geometric Representation and Registration", "authors": [{"id": 86987, "fullname": "Haobo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86987?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 152259, "fullname": "Liang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152259?format=json", "institution": "Alibaba Group"}, {"id": 152258, "fullname": "Jianmin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152258?format=json", "institution": "Nanyang Technological University"}], "abstract": "This paper proposes GM-R^2, a novel Generative Matching Learning framework for unsupervised geometric descriptor learning and correspondence matching. By reformulating descriptor learning as geometry-conditioned cross-view image generation, GM-R^2 leverages the proxy supervisory signal from structurally aligned view synthesis to implicitly enforce feature consistency across correspondence, enabling robust 3D matching. To instantiate GM-R^2, we introduce Denoising-Agnostic Coupled ControlNet conditioned on depth maps as the required geometry-conditioned cross-view generator. It effectively extends the single-view generation of naive ControlNet to the cross-view via coupled depth-map input design and further remove the latent noise dependency to support geometry-only inference (expected by 3D matching). Moreover, we present Zoomable Equirectangular Projection for intrinsics-free point cloud-to-depth mapping that adaptively zooms into the angular region occupies by the narrow-FOV input for dense range-map acquisition. Extensive experiments on 3DMatch and ScanNet datasets verify the superior precision of our GM-R^2, even surpassing supervised methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38461", "url": null, "sourceid": 35643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38434, "uid": "a306a13c6c1ee387390fdc96c7bdca66", "name": "Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training", "authors": [{"id": 107287, "fullname": "Peng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/107287?format=json", "institution": "Zhejiang University &amp; Westlake University"}, {"id": 179989, "fullname": "Jun XIE", "url": "http://cvpr.thecvf.com/api/miniconf/users/179989?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 158625, "fullname": "Tao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/158625?format=json", "institution": "Westlake University"}], "abstract": "Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks.    To address them, we propose $\\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework.    The first stage pre-trains the visual generative component $\\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.    Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art performance.    For example, our IOMM-B (3.6B) model was trained from scratch using only $\\sim\\textbf{1050}$ H800 GPU hours (with the vast majority, $\\textbf{1000}$ hours, dedicated to the efficient $\\textbf{image-only pre-training stage}$). It achieves $\\textbf{0.89}$ on GenEval and $\\textbf{0.55}$ on WISE$\\textemdash$ surpassing strong baselines such as BAGEL-7B (0.82 \\& 0.55) and BLIP3-o-4B (0.84 \\& 0.50).    Code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38434", "url": null, "sourceid": 31802, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38436, "uid": "a7a89b342fac79af75d36d7fca34fa28", "name": "Emergent Outlier View Rejection in Visual Geometry Grounded Transformers", "authors": [{"id": 155742, "fullname": "Jisang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/155742?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187942, "fullname": "Sunghwan Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187942?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 155741, "fullname": "Jaewoo Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/155741?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 153107, "fullname": "Wooseok Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153107?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 155738, "fullname": "Honggyu An", "url": "http://cvpr.thecvf.com/api/miniconf/users/155738?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 150923, "fullname": "Qianqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150923?format=json", "institution": "University of California, Berkeley"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 86335, "fullname": "Chen Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86335?format=json", "institution": "New York University"}], "abstract": "Reliable 3D reconstruction from in-the-wild image collections is often hindered by noisy images\u2014irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effective noise-filtering capability, which we simply leverage to perform outlier-view rejection in feed-forward 3D reconstruction without any additional fine-tuning or supervision. Extensive experiments on both controlled and in-the-wild datasets demonstrate that this implicit filtering mechanism is consistent and generalizes well across diverse scenarios. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38436", "url": null, "sourceid": 44538, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38442, "uid": "92cb9a26e36b18031e5dad8db4edfddb", "name": "Zero-Shot Depth Completion with Vision-Language Model", "authors": [{"id": 155218, "fullname": "Zhiqiang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155218?format=json", "institution": "National University of Singapore"}, {"id": 159631, "fullname": "Yuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159631?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Vision language models (VLMs) have achieved remarkable success in semantic understanding tasks under language guidance, yet their potential for geometric perception remains largely underexplored. This paper introduces the first VLM-based depth completion framework. With almost no architectural modifications, we propose a sparse depth injection mechanism that extends the capability of VLM toward 3D perception through three key aspects: visual tokenization, textual prompt, and textual supervision. At the visual input side, sparse depth is tokenized to provide absolute scale and accurate geometric cues, alleviating the scale and camera ambiguities of RGB-only inputs. At the textual input side, a binary mask derived from sparse depth serves as a prompt, instructing the model where to complete and where to preserve. At the supervision side, the model is fine-tuned using text labels generated from sparse depth, requiring no ground-truth depth. Benefiting from the strong semantic priors and cross-modal expressiveness of VLM, our framework achieves superior zero-shot performance across diverse sensors, sparsity levels, and scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38442", "url": null, "sourceid": 44955, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38443, "uid": "30ea3d69405be8291554115d415c593d", "name": "Hidden Dangers of Compositional Generation: Diagnosing Semantic Safety Failures in Text-to-Image Models", "authors": [{"id": 182838, "fullname": "Haoming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182838?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 156163, "fullname": "Ke Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/156163?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 182848, "fullname": "ligonf zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182848?format=json", "institution": "UCAS"}, {"id": 130874, "fullname": "Xiaojun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/130874?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 156165, "fullname": "Yingfei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/156165?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 156164, "fullname": "Qianqian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156164?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "\\begin{abstract}Text-to-Image (\\textbf{T2I}) models have achieved significant progress in generating high-quality images, with compositional visual generation emerging as an important capability that enables them to synthesize coherent, natural scenes from multiple discrete concepts. However, this powerful compositionality, while enhancing creativity, also introduces new safety risks: combinations of different concepts can produce high-risk images without explicitly expressing harmful content. Motivated by this, we propose \\textbf{CoRA} (Composable Reassembly Attack): an attack method that preserves the original semantics while bypassing safety filters. Unlike traditional compositional generation approaches that rely on modifying the sampling process, \\textbf{CoRA} operates solely in the text space under a black-box setting, iteratively rewriting and guiding prompts through interactive steps.   Specifically, \\textbf{CoRA} decomposes a potentially harmful intent into a set of fine-grained, superficially benign but semantically complete visual elements, and then uses iterative selection and reassembly to guide the target \\textbf{T2I} model to recombine these elements without triggering safety checks, thereby recovering the original malicious semantics. Experimental results show that \\textbf{CoRA} significantly improves attack success rates (\\textbf{ASR}) across several mainstream open-source and commercial \\textbf{T2I} models, producing higher-risk outputs while maintaining semantic consistency.\\textcolor{red}{\\textbf{Warning:} This paper contains model-generated content that may be considered offensive or disturbing.}", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38443", "url": null, "sourceid": 38507, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38444, "uid": "0ff5e6749c211dec8c7509b33ee8a698", "name": "Parallel Rigidity Matters for Bundle Adjustment", "authors": [{"id": 182269, "fullname": "Lalit Manam", "url": "http://cvpr.thecvf.com/api/miniconf/users/182269?format=json", "institution": "Mitsubishi Electric Research Labs (MERL)"}, {"id": 88519, "fullname": "Venu Madhav Govindu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88519?format=json", "institution": "Indian Institute of Science"}], "abstract": "Bundle adjustment is a long-standing problem in computer vision that solves for camera parameters and 3D point coordinates from 2D image observations. While there has been much work on various aspects, like adaptation to different camera models and sensors, and considerations for solving the optimization problem, in this paper, we deal with a fundamental and distinct aspect of the uniqueness of its solution. In particular, we examine the unique solvability of the 3D reconstruction problem using parallel rigidity theory. We design an algorithm to ensure that the topology of the bipartite graph formed by the camera-3D point relations in bundle adjustment does not result in independent scaling of the edges in its subgraphs. To tackle the generally large-sized bipartite graph, we leverage camera-camera relationships in 3D reconstruction problems for efficiency. We demonstrate the benefits of our analysis on a global structure-from-motion pipeline. Applying our proposed algorithm results in significantly cleaner reconstructions by removing misplaced cameras and 3D points.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38444", "url": null, "sourceid": 46049, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38448, "uid": "3ec501b562c6038d2ffabafcbac37cd3", "name": "From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification", "authors": [{"id": 126608, "fullname": "Li-Jun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126608?format=json", "institution": "Shandong University"}, {"id": 126605, "fullname": "Zhen-Duo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126605?format=json", "institution": "Shandong University"}, {"id": 126598, "fullname": "Xin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126598?format=json", "institution": "Shandong University"}, {"id": 126597, "fullname": "Xin-Shun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126597?format=json", "institution": "Shandong University"}], "abstract": "Few-shot fine-grained image classification (FSFG) aims to recognize novel fine-grained categories from only a few labeled samples. Existing FSFG methods primarily focus on fine-grained feature extraction and modeling query\u2013support interactions within training episodes containing a small number of classes. Relying on the episodic training strategy, these methods typically assume that the capabilities learned on training samples can directly transfer to evaluation episodes with a few novel classes (few-way). However, in more practical and challenging scenarios involving many novel classes (many-way), existing approaches lack a reliable and global characterization of the feature space, making it difficult for episodic adaptation alone to generalize effectively. In this paper, we pioneer a theoretical analysis of novel class behavior in FSFG and derive a class discriminative index bound. Guided by this analysis, we propose a novel SCEG method that incorporates Self and Collaborative feature extraction as well as Episodic and Global feature space optimization. Extensive experiments demonstrate that our method consistently and significantly outperforms existing methods under both conventional few-way and the new many-way settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38448", "url": null, "sourceid": 32312, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38446, "uid": "87ec54ef27e93908a8397eb3a6bbb45b", "name": "CubeComposer: Spatio-Temporal Autoregressive 4K 360\u00b0 Video Generation from Perspective Video", "authors": [{"id": 99415, "fullname": "Lingen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/99415?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 128289, "fullname": "Guangzhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128289?format=json", "institution": "National University of Singapore"}, {"id": 85467, "fullname": "Xiaoyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85467?format=json", "institution": "Tencent ARC Lab"}, {"id": 87380, "fullname": "Zhaoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87380?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 75510, "fullname": "Qi Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75510?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 75943, "fullname": "Jinwei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75943?format=json", "institution": "NVIDIA"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 84809, "fullname": "Ying Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84809?format=json", "institution": "Tencent"}], "abstract": "Generating high-quality 360\u00b0 panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience.Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution.We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360\u00b0 videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360\u00b0 video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams.Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38446", "url": null, "sourceid": 32776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38450, "uid": "c6d1db3b8c9c598d7c810afee405b57c", "name": "Language-Grounded Decoupled Action Representation for Robotic Manipulation", "authors": [{"id": 174853, "fullname": "WuDing Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/174853?format=json", "institution": "Tongji University"}, {"id": 189882, "fullname": "Tongshu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189882?format=json", "institution": "Tongji University"}, {"id": 189883, "fullname": "chen liucheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189883?format=json", "institution": "Tongji University; Tongji University"}, {"id": 189884, "fullname": "Siyu xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189884?format=json", "institution": "Tongji University"}, {"id": 87727, "fullname": "Zheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87727?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 130166, "fullname": "Xing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130166?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 89312, "fullname": "Jingkuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/89312?format=json", "institution": "University of Electronic Science and Technology of China,"}, {"id": 84847, "fullname": "Heng Tao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84847?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the \\textbf{La}nguage-Grounded \\textbf{D}ecoupled \\textbf{A}ction Representation (\\textbf{LaDA}) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives\u2014translation, rotation, and gripper control\u2014providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38450", "url": null, "sourceid": 44982, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38451, "uid": "33f9135eb5c0ee9c4d007167acf47439", "name": "MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation", "authors": [{"id": 158681, "fullname": "Yinuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158681?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 85471, "fullname": "Yanbo Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85471?format=json", "institution": "Tencent AI Lab"}, {"id": 85470, "fullname": "Xuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85470?format=json", "institution": "Tencent AI Lab"}, {"id": 157046, "fullname": "Boyao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/157046?format=json", "institution": "Tsinghua University"}, {"id": 87724, "fullname": "Yu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87724?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87686, "fullname": "Fei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87686?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Dyadic interactive head generation aims to synthesize realistic head motions that respond both verbally and non-verbally to an interlocutor in real-time conversation. The existing works often focus on offline scenarios, and struggle with a shallow understanding of the multimodal conversational context while also lacking long-term coherence. To address these limitations, we propose MimicTalker, a novel method for producing real-time, contextually-aware, and long-term consistent interactive head motions. To this end, we propose a Multimodal Interactive Context Extraction (MICE) module to capture both instantaneous and long-term multimodal interactive information from the interlocutor. To enhance in-depth conversational understanding, we propose a Semantic-enhanced Dynamic Interaction (SDI) module to integrate the intentions and topics of the conversation, which are automatically extracted through an LLM-based analyzer. Further, we propose a semantic-guided Motion Style Memory (MSM) mechanism, enabling the long-term motion consistency throughout the conversation. We conduct experiments on both short conversational segments (25 seconds) and extended dialogues (6 minutes), and the comprehensive experiments demonstrate that our method significantly outperforms existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38451", "url": null, "sourceid": 31456, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38457, "uid": "45bb6f7fd156d0f5ed675304cf8a978a", "name": "BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification", "authors": [{"id": 180936, "fullname": "Haoxuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180936?format=json", "institution": "Beihang University"}, {"id": 131299, "fullname": "Guanglin Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131299?format=json", "institution": "Beihang University"}], "abstract": "Visible-Infrared Person Re-Identification (VI-ReID) is a challenging retrieval task due to the substantial modality gap between visible and infrared images. While existing methods attempt to bridge this gap by learning modality-invariant features within a shared embedding space, they often overlook the complex and implicit correlations between modalities. This limitation becomes more severe under distribution shifts, where infrared samples are often far fewer than visible ones. To address these challenges, we propose a novel network termed Bi-directional Interaction Transformation (BIT). Instead of relying on rigid feature alignment, BIT adopts a matching-based strategy that explicitly models the interaction between visible and infrared image pairs. Specifically, BIT employs an encoder-decoder architecture where the encoder extracts preliminary feature representations, and the decoder performs bi-directional feature integration and query aware scoring to enhance cross-modality correspondence. To our best knowledge, BIT is the first to introduce such pairwise matching-driven interaction in VI-ReID. Extensive experiments on several benchmarks demonstrate that our BIT achieves state-of-the-art performance, highlighting its effectiveness in the VI-ReID task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38457", "url": null, "sourceid": 35714, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38460, "uid": "ff8ee6aac61c11f16443646cdf467146", "name": "Batch Loss Score for Dynamic Data Pruning", "authors": [{"id": 155749, "fullname": "Qing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155749?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 189901, "fullname": "Bingxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189901?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 189762, "fullname": "Tao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189762?format=json", "institution": "Northwestern Polytechnical University, Northwest Polytechnical University Xi&#x27;an"}, {"id": 151064, "fullname": "Hongyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151064?format=json", "institution": "University of Hong Kong"}, {"id": 155750, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155750?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 155751, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155751?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Dynamic data pruning accelerates deep learning by selectively omitting less informative samples during training. While per-sample loss is a common importance metric, obtaining it can be challenging or infeasible for complex models or loss functions, often requiring significant implementation effort. This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. We frame the batch loss, from the perspective of a single sample, as a noisy measurement of its scaled individual loss, with noise originating from stochastic batch composition. It is formally shown that the EMA mechanism functions as a first-order low-pass filter, attenuating high-frequency batch composition noise. This yields a score approximating the smoothed and persistent contribution of the individual sample to the loss, providing a theoretical grounding for BLS as a proxy for sample importance. BLS demonstrates remarkable code integration simplicity (\\textbf{three-line injection}) and readily adapts existing per-sample loss-based methods (\\textbf{one-line proxy}). Its effectiveness is demonstrated by enhancing two such methods to losslessly prune \\textbf{20%-50%} of samples across \\textit{14 datasets}, \\textit{11 tasks} and \\textit{18 models}, highlighting its utility and broad applicability, especially for complex scenarios where per-sample loss is difficult to access.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38460", "url": null, "sourceid": 45383, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38464, "uid": "a13a06978e18f0c27ffdcd3fe419a1c4", "name": "Weaver: Decoupled Training for Interleaved Multi-modal Generation", "authors": [{"id": 179630, "fullname": "Jinbo Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/179630?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 127102, "fullname": "Zeyinzi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127102?format=json", "institution": "Alibaba Group"}, {"id": 189907, "fullname": "Yuxiang Tuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189907?format=json", "institution": "Alibaba"}, {"id": 126618, "fullname": "Chaojie Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126618?format=json", "institution": "Alibaba Group"}, {"id": 189908, "fullname": "Xiaotang Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189908?format=json", "institution": "Zhejiang University"}, {"id": 88507, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88507?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 127070, "fullname": "Jingfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127070?format=json", "institution": "Alibaba Group"}, {"id": 90198, "fullname": "Yulin Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90198?format=json", "institution": "Alibaba Group, China"}, {"id": 127103, "fullname": "Zhen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/127103?format=json", "institution": "Alibaba Group"}, {"id": 88802, "fullname": "Jie Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88802?format=json", "institution": "University of Science and Technology of China"}, {"id": 88264, "fullname": "Keyu Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88264?format=json", "institution": "University of Science and Technology of China"}, {"id": 85359, "fullname": "Chen-Wei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85359?format=json", "institution": "Alibaba Group"}, {"id": 189909, "fullname": "Chongyang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189909?format=json", "institution": "Alibaba Group"}, {"id": 90342, "fullname": "Kai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90342?format=json", "institution": "University of Science and Technology of China"}, {"id": 189910, "fullname": "Shen Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189910?format=json", "institution": "Alibaba Group"}, {"id": 88101, "fullname": "Lianghua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88101?format=json", "institution": "Alibaba Group"}, {"id": 88124, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88124?format=json", "institution": "Alibaba Group"}, {"id": 73271, "fullname": "Yujiu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73271?format=json", "institution": "Tsinghua University"}], "abstract": "Recent unified multi-modal models have made unprecedented progress in understanding and generation, yet they largely support multi-modal inputs with single-modality outputs, struggling to produce complex interleaved text\u2013image content due to data scarcity and the difficulty of modeling long-range cross-modal context. We introduce Weaver, which frames interleaved generation as an autoregressive planning\u2013visualization process within a unified multi-modal architecture. A planner, i.e., understanding expert, digests rich text\u2013image context to produce visualization triggers and their dense textual guidance except for plain text, while a visualizer, i.e., generation expert, produces images conditioned on the planner\u2019s textual guidance and visual references. This design enables decoupled learning: we train the two experts on large collections of textual planning and reference-guided image data in parallel, yielding powerful interleaved multi-modal generation capability at inference. Moreover, training the planner with datasets from diverse understanding and generation tasks equips the model with automatic task inference. To analyze and evaluate the model from multiple dimensions, we further introduce a benchmark that covers a range of everyday use cases. Extensive experiments show that, even without or with only very limited real interleaved data training, Weaver achieves superior performance on interleaved multi-modal generation.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38464", "url": null, "sourceid": 46614, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40336?format=json"], "related_events_ids": [40336]}, {"id": 38466, "uid": "37f92a98dd57884d64f167d8edb12095", "name": "4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis", "authors": [{"id": 143957, "fullname": "JIAXUN GUO", "url": "http://cvpr.thecvf.com/api/miniconf/users/143957?format=json", "institution": "CIISE of Concordia University"}, {"id": 189911, "fullname": "Wentao Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189911?format=json", "institution": null}, {"id": 189912, "fullname": "Manar Amayri", "url": "http://cvpr.thecvf.com/api/miniconf/users/189912?format=json", "institution": "Concordia University"}, {"id": 189913, "fullname": "Nizar Bouguila", "url": "http://cvpr.thecvf.com/api/miniconf/users/189913?format=json", "institution": "Concordia University"}], "abstract": "Rotation invariance remains a core challenge in point cloud analysis, where existing methods often struggle with structural ambiguities and insufficient global context. Most rotation-invariant (RI) representations are derived from local coordinate systems, which inherently suffer from point-pair ambiguities and fail to capture discriminative features in symmetric or repetitive structures, while discarding informative global pose cues. To overcome these limitations, we propose Ga4DPF, a novel framework that offers a robust, global-aware RI representation by converting rotation-equivariant geometric representations into invariant ones, while concurrently integrating global pose awareness. Specifically, Ga4DPF introduces a learnable steerable transform that equivariantly lifts point clouds into 4D space, facilitating robust local feature construction and mitigating point-pair ambiguities. Concurrently, we model a dynamic global pose reference using the Bingham distribution, which adaptively estimates a consistent global rotation and enhances global feature discriminability. Extensive experiments on multiple benchmark datasets demonstrate that Ga4DPF achieves state-of-the-art performance with high computational efficiency, offering a new paradigm for rotation-invariant point cloud analysis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38466", "url": null, "sourceid": 44371, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38467, "uid": "51ece4029cf46868d10fa2e1bef17b3f", "name": "L3DR: 3D-aware LiDAR Diffusion and Rectification", "authors": [{"id": 180935, "fullname": "QUAN LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180935?format=json", "institution": "Nanyang Technological University"}, {"id": 89095, "fullname": "Xiaoqin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89095?format=json", "institution": "Wenzhou University"}, {"id": 87230, "fullname": "Ling Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87230?format=json", "institution": "Inception Institute of Artificial Intelligence"}, {"id": 87301, "fullname": "Shijian Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87301?format=json", "institution": "Nanyang Technological University"}], "abstract": "Range-view (RV) based LiDAR diffusion has recently made huge strides towards 2D photo-realism. However, it neglects 3D geometry realism and often generates various RV artifacts such as depth bleeding and wavy surfaces. We design L3DR, a 3D-aware LiDAR Diffusion and Rectification framework that can regress and cancel RV artifacts in 3D space and restore local geometry accurately. Our theoretical and empirical analysis reveals that 3D models are inherently superior to 2D models in generating sharp and authentic boundaries. Leveraging such analysis, we design a 3D residual regression network that rectifies RV artifacts and achieves superb geometry realism by predicting point-level offsets in 3D space. On top of that, we design a Welsch Loss that helps focus on local geometry and ignore anomalous regions effectively. Extensive experiments over multiple benchmarks including KITTI, KITTI360, nuScenes and Waymo show that the proposed L3DR achieves state-of-the-art generation and superior geometry-realism consistently. In addition, L3DR is generally applicable to different LiDAR diffusion models with little computational overhead. Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38467", "url": null, "sourceid": 44312, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40336, "uid": "a13a06978e18f0c27ffdcd3fe419a1c4", "name": "Weaver: Decoupled Training for Interleaved Multi-modal Generation", "authors": [{"id": 179630, "fullname": "Jinbo Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/179630?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 127102, "fullname": "Zeyinzi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127102?format=json", "institution": "Alibaba Group"}, {"id": 189907, "fullname": "Yuxiang Tuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189907?format=json", "institution": "Alibaba"}, {"id": 126618, "fullname": "Chaojie Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126618?format=json", "institution": "Alibaba Group"}, {"id": 189908, "fullname": "Xiaotang Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189908?format=json", "institution": "Zhejiang University"}, {"id": 88507, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88507?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 127070, "fullname": "Jingfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127070?format=json", "institution": "Alibaba Group"}, {"id": 90198, "fullname": "Yulin Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90198?format=json", "institution": "Alibaba Group, China"}, {"id": 127103, "fullname": "Zhen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/127103?format=json", "institution": "Alibaba Group"}, {"id": 88802, "fullname": "Jie Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88802?format=json", "institution": "University of Science and Technology of China"}, {"id": 88264, "fullname": "Keyu Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88264?format=json", "institution": "University of Science and Technology of China"}, {"id": 85359, "fullname": "Chen-Wei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85359?format=json", "institution": "Alibaba Group"}, {"id": 189909, "fullname": "Chongyang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189909?format=json", "institution": "Alibaba Group"}, {"id": 90342, "fullname": "Kai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90342?format=json", "institution": "University of Science and Technology of China"}, {"id": 189910, "fullname": "Shen Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189910?format=json", "institution": "Alibaba Group"}, {"id": 88101, "fullname": "Lianghua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88101?format=json", "institution": "Alibaba Group"}, {"id": 88124, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88124?format=json", "institution": "Alibaba Group"}, {"id": 73271, "fullname": "Yujiu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73271?format=json", "institution": "Tsinghua University"}], "abstract": "Recent unified multi-modal models have made unprecedented progress in understanding and generation, yet they largely support multi-modal inputs with single-modality outputs, struggling to produce complex interleaved text\u2013image content due to data scarcity and the difficulty of modeling long-range cross-modal context. We introduce Weaver, which frames interleaved generation as an autoregressive planning\u2013visualization process within a unified multi-modal architecture. A planner, i.e., understanding expert, digests rich text\u2013image context to produce visualization triggers and their dense textual guidance except for plain text, while a visualizer, i.e., generation expert, produces images conditioned on the planner\u2019s textual guidance and visual references. This design enables decoupled learning: we train the two experts on large collections of textual planning and reference-guided image data in parallel, yielding powerful interleaved multi-modal generation capability at inference. Moreover, training the planner with datasets from diverse understanding and generation tasks equips the model with automatic task inference. To analyze and evaluate the model from multiple dimensions, we further introduce a benchmark that covers a range of everyday use cases. Extensive experiments show that, even without or with only very limited real interleaved data training, Weaver achieves superior performance on interleaved multi-modal generation.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40336", "url": null, "sourceid": -46614, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38464?format=json"], "related_events_ids": [38464]}, {"id": 38465, "uid": "3750d48ecac4e1137d04a38eee74f4c6", "name": "Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling", "authors": [{"id": 145095, "fullname": "Kiseok Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/145095?format=json", "institution": "KAIST"}, {"id": 183745, "fullname": "Jaemin Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/183745?format=json", "institution": "KAIST"}, {"id": 90841, "fullname": "Inchul Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/90841?format=json", "institution": "KAIST"}, {"id": 76517, "fullname": "Min H. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/76517?format=json", "institution": "KAIST"}], "abstract": "X-ray computed tomography (CT) suffers from severe metal artifacts when high-attenuation objects such as dental fillings or orthopedic implants are present. These artifacts originate from the polychromatic nature of X-rays, where attenuation varies strongly with photon energy and material composition, breaking the monochromatic assumption used by conventional reconstruction algorithms. Recent neural rendering approaches attempt to address this mismatch through differentiable polychromatic projection models, but they still struggle with smoothness bias, loss of fine structures, and prohibitive computation when extended to large-scale cone-beam CT. We introduce a splat-based metal artifact reduction framework that incorporates a physically grounded polychromatic forward model into a continuous Gaussian representation for cone-beam CT. Each Gaussian encodes the energy-dependent attenuation of the underlying material using a compact material parameterization, which enables efficient joint optimization of geometric and material properties without relying on a metal mask. This compact attenuation formulation captures the essential variation across biological tissues and metallic implants, allowing our model to explain metal-induced nonlinearity while preserving high-frequency structure. Experiments on simulated and real cone-beam CT scans show that our method converges significantly faster and suppresses metal artifacts more effectively than existing reconstruction and neural field-based approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38465", "url": null, "sourceid": 35997, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38472, "uid": "95d89f1aafa28ac92eb3963212571e04", "name": "Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images", "authors": [{"id": 104400, "fullname": "Xiangyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/104400?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 129307, "fullname": "Haoyi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129307?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 154826, "fullname": "Liu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154826?format=json", "institution": "Horizon Robotics"}, {"id": 159455, "fullname": "Seungtae Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/159455?format=json", "institution": "Yonsei University"}, {"id": 153532, "fullname": "Gyeongjin Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153532?format=json", "institution": "Sungkyunkwan University"}, {"id": 159473, "fullname": "Xinjie wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159473?format=json", "institution": "Horizon Robotics"}, {"id": 166700, "fullname": "Wei Sui", "url": "http://cvpr.thecvf.com/api/miniconf/users/166700?format=json", "institution": "D-Robotics, China"}, {"id": 164010, "fullname": "Zhizhong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/164010?format=json", "institution": "Horizon Robotics"}, {"id": 86175, "fullname": "Wenyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86175?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86179, "fullname": "Xinggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86179?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 159432, "fullname": "Eunbyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/159432?format=json", "institution": "Yonsei University"}], "abstract": "Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce a novel feed-forward framework that reconstructs 3D scenes from unposed multi-view images. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction\u2014all within a single, feed-forward pass. Extensive experiments demonstrate this method establishes a new state-of-the-art across multiple benchmarks, including RE10K and ScanNet. Our work signifies a novel paradigm towards generalizable 3D scene reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38472", "url": null, "sourceid": 36405, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38477, "uid": "cd8b6b3c5d37cd78b6de2c4b5e80b15d", "name": "DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion", "authors": [{"id": 189937, "fullname": "Eungi Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189937?format=json", "institution": null}, {"id": 181612, "fullname": "Seung-hyeok Back", "url": "http://cvpr.thecvf.com/api/miniconf/users/181612?format=json", "institution": "Chonnam national university"}, {"id": 189938, "fullname": "Hyung-Il Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189938?format=json", "institution": "Chonnam National University"}, {"id": 189939, "fullname": "Seok Bong Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189939?format=json", "institution": "Chonnam National University"}], "abstract": "Face-swapping deepfakes allow realistic identity transfer, which can serve creative purposes but increases the risk of identity abuse. A proactive defense aims to prevent deepfake creation by obstructing identity feature extraction from input images, essential for identity-driven face-swapping. Existing proactive defense approaches aim to protect faces by hindering accurate identity feature extraction, but tend to introduce visible artifacts and fail to degrade the visual quality of the face-swapping deepfakes. This work proposes a proactive face-swapping defense using identity blending and attribute distortion (DeepProtect) that integrates global identity fusion in the latent space and local prompt-driven adversarial watermarking to address these problems. This work dilutes distinct identity representations by channel-wise blending of multiple identities in the latent space and optimizing the generator for visual consistency. The proposed approach distorts facial components in the identity space, directly influencing how faces are reconstructed in deepfakes. This approach applies semantic directions derived from user-provided text prompts to embed imperceptible adversarial watermarks that selectively distort facial attributes, affecting the visual fidelity of deepfake results. The proposed method hinders face-swapping deepfakes while preserving the perceptual quality of the protected images, offering a robust and practical solution for facial privacy protection. The experimental results reveal that DeepProtect effectively defends against face-swapping deepfakes while preserving visual consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38477", "url": "https://github.com/BACKAI/DeepProtect", "sourceid": 40229, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65715, "modified": "2026-04-20T18:28:02.540222-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/38477.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38484, "uid": "550871828cc1bca1c0d7df8db25f80a6", "name": "Random Wins All: Rethinking Grouping Strategies for Vision Tokens", "authors": [{"id": 128266, "fullname": "Qihang Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128266?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189954, "fullname": "Yuang Ai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189954?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76200, "fullname": "Huaibo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76200?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76640, "fullname": "Ran He", "url": "http://cvpr.thecvf.com/api/miniconf/users/76640?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \\textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. For example, compared to the classic Swin Transformer, our random grouping strategy achieves improvements of \\textbf{+1.3}, \\textbf{+0.9}, and \\textbf{+0.9} across three model sizes. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: \\textbf{positional information}, \\textbf{head feature diversity}, \\textbf{global receptive field}, and \\textbf{fixed grouping pattern}. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38484", "url": null, "sourceid": 36987, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38480, "uid": "be19b693f0394d68da38939bbd0a1cf7", "name": "Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting", "authors": [{"id": 143966, "fullname": "Nan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/143966?format=json", "institution": "Tianjin University"}, {"id": 189947, "fullname": "Yike Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189947?format=json", "institution": "Tianjin University"}, {"id": 189948, "fullname": "Qian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189948?format=json", "institution": "Tianjin University"}, {"id": 154514, "fullname": "Qi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154514?format=json", "institution": "Tianjin University"}, {"id": 89860, "fullname": "Zhiyi Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89860?format=json", "institution": "Peking University"}, {"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}, {"id": 153902, "fullname": "Liang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153902?format=json", "institution": "Tianjin University"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic real-time rendering but also increase the risks of unauthorized copying and redistribution. Existing 3DGS watermarking methods typically rely on handcrafted thresholds or globally fixed hyperparameters to balance invisibility and robustness, making their embedding behavior static and scene-agnostic. We instead formulate 3DGS watermarking as a goal-directed decision process and introduce Write Where It Matters (W2M), the first reinforcement learning-based framework that adaptively learns where and how much to embed. By modeling the embedding process as a Markov Decision Process, W2M uses a lightweight policy network to allocate precise Gaussian updates directly from immediate reward feedback, iteratively. The reward incentivizes both rendering-space invisibility and decoding robustness under various image- and model-level distortions. To achieve efficient control, W2M operates on a structured 3DGS backbone organized around learnable anchors and applies policy-guided per-anchor gradient scaling. Extensive experiments across the Blender, LLFF, and Mip-NeRF 360 datasets demonstrate that W2M achieves state-of-the-art bit accuracy, strong perceptual fidelity, and structural consistency under both standard and adversarial conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38480", "url": null, "sourceid": 36630, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38482, "uid": "f1be3ad90345f56fd97fcf282a265902", "name": "MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving", "authors": [{"id": 182207, "fullname": "Lingjun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182207?format=json", "institution": "Alibaba Group"}, {"id": 189949, "fullname": "Yujian Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189949?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189950, "fullname": "Changjie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189950?format=json", "institution": null}, {"id": 158334, "fullname": "Xinyuan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158334?format=json", "institution": "Alibaba Group"}, {"id": 152360, "fullname": "Xin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/152360?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189951, "fullname": "Shuang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189951?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189952, "fullname": "Linzhe Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189952?format=json", "institution": "Alibaba Group"}, {"id": 189953, "fullname": "Sijin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189953?format=json", "institution": "Alibaba Group"}, {"id": 186047, "fullname": "Hang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186047?format=json", "institution": "Alibaba Group"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}], "abstract": "Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving.MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuningmethod to optimize the alignment through progressive high-level reward-based learning.MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation.Our trained model and codes will be released once accepted.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38482", "url": null, "sourceid": 45445, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38486, "uid": "97f672b97cd67c31866ecea0b020147e", "name": "Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning", "authors": [{"id": 155926, "fullname": "Yaozong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155926?format=json", "institution": "Guangxi Normal University"}, {"id": 155925, "fullname": "Qihua Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155925?format=json", "institution": "Guangxi Normal University"}, {"id": 128975, "fullname": "Bineng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128975?format=json", "institution": "Guangxi Normal University"}, {"id": 184384, "fullname": "Shuimu Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184384?format=json", "institution": null}, {"id": 155928, "fullname": "Yuanliang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155928?format=json", "institution": "Xi\u2019an Research Institute of High Technology"}, {"id": 155927, "fullname": "Ning Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155927?format=json", "institution": "Guangxi Normal University"}, {"id": 129683, "fullname": "Shuxiang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/129683?format=json", "institution": "Guangxi Normal University"}], "abstract": "Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \\textbf{\\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments on multiple tracking benchmarks demonstrate the superiority of our method, achieving SOTA performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38486", "url": null, "sourceid": 38516, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38488, "uid": "2e61866ea98aba76b0b07e7e7728c646", "name": "VMD-FACT: A New Video Dataset and MLLM-based method for Detecting Realistic AI-Generated Video Misinformation", "authors": [{"id": 184028, "fullname": "Yongkang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184028?format=json", "institution": "Zhongguancun Laboratory"}, {"id": 189967, "fullname": "Dongyu She", "url": "http://cvpr.thecvf.com/api/miniconf/users/189967?format=json", "institution": "Zhongguancun Laboratory"}, {"id": 189968, "fullname": "Baiyu Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/189968?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 189969, "fullname": "Qichuan Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189969?format=json", "institution": "Capital Normal University"}, {"id": 149178, "fullname": "Zhong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/149178?format=json", "institution": "Beihang University"}, {"id": 189970, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189970?format=json", "institution": null}], "abstract": "The rapid evolution of generative AI, exemplified by models such as Sora, has intensified the threat of video misinformation. A critical challenge in detecting these AI-generated video misinformation lies in a fundamental disconnect between existing datasets and practical deception tactics. Current datasets often disrupt cross-modal consistency through editing techniques, resulting in unrealistic and easily detectable artifacts. In stark contrast, generative video misinformation strives for semantic consistency across modalities to remain realism. To address this gap, we introduce RAVM: the first Realistic AI-Generated Video Misinformation Detection Dataset. RAVM contains authentic claim-video pairs, as well as Realistic AI-Generated claim-video pairs. More importantly, unlike existing Video Misinformation Detection (VMD) datasets that are limited to single-source manipulations, RAVM encompasses multiple manipulation sources\u2014Claim, Video, Audio, and Cross-Modal Manipulation\u2014each of which includes multiple manipulation techniques to generate realistic AI-generated video misinformation. Thus, we introduce an AI-generative framework for producing realistic AI-generated video misinformation. Furthermore, we propose IEEG model, which represents multimodal evidence, fact-checking results, and their dependencies as an evidence graph for interpretable AI-generated VMD. Extensive experiments on RAVM demonstrate the vulnerability of general Multimodal Large Language Models (MLLMs) in detecting generative video misinformation, while our IEEG achieves state-of-the-art performance on RAVM.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38488", "url": null, "sourceid": 42807, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38493, "uid": "b6094ce2d81a4ceef4996f3f0ea04635", "name": "Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists", "authors": [{"id": 172537, "fullname": "Jiaqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/172537?format=json", "institution": "independent"}, {"id": 88625, "fullname": "Zhizhong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88625?format=json", "institution": "Wayne State University"}], "abstract": "3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights smaller. As a result, each Gaussian becomes more focused on the pixels where it is dominant, which reduces its impact on nearby pixels, leading to even shorter Gaussian lists. Eventually, we integrate our method into a rendering resolution scheduler which further improves efficiency through progressive resolution increase. We evaluate our method by comparing it with state-of-the-art methods on widely used benchmarks. Our results show significant advantages over others in efficiency without sacrificing rendering quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38493", "url": null, "sourceid": 41877, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38494, "uid": "c20cbf08834c32e106350909a16de19e", "name": "MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation", "authors": [{"id": 183964, "fullname": "Jiale Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183964?format=json", "institution": "Tencent"}, {"id": 154762, "fullname": "Wang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154762?format=json", "institution": "Tencent ARC Lab"}, {"id": 84809, "fullname": "Ying Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84809?format=json", "institution": "Tencent"}], "abstract": "Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language\u2011modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high\u2011poly meshes, and (ii) absence of geometry\u2011aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi\u2011level sparse\u2011voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross\u2011attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse\u2011to\u2011fine vertex prediction in a single decoding step, while tightly couples the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state\u2011of\u2011the\u2011art compression ratio of 18%, can generates meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38494", "url": null, "sourceid": 31594, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38496, "uid": "be4f71c8222e43c0fcb16a77d1226f23", "name": "EXOTIC: External Vision-driven Incomplete Multi-view Classification", "authors": [{"id": 176606, "fullname": "Shilin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176606?format=json", "institution": "Sichuan University "}, {"id": 86132, "fullname": "Dezhong Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86132?format=json", "institution": "Sichuan University"}, {"id": 88480, "fullname": "Zhenwen Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/88480?format=json", "institution": "Southwest University Of Science And Technology"}, {"id": 157685, "fullname": "Yuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/157685?format=json", "institution": "Sichuan University"}], "abstract": "Due to sensor failures and occlusions during data acquisition, multi-view data often suffer from partial missing samples, thereby producing incomplete multi-view data. Recently,  Incomplete Multi-View Classification (IMVC) has become one of the research hot topics, where numerous IMVC methods have been proposed. Although these methods have achieved promising performance by exploiting internal semantic information from partially observed data, they primarily rely on limited internal supervision for view completion. Clearly,  this largely constrains their performance ceiling. To overcome this limitation, we propose an EXternal visiOn-driven incomplete mulTi-vIew Classification (EXOTIC) paradigm that incorporates external vision knowledge as semantic guidance, thereby assisting in imputing incomplete views. To the best of our knowledge, it is the first work that leverages external vision knowledge as supervision signals, thereby guiding missing-view completion. Specifically, we first introduce an external vision knowledge library based on a pre-trained vision\u2013language model. Then, we design a Knowledge Filtering module to adaptively select task-relevant knowledge. Afterwards, we present a Knowledge Purification module to align external knowledge with internal representations. Finally, we propose External Completion that leverages the refined knowledge to impute missing views, thereby enhancing the classification decision ability. Extensive experiments on multiple incomplete multi-view datasets demonstrate that the proposed EXOTIC consistently outperforms existing methods, especially under high missing rates.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38496", "url": null, "sourceid": 46359, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38498, "uid": "849db8ef492a08f7469cf0bdcecbc22e", "name": "Robust Remote Sensing Image\u2013Text Retrieval with Noisy Correspondence", "authors": [{"id": 189991, "fullname": "qiya song", "url": "http://cvpr.thecvf.com/api/miniconf/users/189991?format=json", "institution": "Hunan Normal University"}, {"id": 183684, "fullname": "Yiqiang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/183684?format=json", "institution": "Hunan Normal University"}, {"id": 157685, "fullname": "Yuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/157685?format=json", "institution": "Sichuan University"}, {"id": 153585, "fullname": "Renwei Dian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153585?format=json", "institution": "Hunan University"}, {"id": 189992, "fullname": "Xudong Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189992?format=json", "institution": "Hunan University"}], "abstract": "As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. Although several studies have acknowledged the presence of noisy pairs, little work has explored how to endow neural networks with robustness against such noise. Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image\u2013Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. Specifically, we first divide all training sample pairs into three categories based on the loss magnitude of each pair, i.e., clean sample pairs, ambiguous sample pairs, and noisy sample pairs. Then, we respectively estimate the reliability of each training pair by assigning a weight to each pair based on the values of the loss. Further, we respectively design a new self-paced function to dynamically regulate the training sequence and weights of the samples, thus establishing a progressive learning process. Finally, for noisy sample pairs, we present an enhanced triplet loss to dynamically adjust the soft margin based on semantic similarity, thereby enhancing the robustness against noise. Extensive experiments on three popular benchmark datasets demonstrate that the proposed RRSITR significantly outperforms the state-of-the-art methods, especially in high noise rates.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38498", "url": null, "sourceid": 42872, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38505, "uid": "c2d66337ca53fe0c1c58a92f2665ef70", "name": "TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection", "authors": [{"id": 182043, "fullname": "Rong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182043?format=json", "institution": "Beijing Jiaotong University"}, {"id": 156264, "fullname": "Runqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156264?format=json", "institution": "Beijing Jiaotong University"}, {"id": 190010, "fullname": "Yingjun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190010?format=json", "institution": null}, {"id": 190011, "fullname": "Tao Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190011?format=json", "institution": "Beijing Jiaotong University"}, {"id": 144850, "fullname": "Xiaomeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/144850?format=json", "institution": "Beijing Jiaotong University"}, {"id": 159471, "fullname": "Liping Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/159471?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Weakly Supervised Video Anomaly Detection (WSVAD) aims to localize abnormal segments using only video-level labels during training.Although the paradigm significantly reduces annotation costs, the coarse-grained labels fail to precisely describe the full videos, resulting in the introduction of substantial Weakly Labeled Information (WLI) during training. The presence of WLI makes it difficult for the model to accurately learn the boundary between normal and abnormal behaviors, leading to  misclassifications and compromising the precision of anomaly localization.To tackle the challenges posed by WLI, we propose a triplet learning strategy that selects hard segments from normal videos as anchors. By combining contrastive learning with Multiple Instance Learning (MIL) strategy, we increase the projection distance between abnormal segments and anchor samples, to reduce the interference of WLI in anomaly detection.Moreover, considering that anomalies typically occur in dynamic foreground regions, we further design a motion-aware feature enhancement module that extracts dynamic areas within each video segment to emphasize the representation of critical features.This not only improves the accuracy of anchors in triplets, but also enhances the discriminative power of instance features in MIL. Extensive experiments on UCF-Crime, XD-Violence, and MSAD datasets demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38505", "url": null, "sourceid": 35403, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38506, "uid": "3ead394a64cafeb28d8d7a3c9f79d350", "name": "RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection", "authors": [{"id": 144743, "fullname": "Zian Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144743?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 130050, "fullname": "Wei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130050?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190012, "fullname": "QINGSHAN GAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/190012?format=json", "institution": "Pingan Technology"}, {"id": 190013, "fullname": "Yuanyuanfu Yuanyuanfu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190013?format=json", "institution": "Pingan Technology"}], "abstract": "Change detection and semantic segmentation are key techniques for satellite image analysis in remote sensing. However, acquiring high-quality labeled data is costly and time-consuming. Although recent studies have explored generative models to ease data scarcity, a unified framework supporting both tasks is still lacking, and most methods overlook noise accumulation and cannot generate multispectral images. To address this, we propose the robust diffusion framework for masked image generation (RDF-MIG). RDF-MIG generates bi-temporal change-labeled and single-temporal segmentation-labeled images to enhance downstream change detection and semantic segmentation tasks. Furthermore, to address noise accumulation and improve the quality of generated image\u2013mask pairs, we reformulate the diffusion model training objective by proposing the Maximum Correntropy Robust Diffusion (MCRD) loss, and further design an MSE-consistency calibration that analytically aligns small-error gradients with the MSE objective while preserving robustness to outliers. Experiments indicate that the proposed RDF-MIG framework can generate multispectral image\u2013mask pairs to improve downstream performance, while MCRD loss further enhances the quality of the synthesized data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38506", "url": null, "sourceid": 41077, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38508, "uid": "dedd5db8f760f36dd41fba0d5e94308b", "name": "LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models", "authors": [{"id": 181723, "fullname": "Soumyaratna Debnath", "url": "http://cvpr.thecvf.com/api/miniconf/users/181723?format=json", "institution": "Indian Institute of Technology, Gandhinagar"}, {"id": 190018, "fullname": "Bui Manh", "url": "http://cvpr.thecvf.com/api/miniconf/users/190018?format=json", "institution": "Nanyang Technological University"}, {"id": 175187, "fullname": "Zinan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175187?format=json", "institution": "Nanyang Technological University"}, {"id": 153635, "fullname": "Lin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153635?format=json", "institution": "Nanyang Technological University"}], "abstract": "Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static. It is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS). This empowers us to design a M\u00f6bius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align the perceptual saliency with the textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering (VQA) benchmarks. The results show that ours achieves dramatic gains, with average improvements by +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, results reveal that LLMind can retain up to 82%, 92% and 97% of the full-resolution performance with only 1%, 3% and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38508", "url": "https://empactlab.github.io/LLMind-CVPR-2026/", "sourceid": 37047, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38512, "uid": "59951ac68b785c63518ff49cb84b8225", "name": "MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing", "authors": [{"id": 179916, "fullname": "Jingqiao Xiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179916?format=json", "institution": "The University of Hong Kong"}, {"id": 181441, "fullname": "Can Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181441?format=json", "institution": null}, {"id": 86789, "fullname": "Dong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86789?format=json", "institution": "University of Hong Kong"}], "abstract": "3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to low-level perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into existing MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual positional encoding space, while augmenting it with a jointly trained and sampled 3DGS decoder; and (3) a surrogate task that enhances feedforward editing capabilities. Extensive experiments demonstrate that MLLMSplat delivers state-of-the-art performance across 3DGS understanding, generation, and editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38512", "url": null, "sourceid": 36408, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38516, "uid": "4ff7acc3af759d2b27ba34a457c41932", "name": "Unified Customized Generation by Disentangled Reward Modeling", "authors": [{"id": 190036, "fullname": "Shaojin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190036?format=json", "institution": "ByteDance Inc."}, {"id": 190037, "fullname": "Mengqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190037?format=json", "institution": "University of Science and Technology of China"}, {"id": 181297, "fullname": "Yufeng Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181297?format=json", "institution": "Beijing Zitiao Network Technology Co., Ltd."}, {"id": 190038, "fullname": "wenxu wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190038?format=json", "institution": null}, {"id": 190039, "fullname": "Jiahe Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190039?format=json", "institution": null}, {"id": 190040, "fullname": "Yiming Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190040?format=json", "institution": null}, {"id": 190041, "fullname": "Fei Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190041?format=json", "institution": "Bytedance"}, {"id": 127533, "fullname": "Qian HE", "url": "http://cvpr.thecvf.com/api/miniconf/users/127533?format=json", "institution": "Institute of Remote Sensing Application, Chinese Academic of Sciences"}], "abstract": "Existing literature typically treats various customized generation tasks (e.g., subject-customized generation, style-customized generation) as distinct and disjoint problems, with each task focusing solely on customizing a specific aspect of the reference image. However, we argue that the objectives of these different customization tasks are inherently complementary and can be mutually enhanced within a unified framework, as they fundamentally involve the disentanglement of multiple feature aspects from the reference image. To this end, we introduce **USO**, a **U**nified **S**imultaneous **O**ptimization framework to simultaneously unify different customized tasks (i.e., subject and style). Specifically, USO introduces a cyclical data-model framework that connects these two tasks by a subject-for-style data curation pipeline and a style-for-subject model training pipeline. The subject-for-style data curation pipeline leverages a state-of-the-art subject-customized model to generate high-quality triplet data comprising content images, style images, and their corresponding stylized content images. Building on this foundation, the style-for-subject model training pipeline introduces an auxiliary style reward to simultaneously align style and content features, thereby reinforcing the model\u2019s ability to extract the desired style or content features from the reference image. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models, excelling in both subject consistency and style similarity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38516", "url": null, "sourceid": 45331, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38520, "uid": "2ee54ad04e0c4ccddb42bd52efcf27e8", "name": "Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks", "authors": [{"id": 139147, "fullname": "Ngoc-Bao Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/139147?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 190054, "fullname": "Sy-Tuyen Ho", "url": "http://cvpr.thecvf.com/api/miniconf/users/190054?format=json", "institution": "University of Maryland, College Park"}, {"id": 129847, "fullname": "Koh Jun Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129847?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 88167, "fullname": "Ngai-Man Cheung", "url": "http://cvpr.thecvf.com/api/miniconf/users/88167?format=json", "institution": "Singapore University of Technology and Design"}], "abstract": "Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior studies have primarily examined unimodal deep networks, the vulnerability of vision-language models (VLMs) remains largely unexplored. **In this work, we present the first systematic study of MI attacks on VLMs to understand their susceptibility to leaking private visual training data.**Our work makes two main contributions.First, tailored to the token-generative nature of VLMs,we introduce a suite of token-based and sequence-based model inversion strategies, providing a comprehensive analysis of VLMs' vulnerability under different attack formulations.Second, based on the observation that tokens vary in their visual grounding, and hencetheir gradients differ in informativeness  for image reconstruction, we propose *Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW)* as a novel MI for VLMs. SMI-AW dynamically reweights each token's loss gradient according to its visual grounding, enabling the optimization to focus on visually informative tokens and more effectively guide the reconstruction of private images.Through extensive experiments and human evaluations on a range of state-of-the-art VLMs across multiple datasets, we show that VLMs are susceptible to training data leakage. Human evaluation of the reconstructed images yields an attack accuracy of 61.21\\%, underscoring the severity of these privacy risks.Notably,  we demonstrate that publicly released VLMs are vulnerable to such attacks. Our study highlights the urgent need for privacy safeguards as VLMs become increasingly deployed in sensitive domains such as healthcare and finance.**Code and additional experiments are provided in Supp.**", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38520", "url": null, "sourceid": 35679, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38528, "uid": "7c0d291483f96b28bcf34828e67b0404", "name": "Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization", "authors": [{"id": 161003, "fullname": "Jeonggon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/161003?format=json", "institution": "Hanyang University"}, {"id": 127845, "fullname": "Heejoon Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/127845?format=json", "institution": "HANYANG university"}, {"id": 76091, "fullname": "Je Hyeong Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76091?format=json", "institution": "Hanyang University"}], "abstract": "Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints.However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks.In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points.We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack.DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location.This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed:Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery.DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines.Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38528", "url": null, "sourceid": 35546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40339?format=json"], "related_events_ids": [40339]}, {"id": 40339, "uid": "7c0d291483f96b28bcf34828e67b0404", "name": "Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization", "authors": [{"id": 161003, "fullname": "Jeonggon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/161003?format=json", "institution": "Hanyang University"}, {"id": 127845, "fullname": "Heejoon Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/127845?format=json", "institution": "HANYANG university"}, {"id": 76091, "fullname": "Je Hyeong Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76091?format=json", "institution": "Hanyang University"}], "abstract": "Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints.However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks.In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points.We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack.DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location.This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed:Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery.DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines.Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40339", "url": null, "sourceid": -35546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38528?format=json"], "related_events_ids": [38528]}, {"id": 38532, "uid": "75fca39619796228e6b3f3ab2cc0ebab", "name": "LoL: Longer than Longer, Scaling Video Generation to Hour", "authors": [{"id": 134887, "fullname": "Jiaxing Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/134887?format=json", "institution": "UCLA"}, {"id": 89921, "fullname": "Jie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89921?format=json", "institution": "ByteDance Inc."}, {"id": 190074, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190074?format=json", "institution": "University of Central Florida"}, {"id": 106920, "fullname": "Tao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106920?format=json", "institution": "Xi&#x27;an JiaoTong University"}, {"id": 190075, "fullname": "Xiaojie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190075?format=json", "institution": "Tiktok"}, {"id": 87116, "fullname": "Rui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87116?format=json", "institution": "TikTok"}, {"id": 190076, "fullname": "Andrew Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190076?format=json", "institution": "University of California, Los Angeles"}, {"id": 148556, "fullname": "Yuanhao Ban", "url": "http://cvpr.thecvf.com/api/miniconf/users/148556?format=json", "institution": "UCLA"}, {"id": 84675, "fullname": "Cho-Jui Hsieh", "url": "http://cvpr.thecvf.com/api/miniconf/users/84675?format=json", "institution": "University of California, Los Angeles"}], "abstract": "Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38532", "url": null, "sourceid": 37515, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38535, "uid": "5743ccecb63e95b508eb2fab24e9d074", "name": "Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding", "authors": [{"id": 176472, "fullname": "Jiayun Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/176472?format=json", "institution": "Hangzhou City University"}, {"id": 177229, "fullname": "Haolong Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/177229?format=json", "institution": "Hangzhou City University"}, {"id": 177232, "fullname": "Xueying Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177232?format=json", "institution": "Hangzhou City University"}, {"id": 186546, "fullname": "Xiaoqing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186546?format=json", "institution": "Hong Kong Baptist University"}, {"id": 190082, "fullname": "Zengwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190082?format=json", "institution": "Hangzhou City University"}, {"id": 177313, "fullname": "Zhan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/177313?format=json", "institution": "Zhejiang University"}, {"id": 177314, "fullname": "Junmei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177314?format=json", "institution": "Women\u2019s Hospital, School of Medicine"}, {"id": 177315, "fullname": "Xinyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177315?format=json", "institution": null}, {"id": 89719, "fullname": "Jie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89719?format=json", "institution": "City University of Hong Kong"}, {"id": 177322, "fullname": "Binbin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/177322?format=json", "institution": "Hangzhou City University"}], "abstract": "Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for modalities like CT and MRI, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes.To bridge this gap, we construct US-365K, a large-scale ultrasound image\u2013text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks, Ultrasonographic Hierarchical Anatomical Taxonomy (UHAT) and Ultrasonographic Diagnostic Attribute Framework (UDAF). UHAT standardizes anatomical organization, and UDAF formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity.Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion\u2013attribute relations.Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks built upon US-365K, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38535", "url": null, "sourceid": 43020, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38536, "uid": "b225f92ace98bb942488b1996b2d6c27", "name": "RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes", "authors": [{"id": 180971, "fullname": "Jiarui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180971?format=json", "institution": "Nanyang Technological University"}, {"id": 190083, "fullname": "Zhihao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190083?format=json", "institution": null}, {"id": 89791, "fullname": "Chong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89791?format=json", "institution": "Nanyang Technological University"}, {"id": 127882, "fullname": "Bihan Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127882?format=json", "institution": "Nanyang Technological University"}], "abstract": "Neural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated \\textbf{scene flow module} further predicts temporal offsets between adjacent frames, enforcing temporal occupancy coherence during dynamic scene reconstruction. Moreover, we propose a \\textbf{radar-specific power rendering formulation} grounded in radar sensing physics, improving both synthesis accuracy and interpretability. Extensive experiments on public radar datasets demonstrate that RF4D substantially outperforms existing methods in radar measurement synthesis and occupancy estimation accuracy, with particularly strong gains in dynamic outdoor environments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38536", "url": null, "sourceid": 33006, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38537, "uid": "6dab40e11c3d66094c85b354febf06a6", "name": "Modeling the Brain\u2019s Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding", "authors": [{"id": 154124, "fullname": "Yulong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154124?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 190084, "fullname": "Hua Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190084?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 144873, "fullname": "Yiyang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/144873?format=json", "institution": "Hong Kong University of Sicence and Technology"}, {"id": 190085, "fullname": "Chunyang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190085?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 190086, "fullname": "Sirui Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190086?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 131536, "fullname": "Yike Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131536?format=json", "institution": "Imperial College London"}], "abstract": "Recent advances in fMRI pretraining have significantly improved visual decoding accuracy by leveraging cross-subject neuroimaging datasets. A prevailing strategy aligns individual fMRI signals into a shared feature space using subject-specific adapters, followed by a shared decoder. However, this unstructured feature space overlooks the redundancy and functional correlations among voxels and fails to incorporate the brain\u2019s intrinsic functional architecture centered on regions of interest (ROIs).To address these limitations, we propose ROITok, an ROI-guided fMRI pretraining framework. Our method introduces Sparse ROI Context Fusion to learn  ROI-level visual representations and captures functional synergy between ROIs from cross-subject data. Inspired by Matryoshka Representation Learning (MRL),  we design an embedding compression scheme that prioritizes the most informative visual components first, with later tokens adding progressively finer but still useful details.    ROITok achieves strong transfer learning performance on the NSD and GOD datasets and shows strong resilience against high-level additive noises, while offering better interpretability and enabling new applications. It allows for quantitative assessment of each brain region\u2019s contribution to decoding tasks. Our analysis shows that ROI-based pretraining can automatically learn the brain\u2019s visual hierarchy. Different ROIs can provide complementary contexts for decoding tasks; combining them improves decoding robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38537", "url": null, "sourceid": 44846, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38538, "uid": "33f9679a317534e1f5cf3a9750a7a48c", "name": "ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers", "authors": [{"id": 152559, "fullname": "Mohsen Ghafoorian", "url": "http://cvpr.thecvf.com/api/miniconf/users/152559?format=json", "institution": "Qualcomm"}, {"id": 106397, "fullname": "Amirhossein Habibian", "url": "http://cvpr.thecvf.com/api/miniconf/users/106397?format=json", "institution": "Qualcomm AI Research"}], "abstract": "Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt\u2019s hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to $\\sim$160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline  provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38538", "url": null, "sourceid": 46665, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38540, "uid": "e6e9d37a0f6a79c25564cade197a8e3c", "name": "Particulate: Feed-Forward 3D Object Articulation", "authors": [{"id": 131804, "fullname": "Ruining Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131804?format=json", "institution": "University of Oxford"}, {"id": 162237, "fullname": "YUXIN YAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/162237?format=json", "institution": "University of Cambridge"}, {"id": 180810, "fullname": "Chuanxia Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180810?format=json", "institution": "University of Oxford"}, {"id": 129663, "fullname": "Christian Rupprecht", "url": "http://cvpr.thecvf.com/api/miniconf/users/129663?format=json", "institution": "University of Oxford"}, {"id": 186620, "fullname": "Joan Lasenby", "url": "http://cvpr.thecvf.com/api/miniconf/users/186620?format=json", "institution": "University of Cambridge"}, {"id": 152816, "fullname": "Shangzhe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152816?format=json", "institution": "University of Cambridge"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}], "abstract": "We introduce Particulate, a feed-forward model that, given a single static 3D mesh of an everyday object, predicts its 3D parts, kinematic structure, and articulation parameters.Unlike prior work on articulated 3D object modeling that is limited by costly per-object optimization and small retrieval databases or requires large vision or language foundation models, our approach is based on a flexible, scalable and lightweight transformer architecture.Trained on a diverse collection of articulated 3D assets from public datasets, Particulate accurately infers the articulated structure of  novel objects, including those generated by image-to-3D models, in a single feed-forward pass.We further introduce a benchmark for articulated 3D object estimation curated from high-quality public 3D assets.Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38540", "url": null, "sourceid": 31361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38544, "uid": "14d203158e5c5fd54555ecd1bd9ed8e3", "name": "HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition", "authors": [{"id": 180742, "fullname": "Suhan Woo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180742?format=json", "institution": "Yonsei University"}, {"id": 190097, "fullname": "Seongwon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/190097?format=json", "institution": "Kookmin University"}, {"id": 190098, "fullname": "jinwoo jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190098?format=json", "institution": null}, {"id": 90082, "fullname": "Euntai Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/90082?format=json", "institution": "Yonsei University"}], "abstract": "Visual environments are inherently hierarchical, as a panoramic view naturally encompasses and organizes multiple perspective views within its field. Capturing this hierarchy is crucial for effective perspective-to-equirectangular (P2E) visual place recognition. In this work, we introduce HypeVPR, a hierarchical embedding framework in hyperbolic space specifically designed to address the challenges of P2E matching. HypeVPR leverages the intrinsic ability of hyperbolic space to represent hierarchical structures, allowing panoramic descriptors to encode both broad contextual information and fine-grained local details. To this end, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Furthermore, HypeVPR\u2019s hierarchical organization inherently enables flexible control over the accuracy\u2013efficiency trade-off without additional training, while maintaining robust matching across different image types. This approach allows HypeVPR to outperform existing methods while significantly accelerating retrieval and reducing database storage requirements. The codes and models are available: TBD.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38544", "url": null, "sourceid": 33513, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38545, "uid": "c5af15be85875ccc7fdd2d6e415347c4", "name": "Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection", "authors": [{"id": 183145, "fullname": "Ahyoung Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/183145?format=json", "institution": "Yonsei University"}, {"id": 182312, "fullname": "Wonseok Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182312?format=json", "institution": "Yonsei University"}, {"id": 183737, "fullname": "Songkuk Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183737?format=json", "institution": "Yonsei University"}], "abstract": "Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method establishes new state-of-the-art results on the FPR95 metric\u2014critical for safety-sensitive applications\u2014across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38545", "url": null, "sourceid": 34416, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38546, "uid": "21f201dbabfce0c78e027b5fc9325811", "name": "Designing to Forget: Deep Semi-parametric Models for Unlearning", "authors": [{"id": 132392, "fullname": "Yijia Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/132392?format=json", "institution": "Purdue University"}, {"id": 184182, "fullname": "YU-SHAN TAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/184182?format=json", "institution": "Purdue University; National Taiwan University"}, {"id": 85283, "fullname": "Raymond A. Yeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/85283?format=json", "institution": "Purdue University"}], "abstract": "Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by $11\\\\%$ and achieve over $10\\times$ faster unlearning compared to existing approaches on parametric models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38546", "url": null, "sourceid": 41401, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38549, "uid": "a9a80dd485d4058159acd75553f0d2f2", "name": "Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning", "authors": [{"id": 107148, "fullname": "Yizheng Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107148?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 128633, "fullname": "Siyue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128633?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 190114, "fullname": "Waleed Al-Nuaimy", "url": "http://cvpr.thecvf.com/api/miniconf/users/190114?format=json", "institution": "University of Liverpool"}, {"id": 89348, "fullname": "Jimin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89348?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Vision-language models like CLIP excel at zero-shot recognition but struggle with continual learning due to two critical issues: (1) severe distribution gap between pretraining captions and post-training class names, and (2) performance mismatch between vision-only and dual-encoder approaches\u2014vision-only methods achieve 20% higher accuracy on fine-grained tasks while CLIP dominates on natural images. We propose Learning from Itself (LfI), which mines CLIP's internal knowledge to address both challenges. First, we generate pseudo-captions by optimizing learnable tokens to minimize CLIP's contrastive loss, creating auxiliary training signals that bridge the pretraining-finetuning distribution gap without external models. Second, we introduce adaptive mutual distillation that dynamically weights knowledge transfer between CLIP's text encoder and a temporary vision classifier based on their instantaneous performance\u2014stronger branches teach more, weaker ones learn more. At inference, only the original CLIP architecture is used, having absorbed discriminative knowledge from both branches. LfI achieves state-of-the-art results across multiple continual learning benchmarks, demonstrating that CLIP can effectively teach itself to continually learn new tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38549", "url": null, "sourceid": 39808, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38550, "uid": "e1335d95cb3b93d835eb22780be7327f", "name": "OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation", "authors": [{"id": 183900, "fullname": "Yoonjin Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/183900?format=json", "institution": "Korea University Research and Business Foundation"}, {"id": 183899, "fullname": "Yongjin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183899?format=json", "institution": "Korea University Research and Business Foundation"}, {"id": 145711, "fullname": "Hyomin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/145711?format=json", "institution": "Korea University"}, {"id": 190115, "fullname": "Donghwan Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190115?format=json", "institution": "Korea University; CMU, Carnegie Mellon University"}, {"id": 85255, "fullname": "Sungwoong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85255?format=json", "institution": "Korea University"}], "abstract": "Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text\u2013image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose $\\underline{\\text{O}}$bject-centric $\\underline{\\text{S}}$elf-improving $\\underline{\\text{P}}$reference $\\underline{\\text{O}}$ptimization (OSPO), a self-improving framework designed to enhance object-level text\u2013image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38550", "url": null, "sourceid": 41937, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38572, "uid": "68db38cc9fc94a0feff812a13896202f", "name": "iLRM: An Iterative Large 3D Reconstruction Model", "authors": [{"id": 153532, "fullname": "Gyeongjin Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153532?format=json", "institution": "Sungkyunkwan University"}, {"id": 159455, "fullname": "Seungtae Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/159455?format=json", "institution": "Yonsei University"}, {"id": 183036, "fullname": "Seung kwon Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183036?format=json", "institution": "Yonsei University"}, {"id": 104400, "fullname": "Xiangyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/104400?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 190171, "fullname": "Sameh Khamis", "url": "http://cvpr.thecvf.com/api/miniconf/users/190171?format=json", "institution": "Rembrand"}, {"id": 190172, "fullname": "Abdelrahman Mohamed", "url": "http://cvpr.thecvf.com/api/miniconf/users/190172?format=json", "institution": "Facebook"}, {"id": 159432, "fullname": "Eunbyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/159432?format=json", "institution": "Yonsei University"}], "abstract": "Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enable compact 3D representations; (2) decomposing global multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38572", "url": null, "sourceid": 45565, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38552, "uid": "e5c181ee499b852af66070149f78fb02", "name": "B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta\u2013Bernoulli Bayesian Updates", "authors": [{"id": 183050, "fullname": "Hiromichi Kamata", "url": "http://cvpr.thecvf.com/api/miniconf/users/183050?format=json", "institution": "Sony Group Corporation"}, {"id": 190117, "fullname": "Samuel Munro", "url": "http://cvpr.thecvf.com/api/miniconf/users/190117?format=json", "institution": "Pixomondo Innovation"}, {"id": 190118, "fullname": "Fuminori Homma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190118?format=json", "institution": "Sony Group Corporation"}], "abstract": "Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production.However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use.We propose \\textbf{B$^3$-Seg (Beta--Bernoulli Bayesian Segmentation for 3DGS)}, a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under \\textbf{camera-free} and \\textbf{training-free} conditions.Our approach reformulates segmentation as sequential Beta--Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG).This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy.Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds.The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38552", "url": null, "sourceid": 38587, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38555, "uid": "dcf93600bbba2cf6f214349c39588527", "name": "Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning", "authors": [{"id": 180001, "fullname": "Yuhua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180001?format=json", "institution": "Beihang University"}, {"id": 190125, "fullname": "Qinnan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190125?format=json", "institution": "Beihang University"}, {"id": 190126, "fullname": "Xiaodong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190126?format=json", "institution": "Renmin University of China"}, {"id": 190127, "fullname": "Huan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190127?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 190128, "fullname": "Yifan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190128?format=json", "institution": "Renmin University of China"}, {"id": 190129, "fullname": "Wangjie Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190129?format=json", "institution": "Beihang University"}, {"id": 190130, "fullname": "Hainan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190130?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 190131, "fullname": "Yongxin Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190131?format=json", "institution": "Beihang University"}, {"id": 185685, "fullname": "Zhiming Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185685?format=json", "institution": "Beihang University"}], "abstract": "Prototype-based Personalized Federated Learning (ProtoPFL) enables efficient cross-domain adaptation by communicating compact class prototypes, but directly sharing prototypes raises privacy risks. A common defense involves per-example $\\ell_2$ clipping before prototype computation to limit sensitivity, followed by the addition of isotropic Gaussian noise during upload to enforce Local Differential Privacy (LDP). However, this Isotropic Gaussian Prototype Perturbation (IGPP) often over-perturbs key discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. We propose VPDR, a client-side privacy plug-in that can be seamlessly integrated into existing ProtoPFL frameworks. Motivated by the statistical prior that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which uses groupwise calibration to apply less noise to discriminative subspaces, preserving semantic separability while ensuring privacy. We further design Distillation-guided Clipping Regularization (DCR), which enables feature norms to adaptively concentrate near the predefined clipping threshold while maintaining prediction consistency. Theoretical analysis shows that our groupwise noise provides privacy guarantees no weaker than those of the isotropic mechanism under the same privacy constraints. Extensive experiments on multiple cross-domain benchmarks demonstrate that VPDR achieves a superior privacy-utility trade-off, outperforming IGPP in personalized federated fine-tuning while maintaining strong privacy protection under realistic attack scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38555", "url": null, "sourceid": 38769, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38559, "uid": "14b823daa1dc97b96140b6201156550a", "name": "Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach", "authors": [{"id": 190144, "fullname": "Yuanxiang Huangfu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190144?format=json", "institution": null}, {"id": 190145, "fullname": "Chaochao wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190145?format=json", "institution": "patsnap"}, {"id": 190146, "fullname": "weilei wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190146?format=json", "institution": null}], "abstract": "The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose \\textbf{Role-SynthCLIP}, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs by enriching each image with multiple complementary captions, while keeping the number of training images fixed. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million images achieves a Recall@1 of $\\mathbf{64.1\\%}$ on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by $2.8$ percentage points. The code and trained models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38559", "url": null, "sourceid": 34498, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38560, "uid": "7fb0e93718cc2bbf2ad75d2dfb497c77", "name": "MVP: Multiple View Prediction improves GUI grounding", "authors": [{"id": 190147, "fullname": "Yunzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190147?format=json", "institution": "Zhejiang University"}, {"id": 179962, "fullname": "Zeyu Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179962?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 187083, "fullname": "Zhengwen Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187083?format=json", "institution": "Ant Group"}, {"id": 187084, "fullname": "Shuheng Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187084?format=json", "institution": null}, {"id": 90247, "fullname": "Changhua Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90247?format=json", "institution": "Nanjing University"}, {"id": 74218, "fullname": "Linchao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74218?format=json", "institution": "Zhejiang University"}], "abstract": "GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant \\textbf{coordinate prediction instability}\u2014minor visual perturbations (e.g., cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) \\textbf{Attention-Guided View Proposal}, which derives diverse views guided by instruction-to-image attention scores, and (2) \\textbf{Multi-Coordinates Clustering}, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1\\%, GTA1-7B to 61.7\\%, Qwen3VL-8B-Instruct to 65.3\\%, and Qwen3VL-32B-Instruct to 74.0\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38560", "url": null, "sourceid": 34678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38561, "uid": "be1c598ba7f697a4d255b4df13d8ffde", "name": "DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding,  Perception, Prediction and Planning", "authors": [{"id": 169543, "fullname": "Zhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/169543?format=json", "institution": "The University of Hong Kong"}, {"id": 154122, "fullname": "Runhui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154122?format=json", "institution": "University of Hong Kong"}, {"id": 190148, "fullname": "Rui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190148?format=json", "institution": "The University of HongKong"}, {"id": 190149, "fullname": "Siming Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190149?format=json", "institution": "Yinwang Intelligent Technology Co. Ltd."}, {"id": 190150, "fullname": "Zining Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190150?format=json", "institution": null}, {"id": 128148, "fullname": "Lu Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128148?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 154517, "fullname": "Di Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154517?format=json", "institution": "Tianjin University"}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}], "abstract": "Although multimodal large language models (MLLMs) have shown remarkable capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs within a unified framework remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. We term it 4D MLLM as it outputs both 3D occupancy and flow, capturing fine-grained spatial-temporal dynamics. Specifically, to capture both precise geometric information and rich appearance, our approach integrates point clouds, multi-view images and language instructions within a single MLLM architecture. Remarkably, despite utilizing only a 0.5B Qwen2.5 model as the MLLM, our proposed DrivePI still maintains promising textual scene understanding while achieving competitive performance in 3D perception, prediction, and planning tasks. Moreover, DrivePI even surpasses most specialized vision-based models across these tasks, highlighting the effectiveness of our unified approach. We hope this new VLA framework can inspire future research to enhance autonomous driving systems with improved interpretability and explainable decision-making through language reasoning and fine-grained 3D outputs. To facilitate future research, we will release the code and annotated datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38561", "url": null, "sourceid": 33074, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38564, "uid": "1d87fb03475c5f052588cd435b46656d", "name": "Is Parameter Isolation Better for Prompt-Based Continual Learning?", "authors": [{"id": 181750, "fullname": "Jiangyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181750?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 154066, "fullname": "Chenhao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/154066?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188493, "fullname": "SongLin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188493?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 154063, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154063?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 190157, "fullname": "Jianchao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190157?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87254, "fullname": "Yuhang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/87254?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87250, "fullname": "Yihong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87250?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Prompt-based continual learning methods effectively mitigate catastrophic forgetting. However, most existing methods assign a fixed set of prompts to each task, completely isolating knowledge across tasks and resulting in suboptimal parameter utilization. To address this, we consider the practical needs of continual learning and propose a prompt-sharing framework. This framework constructs a global prompt pool and introduces a task-aware gated routing mechanism that sparsely activates a subset of prompts to achieve dynamic decoupling and collaborative optimization of task-specific feature representations. Furthermore, we introduce a history-aware modulator that leverages cumulative prompt activation statistics to protect frequently used prompts from excessive updates, thereby mitigating inefficient parameter usage and knowledge forgetting. Extensive analysis and empirical results demonstrate that our approach consistently outperforms existing static allocation strategies in effectiveness and efficiency. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38564", "url": null, "sourceid": 39552, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38569, "uid": "ef78696cd7010762dd352b66f28acf95", "name": "E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training", "authors": [{"id": 86486, "fullname": "Qitao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86486?format=json", "institution": "CMU"}, {"id": 128744, "fullname": "Hao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128744?format=json", "institution": "Adobe Systems"}, {"id": 150923, "fullname": "Qianqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150923?format=json", "institution": "University of California, Berkeley"}, {"id": 127654, "fullname": "Sai Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/127654?format=json", "institution": "Adobe Systems"}, {"id": 127646, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127646?format=json", "institution": "Adobe Systems"}, {"id": 89590, "fullname": "Kalyan Sunkavalli", "url": "http://cvpr.thecvf.com/api/miniconf/users/89590?format=json", "institution": "Adobe Research"}, {"id": 76012, "fullname": "Shubham Tulsiani", "url": "http://cvpr.thecvf.com/api/miniconf/users/76012?format=json", "institution": "Carnegie Mellon University"}, {"id": 135363, "fullname": "Hanwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135363?format=json", "institution": "Adobe Systems"}], "abstract": "Self-supervised pre-training has revolutionized foundation models for language, 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv2, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38569", "url": null, "sourceid": 41219, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38570, "uid": "365aa59e779547d7e081b220f8fa67ac", "name": "MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification", "authors": [{"id": 144896, "fullname": "Yujian Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144896?format=json", "institution": "Beihang University"}, {"id": 190170, "fullname": "Hankun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190170?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 131299, "fullname": "Guanglin Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131299?format=json", "institution": "Beihang University"}], "abstract": "Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to Mitigate the Optical\u2013SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0\\%, +6.2\\%, and +16.4\\% in R1 accuracy under the ALL to ALL,Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38570", "url": null, "sourceid": 33482, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38571, "uid": "b78488f85a5c6a2c19698a4777b06019", "name": "UniLight: A Unified Representation for Lighting", "authors": [{"id": 180801, "fullname": "Zitian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180801?format=json", "institution": "Universit\u00e9 Laval"}, {"id": 127644, "fullname": "Iliyan Georgiev", "url": "http://cvpr.thecvf.com/api/miniconf/users/127644?format=json", "institution": "Adobe"}, {"id": 180018, "fullname": "Michael Fischer", "url": "http://cvpr.thecvf.com/api/miniconf/users/180018?format=json", "institution": "Adobe"}, {"id": 85036, "fullname": "Yannick Hold-Geoffroy", "url": "http://cvpr.thecvf.com/api/miniconf/users/85036?format=json", "institution": "Adobe Research"}, {"id": 86795, "fullname": "Jean-Fran\u00e7ois Lalonde", "url": "http://cvpr.thecvf.com/api/miniconf/users/86795?format=json", "institution": "Universit\u00e9 Laval"}, {"id": 129565, "fullname": "Valentin Deschaintre", "url": "http://cvpr.thecvf.com/api/miniconf/users/129565?format=json", "institution": "Adobe Research"}], "abstract": "Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose Unilight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38571", "url": null, "sourceid": 42200, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38581, "uid": "6299f7b7e197e0a3cfb007c2c0b12a20", "name": "UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching", "authors": [{"id": 176231, "fullname": "Qilin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176231?format=json", "institution": "Southern University of Science and Technology"}, {"id": 176908, "fullname": "Quynh Anh Huynh", "url": "http://cvpr.thecvf.com/api/miniconf/users/176908?format=json", "institution": "University of Pennsylvania"}, {"id": 190200, "fullname": "Long Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/190200?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 70682, "fullname": "Chen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70682?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 145576, "fullname": "Chuhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145576?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 190201, "fullname": "Ryan Lucas", "url": "http://cvpr.thecvf.com/api/miniconf/users/190201?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 190202, "fullname": "Eric Eaton", "url": "http://cvpr.thecvf.com/api/miniconf/users/190202?format=json", "institution": "University of Pennsylvania"}, {"id": 90047, "fullname": "Lingjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90047?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}], "abstract": "Recent progress in 3D reconstruction, such as NeRFs and 3D Gaussian Splatting, has made it easy to recover geometry and appearance from images. However, these static representations remain blind to the physics that govern how objects deform and respond to forces. Building interactive 3D worlds therefore requires predicting not only shape but the underlying material properties. Prior approaches either rely on slow test-time optimization or, more recently, a fast feed-forward predictor such as Pixie. However, these models produce only a single point estimate of physical parameters and are limited to a single simulation backend, restricting both expressiveness and portability. We introduce UniPixie, a generative physics-from-pixels framework that overcomes both limitations. UniPixie predicts a controllable, continuous soft-to-stiff distribution of plausible material properties from a single visual input, capturing inherent physical ambiguity. In addition, UniPixie is the first unified architecture to generate simulation-ready parameters for multiple physics solvers, including Material Point Method (MPM), Linear Blend Skinning (LBS), and Spring-Mass systems. Trained on our new PIXIEMULTIVERSE dataset of annotated material ranges, UniPixie produces diverse, physically consistent dynamics and achieves state-of-the-art accuracy, outperforming deterministic baselines by over 2x while inheriting the fast and generalizable inference from the prior feed-forward work.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38581", "url": null, "sourceid": 39186, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38582, "uid": "cda8323e7b825035a16cce8f275d665c", "name": "NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks", "authors": [{"id": 182284, "fullname": "Fangzhou Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182284?format=json", "institution": "Worcester Polytechnic Institute"}, {"id": 179883, "fullname": "Yuping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179883?format=json", "institution": "University of Michigan"}, {"id": 96697, "fullname": "Yuliang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96697?format=json", "institution": "Bosch US Research"}, {"id": 150358, "fullname": "Zixun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150358?format=json", "institution": "Bosch Research North America"}, {"id": 129002, "fullname": "Xinyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129002?format=json", "institution": "Robert Bosch Research NA"}, {"id": 190203, "fullname": "Haichong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190203?format=json", "institution": "Worcester Polytechnic Institute"}, {"id": 190204, "fullname": "Kazunori Yamada", "url": "http://cvpr.thecvf.com/api/miniconf/users/190204?format=json", "institution": null}, {"id": 155027, "fullname": "Zhengzhong Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155027?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 84539, "fullname": "Liu Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/84539?format=json", "institution": "Bosch Research"}, {"id": 77101, "fullname": "Ziming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77101?format=json", "institution": "Worcester Polytechnic Institute"}], "abstract": "Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expressive capacity.We first evaluate NexusFlow on the core challenge of domain-partitioned autonomous driving, where dense map reconstruction and sparse multi-object tracking are supervised in different geographic regions, creating both structural disparity and a strong domain gap. NexusFlow sets a new state-of-the-art result on nuScenes, outperforming strong partially supervised baselines. To demonstrate generality, we further test NexusFlow on NYUv2 using three homogeneous dense prediction tasks, segmentation, depth, and surface normals, as a representative N-task PS-MTL scenario. NexusFlow yields consistent gains across all tasks, confirming its broad applicability. Our code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38582", "url": null, "sourceid": 38501, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38587, "uid": "e562e1bad2cd2a5cc16a8a5108af139e", "name": "Unified Primitive Proxies for Structured Shape Completion", "authors": [{"id": 151979, "fullname": "Zhaiyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151979?format=json", "institution": "Technical University of Munich"}, {"id": 158344, "fullname": "Yuqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158344?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 152189, "fullname": "Xiao Xiang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152189?format=json", "institution": "Technical University Munich"}], "abstract": "Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50\\% and improving normal consistency by up to 7\\%. These results establish an attractive recipe for structured 3D understanding from incomplete data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38587", "url": null, "sourceid": 41831, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38588, "uid": "67a00e4d31786c71df7d83c4eba5b3cf", "name": "AudioStory: Generating Long-Form Narrative Audio with Large Language Models", "authors": [{"id": 127341, "fullname": "Yuxin Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/127341?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 159447, "fullname": "Teng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159447?format=json", "institution": "Tencent ARC Lab"}, {"id": 86827, "fullname": "Yuying Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/86827?format=json", "institution": "University of Hong Kong"}, {"id": 102639, "fullname": "Shijie Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/102639?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76515, "fullname": "Yixiao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/76515?format=json", "institution": "Tencent"}, {"id": 127332, "fullname": "Wei Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127332?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To fill this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for inter-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code and dataset will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38588", "url": null, "sourceid": 37048, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38589, "uid": "9012dbd23c4d2e33be5eccb0e5f72517", "name": "ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking", "authors": [{"id": 172736, "fullname": "Lihong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172736?format=json", "institution": "Jilin University"}, {"id": 190215, "fullname": "Liangqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190215?format=json", "institution": null}, {"id": 89325, "fullname": "Weiwei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89325?format=json", "institution": "University of Science and Technology of China"}, {"id": 152964, "fullname": "Jiamin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152964?format=json", "institution": "The Chinese University of Hong Kong, Shanghai AI Laboratory"}, {"id": 183238, "fullname": "Changtao Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183238?format=json", "institution": "Ant Group"}, {"id": 158521, "fullname": "Tieru Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158521?format=json", "institution": "school of AI, Jilin University"}, {"id": 126769, "fullname": "Rui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/126769?format=json", "institution": "Jilin University"}, {"id": 190216, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190216?format=json", "institution": null}, {"id": 190217, "fullname": "Zhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190217?format=json", "institution": "Ant Group"}], "abstract": "CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks.Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning.In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions.This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science.Inspired by this insight, we propose a **ViRC** framework for multimodal mathematical tasks, introducing a **Reason Chunking** mechanism that structures multimodal mathematical CoT into consecutive **Critical Reasoning Units (CRUs)** to simulate human expert problem-solving patterns.CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning.To this end, we present **CRUX** dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem.Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting **ViRC-7B** model achieves a 18.8\\% average improvement over baselines across multiple mathematical benchmarks.The codes will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38589", "url": null, "sourceid": 36756, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38596, "uid": "d2d5b139c2d31f6055633cba79228a33", "name": "PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning", "authors": [{"id": 190244, "fullname": "Hee Suk Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190244?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 190245, "fullname": "Eunseop Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190245?format=json", "institution": "KAIST"}, {"id": 134042, "fullname": "Ji Woo Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/134042?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 190246, "fullname": "SooHwan Eom", "url": "http://cvpr.thecvf.com/api/miniconf/users/190246?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 153239, "fullname": "Gwanhyeong Koo", "url": "http://cvpr.thecvf.com/api/miniconf/users/153239?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology (KAIST)"}, {"id": 190247, "fullname": "Mark A. Hasegawa-Johnson", "url": "http://cvpr.thecvf.com/api/miniconf/users/190247?format=json", "institution": "University of Illinois, Urbana Champaign"}, {"id": 85511, "fullname": "Qi Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85511?format=json", "institution": "Microsoft Research Asia"}, {"id": 86583, "fullname": "Chong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86583?format=json", "institution": "Microsoft Research Asia"}, {"id": 151335, "fullname": "Chang D. Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/151335?format=json", "institution": "KAIST"}], "abstract": "Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal\u2014rewarding the confidence growth in the ground-truth answer\u2014effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38596", "url": null, "sourceid": 38938, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38598, "uid": "1a08d68b5124c82c0131d4e61c85dd8a", "name": "AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References", "authors": [{"id": 174417, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174417?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 190258, "fullname": "Hualian Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190258?format=json", "institution": "Alibaba Group"}, {"id": 190259, "fullname": "Sijia Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190259?format=json", "institution": null}, {"id": 189137, "fullname": "Yuxiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189137?format=json", "institution": null}, {"id": 189405, "fullname": "Weizhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189405?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 190260, "fullname": "Caixia Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190260?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 154152, "fullname": "Bing Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154152?format=json", "institution": "Alibaba Group"}, {"id": 88210, "fullname": "Jieping Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/88210?format=json", "institution": "Alibaba Group"}], "abstract": "Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption introduces two significant limitations: it curtails creative flexibility by poorly accommodating diverse, real-world input formats, and more critically, it compromises identity fidelity. Relying on a single source is an ill-posed setting, and provides an inherently ambiguous foundation, making it difficult for the model to faithfully reproduce an identity across novel contexts. In response, we present AnyID, an ultra-fidelity identity-preservation video generation framework. Our approach makes two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. The model is trained on a large-scale, meticulously curated dataset to ensure robustness and high fidelity. In addition, we perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings. All the codes, data and models will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38598", "url": "https://johnneywang.github.io/AnyID-webpage/", "sourceid": 31844, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38601, "uid": "41dba7c677fd1dc761ad8717c352da92", "name": "Detect Anything via Next Point Prediction", "authors": [{"id": 129387, "fullname": "Qing Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129387?format=json", "institution": "South China University of Technology"}, {"id": 190265, "fullname": "Junan Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190265?format=json", "institution": "South China University of Technology"}, {"id": 91696, "fullname": "Xingyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91696?format=json", "institution": "Xiaobing.AI"}, {"id": 190266, "fullname": "Yuda Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190266?format=json", "institution": "International Digital Economy Academy, International Digital Economy Academy"}, {"id": 190267, "fullname": "Zhaoyang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190267?format=json", "institution": "International Digital Economy Academy, International Digital Economy Academy"}, {"id": 85985, "fullname": "Yihao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85985?format=json", "institution": "International Digital Economy Academy"}, {"id": 129395, "fullname": "Tianhe Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/129395?format=json", "institution": "The International Digital Economy Academy"}, {"id": 155941, "fullname": "Junzhi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155941?format=json", "institution": "Peking University"}, {"id": 84971, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84971?format=json", "institution": "International Digital Economy Academy (IDEA)"}], "abstract": "Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose \\textbf{Rex-Omni}, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: \\textbf{1) Task Formulation}: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; \\textbf{2) Data Engines}: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \\textbf{3) Training Pipelines}: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38601", "url": null, "sourceid": 40489, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38606, "uid": "f0abad49a501340a5608cdf3014a737c", "name": "SuP: Sub-cloud Driven Point Cloud Registration", "authors": [{"id": 190276, "fullname": "Sheldon Fung", "url": "http://cvpr.thecvf.com/api/miniconf/users/190276?format=json", "institution": null}, {"id": 178408, "fullname": "Wei Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/178408?format=json", "institution": "OPT Machine Vision Tech Co.,Ltd Japan"}, {"id": 190277, "fullname": "Ling Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190277?format=json", "institution": "OPT Machine Vision"}, {"id": 131397, "fullname": "Fei Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131397?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 190278, "fullname": "Ling Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190278?format=json", "institution": "University of Technology Sydney"}, {"id": 156713, "fullname": "Shasha Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156713?format=json", "institution": "Xidian University"}, {"id": 92749, "fullname": "Hongdong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/92749?format=json", "institution": "Australian National University"}, {"id": 156718, "fullname": "Xuequan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156718?format=json", "institution": "The University of Western Australia"}], "abstract": "While existing point-cloud-registration methods can well handle high-overlap scenarios of two point clouds,  they often struggle with low-overlap scenarios,  due to inevitable geometric/semantic ambiguities in the non-overlapping regions.  In this paper, we introduce SuP, a novel framework that reformulates low-overlap registration as a high-overlap sub-cloud pairs (anchor pairs) mining problem. Central to SuP is our Dual-phase Sub-cloud Anchor Mining (DSAM) module, which first subdivides the source and target point clouds into multiple sub-clouds, followed by introducing a dual-phase weighting pipeline: 1) an efficient overlap-guided prior-weighting scheme (OPS) that leverages feature salience to identify candidate anchor pairs, and 2) a multi-scale post-weighting network (MPN) that exploits neighborhood feature consensus to further identify anchor pairs. Subsequently, final correspondences are generated through a merge-to-match module using the anchor pairs. To train DSAM, we design an alignment-aware weighting loss that uses on-the-fly alignment errors as supervision. Comprehensive experiments on the color-enhanced 3DMatch and 3DLoMatch demonstrate that SuP significantly outperforms state-of-the-art methods, achieving higher registration recall and more accurate alignment, especially under challenging low-overlap conditions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38606", "url": null, "sourceid": 42922, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38607, "uid": "f6b35e5217428f5021e113525a55e6f4", "name": "AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception", "authors": [{"id": 190279, "fullname": "Jinho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/190279?format=json", "institution": "Columbia University"}, {"id": 87674, "fullname": "Se Young Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87674?format=json", "institution": "Seoul National University"}, {"id": 190280, "fullname": "Mingoo Seok", "url": "http://cvpr.thecvf.com/api/miniconf/users/190280?format=json", "institution": "Columbia University"}], "abstract": "Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations--pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38607", "url": null, "sourceid": 30751, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38611, "uid": "b3ba081ea31233eb2bbae7c31d752082", "name": "Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling", "authors": [{"id": 181625, "fullname": "Xingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181625?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126479, "fullname": "Pengfei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/126479?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126480, "fullname": "Qi Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126480?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 153469, "fullname": "Haifeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153469?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126498, "fullname": "Zirui Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126498?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126466, "fullname": "Jianxin Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126466?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126469, "fullname": "Jingyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126469?format=json", "institution": "Beijing University of Post and Telecommunication, Tsinghua University"}], "abstract": "Understanding hand-object interaction from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learning. As semantic and motion priors emerge, the STONE phase enforces rigid constraints to consolidate articulated structures and explicitly estimates motion parameters. Experiments on a real-world manipulation dataset show that our method achieves state-of-the-art reconstruction quality and plausible articulation modeling from monocular videos.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38611", "url": null, "sourceid": 35641, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38615, "uid": "11f2365950119362500e06af693e380b", "name": "DynamicsBoost: Dynamic Plausible Video Generation via Annotation-Free Continuation Preference Optimization", "authors": [{"id": 183942, "fullname": "Jiaxing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183942?format=json", "institution": "Nanyang Technological University"}, {"id": 92035, "fullname": "Jiepeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/92035?format=json", "institution": "China Telecom"}, {"id": 158338, "fullname": "Junyao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158338?format=json", "institution": "Tongji University"}, {"id": 106946, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106946?format=json", "institution": "UC Santa Cruz"}, {"id": 190305, "fullname": "Eric Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190305?format=json", "institution": "Kunlun"}, {"id": 129151, "fullname": "Bo An", "url": "http://cvpr.thecvf.com/api/miniconf/users/129151?format=json", "institution": "Nanyang Technological University"}, {"id": 190306, "fullname": "Hao-Xiang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190306?format=json", "institution": "SkyWork AI"}], "abstract": "Despite significant progress in text-to-video generation, current models still suffer from unrealistic dynamics, temporal inconsistency, and unstable semantic alignment. Existing preference alignment approaches rely on costly and often ambiguous human or VLM-based video preference annotation, which has become a major bottleneck for scaling data. To address this challenge, we propose an annotation-free preference alignment method that constructs accurate preference pairs through video continuation.We extend a pretrained video generation model into a continuation model and apply continuation with different amounts of reference frames while keeping the total video length fixed. As generated segments are inferior to ground-truth frames and and fixed-length continuations conditioned on more reference frames contain less generated content, they exhibit higher fidelity than those with fewer references, naturally inducing a preference order.We further introduce Asymmetrical DPO, which computes preference loss on all continuation regions except the shared prefix conditioning frames and normalizes it by their length, preventing spurious preference signals from leaking into the conditioned portion.Experiments across multiple benchmarks show that our method delivers significant improvements in dynamics realism, temporal coherence, and semantic alignment over existing DPO-based approaches, while fully eliminating the need for human preference labeling or auxiliary reward models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38615", "url": null, "sourceid": 32176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38617, "uid": "b0f47015b6ace4492dd96d63e515e779", "name": "Noise-aware few-shot learning through bi-directional multi-view prompt alignment", "authors": [{"id": 180061, "fullname": "Lu Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180061?format=json", "institution": "\u4e1c\u5357\u5927\u5b66"}, {"id": 144923, "fullname": "Cheng Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/144923?format=json", "institution": "Southeast University"}], "abstract": "Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for **N**oise-**A**ware few-shot learning through bi-directional **M**ulti-**V**iew **P**rompt alignment. NA-MVP is built upon a key conceptual shift: *robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones*. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38617", "url": null, "sourceid": 34145, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38620, "uid": "e817d8f1c167c64e069ef31d3e8826c2", "name": "Differentially Private 2D Human Pose Estimation", "authors": [{"id": 163450, "fullname": "Kaushik S", "url": "http://cvpr.thecvf.com/api/miniconf/users/163450?format=json", "institution": "UoG"}, {"id": 90382, "fullname": "Paul Henderson", "url": "http://cvpr.thecvf.com/api/miniconf/users/90382?format=json", "institution": "University of Glasgow"}, {"id": 190323, "fullname": "Fani Deligianni", "url": "http://cvpr.thecvf.com/api/miniconf/users/190323?format=json", "institution": "Computing Science; University of Glasgow"}], "abstract": "Human pose estimation (HPE) underpins critical applications in healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. %Conventional anonymization techniques offer weak protection and are unquantifiable, while Differential Privacy (DP) provides formal guarantees but often results in steep performance costs.We introduce the first unified framework for differentially private  2D Human Pose Estimation (2D-HPE) that achieves strong privacy-utility trade-offs for structured visual prediction through complementary noise mitigation mechanisms. Our Feature-Projective DP integrates: (1) subspace projection that reduces noise variance by a factor $k/p$ by restricting gradient updates to a $k$-principal subspace within the full $p$-dimensional parameter space, and (2) feature-level privacy, which selectively privatizes sensitive features while retaining public visual cues. Together these mechanisms yield a multiplicative utility gain under formal privacy constraints.%We further propose a feature-projective hybrid that combines both the mechanisms within a single post-processing framework.Extensive experiments on MPII and HumanART datasets across privacy budgets $(\\varepsilon \\in \\{0.2, 0.4, 0.6,  0.8\\})$, clipping thresholds $(C \\in \\ \\{0.01, 0.1, 1.0\\})$ and training strategies demonstrate consistent improvements over vanilla DP-SGD. At $\\varepsilon=0.8$, our method achieves 82.61\\% PCKh@0.5, recovering 73\\% of the privacy induced performance gap. Cross-dataset evaluation on the HumanART confirms generalization (51.6 AP). Our study provides the first rigorous benchmark and a practical blueprint for privacy-preserving pose estimation in sensitive, real-world applications. Our source code will be made public on acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38620", "url": null, "sourceid": 33423, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38621, "uid": "195e3b79a3274c7a35342c0ecf371b20", "name": "M${^2}$SeR: Multimodal Self-Refinement for Lightweight Image Captioning", "authors": [{"id": 88539, "fullname": "Junha Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/88539?format=json", "institution": "KAIST, NAVER AI Lab"}, {"id": 190324, "fullname": "Yongsik Jo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190324?format=json", "institution": null}, {"id": 86756, "fullname": "So Yeon Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/86756?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 175450, "fullname": "Quanting Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/175450?format=json", "institution": "Carnegie Mellon University"}, {"id": 98614, "fullname": "Taehwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/98614?format=json", "institution": "UNIST"}, {"id": 86774, "fullname": "Yonatan Bisk", "url": "http://cvpr.thecvf.com/api/miniconf/users/86774?format=json", "institution": "Carnegie Mellon University"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substantial computational cost hinders practical application.This limitation motivates our development of a lightweight captioning model. Our investigation begins by replacing the large-scale language component in MLLMs with a compact 125M-parameter model.Surprisingly, this compact model, despite a 93x reduction in size, achieves comparable performance to MLLMs, suggesting that factual image captioning does not significantly require the complex reasoning abilities of LLMs. Despite this promising result, our lightweight model still lacks reliability. To address this, we draw inspiration from the human visual process: perceiving a global and coarse understanding of the scene before attending to finer details. Accordingly, we propose a multimodal self-refinement framework that guides the model to utilize features from salient regions, identified by referencing the previous coarse caption, and to produce a refined description. Experimental results demonstrate the superiority of our model in both single-sentence and detailed captioning, extending even to long-range video QA tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38621", "url": null, "sourceid": 38686, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38623, "uid": "bcffac7ceb2fafe3584e66bfc9c53a72", "name": "Semantic Foam: Unifying Spatial and Semantic Scene Decomposition", "authors": [{"id": 190328, "fullname": "Amr Sharafeldin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190328?format=json", "institution": "Simon Fraser University"}, {"id": 107223, "fullname": "Aryan Mikaeili", "url": "http://cvpr.thecvf.com/api/miniconf/users/107223?format=json", "institution": "Simon Fraser University"}, {"id": 152522, "fullname": "Thomas Walker", "url": "http://cvpr.thecvf.com/api/miniconf/users/152522?format=json", "institution": "Edinburgh University, University of Edinburgh"}, {"id": 128436, "fullname": "Shrisudhan Govindarajan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128436?format=json", "institution": "Simon Fraser University"}, {"id": 132581, "fullname": "Daniel Rebain", "url": "http://cvpr.thecvf.com/api/miniconf/users/132581?format=json", "institution": "Wayve; University of British Columbia"}, {"id": 69181, "fullname": "Kwang Moo Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/69181?format=json", "institution": "University Of British Columbia"}, {"id": 126249, "fullname": "Andrea Tagliasacchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126249?format=json", "institution": "Simon Fraser University, Google Brain"}], "abstract": "Current generation scene reconstruction methods like 3D Gaussian Splatting are capable of producing photo-realistic novel view synthesis at real-time speeds, yet see only limited adoption in many practical graphics applications.One significant contributing factor to this gap is the difficulty of interacting with and editing these representations in comparison to classic human-authored 3D assets.While work has been done to impose semantic decomposition onto these representations, there are still significant limitations in the quality and consistency of these segmentations.We address this by proposing a semantically decomposed variant of the recently introduced Radiant Foam method.Our approach, Semantic Foam, combines the natural spatial volumetric decomposition provided by Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized on the cells.The explicit mesh structure enables direct spatial regularization that prevents artifacts caused by inconsistent supervision across views or occlusion, which affect similar approaches for other point-based representations.We show that our method achieves superior performance on object-level segmentation compared to Gaussian Grouping and SAGA.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38623", "url": null, "sourceid": 41702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38627, "uid": "b2bbb828c54d598a0afa0c992b0d9a4b", "name": "PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding", "authors": [{"id": 172515, "fullname": "Junpeng Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172515?format=json", "institution": "Zhejiang University"}, {"id": 154184, "fullname": "Feifei Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154184?format=json", "institution": "Zhejiang University"}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}, {"id": 154832, "fullname": "Lin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154832?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 131756, "fullname": "Hongwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131756?format=json", "institution": "Zhejiang University"}, {"id": 190338, "fullname": "Dongfang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190338?format=json", "institution": "Zhejiang University"}], "abstract": "3D visual grounding (VG) aims to localize target objects in 3D scenes based on free-form textual descriptions. Existing 3D VG models predominantly employ point-based backbones for point cloud feature extraction. Such methods require aggressive downsampling of the input point cloud, which sacrifices the fine-grained spatial details crucial for precise localization. This paper proposes  PV-Ground, a novel 3D VG architecture based on effective text-guided point-voxel feature interaction. Our method leverages the complementary strengths of both voxels and keypoints: it employs a voxel-based feature extraction backbone to preserve high-resolution spatial details, while utilizing compact keypoints to aggregate these features for efficient, deep interaction with the textual query. Furthermore, we propose a text-guided keypoint sampling module to adaptively concentrate the keypoint distribution around the text-described object, enabling task-specific feature aggregation and significantly boosts model performance. Extensive qualitative and quantitative experiments demonstrate the superiority of our proposed method. Our method achieves a performance improvement of 5.1\\% on the ScanRefer dataset and 5.6\\% on the ReferIt3D dataset, while also achieves over 4\\% improvement in the segmentation task. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38627", "url": null, "sourceid": 34058, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38631, "uid": "275a5602cd91a468a0e10c226a03a39c", "name": "SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production", "authors": [{"id": 181586, "fullname": "Xiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181586?format=json", "institution": "nanjing university"}, {"id": 107387, "fullname": "Shiwei Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/107387?format=json", "institution": "Nanjing University"}, {"id": 130214, "fullname": "Yafeng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130214?format=json", "institution": "Nanjing University"}, {"id": 190345, "fullname": "Bowen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190345?format=json", "institution": "nanjing university"}, {"id": 130219, "fullname": "Zhiwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130219?format=json", "institution": "Nanjing University"}, {"id": 190346, "fullname": "Shunmei Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190346?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 130192, "fullname": "Lei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/130192?format=json", "institution": "Nanjing University"}, {"id": 72976, "fullname": "Sanglu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72976?format=json", "institution": "Nanjing University"}], "abstract": "Sign language production aims to generate sign sequences from spoken language, where the generation of sign pose sequences from text is often treated as a significant task. However, due to the differences in grammatical rules and modalities between sign language pose sequences and spoken language text, it is rather challenging to convert text into sign poses (\\ie, Text2Pose), while maintaining semantic consistency, motion accuracy and temporal coherence.In this paper, we focus on the Text2Pose task, and propose SignPR, a progressive diffusion framework that jointly models the structural and temporal properties of signing. Structurally, we perform progressive structural refinement: a structural VQVAE encodes each frame into semantic-aware and region-based discrete representations; the diffusion process first produces semantically consistent poses and then progressively refines motion details under text and semantic conditioning. Temporally, we introduce block-wise causal diffusion, which progressively enforces temporal coherence and enables iterative refinement to earlier generated segments, yielding smoother transitions and reduced jitter. Extensive experiments on widely used datasets demonstrate that SignPR achieves superior results compared with prior T2P methods across multiple metrics, producing pose sequences that are semantically faithful, motion-accurate, and temporally coherent.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38631", "url": null, "sourceid": 41167, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38633, "uid": "1aa49d5029bedbd82075fc78ff26b05d", "name": "Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis", "authors": [{"id": 166086, "fullname": "hongyuan chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/166086?format=json", "institution": "jilin university"}, {"id": 157119, "fullname": "Xingyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157119?format=json", "institution": "Westlake University"}, {"id": 89245, "fullname": "Zexiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89245?format=json", "institution": "Hillbot"}, {"id": 190353, "fullname": "Anpei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190353?format=json", "institution": null}], "abstract": "We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38633", "url": null, "sourceid": 46531, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38635, "uid": "f3d834a3a299eb01418783d6777d65f0", "name": "Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning", "authors": [{"id": 183034, "fullname": "Jooyoung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183034?format=json", "institution": "SungKyunKwan University"}, {"id": 183033, "fullname": "Wonje Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183033?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 190358, "fullname": "Younguk Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/190358?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 129308, "fullname": "Honguk Woo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129308?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Recent advances in Vision-Language Models (VLMs) have enabled video-instructed robotic programming, allowing agents to interpret video demonstrations and generate executable control code. We formulate video-instructed robotic programming as a cross-domain adaptation problem, where perceptual and physical differences between demonstration and deployment induce procedural mismatches. However, current VLMs lack the procedural understanding needed to reformulate causal dependencies and achieve task-compatible behavior under such domain shifts. We introduce NeSyCR, a neurosymbolic counterfactual reasoning framework that enables verifiable adaptation of task procedures,  providing a reliable synthesis of code policies.  NeSyCR abstracts video demonstrations into symbolic trajectories that capture the underlying task procedure. Given deployment observations, it derives counterfactual states that reveal cross-domain incompatibilities. By exploring the symbolic state space with verifiable checks, NeSyCR proposes procedural revisions that restore compatibility with the demonstrated procedure. NeSyCR achieves a 31.14\\% improvement in task success over the strongest baseline Statler, showing robust cross-domain adaptation across both simulated and real-world manipulation tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38635", "url": null, "sourceid": 44683, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38637, "uid": "f47313385d07341062472b571e029e32", "name": "Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis", "authors": [{"id": 180454, "fullname": "Kang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/180454?format=json", "institution": "Wuhan University  &amp; SII"}, {"id": 190362, "fullname": "Yuzhe Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190362?format=json", "institution": "Wuhan University"}, {"id": 190363, "fullname": "Xinrong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190363?format=json", "institution": "Wuhan University"}, {"id": 190364, "fullname": "Fei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190364?format=json", "institution": "Wuhan University"}, {"id": 190365, "fullname": "Chong Teng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190365?format=json", "institution": "Wuhan University"}, {"id": 190366, "fullname": "Donghong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/190366?format=json", "institution": "Wuhan University"}], "abstract": "Multimodal sentiment analysis (MSA) seeks to infer human emotions by integrating heterogeneous signals from text, audio, and visual modalities.Although recent approaches attempt to leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities.In practice, the expressive power across modalities is inherently imbalanced: dominant modalities tend to overshadow non-verbal ones, which not only limits their contribution but also induces modality competition during training.This imbalance leads to degraded fusion performance and poor robustness under noisy or missing modalities.To address these challenges, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC first improves representational quality via modality semantic disentanglement (MSD) and cross-modal complementary enhancement (CCE), which strengthens weaker modalities using information from other modalities.To prevent dominant modalities from overwhelming others during joint optimization, EBMC introduces an Energy-guided Modality Coordination (EMC) mechanism that models modality contributions via energy potentials and achieves implicit gradient rebalancing through a differentiable equilibrium objective.Further, an Instance-aware Modality Trust Distillation (IMTD) module estimates sample-level modality reliability and adaptively modulates fusion weights, ensuring robustness against noise and modality incompleteness.Extensive experiments on multiple MSA benchmarks demonstrate that EBMC achieves state-of-the-art or competitive results.Moreover, EBMC maintains strong performance under missing-modality settings, highlighting its effectiveness and robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38637", "url": null, "sourceid": 38702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38638, "uid": "a83840c17c2b2a522f05290db1efbb18", "name": "NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization", "authors": [{"id": 190367, "fullname": "Edwin Vargas", "url": "http://cvpr.thecvf.com/api/miniconf/users/190367?format=json", "institution": "Rice University"}, {"id": 130345, "fullname": "Jhon Lopez", "url": "http://cvpr.thecvf.com/api/miniconf/users/130345?format=json", "institution": "Universidad Industrial de Santander"}, {"id": 95406, "fullname": "Henry Arguello", "url": "http://cvpr.thecvf.com/api/miniconf/users/95406?format=json", "institution": "Universidad Industrial de Santander"}, {"id": 85552, "fullname": "Ashok Veeraraghavan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85552?format=json", "institution": "William Marsh Rice University"}], "abstract": "Ensuring the authenticity and ownership of digital images is increasingly challenging as modern editing tools enable highly realistic forgeries. Existing image protection systems mainly rely on digital watermarking, which is susceptible to sophisticated digital attacks. To address this limitation, we propose a hybrid optical-digital framework that incorporates physical authentication cues during image formation and preserves them through a learned reconstruction process. At the optical level, a phase mask in the camera aperture produces a Null-space Optical Watermark (NOWA) that lies in the Null Space of the imaging operator and therefore remains invisible in the captured image. Then, a Null-Space Network (NSN) performs measurement-consistent reconstruction that delivers high-quality protected images while preserving the NOWA signature.The proposed design enables tamper localization by projecting the image onto the camera's null space and detecting pixel-level inconsistencies. Our design preserves perceptual quality, resists common degradations such as compression, and establishes a structural security asymmetry: without access to the optical or NSN parameters, adversaries cannot forge the NOWA signature. Experiments with simulations and a prototype camera demonstrate competitive performance in terms of image quality preservation and tamper localization accuracy compared to state-of-the-art digital watermarking and learning-based authentication methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38638", "url": null, "sourceid": 40551, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40342?format=json"], "related_events_ids": [40342]}, {"id": 38640, "uid": "4965052ffc53f745343bab61a5f8aee7", "name": "GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension", "authors": [{"id": 107434, "fullname": "Fang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107434?format=json", "institution": "City University of Hong Kong"}, {"id": 90599, "fullname": "Yuhao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90599?format=json", "institution": "City University of Hong Kong"}, {"id": 130651, "fullname": "Ke Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130651?format=json", "institution": "City University of Hong Kong"}, {"id": 94761, "fullname": "Gerhard Hancke", "url": "http://cvpr.thecvf.com/api/miniconf/users/94761?format=json", "institution": "City University of Hong Kong"}, {"id": 86835, "fullname": "Rynson W.H. Lau", "url": "http://cvpr.thecvf.com/api/miniconf/users/86835?format=json", "institution": "City University of Hong Kong"}], "abstract": "In this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Models (MLLM)-guided Reasoning Module that leverages MLLM\u2019s semantic and spatial priors to enhance 3D localization and reasoning. To further improve spatial alignment and computational efficiency, we introduce a GeometryAware Frame Selector (GAFS), which adaptively selects the most informative views based on Gaussian and textural cues. Extensive cross-task evaluations (including 3D referring segmentation, 3D visual question answering, and 3D open-vocabulary understanding) demonstrate state-of-the-art performances and strong generalization capability of GenSplat. We will release the codes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38640", "url": null, "sourceid": 35657, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38641, "uid": "3887768255fc3ba5063eb8df7046d194", "name": "Action\u2013Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation", "authors": [{"id": 126752, "fullname": "Chongyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126752?format=json", "institution": "Sichuan University"}, {"id": 107329, "fullname": "Li Haipeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107329?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187054, "fullname": "Shen Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187054?format=json", "institution": "Megvii Technology Inc."}, {"id": 128936, "fullname": "Haoqiang Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128936?format=json", "institution": "Dexmal"}, {"id": 190371, "fullname": "Ziliang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190371?format=json", "institution": "Sichuan University"}, {"id": 93490, "fullname": "Shuaicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/93490?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner.We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations.We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code and pretrained weights will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38641", "url": null, "sourceid": 43317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38642, "uid": "77bb4ff22ca2954777a580f0b66fc384", "name": "Sparse Spectral LoRA: Routed Experts for Medical VLMs", "authors": [{"id": 175762, "fullname": "Omid Nejatimanzari", "url": "http://cvpr.thecvf.com/api/miniconf/users/175762?format=json", "institution": "Concordia University"}, {"id": 156537, "fullname": "Hojat Asgariandehkordi", "url": "http://cvpr.thecvf.com/api/miniconf/users/156537?format=json", "institution": "Concordia University"}, {"id": 151828, "fullname": "Taha Koleilat", "url": "http://cvpr.thecvf.com/api/miniconf/users/151828?format=json", "institution": "Concordia University"}, {"id": 156539, "fullname": "Yiming Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156539?format=json", "institution": "Concordia University"}, {"id": 156538, "fullname": "Hassan Rivaz", "url": "http://cvpr.thecvf.com/api/miniconf/users/156538?format=json", "institution": "Concordia University"}], "abstract": "Large vision\u2013language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed).  In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift.Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\\times$ fewer trainable parameters, and reduces sequential forgetting to $\\sim$5\\% where strong baselines degrade by $>$20\u201350\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38642", "url": null, "sourceid": 42494, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38644, "uid": "96b7e445281a191b3922d814aae420ce", "name": "Mind the Gap: Transferring Labels to Align Object Detection Datasets", "authors": [{"id": 88269, "fullname": "Mikhail Kennerley", "url": "http://cvpr.thecvf.com/api/miniconf/users/88269?format=json", "institution": "National University of Singapore"}, {"id": 157109, "fullname": "Angelica I Aviles-Rivero", "url": "http://cvpr.thecvf.com/api/miniconf/users/157109?format=json", "institution": "Tsinghua University"}, {"id": 85702, "fullname": "Carola-Bibiane Sch\u00f6nlieb", "url": "http://cvpr.thecvf.com/api/miniconf/users/85702?format=json", "institution": "University of Cambridge"}, {"id": 85616, "fullname": "Robby T. Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85616?format=json", "institution": "National University of Singapore"}], "abstract": "Combining multiple object detection datasets offers a path to improved model generalisation but is hindered by inconsistencies in class semantics and bounding box annotations.cSome methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features and address pseudo-label noise, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual re-annotation. Across multiple benchmarks, LAT demonstrates consistent improvements in detection performance, achieving gains of up to +8.4 AP over baseline methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38644", "url": null, "sourceid": 33597, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38646, "uid": "c733f4f772bfa7b8702ccb81887f8333", "name": "Toward Low-Cost yet Effective Temporal Learning for UAV Tracking", "authors": [{"id": 146149, "fullname": "chaocan xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/146149?format=json", "institution": "Guangxi Normal University"}, {"id": 155925, "fullname": "Qihua Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155925?format=json", "institution": "Guangxi Normal University"}, {"id": 128975, "fullname": "Bineng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128975?format=json", "institution": "Guangxi Normal University"}, {"id": 190380, "fullname": "Yanting Zu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190380?format=json", "institution": "Guangxi Normal University"}, {"id": 155928, "fullname": "Yuanliang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155928?format=json", "institution": "Xi\u2019an Research Institute of High Technology"}, {"id": 184385, "fullname": "Haiying Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184385?format=json", "institution": "Guangxi Normal University"}, {"id": 129683, "fullname": "Shuxiang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/129683?format=json", "institution": "Guangxi Normal University"}], "abstract": "The utilization of temporal information has always been an open topic in the tracking community. However, existing trackers tend to employ more and more inputs or parameters for temporal learning, hindering their deployment in resource-constrained unmanned aerial vehicles (UAVs). More importantly, this raises ambiguity whether the performance gains come from the temporal learning itself, or come from the increased inputs and parameters. In this study, we advocate designing temporal learning components from a more balanced perspective that jointly considers performance gains and computational costs. To achieve this goal, we introduce a new evaluation metric, i.e., precision per FLOPs (PPF). The PPF is introduced to quantify the tracking precision gains achieved by temporal learning components per unit of FLOPs, thus enabling fair and efficiency-aware comparisons among these components and driving them toward more efficient designs. Based on this metric, we propose a low-cost yet effective temporal learning (LETL) approach to efficiently model contextual relationships. This approach continuously propagates and merges representative appearance tokens in video streams, allowing the tracker to efficiently capture the changing patterns of targets with relatively low computational costs. We integrate the LETL approach into existing one-stream frameworks, thereby building a simple yet effective tracker, namely LETrack, for robust UAV tracking. Extensive experimental results on multiple aerial tracking datasets demonstrate the superiority of our LETrack, and show that the proposed LETL approach achieves higher PPF scores, outperforming other temporal learning strategies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38646", "url": null, "sourceid": 45910, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38648, "uid": "2f9dabd3b7df074505f362da6a52c389", "name": "MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID", "authors": [{"id": 188716, "fullname": "Xulin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188716?format=json", "institution": "University of Science and Technology of China"}, {"id": 87596, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87596?format=json", "institution": "Shanghai AI lab"}, {"id": 107620, "fullname": "Bin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107620?format=json", "institution": "University of Science and Technology of China"}, {"id": 190383, "fullname": "Qinhong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190383?format=json", "institution": "University of Science and Technology of China"}, {"id": 131736, "fullname": "Qi Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131736?format=json", "institution": "University of Science and Technology of China"}, {"id": 131741, "fullname": "Tao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131741?format=json", "institution": "University of Science and Technology of China"}, {"id": 90580, "fullname": "Nenghai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90580?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Visible-infrared person re-identification (VI-ReID) is a challenging task due to the significant modality discrepancy between visible and infrared images. We contend that the discrepancy primarily arises from varying lighting conditions of the two modality data, including differences in the wavelengths of light and the types of light source. Recently, frequency-based VI-ReID approaches have achieved notable success, since frequency information can more effectively extract contours and details pertinent to identity while excluding irrelevant lighting and color. However, existing methods do not distinguish different frequency bands or focus solely on a particular frequency band, which is insufficient for capturing the inherent variations in frequency under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different frequencies through a mixture-of-experts method. In addition, we further introduce a Random Frequency Augmentation (RFA) and a Frequency Auxiliary Optimization (FAO) to effectively train the MFEN in mining frequency information. The proposed three frequency modules are complementary to each other and adaptively capture critical frequency domain details to achieve robust representations. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38648", "url": null, "sourceid": 33038, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38651, "uid": "7e5b73e214f17ecc5684ff7b0ca14b0d", "name": "DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization", "authors": [{"id": 152712, "fullname": "Siran Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152712?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 180759, "fullname": "Haoyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180759?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 190390, "fullname": "Li Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190390?format=json", "institution": "China Mobile Financial Technology Co., Ltd., China"}, {"id": 126688, "fullname": "Tianshuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126688?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 76305, "fullname": "Xiangyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76305?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 184975, "fullname": "Bao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184975?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 190391, "fullname": "Weisong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190391?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}], "abstract": "The rapid evolution of deepfake technologies demands robust and reliable face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery clues is also important for enhancing model explainability and building user trust. To address this dual challenge, we introduce DiffusionFF, a diffusion-based framework that simultaneously performs face forgery detection and fine-grained artifact localization. Our key idea is to establish a novel encoder\u2013decoder architecture: a pretrained forgery detector serves as a powerful \"artifact encoder\", and a denoising diffusion model is repurposed as an \"artifact decoder\". Conditioned on multi-scale forgery-related features extracted by the encoder, the decoder progressively synthesizes a detailed artifact localization map. We then fuse this fine-grained localization map with high-level semantic features from the forgery detector, leading to substantial improvements in detection capability. Extensive experiments demonstrate that DiffusionFF achieves state-of-the-art (SOTA) performance across multiple benchmarks, underscoring its superior effectiveness, reliability, and explainability. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38651", "url": null, "sourceid": 41230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38656, "uid": "03dbeaa63095e87dfd054aeb6832c72d", "name": "Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations", "authors": [{"id": 152912, "fullname": "Youyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152912?format=json", "institution": "Harbin Institute of Technology"}, {"id": 87056, "fullname": "Junjun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87056?format=json", "institution": "Harbin Institute of Technology"}, {"id": 145274, "fullname": "Yueru Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/145274?format=json", "institution": "Beijing University of Post and Telecommunications; The Chinese University of Hong Kong"}, {"id": 152254, "fullname": "Kui Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152254?format=json", "institution": "Harbin Institute of Technology"}, {"id": 87539, "fullname": "Xianming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87539?format=json", "institution": "Harbin Institute of Technology"}, {"id": 77320, "fullname": "Xu Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/77320?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 130142, "fullname": "Dave Zhenyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130142?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}], "abstract": "With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up.In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models.At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency.Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38656", "url": null, "sourceid": 44590, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38657, "uid": "e5218ff81e22205eb3a32a767378f03f", "name": "Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances", "authors": [{"id": 182349, "fullname": "Dominik Hollidt", "url": "http://cvpr.thecvf.com/api/miniconf/users/182349?format=json", "institution": "ETH Zurich"}, {"id": 190400, "fullname": "Tommaso Bendinelli", "url": "http://cvpr.thecvf.com/api/miniconf/users/190400?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 153857, "fullname": "Christian Holz", "url": "http://cvpr.thecvf.com/api/miniconf/users/153857?format=json", "institution": "ETH Z\u00fcrich"}], "abstract": "Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture.To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions.However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction.We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints.It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements.These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion.Still, network predictions can violate inter-sensor distance measurements.To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling.Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38657", "url": null, "sourceid": 44637, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38661, "uid": "9b3f19c666b37dcb52bb3bfa51ae0c46", "name": "GDRO: Group-level Reward Post-training Suitable for Diffusion Models", "authors": [{"id": 190415, "fullname": "Yiyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190415?format=json", "institution": "The University of Hong Kong"}, {"id": 88507, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88507?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 86109, "fullname": "Xiaogang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86109?format=json", "institution": "Zhejiang Lab"}, {"id": 88124, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88124?format=json", "institution": "Alibaba Group"}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}], "abstract": "Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38661", "url": null, "sourceid": 37091, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38662, "uid": "48e9381c73959cfa96f0f580b9b15f9f", "name": "KV-Tracker: Real-Time Pose Tracking with Transformers", "authors": [{"id": 87294, "fullname": "Marwan Taher", "url": "http://cvpr.thecvf.com/api/miniconf/users/87294?format=json", "institution": "The University of Sheffield"}, {"id": 136558, "fullname": "Ignacio Alzugaray", "url": "http://cvpr.thecvf.com/api/miniconf/users/136558?format=json", "institution": "Imperial College London"}, {"id": 127570, "fullname": "Kirill Mazur", "url": "http://cvpr.thecvf.com/api/miniconf/users/127570?format=json", "institution": "Imperial College London"}, {"id": 76118, "fullname": "Xin Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76118?format=json", "institution": "Imperial College London"}, {"id": 87268, "fullname": "Andrew J. Davison", "url": "http://cvpr.thecvf.com/api/miniconf/users/87268?format=json", "institution": "Imperial College London"}], "abstract": "Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $\\pi^3$~\\cite{wang2025pi3} with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining.We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\\sim}27$ FPS.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38662", "url": null, "sourceid": 40385, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38665, "uid": "d1cd0a8c9b28f58703a097d5a25534e3", "name": "2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition", "authors": [{"id": 184075, "fullname": "Liying Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184075?format=json", "institution": "EPFL"}, {"id": 190422, "fullname": "Raphael Achddou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190422?format=json", "institution": "Ecole Sup\u00e9rieure d&#x27;Ing\u00e9nieurs en Electronique et Electrotechnique"}, {"id": 77165, "fullname": "Sabine S\u00fcsstrunk", "url": "http://cvpr.thecvf.com/api/miniconf/users/77165?format=json", "institution": "EPFL - EPF Lausanne"}], "abstract": "Raw images taken in low-light conditions are very noisy due to low photon count and sensor noise. Learning-based denoisers have the potential to reconstruct high-quality images. For training, however, these denoisers require large paired datasets of clean and noisy images, which are difficult to collect. Noise synthesis is an alternative to large-scale data acquisition: given a clean image, we can synthesize a realistic noisy counterpart. In this work, we propose a general and practical noise synthesis method that requires only $\\textbf{one single noisy image and one single dark frame}$ per ISO setting. We represent signal-dependent noise with a Poisson distribution and introduce a Fourier-domain spectral sampling algorithm to accurately model signal-independent noise. The latter generates diverse noise realizations that maintain the spatial and statistical properties of real sensor noise. As opposed to concurrent approaches, our method neither relies on simplified parametric models nor on large sets of clean-noisy image pairs. It is accurate and practical. Moreover, our synthesis method leads to state-of-the-art performances on multiple low-light denoising benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38665", "url": null, "sourceid": 41498, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38666, "uid": "cea09ffe70b0c2233754dd0f9cb36a52", "name": "Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction", "authors": [{"id": 174321, "fullname": "Yifan Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/174321?format=json", "institution": "University of Science and Technology of China"}, {"id": 188410, "fullname": "Youcheng Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188410?format=json", "institution": "University of Science and Technology of China"}, {"id": 85084, "fullname": "Ligang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85084?format=json", "institution": "University of Science and Technology of China"}], "abstract": "3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction by achieving high-quality novel view synthesis with fast rendering speed, introducing 3D Gaussian primitives to represent the scene. However, 3DGS encounters blurring and floaters when applied to complex scenes, caused by the reconstruction of redundant and ambiguous geometric structures. We attribute this issue to the unstable optimization of the Gaussians. To address this limitation, we present a plug-and-play PDE-based optimization method that overcomes the optimization constraints of 3DGS-based approaches in various tasks, such as novel view synthesis and surface reconstruction. Firstly, we theoretically derive that the 3DGS optimization procedure can be modeled as a PDE, and introduce a viscous term to ensure stable optimization. Secondly, we use the Material Point Method (MPM) to obtain a stable numerical solution of the PDE, which enhances both global and local constraints. Additionally, an effective Gaussian densification strategy and particle constraints are introduced to ensure fine-grained details. Extensive qualitative and quantitative experiments confirm that our method achieves state-of-the-art rendering and reconstruction quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38666", "url": null, "sourceid": 38873, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38669, "uid": "26e87ce3ffff8cab875cc01616fad7ed", "name": "Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels", "authors": [{"id": 182854, "fullname": "J. Miguel Valverde", "url": "http://cvpr.thecvf.com/api/miniconf/users/182854?format=json", "institution": "Technical University of Denmark"}, {"id": 164954, "fullname": "Dim Papadopoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/164954?format=json", "institution": "Technical University of Denmark"}, {"id": 190429, "fullname": "Rasmus Larsen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190429?format=json", "institution": "Technical University of Denmark"}, {"id": 93991, "fullname": "Anders Dahl", "url": "http://cvpr.thecvf.com/api/miniconf/users/93991?format=json", "institution": "DTU Compute"}], "abstract": "Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://github.com/Anonymous.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38669", "url": null, "sourceid": 43019, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38671, "uid": "4ee3a0ca5b398afe5f6c8610ebf49e39", "name": "RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion", "authors": [{"id": 180000, "fullname": "Panjun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180000?format=json", "institution": "University of Science and Technology of China"}, {"id": 190434, "fullname": "Jiyuan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/190434?format=json", "institution": "University of Science and Technology of China"}, {"id": 181198, "fullname": "YUANSHEN GUAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/181198?format=json", "institution": "University of Science and Technology of China"}, {"id": 91334, "fullname": "Yong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91334?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 72335, "fullname": "Zhiqiang Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/72335?format=json", "institution": "Huawei Noah\u2019s Ark Lab"}, {"id": 76716, "fullname": "Ruikang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76716?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88201, "fullname": "Chang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88201?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 87157, "fullname": "Dehua Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/87157?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88204, "fullname": "Fenglong Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/88204?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 76559, "fullname": "Zhiwei Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76559?format=json", "institution": "USTC"}], "abstract": "Extreme low-light Raw image restoration remains challenging due to overwhelming noise and severe detail loss.In this paper, we exploit the potential of the dual-exposure setting for this severely ill-posed problem.Existing methods suffer from unreliable cross-exposure alignment, resulting in degraded detail recovery and compromised color fidelity. To address these challenges, we propose RawMetaDiff, a novel generative diffusion framework that restores a high-fidelity Raw image from a short-exposure input, conditioned on a potentially misaligned long-exposure reference under the guidance of Raw metadata.At its core, we propsed two complementary mechanisms: the Meta-Assistant Color Transfer (MACT) enforces color consistency by aligning global color statistics along the channel dimension,while the Meta-Normed Cross Attention (MNCA) leverages Raw metadata to establish robust cross-exposure spatial correspondences and inject shadow details.To support robust diffusion training, we first collect a 1K real-world, dual-exposure Raw dataset, namely DERaw, and then design a realistic degradation model to synthesize data that closely approximates real-world conditions.Extensive experiments on both synthetic and real-world datasets demonstrate that RawMetaDiff significantly outperforms existing methods, justifying an effective new solution for extreme low-light Raw image restoration from the generative perspective.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38671", "url": null, "sourceid": 38275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38675, "uid": "7527a9a5e8ed7adc1389a59748913cb0", "name": "SAGA: Source Attribution of Generative AI Videos", "authors": [{"id": 156006, "fullname": "Rohit Kundu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156006?format=json", "institution": "Google LLC; University of California Riverside"}, {"id": 156007, "fullname": "Vishal Mohanty", "url": "http://cvpr.thecvf.com/api/miniconf/users/156007?format=json", "institution": "Google"}, {"id": 132125, "fullname": "Hao Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/132125?format=json", "institution": "Google"}, {"id": 190443, "fullname": "Shan Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/190443?format=json", "institution": "Google"}, {"id": 136781, "fullname": "Athula Balachandran", "url": "http://cvpr.thecvf.com/api/miniconf/users/136781?format=json", "institution": "Google"}, {"id": 163965, "fullname": "Amit Roy-Chowdhury", "url": "http://cvpr.thecvf.com/api/miniconf/users/163965?format=json", "institution": "University of California, Riverside"}], "abstract": "The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce $\\textcolor{blue}{\\texttt{SAGA}}$ ($\\underline{S}$ource $\\underline{A}$ttribution of $\\underline{G}$enerative $\\underline{A}$I videos), the first comprehensive framework to address the urgent need for AI-generated $\\textit{video source attribution}$ at a large scale. Unlike traditional detection, $\\textcolor{blue}{\\texttt{SAGA}}$ identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling $\\textcolor{blue}{\\texttt{SAGA}}$ to achieve state-of-the-art attribution using only 0.5% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures ($\\textcolor{blue}{\\texttt{T-Sig}}$), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for $\\textit{why}$ different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that $\\textcolor{blue}{\\texttt{SAGA}}$ sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38675", "url": null, "sourceid": 44657, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38676, "uid": "6fecbb8aacf459d4bef49fd47970b43f", "name": "ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models", "authors": [{"id": 137037, "fullname": "Peijie Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/137037?format=json", "institution": "Washington University in St.Louis"}, {"id": 190444, "fullname": "Hariharan Ramshankar", "url": "http://cvpr.thecvf.com/api/miniconf/users/190444?format=json", "institution": "Amazon"}, {"id": 142284, "fullname": "Arnau Ramisa", "url": "http://cvpr.thecvf.com/api/miniconf/users/142284?format=json", "institution": "Amazon Inc."}, {"id": 138186, "fullname": "Amit C C", "url": "http://cvpr.thecvf.com/api/miniconf/users/138186?format=json", "institution": "Amazon"}, {"id": 131913, "fullname": "Rene Vidal", "url": "http://cvpr.thecvf.com/api/miniconf/users/131913?format=json", "institution": "University of Pennsylvania and Amazon"}, {"id": 177099, "fullname": "Vamsi Salaka", "url": "http://cvpr.thecvf.com/api/miniconf/users/177099?format=json", "institution": "Amazon.com"}, {"id": 190445, "fullname": "Rahul Bhagat", "url": "http://cvpr.thecvf.com/api/miniconf/users/190445?format=json", "institution": "Amazon"}], "abstract": "Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser's latent space ($\\mathcal{H}$-space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the $\\mathcal{H}$-space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38676", "url": null, "sourceid": 34088, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38678, "uid": "21fbc36d7ebdf028791fd50c01cffeda", "name": "DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance", "authors": [{"id": 180095, "fullname": "Shreedhar Govil", "url": "http://cvpr.thecvf.com/api/miniconf/users/180095?format=json", "institution": "Deutsches Forschungszentrum f\u00fcr K\u00fcnstliche Intelligenz GmbH"}, {"id": 89910, "fullname": "Didier Stricker", "url": "http://cvpr.thecvf.com/api/miniconf/users/89910?format=json", "institution": "Universit\u00e4t Kaiserslautern"}, {"id": 97385, "fullname": "Jason Rambach", "url": "http://cvpr.thecvf.com/api/miniconf/users/97385?format=json", "institution": "DFKI"}], "abstract": "Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human-autonomous vehicle traffic scenarios. Although significant progress has been made through large-scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field-of-view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large-scale 360$^\\circ$ field of view driver attention dataset, containing $\\sim$1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360-Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360-Net achieves state-of-the-art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38678", "url": null, "sourceid": 34284, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38681, "uid": "c2f6ef076c68c04d525d839de995f379", "name": "Learning Personalized Photographic Style from Pairwise User Preferences", "authors": [{"id": 76201, "fullname": "Jinwoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/76201?format=json", "institution": "Yonsei University"}, {"id": 190453, "fullname": "Jihye Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190453?format=json", "institution": "Ewha Woman&#x27;s University"}, {"id": 89061, "fullname": "Seon Joo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/89061?format=json", "institution": "Yonsei University"}], "abstract": "Photographic style preferences are deeply personal, varying across individuals in color and tonal aesthetics. We introduce Personalized Photographic Style (PPS) learning, where the goal is to capture a user's implicit preferences from comparative judgments and apply them consistently across diverse images. To establish a foundation for this problem, we present three contributions. First, we introduce PPSD, a dataset containing pairwise preference judgments from 767 users, each providing an average of 70 comparisons. To capture diverse style signals, images are sourced from professional edits, device pipelines, and generative models. Second, we explore several baseline models demonstrating the feasibility of adapting style transfer and enhancement approaches for preference learning. Third, we develop a comparative evaluation framework suited to the implicit nature of personal preferences. We will make our dataset publicly available, and hope this work serves as a foundation for advancing research in personalized photographic style learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38681", "url": null, "sourceid": 36511, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38683, "uid": "7cde5fa0ffeb15eeb1cb94724485e2a3", "name": "PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems", "authors": [{"id": 181249, "fullname": "Weijie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181249?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 151825, "fullname": "Songlong Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/151825?format=json", "institution": "University of Trento"}, {"id": 129985, "fullname": "Zhengyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129985?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 190458, "fullname": "Bruno Lepri", "url": "http://cvpr.thecvf.com/api/miniconf/users/190458?format=json", "institution": "Ipazia SpA; Fondazione Bruno Kessler"}], "abstract": "Poisoning input views of 3D reconstruction systems has been recently studied.However, we identify that existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the 3D reconstruction pipeline.In this paper, we argue that the structure-from-motion (SfM) initialization, as the geometric core of many widely used reconstruction systems, can be targeted to achieve strong poisoning effects. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM, eventually resulting in low-quality rendered views.  We also provide a theoretical analysis that connects cross-view inconsistency to correspondence collapse.Experimental results demonstrate the effectiveness of our \\name on diverse 3D reconstruction systems and datasets, surpassing the single-view-based method by 25.1% in PSNR and 16.5% in SSIM in black-box transfer settings, such as 3DGS to NeRF.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38683", "url": null, "sourceid": 36372, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38685, "uid": "5a2c8a18227cbb0bc2634e610a3c1746", "name": "Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection", "authors": [{"id": 181622, "fullname": "Wang Changshuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181622?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 181490, "fullname": "Jiangming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181490?format=json", "institution": "Tencent"}, {"id": 91026, "fullname": "Ke-Yue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91026?format=json", "institution": "Tencent"}, {"id": 90624, "fullname": "Taiping Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90624?format=json", "institution": "Tencent Youtu Lab"}, {"id": 87241, "fullname": "Shouhong Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/87241?format=json", "institution": "Tencent Youtu Lab"}, {"id": 190460, "fullname": "Shunli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190460?format=json", "institution": null}, {"id": 88671, "fullname": "Ran Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88671?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 89127, "fullname": "Lizhuang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/89127?format=json", "institution": "Dept. of Computer Sci. &amp; Eng., Shanghai Jiao Tong University"}], "abstract": "We revisit the feature learning process of state-of-the-art deepfake detectors that leverage ViT-based vision foundation models and discover that the [\\texttt{CLS}] token, commonly adopted for detection, suffers from the Pre-trained Information Bias (PIB), \\textit{i.e.}, it tends to mainly focus on global semantics due to the knowledge dominated by pre-trained model parameters, while struggling to emphasize subtle local forgery cues. To overcome this limitation, one potential way is incorporating the token-level features to reform a new detection-specific token. To this end, we propose Query-Driven Token-Level Forgery Purification (QTFP) framework enabling the model to better capture local forgery traces without losing useful pre-trained prior. Specifically, we first introduce randomly initialized, learnable query tokens independent of the backbone and prior knowledge, which can effectively aggregate multi-patch evidence into a global token for detection. To make query tokens focusing on meaningful regions, we propose a theoretical fake-likelihood contrastive learning loss, which employs a weighting strategy to highlight significant fake regions while diminishing the impact of real-like patches. Using SNR theory, we verify that the designed weight is both reliable and informative. To further maintain useful authentic information, a real-attention alignment constraint is applied to query tokens. These designs go beyond relying solely on the [\\texttt{CLS}] token by jointly reorganizing real and fake information across all tokens, which successfully enhance detector robustness. Extensive experiments on diverse datasets demonstrate the effectiveness of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38685", "url": null, "sourceid": 43939, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38686, "uid": "0a86d5f4e64277b61f3a780bae2859cb", "name": "VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference", "authors": [{"id": 181248, "fullname": "Anmin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181248?format=json", "institution": "Peking University"}, {"id": 181204, "fullname": "Ruixuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181204?format=json", "institution": "Fudan University"}, {"id": 190461, "fullname": "Huiqiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190461?format=json", "institution": "Qwen"}, {"id": 190462, "fullname": "Bin Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190462?format=json", "institution": "Alibaba Group"}, {"id": 190463, "fullname": "Minmin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190463?format=json", "institution": "Alibaba Group"}, {"id": 177590, "fullname": "Yong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/177590?format=json", "institution": "Alibaba Cloud"}, {"id": 177967, "fullname": "CHEN ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/177967?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190464, "fullname": "Tao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190464?format=json", "institution": "Peking University"}], "abstract": "Long-context video understanding and generation pose a significant computational challenge for Transformer-based video models due to the quadratic complexity of self-attention. While existing sparse attention methods employ coarse-grained patterns to improve efficiency, they typically incur redundant computation and suboptimal performance. To address this issue, in this paper, we propose  VecAttention, a novel vector-wise sparse attention framework that achieves superior accuracy-efficiency trade-offs for video models. We observe that video attention maps exhibit a strong vertical-vector sparse pattern, and further demonstrate that this vertical-vector pattern offers consistently better accuracy\u2013sparsity trade-offs compared with existing coarse-grained sparse patterns. Based on this observation, VecAttention dynamically selects and processes only informative vertical vectors through a lightweight important-vector selection that minimizes memory access overhead and an optimized vector sparse attention kernel. Comprehensive evaluations on video understanding (VideoMME, LongVideoBench, and VCRBench) and generation (VBench) tasks show that VecAttention delivers a 2.65$\\times$ speedup over full attention and a 1.83$\\times$ speedup over state-of-the-art sparse attention methods, with  comparable accuracy to full attention.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38686", "url": null, "sourceid": 42499, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38688, "uid": "6b6bcd09a82ff1f9158681c4f9161c6b", "name": "Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching", "authors": [{"id": 71925, "fullname": "Bowen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71925?format=json", "institution": "NVIDIA"}, {"id": 85671, "fullname": "Shaurya Dewan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85671?format=json", "institution": "International Institute of Information Technology, Hyderabad, International Institute of Information Technology Hyderabad"}, {"id": 159437, "fullname": "Stan Birchfield", "url": "http://cvpr.thecvf.com/api/miniconf/users/159437?format=json", "institution": "NVIDIA"}], "abstract": "Stereo foundation models achieve strong zero-shotgeneralization but remain computationally prohibitive forreal-time applications.  Efficient stereo architectures, on the other hand, sacrificerobustness for speed and require costly per-domain fine-tuning.To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10\u00d7 faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38688", "url": null, "sourceid": 36503, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38690, "uid": "debf71dec0a113d324faef063f958166", "name": "FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution", "authors": [{"id": 190472, "fullname": "Jun-hao Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190472?format=json", "institution": "Tencent ARC Lab; Tsinghua University, Tsinghua University"}, {"id": 152363, "fullname": "Shi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152363?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 152360, "fullname": "Xin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/152360?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 170633, "fullname": "Xiaohui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/170633?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 77291, "fullname": "Yihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77291?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 86975, "fullname": "Chun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86975?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Diffusion models have recently advanced video restoration, but applying them to real-world and AIGC-generated video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and near real-time performance. To this end, we propose **FlashVSR**, the first diffusion-based one-step streaming framework for efficient video super-resolution. FlashVSR runs at approximately 17 FPS for 768\u00d71408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that reduces redundant computation while bridging the train\u2013test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct **VSR-120K**, a new dataset containing 120K videos and 180K images. Extensive experiments demonstrate that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to approximately 12\u00d7 speed-up over prior one-step diffusion-based VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based video super-resolution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38690", "url": null, "sourceid": 41617, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38692, "uid": "b7dc383be23271e021efa4b0a81c0573", "name": "Adaptive Confidence Regularization for Multimodal Failure Detection", "authors": [{"id": 181176, "fullname": "Moru Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181176?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 155538, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155538?format=json", "institution": "ETH Zurich"}, {"id": 73275, "fullname": "Olga Fink", "url": "http://cvpr.thecvf.com/api/miniconf/users/73275?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 149343, "fullname": "Mario Trapp", "url": "http://cvpr.thecvf.com/api/miniconf/users/149343?format=json", "institution": "Technical University of Munich"}], "abstract": "The deployment of multimodal models in high-stakes domains, such as self-driving vehicles and medical diagnostics, demands not only strong predictive performance but also reliable mechanisms for detecting failures. In this work, we address the largely unexplored problem of failure detection in multimodal contexts. We propose Adaptive Confidence Regularization (ACR), a novel framework specifically designed to detect multimodal failures. Our approach is driven by a key observation: in most failure cases, the confidence of the multimodal prediction is significantly lower than that of at least one unimodal branch, a phenomenon we term confidence degradation. To mitigate this, we introduce an Adaptive Confidence Loss that penalizes such degradations during training. In addition, we propose Multimodal Feature Swapping, a novel outlier synthesis technique that generates challenging, failure-aware training examples. By training with these synthetic failures, ACR learns to more effectively recognize and reject uncertain predictions, thereby improving overall reliability. Extensive experiments across four datasets, three modalities, and multiple evaluation settings demonstrate that ACR achieves consistent and robust gains. The source code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38692", "url": null, "sourceid": 35071, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38693, "uid": "0ce5eb1682917fc391e592aff20c35af", "name": "Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation", "authors": [{"id": 155719, "fullname": "Zihao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155719?format=json", "institution": "Tianjin University"}, {"id": 88250, "fullname": "Aming Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88250?format=json", "institution": "Xidian University"}, {"id": 180103, "fullname": "Li Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180103?format=json", "institution": "Tianjin University"}, {"id": 86180, "fullname": "Yahong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86180?format=json", "institution": "Tianjin University"}, {"id": 185284, "fullname": "Jialie Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185284?format=json", "institution": "City St George&#x27;s, University of London"}], "abstract": "Novel Class Discovery in Point Cloud Segmentation is recently proposed, aiming to leverage knowledge from known classes to automatically segment unlabeled classes within point clouds. The core of this task lies in leveraging the geometric and semantic knowledge of multiple known classes to achieve semantic understanding and segmentation of novel classes.However, existing methods overlook the high-order associations between known and novel classes, relying solely on binary associations for class assignment and novel class reasoning, which leads to less precise semantic segmentation.To address these issues, we introduce a hypergraph structure to model high-order associations among classes, enabling collaborative reasoning from known classes to novel classes, extending beyond traditional binary relations.Additionally, existing methods focus excessively on extracting semantic information when processing point cloud data, neglecting the importance of geometric features. To address this, we introduce Geometric-Aware Prototypes, enhancing the model's ability to capture geometric spatial information.By propagating geometric information through hyperedges, our method enhances the understanding of spatial distributions across classes, improving segmentation accuracy.Significant performance improvements achieved on the SemanticKITTI and SemanticPOSS datasets demonstrate the superiority of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38693", "url": null, "sourceid": 34688, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38695, "uid": "b27cc617a091ad4ee6e6d18d7ab85830", "name": "VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering", "authors": [{"id": 181697, "fullname": "Zhipeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181697?format=json", "institution": "Shenzhen University"}, {"id": 154269, "fullname": "Guilian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154269?format=json", "institution": "Shenzhen University"}, {"id": 190481, "fullname": "Zheng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190481?format=json", "institution": "Shenzhen University"}, {"id": 127066, "fullname": "Huisi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127066?format=json", "institution": "Shenzhen University"}, {"id": 127069, "fullname": "Jing Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/127069?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Automated 3D pulmonary vessel segmentation from CT images is crucial for improving early screening and assessment of pulmonary vessel related diseases. However, it remains an extremely challenging task due to the complex and tree-like structures of vessels, large scale-variation, and the existence of highly similar tissues in the background. Existing segmentation models either cannot sufficiently capture long-range structural dependencies, which are of great importance in vessel segmentation, or are constrained by insufficient computational resources in clinical settings. In this paper, we propose VesMamba, a novel model for 3D pulmonary vessel segmentation that comprehensively addresses these challenges. Specifically, we first devise a spatial-gated structural perception (SSP) module, which employs Mamba to efficiently capture long-range dependencies. In SSP, we design dynamic spatial attention convolutions (DSAC) for dynamically learning the tree-like 3D vessel structures, providing Mamba with the spatial perception capability to better track the complicated topologies of vessels. Second, we propose an innovative bidirectional scale-aware filter (BSF) module to strengthen the representation capability of the encoder, facilitating our model to focus on vessels of different scales under noise. Moreover, we apply a mask-constrained decoder to further improve segmentation consistency and accuracy, which constrains the inference of adjacent low-layer decoders directly by high-layer masks. Extensive experiments on the public dataset Parse22 and the internal dataset Lung79 demonstrate that our method can achieve better performance than SOTAs. Codes will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38695", "url": null, "sourceid": 38381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38698, "uid": "7bab2dffdde2dc0280f291194aec45b1", "name": "Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency", "authors": [{"id": 180581, "fullname": "Zhaofeng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180581?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 129509, "fullname": "Heqian Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129509?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 184551, "fullname": "Lanxiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184551?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 85548, "fullname": "Qingbo Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85548?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 85491, "fullname": "Fanman Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/85491?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 85507, "fullname": "Lili Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85507?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 85496, "fullname": "Hongliang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85496?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}], "abstract": "Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as human-robot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE$^{2}$A$^{3}$) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38698", "url": null, "sourceid": 43070, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38718, "uid": "a78562b316f0578286ddea6e6eaf2c63", "name": "ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding", "authors": [{"id": 129131, "fullname": "Ronggang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129131?format=json", "institution": "South China University of Technology"}, {"id": 190521, "fullname": "FanSen Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190521?format=json", "institution": "South China University of Technology"}, {"id": 89026, "fullname": "Huaidong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89026?format=json", "institution": "South China University of Technology"}, {"id": 89048, "fullname": "Xuemiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89048?format=json", "institution": "South China University of Technology"}], "abstract": "3D visual grounding task aims to accurately identify and grounding target objects in 3D space based on natural language descriptions,where the effective exploitation of relative relations between the target and anchor is crucial.However, in existing methods, relative relations are often tightly entangled with entity semantics. This tight coupling encourages models to rely on semantic shortcuts from entity names, making it difficult to maintain good generalization under multi-view and complex multi-object scenarios.To address this, we propose an object\u2013relation decoupling framework that treats target\u2013anchor relations as first-class geometric and semantic primitives and models them explicitly.First, we construct a scene-level relative geometric representation that encodes the direction and distance between the target and anchor, and introduce a scene-level hyper-object token as a unified prior for scale and viewpoint.Second, we develop a predicate-decoupled cross-modal alignment strategy that preserves only predicates carrying spatial relational semantics while masking out all other tokens, thereby suppressing semantic leakage from entity names.Finally, we design an anchor-guided regression module that predicts auxiliary anchors and samples their features to guide the model in learning entity semantics from text, explicitly injecting target\u2013anchor priors and effectively resolving ambiguities in complex multi-object scenes.Extensive experiments on multiple 3D visual grounding benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches and exhibits strong robustness and generalization under challenging multi-view and relation-intensive settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38718", "url": null, "sourceid": 45370, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38701, "uid": "a42bde7f0a8acc277c4e482a927d86ae", "name": "You Only Erase Once: Erasing Anything without Bringing Unexpected Content", "authors": [{"id": 156228, "fullname": "Yixing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156228?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87176, "fullname": "Qing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87176?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86814, "fullname": "Wenju Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86814?format=json", "institution": "University of Kansas, Lawrence"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "We present YOEO, an approach for object erasure. Unlike recent diffusion-based methods which struggle to erase target objects without generating unexpected content within the masked regions due to lack of sufficient paired training data and explicit constraint on content generation, our method allows to produce high-quality object erasure results free of unwanted objects or artifacts while faithfully preserving the overall context coherence to the surrounding content. We achieve this goal by training an object erasure diffusion model on unpaired data containing only large-scale real-world images, under the supervision of a sundries detector and a context coherence loss that are built upon an entity segmentation model. To enable more efficient training and inference, a diffusion distillation strategy is employed to train for a few-step erasure diffusion model. Extensive experiments show that our method outperforms the state-of-the-art object erasure methods. Our code and trained model will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38701", "url": null, "sourceid": 39801, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38704, "uid": "d4debfe3d5694f7b8a997233f02f3273", "name": "Vibe Spaces for Creatively Connecting and Expressing Visual Concepts", "authors": [{"id": 127847, "fullname": "Huzheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127847?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 94336, "fullname": "Katherine Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/94336?format=json", "institution": "University of Pennsylvania"}, {"id": 190493, "fullname": "Andrew Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190493?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 190494, "fullname": "Michael Grossberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/190494?format=json", "institution": "CUNY City College of NY"}, {"id": 84812, "fullname": "Yutong Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/84812?format=json", "institution": "Johns Hopkins University"}, {"id": 73237, "fullname": "Jianbo Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/73237?format=json", "institution": null}], "abstract": "Creating new visual concepts often requires connecting distinct ideas through their most relevant shared attributes\u2014their vibe. We introduce Vibe Blending, a novel task for generating coherent and meaningful hybrids that reveals these shared attributes between images. Achieving such blends is challenging for current methods, which struggle to identify and traverse nonlinear paths linking distant concepts in latent space. We propose Vibe Space, a hierarchical graph manifold that learns low-dimensional geodesics in feature spaces like CLIP, enabling smooth and semantically consistent transitions between concepts. To evaluate creative quality, we design a cognitively inspired framework combining human judgments, LLM reasoning, and a geometric path-based difficulty score. We find that Vibe Space produces blends that humans consistently rate as more creative and coherent than current methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38704", "url": null, "sourceid": 32265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38707, "uid": "2ab5569b4274a31c7f2b7c67cd9ba9e2", "name": "A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning", "authors": [{"id": 181811, "fullname": "Yichang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181811?format=json", "institution": "Georgia Institute of Technology"}, {"id": 92855, "fullname": "Gaowen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92855?format=json", "institution": "Cisco Systems"}, {"id": 129129, "fullname": "Ramana Kompella", "url": "http://cvpr.thecvf.com/api/miniconf/users/129129?format=json", "institution": "Cisco"}, {"id": 131101, "fullname": "Tiansheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131101?format=json", "institution": "Georgia Institute of Technology"}, {"id": 131098, "fullname": "Sihao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131098?format=json", "institution": "Georgia Institute of Technology"}, {"id": 76154, "fullname": "Fatih Ilhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76154?format=json", "institution": "Georgia Institute of Technology"}, {"id": 131096, "fullname": "Selim Tekin", "url": "http://cvpr.thecvf.com/api/miniconf/users/131096?format=json", "institution": "College of Computing, Georgia Institute of Technology"}, {"id": 189166, "fullname": "Zachary Yahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/189166?format=json", "institution": "Georgia Institute of Technology"}, {"id": 181432, "fullname": "Ling Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181432?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception\u2013action exploration loop with a selection of VLM agents. In each round,  the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration.  During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 10 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38707", "url": null, "sourceid": 31741, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38710, "uid": "6fe57a576e8deb91aba8f0f46ec9c164", "name": "Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models", "authors": [{"id": 133908, "fullname": "Pablo Ruiz-Ponce", "url": "http://cvpr.thecvf.com/api/miniconf/users/133908?format=json", "institution": "Universidad de Alicante"}, {"id": 129850, "fullname": "Sergio Escalera", "url": "http://cvpr.thecvf.com/api/miniconf/users/129850?format=json", "institution": "Computer Vision Center"}, {"id": 154733, "fullname": "Jose Garcia-Rodriguez", "url": "http://cvpr.thecvf.com/api/miniconf/users/154733?format=json", "institution": "Universidad de Alicante"}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}, {"id": 91772, "fullname": "Rolandos Alexandros Potamias", "url": "http://cvpr.thecvf.com/api/miniconf/users/91772?format=json", "institution": "Imperial College London"}], "abstract": "Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38710", "url": null, "sourceid": 43149, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38713, "uid": "90253ae61c07456e5fa65da0c0f17d9b", "name": "MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics", "authors": [{"id": 181211, "fullname": "Ian Noronha", "url": "http://cvpr.thecvf.com/api/miniconf/users/181211?format=json", "institution": "Purdue University"}, {"id": 190509, "fullname": "Heather Neave", "url": "http://cvpr.thecvf.com/api/miniconf/users/190509?format=json", "institution": "Purdue University"}, {"id": 190510, "fullname": "Upinder Kaur", "url": "http://cvpr.thecvf.com/api/miniconf/users/190510?format=json", "institution": "Purdue University"}], "abstract": "Understanding animal behavior requires modeling how bodies, objects, and other agents interact over time, not simply detecting isolated actions or estimating pose frame by frame. Existing animal video datasets target pose estimation or coarse, passively observed actions, and rarely provide the structured, multi-entity interaction annotations needed to study behavioral dynamics. We introduce MooCap, a multi-view video benchmark for animal-object-human interaction understanding under controlled experimental protocols.  MooCap contains 42 hours of synchronized multi-camera video from 43 individually tested cows across seven standardized interaction scenarios, including novel environment, novel object, novel human, human approach, unfamiliar conspecifics (restricted and unrestricted) and Dam reunion (restricted and unrestricted). Recordings are densely annotated with 23 fine-grained behaviors, 39 body keypoints across 157 test sessions, 4 spatial zones, and 43 subjects, describing interactions among subjects, objects, humans, and other cattle. We establish three benchmarks on MooCap: (1) dense temporal action segmentation over 1200-1500-second sequences; (2) pose-based behavior and interaction recognition from keypoint trajectories; and (3) longitudinal behavioral classification linking adult behaviors with rearing conditions. Benchmarking results reveal that state-of-the-art temporal segmentation models achieve only 66.4\\% frame accuracy and 30.6\\% F1@0.5, with performance degrading further during interaction-heavy segments. Overall, MooCap bridges multi-view pose estimation, multi-entity tracking, and structured behavioral protocols to enable interaction-aware models for animal behavior analysis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38713", "url": null, "sourceid": 40272, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38719, "uid": "e0d90dcc491a3323763c77d1dff7248b", "name": "Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering", "authors": [{"id": 106546, "fullname": "Bing Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/106546?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 190522, "fullname": "Xiaoli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190522?format=json", "institution": "Nanjing Forestry University"}, {"id": 190523, "fullname": "Gui-Fu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190523?format=json", "institution": "Anhui polytechnic university"}, {"id": 181619, "fullname": "Zechao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181619?format=json", "institution": "Nanjing University of Science and Techonolgy"}], "abstract": "Multi-view contrastive clustering has emerged as a powerful paradigm for learning comprehensive representations from heterogeneous data sources. However, prevailing approaches typically overlook the intrinsic geometric and clustering structures, rendering them structure-agnostic. In this paper, we propose a novel framework that performs Multi-Hierarchical Contrastive Spectral Fusion (MCSF) to address these limitations. MCSF integrates deep spectral embedding into the encoder to preserve local manifold structure, guiding the learned representations to be clustering-friendly. To enhance cross-view consistency, MCSF introduces a multi-hierarchical contrastive loss jointly optimizing (1) view-specific structure preservation, (2) view-consensus alignment, and (3) consensus structure refinement. This mechanism enables the construction of an accurate and semantically consistent consensus representation, effectively fusing multi-view information and uncovering authentic cluster structures. Extensive experiments on benchmarks validate the effectiveness of multi-hierarchical contrastive spectral fusion in clustering accuracy and representation quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38719", "url": null, "sourceid": 37583, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38720, "uid": "18491156799c51080e55a1cc72ac93fd", "name": "YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal", "authors": [{"id": 144055, "fullname": "wu chenyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144055?format=json", "institution": "Nankai University"}, {"id": 188268, "fullname": "Lina Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188268?format=json", "institution": "Nankai University"}, {"id": 153255, "fullname": "Fan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153255?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 76506, "fullname": "Chun-Le Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76506?format=json", "institution": "Nankai University"}, {"id": 153254, "fullname": "Dehong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153254?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 144139, "fullname": "Xinran Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/144139?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 153256, "fullname": "Zhixin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153256?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}, {"id": 73507, "fullname": "Chongyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73507?format=json", "institution": "Nankai University"}], "abstract": "Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10 FPS, primarily due to dense computations over the entire spatiotemporal token space\u2014even when only a small masked region actually requires processing. In this paper, we present YOSE \u2014 You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions \u2014 in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5x speedup in 70% of cases while maintaining visual quality comparable to the baseline. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38720", "url": null, "sourceid": 35455, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38722, "uid": "e21474d71ab3960b460f6f65ba4763df", "name": "PanDA: Panoptic Domain Adaptation for Multimodal Perception in Autonomous Driving", "authors": [{"id": 131070, "fullname": "Yining Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131070?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 190527, "fullname": "Shijie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190527?format=json", "institution": "A*STAR"}, {"id": 181146, "fullname": "Yuchen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181146?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 107391, "fullname": "Xulei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107391?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 135934, "fullname": "Na Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135934?format=json", "institution": "Singapore University of Technology and Design"}], "abstract": "This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with a mm-3DPS backbone. However, existing mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g. poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose **PanDA**, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38722", "url": null, "sourceid": 36820, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38723, "uid": "cea0fedca8442bd7715832a012c208c1", "name": "FARMER: Flow AutoRegressive Transformer over Pixels", "authors": [{"id": 101039, "fullname": "GuangTing Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/101039?format=json", "institution": "University of Science and Technology of China"}, {"id": 73658, "fullname": "Qinyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73658?format=json", "institution": "ByteDance Inc.; Australian National University"}, {"id": 106920, "fullname": "Tao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106920?format=json", "institution": "Xi&#x27;an JiaoTong University"}, {"id": 190528, "fullname": "Fei Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190528?format=json", "institution": null}, {"id": 87202, "fullname": "Zhijie Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87202?format=json", "institution": "Zhejiang University"}, {"id": 89921, "fullname": "Jie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89921?format=json", "institution": "ByteDance Inc."}, {"id": 186608, "fullname": "Jiajun Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186608?format=json", "institution": "University of Adelaide"}, {"id": 90108, "fullname": "Yanyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90108?format=json", "institution": "Rutgers University, Newark"}, {"id": 128715, "fullname": "Rui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128715?format=json", "institution": "Chinese University of Hong Kong (Shenzhen)"}], "abstract": "Directly modeling the explicit likelihood of the raw data distribution is a key topic in the machine learning area, which achieves the scaling success in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffers from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NFs) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is implicitly modeled by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38723", "url": null, "sourceid": 46360, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38727, "uid": "acd746ef45bbe504516078ff295ad7a7", "name": "VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba", "authors": [{"id": 181839, "fullname": "Longmi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181839?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 85380, "fullname": "Pan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85380?format=json", "institution": "Nanjing University of Aeronautics and Astronautics, Tsinghua University"}], "abstract": "Volume Electron Microscopy (VEM) is crucial for 3D tissue imaging but often produces anisotropic data with poor axial resolution, hindering visualization and downstream analysis. Existing methods for isotropic reconstruction often suffer from neglecting abundant axial information and employing simple downsampling to simulate anisotropic data. To address these limitations, we propose VEMamba, an efficient framework for isotropic reconstruction. The core of VEMamba is a novel \\textbf{3D Dependency Reordering} paradigm, implemented via two key components: an Axial-Lateral Chunking Selective Scan Module (ALCSSM), which intelligently re-maps complex 3D spatial dependencies (both axial and lateral) into optimized 1D sequences for efficient Mamba-based modeling, explicitly enforcing axial-lateral consistency; and a Dynamic Weights Aggregation Module (DWAM) to adaptively aggregate these reordered sequence outputs for enhanced representational power. Furthermore, we introduce a realistic degradation simulation and then leverage Momentum Contrast (MoCo) to integrate this degradation-aware knowledge into the network for superior reconstruction. Extensive experiments on both simulated and real-world anisotropic VEM datasets demonstrate that VEMamba achieves state-of-the-art performance while maintaining a lower computational footprint.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38727", "url": null, "sourceid": 34706, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38729, "uid": "d8bc24756770daf4b6bb9a28959597b0", "name": "APPO: Attention-guided Perception Policy Optimization for Video Reasoning", "authors": [{"id": 152093, "fullname": "Henghui Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/152093?format=json", "institution": "Renmin University of China"}, {"id": 153903, "fullname": "Chang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153903?format=json", "institution": "Tencent Video AI Center (OVB)"}, {"id": 190534, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190534?format=json", "institution": null}, {"id": 127885, "fullname": "Di Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127885?format=json", "institution": "Renmin University of China"}], "abstract": "Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning.Through extensive empirical observation, we have recognized the critical impact of perception.In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance.Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile.To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception.The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens).Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5% ~ 4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38729", "url": null, "sourceid": 46295, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40346, "uid": "cc0ec50248642a1df8306d37126a7241", "name": "RefAV: Towards Planning Centric Scenario Mining", "authors": [{"id": 183586, "fullname": "Cainan Davidson", "url": "http://cvpr.thecvf.com/api/miniconf/users/183586?format=json", "institution": "Carnegie Mellon University"}, {"id": 75455, "fullname": "Deva Ramanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75455?format=json", "institution": "Carnegie Mellon University"}, {"id": 150974, "fullname": "Neehar Peri", "url": "http://cvpr.thecvf.com/api/miniconf/users/150974?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of $10,000$ diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from $1000$ driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40346", "url": null, "sourceid": -38740, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38786?format=json"], "related_events_ids": [38786]}, {"id": 38732, "uid": "8072e512102b794c08f3479a856c0796", "name": "Self-Evaluation Unlocks Any-Step Text-to-Image Generation", "authors": [{"id": 90670, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90670?format=json", "institution": "The University of Hong Kong"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}, {"id": 154196, "fullname": "Zhengqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154196?format=json", "institution": "Google"}, {"id": 127646, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127646?format=json", "institution": "Adobe Systems"}, {"id": 89223, "fullname": "Richard Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89223?format=json", "institution": "Adobe Systems"}, {"id": 85199, "fullname": "Zhe Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85199?format=json", "institution": "Adobe Research"}, {"id": 75717, "fullname": "Eli Shechtman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75717?format=json", "institution": "Adobe Research, US"}, {"id": 150606, "fullname": "Tianyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150606?format=json", "institution": "Adobe Research"}, {"id": 90709, "fullname": "Yotam Nitzan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90709?format=json", "institution": "Tel Aviv University"}], "abstract": "We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38732", "url": null, "sourceid": 31358, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38736, "uid": "d28b6a4077491078708c95d4e8f35d68", "name": "VMonarch: Efficient Video Diffusion Transformers with Structured Attention", "authors": [{"id": 148643, "fullname": "Cheng Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/148643?format=json", "institution": "nanjing university"}, {"id": 190549, "fullname": "Haoxian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190549?format=json", "institution": "nanjing university"}, {"id": 129317, "fullname": "Liang Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129317?format=json", "institution": "Kuaishou Technology"}, {"id": 130772, "fullname": "Qi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86405, "fullname": "Gangshan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86405?format=json", "institution": "Nanjing University"}, {"id": 87541, "fullname": "Xin Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87541?format=json", "institution": "Kuaishou"}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal fine-tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of  $17.5$, and achieves a speedup of over $5\\times$ in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90\\% sparsity.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38736", "url": null, "sourceid": 38058, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38737, "uid": "98bedaa8625e9046c5fc16b5293181b8", "name": "Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild", "authors": [{"id": 180059, "fullname": "Hao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180059?format=json", "institution": "Peking University"}, {"id": 186906, "fullname": "Ye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186906?format=json", "institution": "Renmin University of China"}, {"id": 186905, "fullname": "Wanpeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186905?format=json", "institution": "Peking University"}, {"id": 186907, "fullname": "Haoqi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186907?format=json", "institution": "Peking University"}, {"id": 180070, "fullname": "Yicheng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180070?format=json", "institution": "Peking University"}, {"id": 190550, "fullname": "Haiweng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190550?format=json", "institution": "Peking University"}, {"id": 186908, "fullname": "Sipeng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186908?format=json", "institution": "BeingBeyond"}, {"id": 87087, "fullname": "Zongqing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87087?format=json", "institution": "Peking University"}], "abstract": "Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstructing, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending labotatory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improves downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38737", "url": null, "sourceid": 41914, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38738, "uid": "29e48b729552621616d15c0a82f58016", "name": "Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition", "authors": [{"id": 190551, "fullname": "Shanliang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190551?format=json", "institution": null}, {"id": 190552, "fullname": "wangxiaoxiao wangxiaoxiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190552?format=json", "institution": "Shandong University of Technology"}], "abstract": "The performance of multimodal learning systems, particularly in high-stakes domains like automated depression recognition, is fundamentally constrained by the challenge of learning robust visual representations from limited and complex clinical data. To overcome this, we introduce Cross-Modal Guided Visual Synthesis (CMG-VS), a novel training framework that internally enhances the learning process by synthesizing new, task-relevant visual features. At its core, CMG-VS leverages the rich context from audio and text modalities to guide a conditional generative model. This model learns the intricate mapping from speech and language to visual expression, generating a diverse manifold of plausible visual behaviors to enrich the training distribution. Crucially, this synthesis is not a separate pre-processing step. Through a task-guided joint optimization scheme, the generative process is dynamically steered by the downstream multimodal recognizer's performance. This closed-loop feedback mechanism ensures the synthesized visual features are optimized to be maximally discriminative for the recognition task, rather than merely realistic. Comprehensive experiments on the widely-used DAIC-WOZ and E-DAIC benchmark datasets demonstrate that CMG-VS significantly outperforms existing state-of-the-art methods across all standard regression and classification metrics. Ablation studies further validate that our task-guided synthesis is the key driver of this performance gain, proving its effectiveness as a new paradigm for robust multimodal representation learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38738", "url": null, "sourceid": 32298, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38739, "uid": "735b35808f01b379a564129d77847dfb", "name": "Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism", "authors": [{"id": 180213, "fullname": "Tao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180213?format=json", "institution": "Xiamen University"}, {"id": 190553, "fullname": "Kun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190553?format=json", "institution": "Xiamen University"}, {"id": 180846, "fullname": "Qiong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180846?format=json", "institution": "Xiamen University"}, {"id": 190554, "fullname": "Xiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190554?format=json", "institution": "Minjiang University"}, {"id": 190555, "fullname": "Chao Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190555?format=json", "institution": null}, {"id": 76395, "fullname": "Xiaoshuai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76395?format=json", "institution": "Xiamen University"}, {"id": 88619, "fullname": "Yiyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88619?format=json", "institution": "Xiamen University"}, {"id": 86308, "fullname": "Rongrong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/86308?format=json", "institution": "Xiamen University"}], "abstract": "Long video understanding is a key challenge that plagues the advancement of Multimodal Large language Models (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed Flexible Memory (FlexMem).  In principle, FlexMem aims to mimic human behavior of video watching, i.e., continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design.  Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one.To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on a single 3090 GPU, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than 1k frames, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, e.g. , GPT-4o and Gemini-1.5 Pro. Our code project is given in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38739", "url": null, "sourceid": 39826, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38741, "uid": "2ec4276ab0862e9bc0f7dbb4bd260cac", "name": "SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing", "authors": [{"id": 182143, "fullname": "Han Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182143?format=json", "institution": "Baidu"}, {"id": 180419, "fullname": "Yan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180419?format=json", "institution": "Baidu"}, {"id": 158619, "fullname": "Ruiqi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158619?format=json", "institution": "Baidu"}, {"id": 158617, "fullname": "Cong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/158617?format=json", "institution": "Baidu"}, {"id": 190558, "fullname": "huangjie huangjie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190558?format=json", "institution": null}, {"id": 158620, "fullname": "Zhan Zhenpeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158620?format=json", "institution": "Baidu"}], "abstract": "Sketch editing is central to digital illustration, yet existing image editing systems struggle to preserve the sparse, style\u2011sensitive structure of line art while supporting both high\u2011level semantic changes and precise local redrawing. We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction\u2011guided global edits with line\u2011guided region redrawing, while keeping unrelated regions and overall composition intact. To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute\u2011addition sequences from attribute\u2011free base sketches, (ii) forms multi\u2011step edit chains via cross\u2011sequence sampling, and (iii) expands stylistic coverage with a style\u2011preserving attribute\u2011removal model applied to diverse sketches. Building on this data, SketchAssist employs a unified sketch editing framework with minimal changes to DiT\u2011based editors. We repurpose the RGB channels to encode the inputs, enabling seamless switching between  instruction\u2011guided edits and line\u2011guided redrawing within a single input interface. To further specialize behavior across modes, we integrate a task\u2011guided mixture\u2011of\u2011experts into LoRA layers, routing by text and visual cues to improve semantic controllability, structural fidelity, and style preservation.Extensive experiments show state\u2011of\u2011the\u2011art results on both tasks, with superior instruction adherence and style/structure preservation compared to recent baselines. Together, our dataset and SketchAssist provide a practical, controllable assistant for sketch creation and revision", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38741", "url": null, "sourceid": 34985, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38743, "uid": "0566a85f6665677610d6af3482ea8dc4", "name": "SJD++: Accelerating Speculative Jacobi Decoding for Text-to-Image Models via Multi-Drafting and Enhanced Rejection Stability", "authors": [{"id": 190560, "fullname": "Jialiang Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190560?format=json", "institution": "Peking University"}, {"id": 189422, "fullname": "Han Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189422?format=json", "institution": "Huawei Technologies"}, {"id": 154101, "fullname": "Wenshuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154101?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190561, "fullname": "Yingjie Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190561?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 154103, "fullname": "Xinghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154103?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}], "abstract": "Speculative Jacobi Decoding (SJD) provides a compelling, draft-model-free approach to accelerating autoregressive text-to-image synthesis. However, the high-entropy nature of image generation leads to low acceptance rates of draft tokens in high-complexity image regions, where rejections occur frequently. This bottleneck restricts SJD\u2019s practical efficiency and limits overall throughput. To address the bottleneck, we introduce SJD++, an enhanced speculative Jacobi decoding framework. First, SJD++ integrates a well-designed multi-drafting strategy to improve local acceptance rates when generating these challenging regions. Furthermore, we propose an adaptive rejection mechanism that enhances sequence stability by continuing validation instead of reverting to full resampling after an initial mismatch. These key optimizations work in tandem to significantly increase the average accepted token length, boosting overall inference speed while strictly preserving the integrity of the target output distribution. Experiments on text-to-image benchmarks demonstrate that SJD++ achieves 3.8\u00d7 acceleration with lossless image quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38743", "url": null, "sourceid": 45957, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38746, "uid": "33c54daf901630e4e088b9986156a597", "name": "SURF: Signature-retained Fast Video Generation", "authors": [{"id": 179952, "fullname": "Kaixin Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/179952?format=json", "institution": "HKU"}, {"id": 88507, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88507?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 190574, "fullname": "Sihui Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/190574?format=json", "institution": "University of Hong Kong"}, {"id": 187741, "fullname": "Yuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187741?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 129317, "fullname": "Liang Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129317?format=json", "institution": "Kuaishou Technology"}, {"id": 87541, "fullname": "Xin Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87541?format=json", "institution": "Kuaishou"}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}], "abstract": "The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive signatures (e.g., layout, semantic, motion) of the original model. In this work, we propose **SURF**, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. Specifically, SURF divides video generation into two stages: First, we leverage the pretrained model to infer at optimal resolution and downsample latent to generate low-resolution previews in fast speed; then we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe losses in signatures. So we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps on the original resolution and switching to low resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, which significantly reduces the denoising steps. We further integrate shifting windows and carefully design the training paradigm to get a powerful and efficient Refiner. In this way, SURF enables generating high-resolution videos efficiently while maximally closer to the signatures of the given pretrained model. SURF is conceptually simple and could serve as a plug-in that is compatible with various base model and acceleration methods. For example, it achieves 12.5\u00d7 speedup for generating 5-second, 16fps, 720p Wan 2.1 videos and 8.7\u00d7 speedup for generating 5-second, 24fps, 720p HunyuanVideo.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38746", "url": null, "sourceid": 42332, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38750, "uid": "16d40a3be745187190a9683cc8a9e1c0", "name": "Endless World: Real-Time 3D-Aware Long Video Generation", "authors": [{"id": 102379, "fullname": "Ke Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102379?format=json", "institution": "Johns Hopkins University"}, {"id": 91730, "fullname": "Jiacong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91730?format=json", "institution": "Johns Hopkins University"}, {"id": 76137, "fullname": "Yiqun Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76137?format=json", "institution": "Johns Hopkins University"}, {"id": 87493, "fullname": "Vishal M. Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/87493?format=json", "institution": "Johns Hopkins University"}], "abstract": "Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video generation. To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead. Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis. Extensive experiments demonstrate that Endless World produces long, stable, and visually coherent videos, achieving competitive or superior performance to existing methods in both visual fidelity and spatial consistency. Our code will be released after review.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38750", "url": null, "sourceid": 46689, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38749, "uid": "3e15bc5fa2dc63605c64a2afbee14366", "name": "Enhancing Vision Language Models for 4D Perception", "authors": [{"id": 92999, "fullname": "Seokju Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/92999?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 158810, "fullname": "Abhishek Badki", "url": "http://cvpr.thecvf.com/api/miniconf/users/158810?format=json", "institution": "NVIDIA"}, {"id": 99355, "fullname": "Hang Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/99355?format=json", "institution": "NVIDIA"}, {"id": 168092, "fullname": "Jindong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/168092?format=json", "institution": "NVIDIA"}, {"id": 101985, "fullname": "Ziyao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/101985?format=json", "institution": "Yale University"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 76011, "fullname": "Sifei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76011?format=json", "institution": "NVIDIA"}, {"id": 89205, "fullname": "Orazio Gallo", "url": "http://cvpr.thecvf.com/api/miniconf/users/89205?format=json", "institution": "NVIDIA"}], "abstract": "Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about 3D motion, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection on 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate large-scale 400K training samples and a 2.2K-sample benchmark. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38749", "url": null, "sourceid": 32989, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38753, "uid": "8d3d7e8ca2dc7c98b0effe01f0b1fccb", "name": "DDT: Decoupled Diffusion Transformer", "authors": [{"id": 188174, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188174?format=json", "institution": "Nanjing University"}, {"id": 190588, "fullname": "Zhi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190588?format=json", "institution": "ByteDance Inc."}, {"id": 190589, "fullname": "Weilin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190589?format=json", "institution": "Alibaba Group"}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \\textbf{\\color{ddt}D}ecoupled \\textbf{\\color{ddt}D}iffusion \\textbf{\\color{ddt}T}ransformer(\\textbf{\\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38753", "url": null, "sourceid": 32706, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38754, "uid": "9407be1bd4c5f4d01dd9b1a971290160", "name": "Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues", "authors": [{"id": 126786, "fullname": "Wenjin Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/126786?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 179716, "fullname": "Xiaoxiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/179716?format=json", "institution": "Stanford University"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}], "abstract": "Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates.  For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain. All code, models, and complete features will be open-sourced upon publication to accelerate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38754", "url": null, "sourceid": 37180, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38755, "uid": "bf441b0f92957832f02987c16a5d7f11", "name": "Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition", "authors": [{"id": 172578, "fullname": "Shuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172578?format=json", "institution": "University of Oxford"}, {"id": 190590, "fullname": "Chenqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190590?format=json", "institution": "University of Oxford"}, {"id": 182788, "fullname": "Tingting Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182788?format=json", "institution": "University of Oxford"}], "abstract": "Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. During the classifier retraining stage, adaptive norm rescaling is a popular technique. It adjusts the per-class weight norms via parameter regularization, which inevitably introduces hyperparameters. However, many studies report that long-tailed recognition is sensitive to these hyperparameters, as their setup significantly impacts performance. In this paper, we first provide a class-conditional distribution perspective to support norm rescaling methods. Furthermore, we propose a simple but effective approach called Self-Adaptive Monotonic Normalization (SAMN). SAMN avoids the need for parameter regularization. It directly enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm, making the method hyperparameter-friendly. SAMN is a universal strategy that integrates seamlessly with other methods for enhanced performance. Experiments on benchmark datasets demonstrate that our method significantly boosts long-tailed recognition performance, often achieving state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38755", "url": null, "sourceid": 32727, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38759, "uid": "70fc8a17ae8984aaa705b62f3e9ef2df", "name": "TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation", "authors": [{"id": 160434, "fullname": "Yiheng Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/160434?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190602, "fullname": "Yi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190602?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 175441, "fullname": "Shilong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175441?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190603, "fullname": "Xiyan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190603?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90863, "fullname": "Xin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90863?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Automatic radiology report generation (RRG) aims to translate medical images into diagnostic text, reducing radiologists' workload and standardizing clinical documentation. Nonetheless, existing approaches mainly focus on single-timepoint analysis and fail to capture temporal disease evolution across longitudinal examinations. While recent longitudinal RRG (LRRG) approaches incorporate historical data, they often combine images from different time points within a single representation space, leading to blurred semantics and inconsistent temporal reasoning. In this work, we propose a Temporal Decoupling with Iterative Mutual-Refinement Model (TIM), a two-stage framework that explicitly decouples spatial pathology from temporal progression and iteratively refines reports through mutual feedback. Stage I performs temporal-decoupled representation learning, separating temporal evolution patterns from disease-specific features and generating radiology reports for both prior and current studies. Stage II introduces a mutual report refinement mechanism that identifies diagnostic inconsistencies within prior reports and iteratively rectifies both prior and current reports through error-sensitive feedback. Experiments on the Longitudinal-MIMIC dataset demonstrate that TIM surpasses existing single-image and longitudinal baselines, achieving new state-of-the-art performance across both language and clinical metrics. Code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38759", "url": null, "sourceid": 31446, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38765, "uid": "d22164fcc6ce2e97e0fd2f02c953e1a4", "name": "ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction", "authors": [{"id": 190619, "fullname": "Yanzhe Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190619?format=json", "institution": "University of Science and Technology of China"}, {"id": 181160, "fullname": "Ruijie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181160?format=json", "institution": "Nanyang Technological University"}, {"id": 190620, "fullname": "Hanzhi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190620?format=json", "institution": "University of Science and Technology of China"}, {"id": 189916, "fullname": "Zhuoyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189916?format=json", "institution": "University of Science and Technology of China"}, {"id": 186540, "fullname": "Jiahao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186540?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation.To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision.The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction.Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38765", "url": null, "sourceid": 45500, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38767, "uid": "c0ae985fa56f6403000b7b334c0d8181", "name": "SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction", "authors": [{"id": 180768, "fullname": "Yijian Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/180768?format=json", "institution": "Tsinghua University"}, {"id": 190624, "fullname": "Mingtao Ou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190624?format=json", "institution": null}, {"id": 180760, "fullname": "Pan Zijian", "url": "http://cvpr.thecvf.com/api/miniconf/users/180760?format=json", "institution": "Tsinghua University"}, {"id": 190625, "fullname": "Xinglong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/190625?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit 3D representation, enabling photorealistic and real-time novel view synthesis. However, most 3DGS pipelines still assume precomputed camera poses and offline optimization, which introduces latency and makes them brittle in fast-motion, real-world scenarios. Existing online 3DGS systems mostly fall into two camps: (1) hybrid systems that rely on a separate traditional SLAM system for camera poses and optimize Gaussians decoupled from tracking, increasing system complexity; and (2) purely Gaussian-based systems that estimate poses from dense photometric errors, requiring repeated rendering of a large number of Gaussians and thus incurring high computational cost. Moreover, current online methods are often sensitive to motion blur and high dynamic range scenes, limiting their applicability in practice.We address these limitations with a sparse, edge-guided online 3DGS framework. Our method represents the scene as an edge-aligned sparse Gaussian map and estimates 6-DoF camera poses by aligning rendered 3D edges with observed 2D edges using a distance transform based objective, yielding roughly 2\u00d7 faster per-iteration pose optimization than existing Gaussian-based systems while recovering clear scene geometry. We further leverage a dual-channel hybrid pixel vision sensor that outputs blur-free, high-frame-rate spatial-difference edge signals alongside RGB images, and use these signals both for robust edge-based tracking and for a mutual supervision scheme that mitigates motion blur in dense 3D reconstruction. Our system maintains stable tracking and high-fidelity geometry under extremely high-speed motion, where existing RGB-only methods fail, while remaining compatible with standard RGB cameras and achieving competitive tracking accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38767", "url": null, "sourceid": 41356, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38770, "uid": "b07237be800d559b7ab9eb2398d4c7d5", "name": "Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning", "authors": [{"id": 181580, "fullname": "Xiaojun Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181580?format=json", "institution": "School of Computer Science and Engineering, Sun Yat-sen University"}, {"id": 190629, "fullname": "Tianchi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190629?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 173781, "fullname": "Zhiyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173781?format=json", "institution": "Sun Yat-sen University"}, {"id": 190630, "fullname": "Chuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190630?format=json", "institution": "Sun Yat-Sen University"}, {"id": 105581, "fullname": "Zibin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/105581?format=json", "institution": "Sun Yat-sen University"}], "abstract": "One-shot Federated Learning (OSFL) has emerged as a promising paradigm to mitigate the high communication overhead of traditional federated learning. However, its effectiveness is often hampered by data heterogeneity across client data. While recent methods leverage pre-trained diffusion models to generate data for OSFL, they often struggle with some practical limitations, including a lack of semantic fidelity in capturing the fine-grained characteristics of local data, and insufficient diversity in the generated data, which collectively degrade the performance of the global model. To address these challenges, we propose \\texttt{Espresso}, a novel framework that enhances both the fidelity and diversity of synthetic data in OSFL. \\texttt{Espresso} consists of two main components: (1) \\textbf{Fine-Grained Condition Learning}, which learns fine-grained conditional embeddings to improve semantic fidelity and diversity by modeling intra-category patterns, and (2) \\textbf{Semantics-Preserving Sampling}, which diversifies the generated data by modeling the initial latent noise distribution and applying a self-reflection sampling strategy. Extensive experiments on benchmark datasets demonstrate that \\texttt{Espresso} can improve the semantic fidelity and diversity of the synthetic data, leading to a enhancement in the performance of the global model compared to state-of-the-art OSFL methods under data heterogeneity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38770", "url": null, "sourceid": 43161, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38773, "uid": "ace27c5277ecc8da47cd53ff5c82cb4f", "name": "WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation", "authors": [{"id": 190635, "fullname": "Wenhua Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190635?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190636, "fullname": "Huai Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190636?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 130610, "fullname": "Zhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130610?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 129735, "fullname": "Hesheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129735?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose \\textbf{WeatherCity}, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns.Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38773", "url": null, "sourceid": 36927, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38776, "uid": "f65f78d68a2ceb7d945bbf22399e6886", "name": "3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds", "authors": [{"id": 183239, "fullname": "Ryousuke Yamada", "url": "http://cvpr.thecvf.com/api/miniconf/users/183239?format=json", "institution": "National Institute of Advanced Industrial Science and Technology (AIST)"}, {"id": 183359, "fullname": "Kohsuke Ide", "url": "http://cvpr.thecvf.com/api/miniconf/users/183359?format=json", "institution": "National Institute of Advanced Industrial Science and Technology"}, {"id": 99393, "fullname": "Yoshihiro Fukuhara", "url": "http://cvpr.thecvf.com/api/miniconf/users/99393?format=json", "institution": "AIST, National Institute of Advanced Industrial Science and Technology; ExaWizards Inc."}, {"id": 87986, "fullname": "Hirokatsu Kataoka", "url": "http://cvpr.thecvf.com/api/miniconf/users/87986?format=json", "institution": "National Institute of Advanced Industrial Science and Technology (AIST)"}, {"id": 86003, "fullname": "Gilles Puy", "url": "http://cvpr.thecvf.com/api/miniconf/users/86003?format=json", "institution": "valeo.ai"}, {"id": 87148, "fullname": "Andrei Bursuc", "url": "http://cvpr.thecvf.com/api/miniconf/users/87148?format=json", "institution": "valeo.ai"}, {"id": 189862, "fullname": "Yuki M Asano", "url": "http://cvpr.thecvf.com/api/miniconf/users/189862?format=json", "institution": "University of Technology Nuremberg"}], "abstract": "Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce \\data, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model.  We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38776", "url": null, "sourceid": 37391, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40345, "uid": "a5d3526558368a8cda8a30a3c35963cb", "name": "Efficiently Reconstructing Dynamic Scenes one D4RT at a Time", "authors": [{"id": 136125, "fullname": "Chuhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136125?format=json", "institution": "Google"}, {"id": 182924, "fullname": "Guillaume LE MOING", "url": "http://cvpr.thecvf.com/api/miniconf/users/182924?format=json", "institution": "Google DeepMind"}, {"id": 128673, "fullname": "Skanda Koppula", "url": "http://cvpr.thecvf.com/api/miniconf/users/128673?format=json", "institution": "Google Deepmind"}, {"id": 86793, "fullname": "Ignacio Rocco", "url": "http://cvpr.thecvf.com/api/miniconf/users/86793?format=json", "institution": "Facebook"}, {"id": 150948, "fullname": "Liliane Momeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/150948?format=json", "institution": "Google"}, {"id": 128081, "fullname": "Junyu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/128081?format=json", "institution": "University of Oxford"}, {"id": 130779, "fullname": "Shuyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/130779?format=json", "institution": "University of Oxford"}, {"id": 178259, "fullname": "Rahul Sukthankar", "url": "http://cvpr.thecvf.com/api/miniconf/users/178259?format=json", "institution": "Google DeepMind"}, {"id": 190650, "fullname": "Jo\u00eblle Barral", "url": "http://cvpr.thecvf.com/api/miniconf/users/190650?format=json", "institution": "Google"}, {"id": 190651, "fullname": "Raia Hadsell", "url": "http://cvpr.thecvf.com/api/miniconf/users/190651?format=json", "institution": "DeepMind"}, {"id": 190652, "fullname": "Zoubin Ghahramani", "url": "http://cvpr.thecvf.com/api/miniconf/users/190652?format=json", "institution": "Google DeepMind ; University of Cambridge"}, {"id": 75512, "fullname": "Andrew Zisserman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75512?format=json", "institution": "University of Oxford"}, {"id": 190653, "fullname": "Junlin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190653?format=json", "institution": "Google"}, {"id": 90310, "fullname": "Mehdi S. M. Sajjadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/90310?format=json", "institution": "Google"}], "abstract": "Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40345", "url": null, "sourceid": -32380, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38780?format=json"], "related_events_ids": [38780]}, {"id": 38785, "uid": "733e46b1d36d27ff88a949833bbe10c0", "name": "Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection", "authors": [{"id": 173772, "fullname": "Yaoteng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173772?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 155749, "fullname": "Qing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155749?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 155750, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155750?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 155751, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155751?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Incremental Object Detection (IOD) aims to continuously learn new object categories without forgetting previously learned ones. Recently, prompt-based methods have gained popularity for their replay-free design and parameter efficiency. However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. PDP innovatively designs a dual-pool prompt decoupling paradigm, which consists of a shared pool used to capture task-general knowledge for forward transfer, and a private pool used to learn task-specific discriminative features. This paradigm explicitly separates task-general and task-specific prompts, preventing interference between prompts and mitigating prompt coupling. In addition, to counteract prompt drift resulting from inconsistent supervision where old foreground objects are treated as background in subsequent tasks, PDP introduces a Prototypical Pseudo-Label Generation (PPG) module. PPG can dynamically update the class prototype space during training and use the class prototypes to further filter valuable pseudo-labels, maintaining supervisory signal consistency throughout the incremental process. PDP achieves state-of-the-art performance on MS-COCO (with a 9.2% AP improvement) and PASCAL VOC (with a 3.3% AP improvement) benchmarks, highlighting its potential in balancing stability and plasticity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38785", "url": null, "sourceid": 31253, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38786, "uid": "cc0ec50248642a1df8306d37126a7241", "name": "RefAV: Towards Planning Centric Scenario Mining", "authors": [{"id": 183586, "fullname": "Cainan Davidson", "url": "http://cvpr.thecvf.com/api/miniconf/users/183586?format=json", "institution": "Carnegie Mellon University"}, {"id": 75455, "fullname": "Deva Ramanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75455?format=json", "institution": "Carnegie Mellon University"}, {"id": 150974, "fullname": "Neehar Peri", "url": "http://cvpr.thecvf.com/api/miniconf/users/150974?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of $10,000$ diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from $1000$ driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38786", "url": null, "sourceid": 38740, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40346?format=json"], "related_events_ids": [40346]}, {"id": 38784, "uid": "6d2442b46334b9e98e27f8bf2b98abb4", "name": "RISE: Single Static Radar-based Indoor Scene Understanding", "authors": [{"id": 166392, "fullname": "Kaichen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/166392?format=json", "institution": "University of Oxford"}, {"id": 181024, "fullname": "Laura Dodds", "url": "http://cvpr.thecvf.com/api/miniconf/users/181024?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 190674, "fullname": "Sayed Afzal", "url": "http://cvpr.thecvf.com/api/miniconf/users/190674?format=json", "institution": null}, {"id": 184575, "fullname": "Fadel Adib", "url": "http://cvpr.thecvf.com/api/miniconf/users/184575?format=json", "institution": "MIT &amp; Cartesian"}], "abstract": "Robust and privacy-preserving indoor scene understanding remains a fundamental open problem.While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments.In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult.We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection.RISE is built upon the key insight that multipath reflections\u2014traditionally treated as noise\u2014encode rich geometric cues.To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures.On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layouts reconstruction and object detection.Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to radar-based indoor scene understanding.Extensive experiments show that RISE reduces the Chamfer Distance by 60\\% (down to 16 cm) compared to the state of the art in layout reconstruction, and delivers the first mmWave-based object detection, achieving 58\\% IoU.These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38784", "url": null, "sourceid": 44879, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38790, "uid": "4fec58181bb416f09f8ef0f69433584f", "name": "MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation", "authors": [{"id": 128680, "fullname": "Xun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128680?format=json", "institution": "Xiamen University"}, {"id": 128671, "fullname": "Shijia Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128671?format=json", "institution": "Xiamen University"}, {"id": 183664, "fullname": "Yunxiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183664?format=json", "institution": "Nanyang Technological University, Singapore"}, {"id": 181744, "fullname": "Xin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181744?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 190680, "fullname": "Wanfa Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190680?format=json", "institution": "Xiamen University"}, {"id": 145156, "fullname": "Rongsheng Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145156?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 107115, "fullname": "Weixin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107115?format=json", "institution": "Beihang University"}, {"id": 89528, "fullname": "Yunhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89528?format=json", "institution": "Beihang University"}, {"id": 86653, "fullname": "Chenglu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86653?format=json", "institution": "Xiamen University"}], "abstract": "Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies.To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the \u201clast mile\u201d problem in zero-shot navigation \u2014 determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets.  The open-source code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38790", "url": null, "sourceid": 46390, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38791, "uid": "273ea7552f2fedc728d1462e7791434b", "name": "TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens", "authors": [{"id": 89768, "fullname": "Jiawei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/89768?format=json", "institution": "NVIDIA"}, {"id": 183772, "fullname": "Michal Tyszkiewicz", "url": "http://cvpr.thecvf.com/api/miniconf/users/183772?format=json", "institution": "NVIDIA"}, {"id": 71129, "fullname": "Jiahui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71129?format=json", "institution": "NVIDIA"}, {"id": 86828, "fullname": "\u017dan Goj\u010di\u010d", "url": "http://cvpr.thecvf.com/api/miniconf/users/86828?format=json", "institution": "NVIDIA"}], "abstract": "In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss.This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby _unbinding_ the number of predicted primitives from input image resolution and number of views. Our resulting method, __TokenGS__, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38791", "url": null, "sourceid": 44909, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38793, "uid": "9e98b948f4f3c15274939173e5c20799", "name": "Anti-I2V: Safeguarding your photos from malicious image-to-video generation", "authors": [{"id": 183859, "fullname": "Hong Duc Vu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183859?format=json", "institution": "Qualcomm AI Research"}, {"id": 190682, "fullname": "Anh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190682?format=json", "institution": "QualComm"}, {"id": 190683, "fullname": "Chi Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/190683?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 76979, "fullname": "Anh Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/76979?format=json", "institution": "VinAI Research"}], "abstract": "Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38793", "url": null, "sourceid": 41775, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38807, "uid": "5e86cdd17f928cd2a20e2f946dac5131", "name": "Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models", "authors": [{"id": 180351, "fullname": "Boyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180351?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 77213, "fullname": "Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77213?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 190724, "fullname": "Linpeng Linpeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190724?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 189817, "fullname": "Yuhan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189817?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 158795, "fullname": "Xichun Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158795?format=json", "institution": "Macao Polytechnic University"}, {"id": 89584, "fullname": "Chenggang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89584?format=json", "institution": "Hangzhou Dianzi University, Tsinghua University"}], "abstract": "Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs).Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization.First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features.This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM.Second, we introduce neural-collapse\u2013driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss.These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment.Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.We will release all source code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38807", "url": null, "sourceid": 39756, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38798, "uid": "3648d0f7a719a39f363f6748ecb90918", "name": "Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval", "authors": [{"id": 183869, "fullname": "Yuxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183869?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 190694, "fullname": "Yinan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190694?format=json", "institution": "City University of Hong Kong; Xi&#x27;an Jiaotong University"}, {"id": 155774, "fullname": "Yuxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155774?format=json", "institution": "Tencent ARC Lab"}, {"id": 76750, "fullname": "Ziqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76750?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 88244, "fullname": "Zongyang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88244?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 88203, "fullname": "Chunfeng Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88203?format=json", "institution": ", Institute of automation, Chinese academy of science"}, {"id": 84979, "fullname": "Bing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84979?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 190695, "fullname": "Jun Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190695?format=json", "institution": "Momo Inc."}, {"id": 85031, "fullname": "Weiming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85031?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible, multimodal queries that combine a reference image and modification text.However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential.In this work, we propose **O**bject-**A**nchored **C**omposed **I**mage **R**etrieval (**OACIR**), a novel fine-grained retrieval task that mandates strict instance-level consistency.To advance research on this task, we construct **OACIRR** (**OACIR** on **R**eal-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors.Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation.To perform the OACIR task, we propose ***AdaFocal***, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context.Extensive experiments demonstrate that ***AdaFocal*** substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38798", "url": null, "sourceid": 38041, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38801, "uid": "dec2236203d220c20de58dc2a0040258", "name": "Dejavu: Towards Experience Feedback Learning for Embodied Intelligence", "authors": [{"id": 182734, "fullname": "Shaokai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182734?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190705, "fullname": "Yanbiao Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/190705?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190706, "fullname": "Qiuchang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190706?format=json", "institution": "Shanghai Jiaotong University; Shanghai Jiaotong University"}, {"id": 190707, "fullname": "Zhiyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190707?format=json", "institution": null}, {"id": 190708, "fullname": "Qichen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190708?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190709, "fullname": "Wenyuan XIE", "url": "http://cvpr.thecvf.com/api/miniconf/users/190709?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 173940, "fullname": "Guodong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173940?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190710, "fullname": "Bayram Bayramli", "url": "http://cvpr.thecvf.com/api/miniconf/users/190710?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 129180, "fullname": "Yue Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/129180?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86008, "fullname": "Hongtao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86008?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Embodied agents face a fundamental limitation: once deployed in real-world environments to perform specific tasks, they are unable to acquire additional knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN identifies contextually prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards to train EFN, ensuring that the predicted actions align with past behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit \u201clearning from experience\u201d. Experiments across diverse embodied tasks show that EFN improves adaptability, robustness, and success rates over frozen baselines. We provide code and demo in our supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38801", "url": null, "sourceid": 34231, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38804, "uid": "beef9b74a8201c503948ff6b171d73d4", "name": "DC-Merge: Improving Model Merging with Directional Consistency", "authors": [{"id": 181792, "fullname": "Han-Chen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181792?format=json", "institution": "Southeast University"}, {"id": 148713, "fullname": "Zi-Hao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/148713?format=json", "institution": "Southeast University"}, {"id": 190718, "fullname": "Mao-Lin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190718?format=json", "institution": "Southeast University; Southeast University"}, {"id": 190719, "fullname": "Shimin Di", "url": "http://cvpr.thecvf.com/api/miniconf/users/190719?format=json", "institution": "Southeast University"}, {"id": 87706, "fullname": "Min-Ling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87706?format=json", "institution": "Southeast University"}, {"id": 77070, "fullname": "Tong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/77070?format=json", "institution": "Southeast University"}], "abstract": "Model merging aims to integrate multiple task-adapted models into a unified model that preserves the knowledge of each task.In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. However, this consistency is frequently compromised by two issues: i) an imbalanced energy distribution within task vectors, where a small fraction of singular values dominate the total energy, leading to the neglect of semantically important but weaker components upon merging, and ii) the geometric inconsistency of task vectors in parameter space, which causes direct merging to distort their underlying directional geometry. To address these challenges, we propose DC-Merge, a method for directional-consistent model merging. It first balances the energy distribution of each task vector by smoothing its singular values, ensuring all knowledge components are adequately represented. These energy-balanced vectors are then projected onto a shared orthogonal subspace to align their directional geometries with minimal reconstruction error. Finally, the aligned vectors are aggregated in the shared orthogonal subspace and projected back to the original parameter space. Extensive experiments on vision and vision-language benchmarks show that DC-Merge consistently achieves state-of-the-art performance in both full fine-tuning and LoRA settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38804", "url": null, "sourceid": 36227, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38813, "uid": "022d682784c20ccc9176ceff49b82c0b", "name": "Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework", "authors": [{"id": 180974, "fullname": "Kaihua Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180974?format=json", "institution": "Tongji University"}, {"id": 183917, "fullname": "JIAXIN QI", "url": "http://cvpr.thecvf.com/api/miniconf/users/183917?format=json", "institution": "Computer Network Information Center"}, {"id": 190751, "fullname": "Jinliou Jinliou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190751?format=json", "institution": null}, {"id": 190752, "fullname": "Yuhua Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190752?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 190753, "fullname": "Jianqiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190753?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "The emergence of Large Language Models (LLMs) has driven rapid progress in multi-modal learning, particularly in the development of Large Vision-Language Models (LVLMs). However, existing LVLM training paradigms place excessive reliance on the LLM component, giving rise to two critical robustness challenges: language bias and language sensitivity. To address both issues simultaneously, we propose a novel Self-Critical Inference (SCI) framework that extends Visual Contrastive Decoding by conducting multi-round counterfactual reasoning through both textual and visual perturbations. This process further introduces a new strategy for improving robustness by scaling the number of counterfactual rounds. Moreover, we also observe that failure cases of LVLMs differ significantly across models, indicating that fixed robustness benchmarks may not be able to capture the true reliability of LVLMs. To this end, we propose the Dynamic Robustness Benchmark (DRBench), a model-specific evaluation framework targeting both language bias and sensitivity issues. Extensive experiments show that SCI consistently outperforms baseline methods on DRBench, and that increasing the number of inference rounds further boosts robustness beyond existing single-step counterfactual reasoning methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38813", "url": null, "sourceid": 44147, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38815, "uid": "bcf2db216a3f04c29273c371e2e4bf7d", "name": "Lighting-grounded Video Generation with Renderer-based Agent Reasoning", "authors": [{"id": 101627, "fullname": "Ziqi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/101627?format=json", "institution": "Chinese Academy of Sciences &amp; Beijing Jiaotong University"}, {"id": 190755, "fullname": "Taoyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190755?format=json", "institution": "Peking University"}, {"id": 76616, "fullname": "Zheng Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76616?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 89137, "fullname": "Si Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89137?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 152935, "fullname": "Han Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152935?format=json", "institution": "OpenBayes Information Technology Co., Ltd."}, {"id": 89455, "fullname": "Shuchen Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89455?format=json", "institution": "Peking University"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}], "abstract": "Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38815", "url": null, "sourceid": 39020, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38816, "uid": "6f1ae1357113c65a43d47829739fe4be", "name": "Hybrid Agents for Image Restoration", "authors": [{"id": 89755, "fullname": "Bingchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89755?format=json", "institution": "University of Science and Technology of China"}, {"id": 158204, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158204?format=json", "institution": "University of Science and Technology of China"}, {"id": 128862, "fullname": "Yiting Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128862?format=json", "institution": "University of Science and Technology of China"}, {"id": 85129, "fullname": "Zhibo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85129?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Existing Image Restoration (IR) studies typically focus on task-specific or universal modes individually, relying on the mode selection of users and lacking cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38816", "url": null, "sourceid": 37647, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38817, "uid": "a64419ad5643858004cdd6867f12acfb", "name": "Gaussian Mapping for Evolving Scenes", "authors": [{"id": 127165, "fullname": "Vladimir Yugay", "url": "http://cvpr.thecvf.com/api/miniconf/users/127165?format=json", "institution": "University of Amsterdam"}, {"id": 190756, "fullname": "Thies Kersten", "url": "http://cvpr.thecvf.com/api/miniconf/users/190756?format=json", "institution": "University of Amsterdam"}, {"id": 75802, "fullname": "Luca Carlone", "url": "http://cvpr.thecvf.com/api/miniconf/users/75802?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 153394, "fullname": "Theo Gevers", "url": "http://cvpr.thecvf.com/api/miniconf/users/153394?format=json", "institution": "University of Amsterdam, University of Amsterdam"}, {"id": 88372, "fullname": "Martin R. Oswald", "url": "http://cvpr.thecvf.com/api/miniconf/users/88372?format=json", "institution": "University of Amsterdam"}, {"id": 186631, "fullname": "Lukas Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/186631?format=json", "institution": "University of Technology Nuremberg UTN"}], "abstract": "Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, as well as in various applications, including augmented reality, robotics, and autonomous driving. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored.To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We thoroughly evaluate Gaussian Mapping for Evolving Scenes (\\ours) on both synthetic and real-world datasets, achieving a 29.7\\% improvement in PSNR and a $ \\times3$-improvement in L1 depth error over the most competitive baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38817", "url": null, "sourceid": 34554, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38818, "uid": "b78f81d5bdfed3168c472b237f37b43a", "name": "ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection", "authors": [{"id": 101049, "fullname": "Chengzhi Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/101049?format=json", "institution": "Wuhan University"}, {"id": 152831, "fullname": "Bijun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152831?format=json", "institution": "Wuhan University"}], "abstract": "Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric\u2013topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, coupling metric and topology across surfaces, curves, and samples. Building on this, we propose ReManNet: it first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features via a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point\u2013curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive performance, and on OpenLane it improves F1 by +8.2\\% over the baseline and by +1.8\\% over the previous SOTA, with scenario-level F1 gains of up to +6.6\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38818", "url": null, "sourceid": 32147, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38819, "uid": "44437f1b3fd90ae1a003583c7d2b853a", "name": "VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension", "authors": [{"id": 128892, "fullname": "Hyejin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/128892?format=json", "institution": "Pohang University of Science and Technology (POSTECH)"}, {"id": 190757, "fullname": "Junhyuk Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190757?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 87833, "fullname": "Suha Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/87833?format=json", "institution": "POSTECH"}, {"id": 128871, "fullname": "Jungseul Ok", "url": "http://cvpr.thecvf.com/api/miniconf/users/128871?format=json", "institution": "POSTECH"}], "abstract": "Referring Expression Comprehension (REC) aims to localize the image region corresponding to a natural-language query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries into structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate.However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.Each operator executes and validates its output, such as object existence or spatial relationship, thereby allowing the system to robustly handle no-target cases when verification conditions are not met.Our framework achieves state-of-the-art performance, reaching 61.1\\% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. Furthermore, VIRO shows superior computational efficiency in terms of throughput, high reliability with a program failure rate of less than 0.3\\%, and scalability through decoupled program generation from execution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38819", "url": null, "sourceid": 30928, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38821, "uid": "78afc2595242c90f511a52ced9dec893", "name": "REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting", "authors": [{"id": 183009, "fullname": "Di Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183009?format=json", "institution": "Hefei Institutes of Physical Science, Chinese Academy of Sciences"}, {"id": 156603, "fullname": "Liu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156603?format=json", "institution": "Hefei University of Technology"}, {"id": 190759, "fullname": "Anran Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190759?format=json", "institution": null}, {"id": 176151, "fullname": "\u7389\u7814 \u5218", "url": "http://cvpr.thecvf.com/api/miniconf/users/176151?format=json", "institution": "Hefei University of Technology"}, {"id": 189321, "fullname": "Qiaojun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189321?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 190760, "fullname": "Liu Shaofan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190760?format=json", "institution": "Zhejiang University"}, {"id": 190761, "fullname": "Liangtu Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/190761?format=json", "institution": "hefei institute of Chinese academy of scienses"}, {"id": 86440, "fullname": "Cewu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86440?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Codes can be found in the supplementary material and will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38821", "url": null, "sourceid": 40612, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38826, "uid": "4d3d68218084db39aa3c734ead6aaa31", "name": "GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator", "authors": [{"id": 107128, "fullname": "Liyuan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107128?format=json", "institution": "Stanford University"}, {"id": 174642, "fullname": "Manjunath Narayana", "url": "http://cvpr.thecvf.com/api/miniconf/users/174642?format=json", "institution": "Zillow group"}, {"id": 190770, "fullname": "Michal Stary", "url": "http://cvpr.thecvf.com/api/miniconf/users/190770?format=json", "institution": "Zillow Group, Inc."}, {"id": 100302, "fullname": "Will Hutchcroft", "url": "http://cvpr.thecvf.com/api/miniconf/users/100302?format=json", "institution": "Zillow"}, {"id": 85845, "fullname": "Gordon Wetzstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85845?format=json", "institution": "Stanford University"}, {"id": 106415, "fullname": "Iro Armeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/106415?format=json", "institution": "Stanford University"}], "abstract": "We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generalization. GaussFusion achieves state-of-the-art performance on novel-view synthesis benchmarks, and an efficient variant runs in real time at 21 FPS while maintaining similar performance, enabling interactive 3D applications. We plan to release our code and model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38826", "url": null, "sourceid": 38643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39714, "uid": "ef000db3b0f2a2c40465c2a8464b726b", "name": "Learning to Drive via Real-World Simulation at Scale", "authors": [{"id": 142811, "fullname": "Haochen Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/142811?format=json", "institution": "CASIA, Xiaomi EV"}, {"id": 185064, "fullname": "Tianyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185064?format=json", "institution": "Fudan University"}, {"id": 98926, "fullname": "Haochen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/98926?format=json", "institution": null}, {"id": 90131, "fullname": "Jiazhi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90131?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 192709, "fullname": "Yihang Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192709?format=json", "institution": "University of Hong Kong"}, {"id": 188438, "fullname": "Guang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188438?format=json", "institution": "Xiaomi Corporation"}, {"id": 145782, "fullname": "junli wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145782?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 192710, "fullname": "Yinfeng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192710?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 89150, "fullname": "Zhang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89150?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 87288, "fullname": "Liang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87288?format=json", "institution": "CASIA"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 158846, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158846?format=json", "institution": "Wayve"}, {"id": 69669, "fullname": "Hongyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/69669?format=json", "institution": "The University of Hong Kong"}], "abstract": "Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing these crucial massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by ego trajectory perturbations. Furthermore, we develop a pseudo-expert trajectory generation mechanism to provide feasible action supervision for these newly simulated states to provide action supervision.Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal crucial findings of such a sim-real paradigm, includingthe design of pseudo-experts and the scaling properties for different policy architectures. Simulation data and code would be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39714", "url": null, "sourceid": 34548, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40379?format=json"], "related_events_ids": [40379]}, {"id": 38868, "uid": "af1fe82173715488900ae3f31567b4ab", "name": "CrackSSM: Reviving SSMs for Crack Segmentation via Dynamic Scanning", "authors": [{"id": 152090, "fullname": "Yubin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152090?format=json", "institution": "Xiamen University"}, {"id": 190882, "fullname": "Boyang Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190882?format=json", "institution": "Xiamen University"}, {"id": 157627, "fullname": "Yuan Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157627?format=json", "institution": "Xiamen University"}, {"id": 190883, "fullname": "Wenting Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190883?format=json", "institution": "Xiamen University"}, {"id": 131760, "fullname": "Jiayi Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/131760?format=json", "institution": "Xiamen University"}, {"id": 76395, "fullname": "Xiaoshuai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76395?format=json", "institution": "Xiamen University"}], "abstract": "Crack segmentation (CS) is crucial for structural inspection and maintenance in production scenarios. To achieve both high accuracy and efficiency, recent methods have adopted Mamba-based architectures built upon state space models (SSMs), which enable linear-complexity modeling of long-range dependencies. However, existing approaches typically rely on static multi-directional scanning to flatten visual features into sequences. This fixed flattening order disrupts spatial continuity and weakens the SSM\u2019s ability to model irregular crack patterns effectively. To address this limitation, we propose \\textbf{CrackSSM}, a novel crack-aware segmentation framework featuring a dynamic scanning strategy that adapts the token sequence to the underlying structure of each image. Specifically, we compute directional response strength along four orientations from high-level semantic features, and use these values to reorder tokens so that crack-relevant regions remain adjacent in sequence. This alignment improves the causal modeling ability of SSMs while preserving their efficiency and better suits the irregular, fine-grained nature of cracks. Additionally, we design a wavelet-guided decoding mechanism to recover detailed features. It incorporates high-frequency components extracted from the input image and applies them to guide feature refinement and edge-aware fusion, further enhancing segmentation precision. Experiments on three benchmark datasets demonstrate that our method achieves superior segmentation accuracy with fewer parameters and faster inference compared to existing state-of-the-art models. Source code is available in supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38868", "url": null, "sourceid": 45258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38832, "uid": "2281165f5495a1c5b48a776ab6a8c8df", "name": "FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinkage for Large-scale LoD 3D Gaussian Splatting", "authors": [{"id": 182823, "fullname": "Yixian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182823?format=json", "institution": "Beijing Institute of Technology"}, {"id": 190789, "fullname": "HaoLin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190789?format=json", "institution": "Beijing Institute of Technology"}, {"id": 140737, "fullname": "Jiadong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/140737?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155455, "fullname": "Yu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155455?format=json", "institution": "Beijing Institute of Technology"}, {"id": 190790, "fullname": "Xihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190790?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155458, "fullname": "Yufeng Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155458?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155459, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155459?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60\\% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we propose FilterGS, featuring a parallel filtering mechanism with two complementary filters that enable efficient selection without tree traversal, coupled with a scene-adaptive Gaussian shrinkage strategy that minimizes redundancy through opacity-based scaling. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38832", "url": null, "sourceid": 34591, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38834, "uid": "13170ed0153d4bc4d2c8bd6016634aaf", "name": "VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models", "authors": [{"id": 180282, "fullname": "Juncan Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180282?format=json", "institution": "Zhejiang University"}, {"id": 190795, "fullname": "Kejie Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190795?format=json", "institution": "Zhejiang University"}], "abstract": "Post-training quantization (PTQ) emerges as a vital technique for efficiently compressing large-scale models, with weight-compensation methods like GPTQ (symmetric calibration) and GPTAQ (asymmetric calibration) showing remarkable success. However, directly applying these methods to Vision-Language Models (VLMs) reveals two critical limitations: 1) their reliance on standard rounding is suboptimal for the asymmetric objective, failing to account for residual-induced shifts in the optimal quantization target; and 2) they uniformly process input channels across modalities, overlooking the distinct information densities of vision and language tokens. In this paper, we introduce VLM-PTQ, a new PTQ asymmetric framework for VLMs. First, we derive a closed-form correction term for the quantization point, which explicitly accounts for the propagated residual and the corresponding inverse Hessian column, yielding a better local optimum than standard rounding. Second, we propose a modality-aware quantization that differentiates channel importance between vision and language tokens, enabling the quantizer to prioritize salient channels through a lightweight fusion coefficient search. Our method extends GPTAQ with minimal overhead while achieving significant performance improvements in low-bit scenarios. Extensive experiments demonstrate that VLM-PTQ achieves state-of-the-art results, effectively compressing models from 1B to 72B parameters on a single GPU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38834", "url": null, "sourceid": 42718, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38836, "uid": "0623d6fcf18ec961b45184b98f0f6fdb", "name": "fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding", "authors": [{"id": 184272, "fullname": "Yuxiang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/184272?format=json", "institution": "Georgia Institute of Technology"}, {"id": 190798, "fullname": "Yanteng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190798?format=json", "institution": "Georgia State University"}, {"id": 169031, "fullname": "Xi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/169031?format=json", "institution": "Oak Ridge National Laboratory; University of Alabama at Birmingham"}, {"id": 190799, "fullname": "Chengxuan Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190799?format=json", "institution": "Northwestern University; Texas A&amp;M University - College Station"}, {"id": 190800, "fullname": "Tianyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190800?format=json", "institution": "University of Alabama at Birmingham"}, {"id": 190801, "fullname": "Vince Calhoun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190801?format=json", "institution": "Georgia State University; Georgia Institute of Technology; Emory University"}], "abstract": "Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI\u2013text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38836", "url": null, "sourceid": 34063, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38840, "uid": "8557c4a4fd714e9a350d9e287e345e5a", "name": "POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse", "authors": [{"id": 183425, "fullname": "Anjie Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/183425?format=json", "institution": "University of Oxford"}, {"id": 157200, "fullname": "Can Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157200?format=json", "institution": "University of Oxford"}, {"id": 182158, "fullname": "Yuyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182158?format=json", "institution": "University of Oxford"}, {"id": 85906, "fullname": "Alison Noble", "url": "http://cvpr.thecvf.com/api/miniconf/users/85906?format=json", "institution": "University of Oxford"}], "abstract": "In computer vision, machine unlearning aims to remove the influence of specific visual concepts or training images without retraining from scratch. Studies show that existing approaches often modify the classifier while leaving internal representations intact, resulting in incomplete forgetting.In this work, we extend the notion of unlearning to the representation level, deriving a three-term interplay between forgetting efficacy, retention fidelity, and class separation. Building on Neural Collapse theory, we show that the orthogonal projection of a simplex Equiangular Tight Frame (ETF) remains an ETF in a lower-dimensional space, yielding a provably optimal forgetting operator.We further introduce the Representation Unlearning Score (RUS) to quantify representation-level forgetting and retention fidelity. Building on this, we introduce POUR (Provably Optimal Unlearning of Representations), a geometric projection method with closed-form (POUR-P) and feature-level unlearning variants under a distillation scheme (POUR-D).Experiments on CIFAR-10/100 and PathMNIST demonstrate that POUR achieves effective unlearning while preserving retained knowledge, outperforming state-of-the-art unlearning methods on both classification-level and representation-level metrics. Code will be released upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38840", "url": null, "sourceid": 40148, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38844, "uid": "ecc008a2e2e0a62395ada91aaf5d6236", "name": "Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning", "authors": [{"id": 179886, "fullname": "Md. Borhan Uddin", "url": "http://cvpr.thecvf.com/api/miniconf/users/179886?format=json", "institution": "Shenzhen University"}, {"id": 190819, "fullname": "Arif Raza", "url": "http://cvpr.thecvf.com/api/miniconf/users/190819?format=json", "institution": "Shenzhen University"}, {"id": 190820, "fullname": "Zhiliang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190820?format=json", "institution": "Shenzhen University"}, {"id": 190821, "fullname": "Lu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190821?format=json", "institution": "Shenzhen University"}, {"id": 131479, "fullname": "Jianqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131479?format=json", "institution": "Shenzhen University"}, {"id": 131480, "fullname": "Jie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131480?format=json", "institution": "Shenzhen University"}], "abstract": "End-to-End (E2E) Reinforcement Learning (RL) for autonomous driving still struggles with safety and generalization under distribution shift, as perception-heavy encoders, sparse rewards, and ad hoc uncertainty handling often yield brittle closed-loop behavior. This work introduces a unified Deep RL (DRL) framework addressing key gaps: causal ego-centric state design, dense differentiable rewards, joint uncertainty estimation with entropy gating, and control-level policy transfer. An ego-centric relational graph encodes agent influence via uncertainty-weighted attention over kinematics, lane geometry, and semantics, producing a compact control state. A multi-objective differentiable reward stabilizes optimization by shaping safety, progress, and comfort with an uncertainty term. Aleatoric and epistemic uncertainty-captured through per-edge heteroscedastic variance and a critic ensemble-are aggregated into a calibrated confidence signal that modulates policy entropy for risk-aware exploration. A causal-semantic transfer objective aligns actions, attention, and uncertainty statistics across domains, combined with meta-learned initialization for few-shot adaptation. In closed-loop urban driving across varied towns, traffic, and weather, the framework improves success rate, reduces infractions per kilometer, and achieves higher time-to-conflict with lower lateral deviation and comfort cost compared to strong baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38844", "url": "https://github.com/szu-ai/safe-driving-drl/", "sourceid": 38759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38845, "uid": "a2b4d8fee5c06984ec6a9610196140dd", "name": "What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely $F_1$", "authors": [{"id": 163915, "fullname": "S\u00e9bastien Pi\u00e9rard", "url": "http://cvpr.thecvf.com/api/miniconf/users/163915?format=json", "institution": "University of Li\u00e8ge"}, {"id": 153974, "fullname": "Adrien Deliege", "url": "http://cvpr.thecvf.com/api/miniconf/users/153974?format=json", "institution": "University of Li\u00e8ge"}, {"id": 71630, "fullname": "Marc Van Droogenbroeck", "url": "http://cvpr.thecvf.com/api/miniconf/users/71630?format=json", "institution": "University of Li\u00e8ge"}], "abstract": "Ranking methods or models based on their performance is of prime importance but is tricky because performance is fundamentally multidimensional. In the case of classification, precision and recall are scores with probabilistic interpretations that are both important to consider and complementary. The rankings induced by these two scores are often in partial contradiction. In practice, therefore, it is extremely useful to establish a compromise between the two views to obtain a single, global ranking. Over the last fifty years or so, it has been proposed to take a weighted harmonic mean, known as the F-score, F-measure, or $F_\\beta$. Generally speaking, by averaging basic scores, we obtain a score that is intermediate in terms of values. However, there is no guarantee that these scores lead to meaningful rankings and no guarantee that the rankings are good tradeoffs between these base scores. Given the ubiquity of $F_\\beta$ scores in the literature, some clarification is in order. Concretely: (1) We establish that $F_\\beta$-induced rankings are meaningful and define a shortest path between precision- and recall-induced rankings. (2) We frame the problem of finding a tradeoff between two scores as an optimization problem expressed with Kendall rank correlations. We show that $F_1$ and its skew-insensitive version are far from being optimal in that regard. (3) We provide theoretical tools and a closed-form expression to find the optimal value for $\\beta$ for any distribution or set of performances, and we illustrate their use on six case studies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38845", "url": null, "sourceid": 42825, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38848, "uid": "f5da97d441cd42b0b26c10145b71a923", "name": "Scaling Multi-Identity Consistency for Image Customization via Multi-to-Multi Matching Paradigm", "authors": [{"id": 181297, "fullname": "Yufeng Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181297?format=json", "institution": "Beijing Zitiao Network Technology Co., Ltd."}, {"id": 190038, "fullname": "wenxu wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190038?format=json", "institution": null}, {"id": 190036, "fullname": "Shaojin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190036?format=json", "institution": "ByteDance Inc."}, {"id": 190037, "fullname": "Mengqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190037?format=json", "institution": "University of Science and Technology of China"}, {"id": 190041, "fullname": "Fei Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190041?format=json", "institution": "Bytedance"}, {"id": 127533, "fullname": "Qian HE", "url": "http://cvpr.thecvf.com/api/miniconf/users/127533?format=json", "institution": "Institute of Remote Sensing Application, Chinese Academic of Sciences"}], "abstract": "Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With multi-to-multi matching paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally. To facilitate the training of UMO, we develop a customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38848", "url": null, "sourceid": 45703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38850, "uid": "ee6a59b36c59499b1bf9ee56b6e73ca1", "name": "BinaryAttention: One-Bit Attention for Vision and Diffusion Transformers", "authors": [{"id": 180682, "fullname": "Chaodong XIAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/180682?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 129145, "fullname": "Zhengqiang ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129145?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2$\\times$ faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit transformers for vision tasks. The codes and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38850", "url": null, "sourceid": 45076, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38855, "uid": "8cfebc3e119ed2fe0dd172eb59b0a595", "name": "Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors", "authors": [{"id": 182560, "fullname": "Jiatong Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/182560?format=json", "institution": "Adelaide University"}, {"id": 185854, "fullname": "Zicheng Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185854?format=json", "institution": "University of Adelaide"}, {"id": 88134, "fullname": "Anton van den Hengel", "url": "http://cvpr.thecvf.com/api/miniconf/users/88134?format=json", "institution": "University of Adelaide"}, {"id": 77115, "fullname": "Lingqiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77115?format=json", "institution": "University of Adelaide"}], "abstract": "Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain\u2014from active sensors such as LiDAR or from feed-forward predictors like VGGT\u2014offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input.Experiments on both single-object and multi-object scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38855", "url": null, "sourceid": 43031, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38856, "uid": "ec42be71515ac87ff31ac02ad01eeefc", "name": "ReasonX: MLLM-Guided Intrinsic Image Decomposition", "authors": [{"id": 180687, "fullname": "Alara Dirik", "url": "http://cvpr.thecvf.com/api/miniconf/users/180687?format=json", "institution": "Imperial College London"}, {"id": 71652, "fullname": "Tuanfeng Y. Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71652?format=json", "institution": "Adobe Systems"}, {"id": 86074, "fullname": "Duygu Ceylan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86074?format=json", "institution": "Adobe Systems"}, {"id": 86877, "fullname": "Stefanos Zafeiriou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86877?format=json", "institution": "Imperial College London"}, {"id": 92251, "fullname": "Anna Fr\u00fchst\u00fcck", "url": "http://cvpr.thecvf.com/api/miniconf/users/92251?format=json", "institution": "Adobe Research"}], "abstract": "Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge\u2019s relational assessments and analytically derived relations from the model\u2019s outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9\u201325\\% WHDR reduction on IIW albedo and up to 46\\% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38856", "url": null, "sourceid": 43413, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38861, "uid": "cda28f311200863196d587fb262c0ffa", "name": "UI-Lens: Assessing General MLLMs\u2019 Potential to Automate UI Display Quality Assurance", "authors": [{"id": 190856, "fullname": "Wei Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190856?format=json", "institution": "Zhejiang University"}, {"id": 172852, "fullname": "Yexinrui WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/172852?format=json", "institution": "zhejiang university"}, {"id": 190857, "fullname": "Xinli Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190857?format=json", "institution": "Zhejiang University"}, {"id": 175637, "fullname": "Xinran Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/175637?format=json", "institution": "Zhejiang University"}, {"id": 175624, "fullname": "Shi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/175624?format=json", "institution": "Zhejiang University"}], "abstract": "User Interface (UI) display defect detection poses challenges far beyond UI understanding, requiring fine-grained element boundary understanding, missing-content detection, and reasoning about sequential interface semantic consistency. However, the capabilities of multimodal large language models (MLLMs) and vision-language models (VLMs) for detecting UI defects in realistic, complex interfaces have not been systematically validated. To fill this gap, we present UI-Lens, the first multi-dimensional UI display detection benchmark for Chinese-language UI scenarios. The dataset comprises 4,759 pages meticulously annotated by design experts, covering six core display defect categories. We conduct a systematic evaluation of 10 mainstream models (8 closed-source, 2 open-source). Results show clear shortcomings in current models: for tasks requiring fine-grained element boundary understanding, performance is near random, with task-average F1 scores of 20.36% and 31.21% on Text Overflow and Container Overlap, respectively; for sequential interface semantic consistency (e.g., Text Inconsistency), the task-average F1 score is only 10.61%, indicating severe underperformance. We release UI-Lens to catalyze research toward more robust UI display defect detection with fine-grained boundary awareness in realistic, complex interfaces.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38861", "url": null, "sourceid": 36276, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38864, "uid": "d85f7ffd34969c63e0146c6ac8bc518f", "name": "C$^3$R: Cross-Modal Cycle Consistency Rewards Improve Multimodal Reasoning", "authors": [{"id": 190874, "fullname": "Zirui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190874?format=json", "institution": "Rutgers University"}, {"id": 176241, "fullname": "Haoyu Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/176241?format=json", "institution": null}, {"id": 190875, "fullname": "Kexin Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/190875?format=json", "institution": "The University of Chicago"}, {"id": 87390, "fullname": "Chengzhi Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87390?format=json", "institution": "Columbia University"}], "abstract": "Multimodal Large Language Models (MLLMs) suffer from a fundamental \"modality gap\", contradicting themselves on visual versus text views of the same content. This paper argues that this inconsistency is not a failure, but a powerful resource for self-reward multimodal learning. Instead of relying on flawed voting mechanisms that amplify systematic errors when the majority is wrong, we introduce cross-modal cycle consistency as rewards C$^3$R to improve multimodal reasoning. C$^3$R performs backward inference from an answer to a query, switches modalities, and performs forward inference to verify the answer's consistency. This cycle serves as a dense, label-free reward that guides the model to resolve its own internal conflicts, while avoiding majority-is-wrong failures of standard voting methods. On standard benchmarks, C$^3$R mitigates modality-specific biases and improves reasoning accuracy by up to 7.6%. Our results show that robust reasoning emerges not just from scaling data, but from achieving a bidirectional understanding of the multimodal world.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38864", "url": null, "sourceid": 43277, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38874, "uid": "14a2750f09d061e1744e376eeaae4608", "name": "SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation", "authors": [{"id": 151652, "fullname": "Can Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151652?format=json", "institution": "National University of Singapore"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO , a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines. Our souce-code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38874", "url": null, "sourceid": 42368, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40349, "uid": "bffa0846b1e371410d49aa3c87d57c29", "name": "Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks", "authors": [{"id": 143140, "fullname": "Zhichao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143140?format=json", "institution": "Xidian University"}, {"id": 190897, "fullname": "Jianjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190897?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 190898, "fullname": "Zhixianhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190898?format=json", "institution": "Xidian University"}, {"id": 190899, "fullname": "Pangu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190899?format=json", "institution": "Xidian University"}, {"id": 190900, "fullname": "Xiangfei Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190900?format=json", "institution": "Xidian University"}, {"id": 190901, "fullname": "Pengfei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190901?format=json", "institution": "Xidian University"}, {"id": 126996, "fullname": "Leida Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126996?format=json", "institution": "Xidian University"}], "abstract": "Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method. Data and model will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40349", "url": null, "sourceid": -37904, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38875?format=json"], "related_events_ids": [38875]}, {"id": 38880, "uid": "0b61a4e863c0f5e7e20001aea1c33962", "name": "DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks", "authors": [{"id": 88442, "fullname": "Debasmit Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/88442?format=json", "institution": "Qualcomm Inc."}, {"id": 85738, "fullname": "Munawar Hayat", "url": "http://cvpr.thecvf.com/api/miniconf/users/85738?format=json", "institution": "Monash University"}, {"id": 85634, "fullname": "Fatih Porikli", "url": "http://cvpr.thecvf.com/api/miniconf/users/85634?format=json", "institution": "QualComm"}], "abstract": "Fine-tuning large language models (LLMs) using reinforcement learning (RL) objectives has gained traction, especially in scenarios where labeled data is limited. Building on its success in the language domain, recent efforts have extended RL-based fine-tuning to multimodal tasks. Visual-RFT, for instance, applied Group Relative Policy Optimization (GRPO) to fine-tune multimodal LLMs (MLLMs) across various visual perception benchmarks, achieving notable improvements over standard supervised fine-tuning (SFT). However, its scope was limited by a narrow evaluation of RL adaptation strategies. In this work, we expand the landscape by introducing new RL-based baselines on the same benchmarks and conducting a deeper analysis of GRPO\u2019s training dynamics. We identify key limitations\u2014such as reduced generation diversity, constrained policy exploration, and suboptimal reward formulation and aggregation. To address these, we propose DEVA: a framework that enhances Diversity via a flow-based training objective, encourages broader policy Exploration through global entropic regularization, and leverages alignment Volume as a non-verifiable reward combined with harmonic Aggregation. Applied to GRPO and other RL methods, DEVA delivers consistent gains in both quantitative (+5 to +13 points) and qualitative metrics. We further provide visualizations, ablations, and analyses to unpack the contributions of each component in our framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38880", "url": null, "sourceid": 32660, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38881, "uid": "d2328daaa6a0a581028bda7a8d069510", "name": "EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion", "authors": [{"id": 135945, "fullname": "Da Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/135945?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 127020, "fullname": "Dominik Engel", "url": "http://cvpr.thecvf.com/api/miniconf/users/127020?format=json", "institution": "Ulm University"}, {"id": 190908, "fullname": "Deng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190908?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 137436, "fullname": "Ivan Viola", "url": "http://cvpr.thecvf.com/api/miniconf/users/137436?format=json", "institution": "King Abdullah University of Science and Technology"}], "abstract": "Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability.To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields.Extensive experiments on various real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38881", "url": null, "sourceid": 40668, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38883, "uid": "9cfe5db52ec297b4edb475fd437b3530", "name": "Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos", "authors": [{"id": 180352, "fullname": "Dingkun Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/180352?format=json", "institution": "Zhejiang University"}, {"id": 76241, "fullname": "Zehong Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76241?format=json", "institution": "Zhejiang University"}, {"id": 144016, "fullname": "Yan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/144016?format=json", "institution": "Zhejiang University"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 128201, "fullname": "Georgios Pavlakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/128201?format=json", "institution": "University of Texas at Austin"}, {"id": 76363, "fullname": "Xiaowei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76363?format=json", "institution": "Zhejiang University"}], "abstract": "Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues\u2014velocity and acceleration\u2014which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail.We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, velocities, and accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines camera-space and world-space trajectories, significantly reducing jitter, suppressing oversmoothing, and restoring physically plausible motion profiles.Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38883", "url": null, "sourceid": 42243, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40351?format=json"], "related_events_ids": [40351]}, {"id": 38882, "uid": "9c88225457856ad8f19a6952c950ab32", "name": "Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images", "authors": [{"id": 183782, "fullname": "Jingzhou Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183782?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 166265, "fullname": "Dexin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/166265?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 162015, "fullname": "Fengchao Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/162015?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 148795, "fullname": "Yuntao Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/148795?format=json", "institution": "Zhejiang University"}, {"id": 130298, "fullname": "Liang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130298?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38882", "url": null, "sourceid": 31275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40351, "uid": "9cfe5db52ec297b4edb475fd437b3530", "name": "Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos", "authors": [{"id": 180352, "fullname": "Dingkun Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/180352?format=json", "institution": "Zhejiang University"}, {"id": 76241, "fullname": "Zehong Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76241?format=json", "institution": "Zhejiang University"}, {"id": 144016, "fullname": "Yan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/144016?format=json", "institution": "Zhejiang University"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 128201, "fullname": "Georgios Pavlakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/128201?format=json", "institution": "University of Texas at Austin"}, {"id": 76363, "fullname": "Xiaowei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76363?format=json", "institution": "Zhejiang University"}], "abstract": "Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues\u2014velocity and acceleration\u2014which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail.We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, velocities, and accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines camera-space and world-space trajectories, significantly reducing jitter, suppressing oversmoothing, and restoring physically plausible motion profiles.Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40351", "url": null, "sourceid": -42243, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38883?format=json"], "related_events_ids": [38883]}, {"id": 38885, "uid": "c52caddded0b8beec90aa67a3b812622", "name": "Geometry-driven OOD Detectors Are Class-Incremental Learners", "authors": [{"id": 144914, "fullname": "Wangwang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/144914?format=json", "institution": "National University of Defense Technology"}, {"id": 154699, "fullname": "Zijian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154699?format=json", "institution": "National University of Defense Technology"}, {"id": 190915, "fullname": "Tianjiao Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190915?format=json", "institution": null}, {"id": 190916, "fullname": "Yuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190916?format=json", "institution": "National University of Defense Technology"}, {"id": 154703, "fullname": "Yong Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/154703?format=json", "institution": null}, {"id": 153556, "fullname": "Kele Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153556?format=json", "institution": "National University of Defense Technology"}], "abstract": "Class-Incremental Learning (CIL) seeks to acquire new classes over time without erasing prior knowledge. While recent methods leverage pre-trained models (PTMs) to curb forgetting, they largely optimize the feature extractor and overlook the crucial classification head. In this work, we advance a simple view: if each task is equipped with a classifier that has the ability to both recognize in-distribution (IND) classes and reject out-of-distribution (OOD) inputs, CIL arises naturally\u2014inputs are accepted only by heads that deem them in-distribution and rejected otherwise. Supported by rigorous theoretical and empirical studies, we find that this ability is characterized by Inter-class Separation and Intra-class Compactness; lacking these, standard linear and cosine-similarity heads remain closed-set and fail to yield a usable OOD signal. To address this, we propose GOD (Geometry-driven OOD Detectors), which unifies IND recognition and OOD rejection in a single geometric space by replacing the learnable head with fixed Equiangular Tight Frame (ETF) anchors; an ETF loss enforces inter-class separation, and an ArcFace loss further tightens intra-class compactness. For efficiency, we further introduce a parameter-efficient hybrid architecture and an efficient inference strategy, thus reducing both parameter footprint and inference cost. Extensive experiments on multiple incremental settings and datasets show that GOD achieves state-of-the-art results.[^1]: Code and datasets are available in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38885", "url": null, "sourceid": 45705, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38886, "uid": "a81855f5ad4a94d453bb634487082a90", "name": "LogCD: Local-to-global Consistency Distillation for Few-step Image Generation", "authors": [{"id": 130250, "fullname": "Qingsong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/130250?format=json", "institution": "OPPO"}, {"id": 190917, "fullname": "Zhenyi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190917?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 184339, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184339?format=json", "institution": "OPPO AI Center"}, {"id": 129902, "fullname": "Zhijie Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129902?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 154487, "fullname": "Haonan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154487?format=json", "institution": "OPPO Guangdong Mobile Telecommunications Co., Ltd."}], "abstract": "Distilling latent diffusion models (LDMs)/rectified flow models (RFMs) into ones that are fast to sample from conditions is attracting huge interest. However, the majority of existing methods either need significant training resources or lead to quality degradation, especially in text-image alignment. To address these challenges, we propose Local-to-global Consistency Distillation (LogCD) to accelerate LDMs/RFMs via two-stage distillation.LogCD first performs local consistency distillation and then executes global consistency distillation to ensure the consistency along inference path.Besides, Latent Learned Perceptual Image Patch Similarity model is exploited to enhance perceptual consistency.Notably, LogCD exhibits high flexibility, allowing a single unified model to operate with 2 to 4 sampling steps. The model's performance improves seamlessly as the number of steps increases within this range.With only 70 A100 GPU hours, LogCD accelerates SDXL to achieve a 33.5 CLIP score with just 3 sampling steps, surpassing state-of-the-art accelerated models using even more steps.   FLUX.1-dev accelerated by LogCD  with 4-step sampling presents comparable performance to  25-step teacher model, with CLIP score of 32.6.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38886", "url": null, "sourceid": 44342, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38887, "uid": "b51887d225c4084c4c75f34ae85ff5e8", "name": "EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors", "authors": [{"id": 152346, "fullname": "Luca Bartolomei", "url": "http://cvpr.thecvf.com/api/miniconf/users/152346?format=json", "institution": "University of Bologna"}, {"id": 75726, "fullname": "Fabio Tosi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75726?format=json", "institution": "University of Bologna"}, {"id": 87171, "fullname": "Matteo Poggi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87171?format=json", "institution": "Universit\u00e0 di Bologna"}, {"id": 87188, "fullname": "Stefano Mattoccia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87188?format=json", "institution": "University of Bologna"}, {"id": 73480, "fullname": "Guillermo Gallego", "url": "http://cvpr.thecvf.com/api/miniconf/users/73480?format=json", "institution": "TU Berlin"}], "abstract": "We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities.Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38887", "url": null, "sourceid": 45543, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38888, "uid": "cd738e1809b4b8548086df18ff54eef7", "name": "CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection", "authors": [{"id": 181146, "fullname": "Yuchen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181146?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 127406, "fullname": "Kun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127406?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 131070, "fullname": "Yining Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131070?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 135934, "fullname": "Na Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135934?format=json", "institution": "Singapore University of Technology and Design"}], "abstract": "Multi-modal approaches have emerged as a promising paradigm for accurate 3D object detection. However, performance degrades precipitously when deployed in target domains divergent from the training distribution. In this work, we pinpoint two primary culprits: 1) In certain domains, such as nighttime or rainy conditions, one modality experiences significant degradation;2) The LiDAR branch tends to dominate the detection process, resulting in systematic under-exploitation of visual cues and vulnerability when point clouds are compromised. To surmount these impediments, we propose three synergistic innovations. First, Query-Decoupled Loss imparts independent supervisory signals to 2D-only, 3D-only, and fused queries, ensuring equitable gradient propagation and mitigating the image branch's supervisory starvation. Second, LiDAR-Guided Depth Prior augments 2D queries with instance-aware geometric priors via probabilistic depth distribution fusion from the point cloud, enhancing their spatial reasoning. Third, Inconsistent Cross-Modal Masking applies complementary spatial masks to the image and point cloud, simulating modality-specific failures and compelling queries from both modalities to compete within the fused decoder, thereby promoting adaptive fusion and preventing over-reliance on a single sensor.Extensive experiments reveal substantial gains over state-of-the-art baselines, achieving mAP improvements of 2.8, 1.3, and 3.2 on the Rain, Night, and Boston domains, respectively, while preserving competitive source-domain efficacy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38888", "url": null, "sourceid": 42041, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38889, "uid": "af5ac7432f2b60611a2b2081da85bdc0", "name": "DualMirage: Hunting Stealthy Multimodal LLM Agents via CAPTCHAs with Contour and Adversarial Illusions", "authors": [{"id": 180280, "fullname": "Bei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180280?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190918, "fullname": "Gaolei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190918?format=json", "institution": null}, {"id": 190919, "fullname": "Jun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190919?format=json", "institution": null}, {"id": 190920, "fullname": "Jianhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190920?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "The rapid advancement of Multimodal Large Language Models (MLLMs) has given rise to sophisticated autonomous agents capable of performing complex, human-like tasks across the web. However, this also introduces significant security risks, particularly from stealthy MLLM agents that can evade conventional detection mechanisms by mimicking human behavior. In this paper, we propose DualMirage, a novel CAPTCHA framework that proactively counters and identifies stealthy agents by exploiting fundamental disparities between human and machine perception. DualMirage employs a dual-pronged strategy: (1) Contour Illusions, which utilize cognitive principles to generate illusory contours that humans perceive effortlessly yet pose interpretation challenges for MLLMs; and (2) Adversarial Illusions, which embed human-imperceptible perturbations optimized to mislead the visual encoders of target MLLMs and thereby elicit characteristic, identifiable model responses. Evaluations on five state-of-the-art MLLMs demonstrate that DualMirage achieves an average 95.8\\% human success rate while blocking MLLM agents (up to 100\\% agent blocking rate), outperforming existing CAPTCHAs. Furthermore, DualMirage induces models to expose identities actively, achieving 58.8\\% white-box and 21.9\\% black-box attack success rates, proving effective against stealthy multimodal agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38889", "url": null, "sourceid": 35715, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38890, "uid": "02c1ffbf4378893da347eb8ec50b2456", "name": "LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models", "authors": [{"id": 190921, "fullname": "HyunsooHan HyunsooHan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190921?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 183136, "fullname": "Sangyeop Yeo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183136?format=json", "institution": "Ulsan National Institute of Science &amp; Technology"}, {"id": 87514, "fullname": "Jaejun Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87514?format=json", "institution": "Ulsan National Institute of Science and Technology"}], "abstract": "We demonstrate that in knowledge distillation for diffusion models, the teacher network\u2019s highly complex denoising process\u2014stemming from its substantially larger capacity\u2014poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTting-based distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a ``coarse'' alignment and a ``fine'' refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, LInear FiTting-based distillation extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance.Our comprehensive experimental results demonstrate that ours, \\core~with \\pick, outperforms previous knowledge distillation on diffusion models based on both U-Net and DiT architectures. Furthermore, as compression rates become exceedingly high, conventional knowledge distillation fails to provide sufficient guidance, thereby preventing lightweight diffusion models from achieving stable training. In contrast, our method demonstrates stable convergence even under such extreme compression ratios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38890", "url": null, "sourceid": 40619, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38893, "uid": "3d015642567b62204c8bce00b2b1d60c", "name": "FaceDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment", "authors": [{"id": 153269, "fullname": "Chaonan Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/153269?format=json", "institution": "Institute for Intelligent Computing\uff0cAlibaba Group"}, {"id": 153268, "fullname": "Jinwei Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/153268?format=json", "institution": "Institute for Intelligent Computing, Alibaba Group"}, {"id": 90173, "fullname": "Sheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90173?format=json", "institution": "Beihang University"}, {"id": 144115, "fullname": "Peng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144115?format=json", "institution": "Alibaba Group"}, {"id": 89581, "fullname": "Bang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89581?format=json", "institution": "Alibaba Group"}], "abstract": "Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce FaceDirector, a novel framework that reframes face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent.Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. FaceDirector achieves streaming, high-fidelity, controllable 512x512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38893", "url": null, "sourceid": 35605, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38894, "uid": "29b9a3c3866896d0f59400dd232a30f2", "name": "U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation", "authors": [{"id": 104146, "fullname": "xiang deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/104146?format=json", "institution": "tsinghua university"}, {"id": 190925, "fullname": "Feng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190925?format=json", "institution": "Meituan"}, {"id": 85474, "fullname": "Yong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85474?format=json", "institution": "Tencent AI Lab"}, {"id": 151732, "fullname": "Youxin Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151732?format=json", "institution": "THU"}, {"id": 129603, "fullname": "Xu Xiaoming", "url": "http://cvpr.thecvf.com/api/miniconf/users/129603?format=json", "institution": "meituan"}, {"id": 189866, "fullname": "Zhuoliang Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189866?format=json", "institution": null}, {"id": 84905, "fullname": "Xiaoming Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84905?format=json", "institution": "Meituan"}, {"id": 75944, "fullname": "Yebin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75944?format=json", "institution": "Tsinghua University"}], "abstract": "Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce \\textbf{U-Mind}, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.At its core, U-Mind implements a \\textit{Unified Alignment and Reasoning Framework} that addresses two key challenges: enhancing cross-modal synchronization via a \\textit{segment-wise alignment strategy}, and preserving reasoning abilities through \\textit{Rehearsal-Driven Learning}. During inference, U-Mind adopts a \\textit{text-first decoding pipeline} that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback.Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38894", "url": null, "sourceid": 43767, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38899, "uid": "987d2f8de201e03eaf666747dafbc659", "name": "SeD-UD: An Influence-Driven and Hierarchically-Decoupled Information Bottleneck for Multimodal Intent Recognition", "authors": [{"id": 181843, "fullname": "Qin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181843?format=json", "institution": "Hunan Technology and Business University"}, {"id": 190932, "fullname": "Wenbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190932?format=json", "institution": "Hunan University Of Technology and Business"}, {"id": 190933, "fullname": "Limei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190933?format=json", "institution": "Hunan University Of Technology and Business"}, {"id": 148175, "fullname": "Han Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/148175?format=json", "institution": "hutb"}, {"id": 190934, "fullname": "Junfeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190934?format=json", "institution": "Hunan University Of Technology and Business"}, {"id": 190935, "fullname": "Guanying Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190935?format=json", "institution": null}], "abstract": "Multimodal intent recognition (MIR) is hindered by substantial redundancy and noise originating from text, speech, and visual inputs, which weakens feature distinctiveness and ultimately harms recognition performance. Although recent approaches based on the information bottleneck (IB) principle mitigate this issue via feature compression and reconstruction to obtain compact and noise-reduced representations, they still encounter two major drawbacks. First, conventional IB employs a fixed bottleneck dimension, making it unable to accommodate sample-dependent variations in redundancy and noise. Second, simultaneously handling redundancy and noise within a single compression process leads to incomplete feature purification. In this paper, we propose a novel framework named SeD-UD, which incorporates influence-driven input-adaptive bottleneck (IDAB) modules following a hierarchically-decoupled IB strategy. Given a redundancy/noise influence factor, IDAB dynamically adjusts dimensions and selects the optimal parameters for compression and reconstruction, thereby achieving the best trade-off between information preservation and interference suppression. The IB strategy performs hierarchically-decoupled processing of redundancy and noise via separated de-redundancy and unified denoising based on IDAB modules. Extensive experiments on benchmark datasets show SeD-UD outperforms current state-of-the-art models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38899", "url": null, "sourceid": 44937, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38904, "uid": "d0ed3ce01aa99d5f4101b196e85fba3a", "name": "Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation", "authors": [{"id": 175149, "fullname": "Yangshi Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/175149?format=json", "institution": "Beihang University"}, {"id": 190944, "fullname": "Zheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190944?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 128288, "fullname": "Feng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128288?format=json", "institution": "Beihang University, Tsinghua University"}], "abstract": "Deep learning-based gaze estimation methods tend to suffer from substantial performance drop in real-world scenarios with varying users and environments. To tackle this issue, most recent approaches employ Unsupervised Domain Adaptation (UDA) to bridge the gap between source and target domains. However, this paradigm is misaligned with real-world scenarios, where the system typically needs to adapt to only a single new user. Therefore, this paper advocates a more practical paradigm: Unsupervised Personal Adaptation (UPA), which calibrates a pre-trained model using a few unlabeled images from a single new user. Conventional UDA methods do not guarantee improvements for every user and often yield lower average performance in this setting. To address this problem, we propose Render-to-Adapt (R2A), a self-supervised framework specifically designed for the UPA task. Given a pretrained gaze model, R2A utilizes a gaze-conditioned renderer to synthesize new images based on the model's gaze predictions, and enforces eye-region consistency as a label-free signal to enhance personalized gaze estimation. We evaluate R2A on a re-designed cross-dataset personal adaptation benchmark. Experimental results show that R2A consistently improves performance across all individuals and significantly outperforms existing SOTA methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38904", "url": null, "sourceid": 44688, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38907, "uid": "74aae646b0b4e8c286a3aa6c17d5387f", "name": "Video Panels for Long Video Understanding", "authors": [{"id": 181656, "fullname": "Lars Doorenbos", "url": "http://cvpr.thecvf.com/api/miniconf/users/181656?format=json", "institution": "University of Bonn - Institute of Computer Science"}, {"id": 144290, "fullname": "Federico Spurio", "url": "http://cvpr.thecvf.com/api/miniconf/users/144290?format=json", "institution": "University of Bonn"}, {"id": 75807, "fullname": "J\u00fcrgen Gall", "url": "http://cvpr.thecvf.com/api/miniconf/users/75807?format=json", "institution": "University of Bonn"}], "abstract": "Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models.To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution.Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs.Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4\\%. Overall, our method raises the bar for long video understanding models.   We will make our code available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38907", "url": null, "sourceid": 35991, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38908, "uid": "5e9581188f5fccbff6b350460f621d00", "name": "Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention", "authors": [{"id": 182483, "fullname": "Koichiro Ito", "url": "http://cvpr.thecvf.com/api/miniconf/users/182483?format=json", "institution": "Japan Aerospace Exploration Agency"}], "abstract": "Recent advances in chart recognition have been driven by the supervised fine-tuning (SFT) of vision-language models (VLMs), which unify multiple related tasks, and by diversifying training corpora. In parallel, research on leveraging large language models (LLMs) for object detection has shown that jointly training phrase grounding alongside SFT enhances a model\u2019s generative capabilities.Inspired by this, we hypothesize that chart recognition can also benefit from phrase grounding, which aligns textual phrases with chart regions\u2014a setting that remains underexplored due to the lack of corresponding datasets.In this work, we introduce a phrase-grounding-aware SFT via a Side-Masked Attention Module (SMAM), which is inserted into each transformer layer of the LLM. SMAM performs masked attention within the annotated region\u2014aligned with the corresponding phrase\u2014to produce an additional logit. We supervise this logit and use it as a reference to guide the LLM\u2019s output prediction during fine-tuning, alongside the standard SFT objective. To enable this approach, we also develop an automated pipeline for generating phrase-to-region alignments, which augments existing datasets. Experiments show that our method effectively incorporates phrase grounding into chart recognition via VLM fine-tuning. Code and datasets will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38908", "url": null, "sourceid": 36258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38909, "uid": "21ac90c376d8daa6ca4252a894ee1361", "name": "SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models", "authors": [{"id": 126306, "fullname": "Jiwoo Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/126306?format=json", "institution": "Sungkyunkwan University"}, {"id": 87830, "fullname": "Sangeek Hyun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87830?format=json", "institution": "Sungkyunkwan University"}, {"id": 91826, "fullname": "MinKyu Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/91826?format=json", "institution": "Sungkyunkwan University"}, {"id": 190949, "fullname": "Byeongju Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190949?format=json", "institution": "NAVER Cloud"}, {"id": 153108, "fullname": "Geonho Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/153108?format=json", "institution": "NAVER"}, {"id": 91073, "fullname": "Dongyoon Wee", "url": "http://cvpr.thecvf.com/api/miniconf/users/91073?format=json", "institution": "NAVER"}, {"id": 148698, "fullname": "Youngjun Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/148698?format=json", "institution": "NAVER"}, {"id": 87383, "fullname": "Jae-Pil Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87383?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors of the underlying diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38909", "url": null, "sourceid": 41447, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40353?format=json"], "related_events_ids": [40353]}, {"id": 40353, "uid": "21ac90c376d8daa6ca4252a894ee1361", "name": "SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models", "authors": [{"id": 126306, "fullname": "Jiwoo Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/126306?format=json", "institution": "Sungkyunkwan University"}, {"id": 87830, "fullname": "Sangeek Hyun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87830?format=json", "institution": "Sungkyunkwan University"}, {"id": 91826, "fullname": "MinKyu Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/91826?format=json", "institution": "Sungkyunkwan University"}, {"id": 190949, "fullname": "Byeongju Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190949?format=json", "institution": "NAVER Cloud"}, {"id": 153108, "fullname": "Geonho Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/153108?format=json", "institution": "NAVER"}, {"id": 91073, "fullname": "Dongyoon Wee", "url": "http://cvpr.thecvf.com/api/miniconf/users/91073?format=json", "institution": "NAVER"}, {"id": 148698, "fullname": "Youngjun Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/148698?format=json", "institution": "NAVER"}, {"id": 87383, "fullname": "Jae-Pil Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87383?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors of the underlying diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40353", "url": null, "sourceid": -41447, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38909?format=json"], "related_events_ids": [38909]}, {"id": 38910, "uid": "03ff5913b0517be4231fee8f421f2699", "name": "Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization", "authors": [{"id": 173505, "fullname": "Jiaze Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173505?format=json", "institution": "Southeast University "}, {"id": 84874, "fullname": "Shiyu Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/84874?format=json", "institution": "Southeast University"}, {"id": 84853, "fullname": "Jiaqi Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/84853?format=json", "institution": "RIKEN"}, {"id": 84884, "fullname": "Xin Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84884?format=json", "institution": "Southeast University"}], "abstract": "Appropriate parameter initialization is crucial for reducing the training cost of deep neural networks. Graph HyperNetworks (GHN) have emerged as a promising approach for initializing diverse architectures, with recent methods such as Task-Aware Learngene (TAL) further attempting to leverage pre-trained model knowledge via soft label supervision. However, such indirect supervision fails to fully exploit the rich information encoded in pre-trained weights. We propose **P**arameter **I**nheri**T**ance **H**yperNetwork (**PITH**), which introduces a novel parameter projection mechanism to directly inherit parameters from pre-trained models for initializing target networks of varying configurations. Our method enables initialized networks to directly achieve competitive performance on downstream tasks without any further training, which we term zero-shot initialization. Extensive experiments demonstrate the superiority of PITH: ViT-Base initialized by PITH achieves 53.35\\% zero-shot accuracy on ImageNet-1K, surpassing the previous state-of-the-art by 6.54\\%, with consistent improvements across multiple downstream tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38910", "url": null, "sourceid": 34845, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38911, "uid": "58c0c5010ab781e7e188e88c7270702b", "name": "Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction", "authors": [{"id": 126965, "fullname": "Yujie Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/126965?format=json", "institution": "Fudan University"}, {"id": 190950, "fullname": "Chenglong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190950?format=json", "institution": "Peking University; Alibaba Group; Korea Advanced Institute of Science &amp; Technology; Seoul National University; Fudan University"}, {"id": 149525, "fullname": "Jianxiong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/149525?format=json", "institution": "Fudan University"}, {"id": 105464, "fullname": "Chenhui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/105464?format=json", "institution": "Meituan; Fudan University"}, {"id": 73091, "fullname": "Shiwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73091?format=json", "institution": "Alibaba Group"}, {"id": 90155, "fullname": "Biao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/90155?format=json", "institution": "Alibaba Group"}, {"id": 154272, "fullname": "Shuai Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154272?format=json", "institution": "Antgroup Group"}, {"id": 126537, "fullname": "Hangjie Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126537?format=json", "institution": "Zhejiang University"}, {"id": 76406, "fullname": "Hongming Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76406?format=json", "institution": "Fudan University"}], "abstract": "Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce $\\textbf{CineNeuron}$, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant \"memories\" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics. Code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38911", "url": null, "sourceid": 33114, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38918, "uid": "beb405c52d26c065acc78d558251e512", "name": "SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras", "authors": [{"id": 155872, "fullname": "Huanjing Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155872?format=json", "institution": "Tianjin University"}, {"id": 190959, "fullname": "Shangbin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190959?format=json", "institution": "Tianjin University"}, {"id": 190960, "fullname": "Cong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190960?format=json", "institution": "Sensetime"}, {"id": 190961, "fullname": "Qian.Wu Qian.Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190961?format=json", "institution": "Unisoc"}, {"id": 190962, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190962?format=json", "institution": "Image Algorithm"}, {"id": 138544, "fullname": "Zhao Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/138544?format=json", "institution": null}, {"id": 87226, "fullname": "Jingyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87226?format=json", "institution": "Tianjin University"}], "abstract": "RAW images preserve superior fidelity and rich scene information compared to RGB, making them essential for tasks in challenging imaging conditions. To alleviate the high cost of data collection, recent RGB-to-RAW conversion methods aim to synthesize RAW images from RGB. However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. In addition, we introduce CamLoRA, a camera-aware lightweight adaptation module that enables a unified model to handle diverse ISP characteristics. Extensive experiments on four benchmark datasets demonstrate the superiority of SpiralDiff in RGB-to-RAW conversion quality and its downstream benefits in RAW-based object detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38918", "url": null, "sourceid": 40093, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38921, "uid": "e5f5c9b8d407672fc0c83e836adff55d", "name": "EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization", "authors": [{"id": 181885, "fullname": "Haolan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181885?format=json", "institution": "Michigan State University"}, {"id": 134455, "fullname": "Keli Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/134455?format=json", "institution": null}, {"id": 91282, "fullname": "Lei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91282?format=json", "institution": "Qualcomm"}, {"id": 87064, "fullname": "Ning Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87064?format=json", "institution": "QualComm"}, {"id": 73926, "fullname": "Xiaoming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73926?format=json", "institution": "Michigan State University"}], "abstract": "Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Few-shot methods enable instant personalization by reconstructing high-fidelity avatars from only a few seconds of video. However, achieving natural talking-head generation further requires strong emotion-aware motion modeling, and existing few-shot approaches exhibit geometric instability and audio-emotion mismatch under expressive facial motion. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, which introduces strong geometric priors for stable and interpretable motion. Building upon this, we propose a Gated Residual Motion Network (GRMN), which can capture emotional prosody from audio while supplementing head pose and upper-face cues absent in audio to enable expressive yet stable motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38921", "url": null, "sourceid": 35810, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38917, "uid": "4d4eadbb57d186811c54b6df8b609a9b", "name": "Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces", "authors": [{"id": 184205, "fullname": "Depanshu Sani", "url": "http://cvpr.thecvf.com/api/miniconf/users/184205?format=json", "institution": "Indraprastha Institute of Information Technology, Delhi"}, {"id": 181966, "fullname": "Saket Anand", "url": "http://cvpr.thecvf.com/api/miniconf/users/181966?format=json", "institution": "IIIT-Delhi"}], "abstract": "Traditional classifiers treat all class labels as mutually independent, thereby considering all negative classes to be equally incorrect. This approach fails severely in many real-world scenarios, where a known semantic hierarchy defines a partial order of preferences over negative classes. While hierarchy-aware feature representations have shown promise in mitigating this problem, their performance is typically assessed using metrics like Mistake Severity (MS) and Average Hierarchical Distance (AHD). In this paper, we highlight important shortcomings in existing hierarchical evaluation metrics, demonstrating that they are often incapable of measuring true hierarchical performance. Our analysis reveals that existing methods learn sub-optimal hierarchical representations, despite competitive MS and AHD scores. To counter these issues, we introduce Hierarchical Composition of Orthogonal Subspaces (Hier-COS), a novel framework for unified 'hierarchy-aware fine-grained' and 'hierarchical multi-label' classification. We show that Hier-COS is theoretically guaranteed to be consistent with the given hierarchy tree. Furthermore, our framework implicitly adapts the learning capacity for different classes based on their position within the hierarchy tree \u2014 a vital property absent in existing methods. Finally, to address the limitations of evaluation metrics, we propose Hierarchically Ordered Preference Score (HOPS), a ranking-based metric that demonstrably overcomes the deficiencies of current evaluation standards. We benchmark Hier-COS on four challenging datasets, including the deep and imbalanced tieredImageNet-H (12-level) and iNaturalist-19 (7-level). Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art performance across all hierarchical metrics for every dataset, while simultaneously beating the top-1 accuracy in all but one case. Lastly, we show that Hier-COS can effectively learn to transform the frozen features extracted from a pretrained backbone (ViT) to be hierarchy-aware, yielding substantial benefits for hierarchical classification performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38917", "url": null, "sourceid": 37591, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38924, "uid": "50c8932f79e0369ac94e06e9c6bd86e5", "name": "FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model", "authors": [{"id": 181159, "fullname": "Shiyu Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/181159?format=json", "institution": "Tsinghua University"}, {"id": 190975, "fullname": "Yongkang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190975?format=json", "institution": "South China University of Technology"}, {"id": 190976, "fullname": "Yimin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190976?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187934, "fullname": "Jiawei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187934?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187935, "fullname": "Yifan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/187935?format=json", "institution": null}, {"id": 154601, "fullname": "Xue Yuerong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154601?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 87242, "fullname": "Shu-Tao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87242?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 87209, "fullname": "Bin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87209?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}], "abstract": "Stereo image compression is essential for a wide range of 3D vision. Recent methods have demonstrated strong capabilities in eliminating inter-view redundancy and enabling compact entropy coding via spatial-domain stereo transformation and advanced autoregressive entropy models. However, these approaches often suffer from high-frequency information loss and incur considerable coding latency. To overcome these limitations, we propose a novel frequency stereo context transfer (FSCT) module. Unlike spatial-domain methods, the FSCT module separately captures inter-view redundancy in high- and low-frequency components and dynamically balances their contributions to preserve reconstruction quality. In addition, we replace the conventional autoregressive framework with a checkerboard strategy and integrate the FSCT module to model inter-view priors, enabling faster and more efficient entropy coding. Extensive experiments demonstrate that our method achieves state-of-the-art rate-distortion performance among existing stereo image compression approaches, while also attaining the lowest coding latency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38924", "url": null, "sourceid": 44669, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38925, "uid": "a6b44d616ce540606027b334ce8c1dd0", "name": "Duala: Dual-level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding", "authors": [{"id": 180556, "fullname": "Shumeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180556?format=json", "institution": "Nanjing University"}, {"id": 77205, "fullname": "Jintao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/77205?format=json", "institution": "Nanjing University"}, {"id": 152484, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152484?format=json", "institution": "Nanjing University"}, {"id": 190977, "fullname": "Yulin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190977?format=json", "institution": "nanjing university"}, {"id": 190978, "fullname": "Luyang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190978?format=json", "institution": "Nanjing University"}, {"id": 86630, "fullname": "Yinghuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86630?format=json", "institution": "Nanjing University"}], "abstract": "Cross-subject visual decoding aims to reconstruct visual experiences from brain activity across individuals, enabling more scalable and practical brain-computer interfaces. However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. (1) At the stimulus level, Duala introduces a semantic alignment and relational consistency strategy that preserves intra-class similarity and inter-class separability, maintaining clear semantic boundaries during adaptation. (2) At the subject level, a distribution-based feature perturbation mechanism is developed to capture both global and subject-specific variations, enabling adaptation to individual neural representations without overfitting. Experiments on the Natural Scenes Dataset (NSD) demonstrate that Daula effectively improves alignment across subjects. Remarkably, even when fine-tuned with only about one hour of fMRI data, Duala achieves over 81.1% image-to-brain retrieval accuracy and consistently outperforms existing fine-tuning strategies in both retrieval and reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38925", "url": null, "sourceid": 32891, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38926, "uid": "9f76ca9de39c2906f5e1e3f3ba38e255", "name": "Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation", "authors": [{"id": 181563, "fullname": "Zehao Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181563?format=json", "institution": "Soochow University"}, {"id": 190375, "fullname": "Tianjie Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/190375?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190376, "fullname": "Zheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190376?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 106927, "fullname": "Zhuosheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106927?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 178500, "fullname": "Gongshen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178500?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management.Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38926", "url": null, "sourceid": 32835, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38928, "uid": "b374d856eb3d8f78c56820a5eb29629d", "name": "FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning", "authors": [{"id": 156462, "fullname": "Zihui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156462?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 182386, "fullname": "Yuhang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182386?format=json", "institution": "Harbin Engineering University"}, {"id": 190979, "fullname": "Mengmeng Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/190979?format=json", "institution": "Fuyang Normal University"}, {"id": 130576, "fullname": "Zhimin Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130576?format=json", "institution": "School of Informatics Xiamen University"}, {"id": 190980, "fullname": "Yachen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190980?format=json", "institution": "Guilin University of Electronic Technology"}, {"id": 190981, "fullname": "Weisheng Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190981?format=json", "institution": null}, {"id": 190982, "fullname": "Kaiyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190982?format=json", "institution": "Harbin Engineering University"}, {"id": 151501, "fullname": "Zheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151501?format=json", "institution": "Xiamen University"}], "abstract": "Collaborative fairness in federated learning ensures that clients are rewarded according to their contributions, thereby fostering long-term participation among clients. However, existing methods often under-reward low-contributing clients in the early training stage and neglect critical issues, such as consistency across local models or unequal neuron training frequencies in the aggregated model, both of which lead to degraded performance. To address these issues, we propose FedRAC, a novel Federated learning framework employing Rolling submodel Allocation for Collaborative fairness, without compromising the global model performance. First, we design a dynamic reputation calculation module with a theoretical fairness guarantee to generate reputations matching clients\u2019 contributions. It adjusts their reputations dynamically during training, ensuring low-contributing clients access better models in the early stages for adequate training. Second, we propose a rolling submodel allocation module that assigns high-performance submodels to clients with high reputations. This module prioritizes low-frequency neurons during allocation and is supported by theoretical convergence guarantees, ensuring that all neurons in the global model are fully trained. Extensive experiments are conducted on three public datasets to confirm the advantages of our method in terms of fairness and model accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38928", "url": null, "sourceid": 40217, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38927, "uid": "947501a196966253d01a63be8a17e8cd", "name": "Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning", "authors": [{"id": 87921, "fullname": "Chun-Hsiao Yeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/87921?format=json", "institution": "UC Berkeley"}, {"id": 69656, "fullname": "Shengyi Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/69656?format=json", "institution": "Meta FAIR"}, {"id": 88713, "fullname": "Manchen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88713?format=json", "institution": "Amazon"}, {"id": 106959, "fullname": "Yi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/106959?format=json", "institution": "UC Berkeley"}, {"id": 88708, "fullname": "Joseph Tighe", "url": "http://cvpr.thecvf.com/api/miniconf/users/88708?format=json", "institution": "Meta"}, {"id": 156801, "fullname": "Fanyi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156801?format=json", "institution": "Meta FAIR"}], "abstract": "Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM\u2019s transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs\u2019 internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks---including +18.2% on All-Angles Bench and +29.0% on VSI-Bench---all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38927", "url": "https://danielchyeh.github.io/GASP/", "sourceid": 35285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38932, "uid": "280515198e9b8f63e158dbebc95dc28b", "name": "RAAS: LLM Agentic System Architecture Search with GRPO", "authors": [{"id": 190989, "fullname": "Jiayi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190989?format=json", "institution": "Boston University, Boston University; Wuhan University"}, {"id": 190990, "fullname": "Guancheng Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190990?format=json", "institution": "Wuhan University"}, {"id": 190991, "fullname": "Man Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190991?format=json", "institution": "Zhejiang University; Wuhan University"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "Large Language Model (LLM) agentic systems solve complex tasks through coordinated workflows, but designing them remains labor-intensive. The \\textbf{Agentic Supernet} paradigm automates this by optimizing a probabilistic architecture space, yet suffers from critical evaluation instabilities: absolute performance scores entangle architectural merit with query difficulty, while single-execution protocols capture execution randomness rather than true capability. These instabilities lead to unreliable search dynamics where simple queries inflate weak designs and challenging queries suppress strong ones. We introduce \\textbf{RAAS} (Robust Architecture Adaptive Search), which establishes stable, fair evaluation through two synergistic mechanisms. \\textbf{Contextual Architecture Orchestration (CAO)} disentangles quality from task difficulty by evaluating cohorts of candidate architectures on identical queries, deriving context-aware merit signals through peer-group comparison. \\textbf{Multi-Trial Assessment Synthesis (MTAS)} eliminates execution variance by aggregating performance across multiple independent trials, producing statistically robust capability estimates. Together, these mechanisms isolate genuine architectural superiority and guide reliable architecture discovery. Extensive experiments across six benchmarks show RAAS significantly outperforms state-of-the-art methods, improving HumanEval pass@1 from 92.23\\% to 96.31\\% and MATH accuracy from 52.08\\% to 60.87\\%, while maintaining practical efficiency, demonstrating the effectiveness of robust evaluation for agentic architecture search.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38932", "url": null, "sourceid": 32947, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38939, "uid": "c1537c9ed39baee3476c6fdd666b5fd8", "name": "Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging", "authors": [{"id": 146121, "fullname": "Tianbo Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/146121?format=json", "institution": "NUS"}, {"id": 151706, "fullname": "Xingyi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151706?format=json", "institution": "National University of Singapore"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Multimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose \\textbf{Merge3D}, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic\u2013Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70\\% visual token reduction and up to $\\sim$3$\\times$ inference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38939", "url": null, "sourceid": 34126, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38942, "uid": "16a8dbbbdbb3056273cbf39955c6f7b0", "name": "Beyond the Ground Truth: Enhanced Supervision for Image Restoration", "authors": [{"id": 129021, "fullname": "Donghun Ryou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129021?format=json", "institution": "Seoul National University"}, {"id": 100791, "fullname": "Inju Ha", "url": "http://cvpr.thecvf.com/api/miniconf/users/100791?format=json", "institution": "Seoul National University"}, {"id": 96405, "fullname": "sanghyeok chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/96405?format=json", "institution": "seoul national university"}, {"id": 75881, "fullname": "Bohyung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/75881?format=json", "institution": "Seoul National University"}], "abstract": "Deep learning-based image restoration has achieved significant success. However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. Our framework generates perceptually enhanced ground truth variants using super-resolution, and employs a conditional frequency mask generator to produce adaptive frequency masks. These masks guide the optimal fusion of frequency components from the original ground truth and its super-resolved variants to yield enhanced ground truth images. This frequency-domain mixup preserves the semantic consistency of the original content while selectively enriching perceptual details, preventing hallucinated artifacts that could compromise fidelity. The enhanced ground truth images are used to train a lightweight output refinement network that can be seamlessly integrated with existing restoration models.  Extensive experiments demonstrate that our approach consistently improves the quality of restored images. We further validate the effectiveness of both supervision enhancement and output refinement through user studies. We will publicly release our code, enhanced images and model weights to support reproducibility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38942", "url": null, "sourceid": 35495, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38944, "uid": "b241600bbbbc8cfbba760be3f23083bc", "name": "GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models", "authors": [{"id": 176596, "fullname": "Korada Sri Vardhana", "url": "http://cvpr.thecvf.com/api/miniconf/users/176596?format=json", "institution": "Indian Institute of Science, Bengaluru"}, {"id": 183196, "fullname": "Soma Biswas", "url": "http://cvpr.thecvf.com/api/miniconf/users/183196?format=json", "institution": "Indian Institute of Science, Bangalore"}], "abstract": "Text-to-Image (T2I) diffusion models power modern creative tools, but their open-ended generative nature raises safety, ethical, and copyright concerns. Retraining or fine-tuning to remove every unsafe or copyrighted concept is impractical, motivating training-free interventions that suppress specific semantics while preserving general visual quality. Existing guard-railing methods face a core trade-off: they are either rigid, failing to generalize to paraphrased or context-shifted prompts, or coarse, distorting unrelated content and fidelity. We present GenErase (GENeralizable ERAsure with SEmantic Awareness), a training-free, geometry-grounded framework for robust concept removal in diffusion models. GenErase enforces semantic orthogonality in the cross-attention value space via an explicit \\emph{erase-and-replace} operation, guided by a per-token preserve projector and a hard geometric gate. This design enables precise erasure, explicit protection of critical semantics, and stability across layers, paraphrases, and multi-concept cases. Extensive experiments on identity, object, and style erasure, together with a new GenBench-40 benchmark, show that GenErase achieves state-of-the-art erasure fidelity and superior paraphrase-level generalization, establishing it as a practical and principled guard-rail for safe, real-time diffusion deployment. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38944", "url": null, "sourceid": 41344, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38948, "uid": "fd577536e9ceee3d9c2b3b2e6d2e2a88", "name": "PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding", "authors": [{"id": 151829, "fullname": "Souhail Hadgi", "url": "http://cvpr.thecvf.com/api/miniconf/users/151829?format=json", "institution": "Ecole Polytechnique"}, {"id": 153010, "fullname": "Bingchen Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153010?format=json", "institution": "\u00c9cole Polytechnique"}, {"id": 191033, "fullname": "Ramana Sundararaman", "url": "http://cvpr.thecvf.com/api/miniconf/users/191033?format=json", "institution": null}, {"id": 168736, "fullname": "Emery Pierson", "url": "http://cvpr.thecvf.com/api/miniconf/users/168736?format=json", "institution": "Lix, Polytechnique"}, {"id": 90214, "fullname": "Lei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90214?format=json", "institution": "University of Virginia"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}, {"id": 85376, "fullname": "Maks Ovsjanikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85376?format=json", "institution": "Ecole Polytechnique, France"}], "abstract": "Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38948", "url": null, "sourceid": 40602, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38950, "uid": "430a7e19cd27b66c8a4a756b85dbfd85", "name": "SoccerMaster: A Vision Foundation Model for Soccer Understanding", "authors": [{"id": 174245, "fullname": "Haolin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174245?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 152566, "fullname": "Jiayuan Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152566?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 107441, "fullname": "Haoning Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107441?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 73937, "fullname": "Weidi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73937?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.However, prior works typically rely on task-specific expert models, which are resource-intensive and hinder a holistic view of the game.This paper aims to propose a unified framework that enables a single model to handle diverse soccer visual understanding tasks, spanning both fine-grained perception (e.g., athlete detection) and semantic reasoning (e.g., event classification).Concretely, we make the following contributions in this paper:(i) we present **SoccerMaster**, the first soccer-specific vision foundation model that unifies comprehensive understanding tasks within a single framework via **supervised multi-task pretraining**;(ii) we consolidate multiple existing soccer video datasets and develop an automated data curation pipeline, termed as **SoccerFactory**, to produce scalable multi-task training annotations;and (iii) we conduct extensive experiments demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, underscoring its breadth and superiority.The data, code, and model will be publicly available to the research community.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38950", "url": null, "sourceid": 42564, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40354?format=json"], "related_events_ids": [40354]}, {"id": 40354, "uid": "430a7e19cd27b66c8a4a756b85dbfd85", "name": "SoccerMaster: A Vision Foundation Model for Soccer Understanding", "authors": [{"id": 174245, "fullname": "Haolin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174245?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 152566, "fullname": "Jiayuan Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152566?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 107441, "fullname": "Haoning Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107441?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 73937, "fullname": "Weidi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73937?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.However, prior works typically rely on task-specific expert models, which are resource-intensive and hinder a holistic view of the game.This paper aims to propose a unified framework that enables a single model to handle diverse soccer visual understanding tasks, spanning both fine-grained perception (e.g., athlete detection) and semantic reasoning (e.g., event classification).Concretely, we make the following contributions in this paper:(i) we present **SoccerMaster**, the first soccer-specific vision foundation model that unifies comprehensive understanding tasks within a single framework via **supervised multi-task pretraining**;(ii) we consolidate multiple existing soccer video datasets and develop an automated data curation pipeline, termed as **SoccerFactory**, to produce scalable multi-task training annotations;and (iii) we conduct extensive experiments demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, underscoring its breadth and superiority.The data, code, and model will be publicly available to the research community.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40354", "url": null, "sourceid": -42564, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38950?format=json"], "related_events_ids": [38950]}, {"id": 38956, "uid": "04c9f83b9f0f69b71c45cfc4aebc1030", "name": "One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination", "authors": [{"id": 180560, "fullname": "Zhan Fa", "url": "http://cvpr.thecvf.com/api/miniconf/users/180560?format=json", "institution": "Nanjing University"}, {"id": 191057, "fullname": "Yue Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191057?format=json", "institution": "Nanjing University"}, {"id": 152484, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152484?format=json", "institution": "Nanjing University"}, {"id": 86644, "fullname": "Lei Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86644?format=json", "institution": "Southeast University"}, {"id": 86630, "fullname": "Yinghuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86630?format=json", "institution": "Nanjing University"}], "abstract": "Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a **unified framework**. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead. Codes are available in supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38956", "url": null, "sourceid": 37896, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38957, "uid": "40c04db8524c0d1e87ed1966eecf48c2", "name": "FRM: Linear-Time 3D Reconstruction via Test-Time Training", "authors": [{"id": 77510, "fullname": "Haian Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/77510?format=json", "institution": "Zhejiang University"}, {"id": 126137, "fullname": "Rundi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126137?format=json", "institution": "Columbia University"}, {"id": 133380, "fullname": "Tianyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133380?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 151093, "fullname": "Ruiqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151093?format=json", "institution": "Google"}, {"id": 69179, "fullname": "Jonathan T. Barron", "url": "http://cvpr.thecvf.com/api/miniconf/users/69179?format=json", "institution": "Google"}, {"id": 85450, "fullname": "Noah Snavely", "url": "http://cvpr.thecvf.com/api/miniconf/users/85450?format=json", "institution": "Google / Cornell"}, {"id": 85769, "fullname": "Aleksander Holynski", "url": "http://cvpr.thecvf.com/api/miniconf/users/85769?format=json", "institution": "UC Berkeley &amp; Google Research"}], "abstract": "Feed-forward transformer models such as VGGT and $\\pi^3$ are highly accurate, but their computational cost grows quadratically with the number of input images, making them slow to evaluate on large collections. More efficient approaches ameliorate this cost at the expense of reconstruction quality. We introduce Fast Reconstruction Model, a stateful feed-forward reconstruction model that uses a bidirectional architecture that scales linearly in the number of input views, while matching or surpassing the reconstruction quality of quadratic-time methods. FRM employs test-time training layers to compress images into a compact hidden scene state during a single forward pass, enabling our model to reconstruct 3D scenes at speeds up to 75 FPS on a single H100 GPU---over 20 times faster than SOTA methods such as VGGT. This hidden state also serves as an implicit scene representation which can be queried at real-time rates to produce colored point maps from novel views.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38957", "url": null, "sourceid": 35081, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38963, "uid": "9cf2bd74e4c3eaa1d1eb60c7defe4fb0", "name": "Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image", "authors": [{"id": 72562, "fullname": "Yushi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72562?format=json", "institution": "University of Washington"}, {"id": 137286, "fullname": "Reyhane Askari", "url": "http://cvpr.thecvf.com/api/miniconf/users/137286?format=json", "institution": "FAIR"}, {"id": 150972, "fullname": "Melissa Hall", "url": "http://cvpr.thecvf.com/api/miniconf/users/150972?format=json", "institution": "FAIR (Meta)"}, {"id": 189609, "fullname": "Emily Dinan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189609?format=json", "institution": "Facebook"}, {"id": 130921, "fullname": "Luke Zettlemoyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/130921?format=json", "institution": "University of Washington"}, {"id": 189608, "fullname": "Marjan Ghazvininejad", "url": "http://cvpr.thecvf.com/api/miniconf/users/189608?format=json", "institution": "Facebook"}], "abstract": "Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning (\u201cthinking-with-images\u201d), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models, and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. Top judges like GPT-5 and Gemini-2.5-Pro reach 66\u201375% accuracy, compared to >90% for humans, and outperform the commonly used GPT-4o (59% accuracy). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini-2.5-Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success, and conduct an in-depth analysis that shows key areas to improve the reward models going forward.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38963", "url": null, "sourceid": 45814, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38969, "uid": "2b12c0d47f5821a5adb3bfd973d0f708", "name": "Generative Neural Video Compression via Video Diffusion Prior", "authors": [{"id": 181193, "fullname": "Qi Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181193?format=json", "institution": "Communication University of China"}, {"id": 191089, "fullname": "Hao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191089?format=json", "institution": "Communication University of China"}, {"id": 191090, "fullname": "Tinghan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191090?format=json", "institution": "Communication University of China"}, {"id": 191091, "fullname": "Libiao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191091?format=json", "institution": "Communication University of China"}, {"id": 90140, "fullname": "Siwei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90140?format=json", "institution": "Peking University"}], "abstract": "We present \\textbf{GNVC-VD}, the first DiT-based generative neural video compression framework built upon an advancedvideo generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained \\textbf{image} generative priors to restorehigh-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to \\textbf{perceptual flickering}. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a \\textbf{video diffusion transformer} to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation.  A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherenceunder extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01~bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38969", "url": null, "sourceid": 34370, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38970, "uid": "ce5f48cd6be4756c729c85fb2a204f2a", "name": "PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection", "authors": [{"id": 175518, "fullname": "WANG XINYUAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/175518?format=json", "institution": "Xiamen University"}, {"id": 157178, "fullname": "Yingxin Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157178?format=json", "institution": "Xiamen University"}, {"id": 157182, "fullname": "Zhiming Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157182?format=json", "institution": "Xiamen University"}, {"id": 186232, "fullname": "Zhihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186232?format=json", "institution": null}], "abstract": "The rapid rise of highly realistic AI-generated images necessitates reliable and generalizable detection methods. However, existing methods are constrained by their discriminative nature: by learning a single static decision boundary, they tend to memorize generator-specific artifacts and consequently fail to generalize to the unseen distributions of new generative models. To overcome this limitation, we propose PPM-CLIP, a new framework that shifts from static classification to conditional generative modeling based on the CLIP vision\u2013language model. Instead of learning a fixed decision boundary, a Probabilistic Prompt Modeling (PPM) module is used as a generator that produces an adaptive distribution of prompts according to the input image. This allows the model to flexibly capture novel artifacts, rather than matching them against fixed templates. In addition, to enhance the visual encoder's sensitivity to subtle artifacts, a Patch-Wise Contrastive Learning (PWCL) strategy is introduced. Extensive experiments on Ojha, GenImage, and DRCT benchmarks demonstrate that our generative paradigm significantly outperforms state-of-the-art methods, especially in cross-domain detection. Code will be released on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38970", "url": null, "sourceid": 39294, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38975, "uid": "744a37185fa8c4449a60bd11976f9045", "name": "LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding", "authors": [{"id": 136810, "fullname": "XUANZHAO DONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/136810?format=json", "institution": "Arizona State University"}, {"id": 133834, "fullname": "Wenhui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133834?format=json", "institution": "Arizona State University"}, {"id": 133805, "fullname": "Xiwen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/133805?format=json", "institution": null}, {"id": 137869, "fullname": "Zhipeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/137869?format=json", "institution": "LinkedIn Corporation"}, {"id": 137037, "fullname": "Peijie Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/137037?format=json", "institution": "Washington University in St.Louis"}, {"id": 191103, "fullname": "Shao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191103?format=json", "institution": "Linkedin"}, {"id": 191104, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191104?format=json", "institution": "Arizona State University"}, {"id": 131533, "fullname": "Yalin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131533?format=json", "institution": "Arizona State University"}], "abstract": "Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \\textbf{LLaDA-MedV}, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855\\% over LLaVA-Med and 1.867\\% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93\\% on VQA-RAD, 92.31\\% on SLAKE, and 95.15\\% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. All code and model weights will be released publicly to support future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38975", "url": null, "sourceid": 39960, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38978, "uid": "77880e0856efb9360da9f3434caeedda", "name": "Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation", "authors": [{"id": 180676, "fullname": "Junming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180676?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189016, "fullname": "Shuyu Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189016?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189017, "fullname": "Peilin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189017?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191109, "fullname": "Rendong Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/191109?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189018, "fullname": "Fei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189018?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Test-time adaptation (TTA) aims to enhance the cross-domain performance of pre-trained models by adapting to unlabeled test data.While most existing TTA methods rely on backpropagation (BP) for finetuning, BP-free methods such as zeroth-order (ZO) methods are more desired in practical on-device scenarios. ZO methods rely only on forward computation, which can largely reduce the complexity and memory overhead of on-device deployment.However, ZO methods suffer from much higher variance compared with first-order  methods in estimating the gradient.To address this, we propose an improved ZO method to substantially boost the performance of ZO optimization based TTA.First, we provide an observation to reveal the persistent low-rank Hessian structure of the loss during the adaptation process. Based on this insight, we then propose a loss-landscape curvature-aware zeroth-order (CAZO) method, which leverages a sliding-average estimation of the diagonal Hessian to construct a covariance matrix for anisotropic\u200c perturbation sampling. CAZO operates by freezing pretrained weights and optimizing minimal adapter parameters via forward-only passes based gradient estimation, which can substantially reduce the memory overhead compared to BP-based methods. Extensive experiments demonstrate that CAZO significantly outperforms existing TTA methods, achieving state-of-the-art performance while maintaining an excellent balance between accuracy and memory efficiency. Code is provided in supplemental material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38978", "url": null, "sourceid": 42909, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38977, "uid": "ce1542ca94b4c1147ab2c8155fb41578", "name": "RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection", "authors": [{"id": 191107, "fullname": "Xin Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191107?format=json", "institution": "Zhejiang University"}, {"id": 191108, "fullname": "Wenjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191108?format=json", "institution": "Zhejiang University"}], "abstract": "Accurate 3D object detection in autonomous driving relies on effectively combining complementary information from multiple sensors. 4D millimeter-wave radar provides sparse yet physically reliable measurements, whose potential for enhancing sensor fusion has not been fully utilized. In this work, we propose \\textbf{R}adar \\textbf{P}rior \\textbf{G}uided \\textbf{Fusion} (\\textbf{RPGFusion}), a practical 4D radar\u2013camera fusion framework. We first generate radar prior maps that encode spatial confidence and depth cues. These priors guide image feature sampling while preventing the uneven BEV feature distribution (near-dense, far-sparse) caused by Lift-Splat-Shoot view transformation. To address the sparsity and noise inherent in point clouds, we adopt a hybrid robust encoding and sparse-to-dense feature propagation. We further introduce spatial alignment and semantic fusion modules to reconcile geometric and semantic differences between modalities, yielding more consistent and complementary BEV representations. Extensive experiments on the public View-of-Delft and TJ4DRadSet show that RPGFusion outperforms prior radar\u2013camera fusion methods, achieving \\textbf{SOTA} performance. Our work not only uses 4D radar signals to guide image BEV queries, but also enables robust radar feature encoding and densification for 3D perception, demonstrating the strong potential of 4D radar.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38977", "url": null, "sourceid": 35395, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38984, "uid": "aa954eb9fc47e002ecbf68b60517a3de", "name": "Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes", "authors": [{"id": 183665, "fullname": "Umangi Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/183665?format=json", "institution": "University of Toronto"}, {"id": 86499, "fullname": "Vladimir G. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86499?format=json", "institution": "Adobe Systems"}, {"id": 106544, "fullname": "Matheus Gadelha", "url": "http://cvpr.thecvf.com/api/miniconf/users/106544?format=json", "institution": "Adobe Systems"}, {"id": 86662, "fullname": "Igor Gilitschenski", "url": "http://cvpr.thecvf.com/api/miniconf/users/86662?format=json", "institution": "University of Toronto"}, {"id": 89985, "fullname": "Zhiqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89985?format=json", "institution": "Simon Fraser University"}], "abstract": "We introduce the problem of material-aware part grouping in untextured meshes.Many real-world shapes, such as scales of pinecones or windows of buildings, contain repeated structures that share the same material but exhibit geometric variations.When assigning materials to such meshes, these repeated parts often require piece-by-piece manual identification and selection, which is tedious and time-consuming.To address this, we propose Material Magic Wand, a tool that allows artists to select part groups based on their estimated material properties -- when one part is selected, our algorithm automatically retrieves all other parts likely to share the same material. The key component of our approach is a part encoder that generates a material-aware embedding for each 3D part, accounting for both local geometry and global context.We train our model with a supervised contrastive loss that brings embeddings of material-consistent parts closer while separating those of different materials;therefore, part grouping can be achieved by retrieving embeddings that are close to the embedding of the selected part.To benchmark this task, we introduce a curated dataset of 100 shapes with 241 part-level queries.We verify the effectiveness of our method through extensive experiments and demonstrate its practical value in an interactive material assignment application.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38984", "url": null, "sourceid": 31765, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38985, "uid": "a9d61b085ddc473847e9397f6c0e485a", "name": "ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization", "authors": [{"id": 167168, "fullname": "Minseo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/167168?format=json", "institution": "KAIST"}, {"id": 191122, "fullname": "Minchan Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/191122?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 76173, "fullname": "Dongyeun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/76173?format=json", "institution": "KAIST &amp; Klleon AI Research"}, {"id": 191123, "fullname": "Yunho Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/191123?format=json", "institution": "Hanbat National University"}, {"id": 85168, "fullname": "Junmo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85168?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose \\textbf{ConceptPrism}, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38985", "url": null, "sourceid": 46181, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38986, "uid": "b8213dd6cdc0bfb7d1ca745b9aaf2ba2", "name": "OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery", "authors": [{"id": 180963, "fullname": "Qi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180963?format=json", "institution": "Wuhan University"}, {"id": 191124, "fullname": "Jue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191124?format=json", "institution": null}, {"id": 191125, "fullname": "Yinhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191125?format=json", "institution": "Wuhan University"}, {"id": 156426, "fullname": "Yanfei Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156426?format=json", "institution": "Wuhan University"}], "abstract": "Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and perform similarity retrieval with change proposals in the visual space during inference. The other secondary bottleneck lies in change localization, due to the lack of change priors in VFMs under unsupervised settings. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for binary change localization. Integrating the pretrained S2C into OpenDPR leads to a weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available in the Supplementary Material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38986", "url": null, "sourceid": 36249, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38987, "uid": "86d91c66f12894957b1438e1fc1a95dc", "name": "Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation", "authors": [{"id": 127774, "fullname": "Junjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127774?format=json", "institution": "Jiangxi University of Finance and Economics"}, {"id": 175491, "fullname": "Junwei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/175491?format=json", "institution": "Jiangxi University of Finance and Economics"}, {"id": 191126, "fullname": "Ren Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191126?format=json", "institution": "Jiangxi University of Finance and Economics"}, {"id": 191127, "fullname": "Shengjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191127?format=json", "institution": "Jiangxi University of Finance and Economics"}, {"id": 86634, "fullname": "Yuming Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86634?format=json", "institution": "Jiangxi University of Finance and Economics"}, {"id": 191128, "fullname": "Feng Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191128?format=json", "institution": "Jiangxi University of Finance and Economics"}, {"id": 157118, "fullname": "Yifan Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157118?format=json", "institution": "Jiangxi University of Finance and Economics"}], "abstract": "Amodal instance segmentation aims to segment both visible and occluded regions of object instance, which are challenging due to lacking inference support under occlusion. Most existing methods employ the prior knowledge about object mask (shape prior) to support the amodal estimation, but the shape prior is not always compatible for object instances in the test stage. In this paper, we explore the task of interactive amodal segmentation, where a few user clicks are available for better segmenting the complete masks of object instances.For this task, we propose a novel framework based on learning and aligning click-aware shape prior. Specifically, we propose to learn click-aware shape prior with triplet loss, which forces the retrieved shape priors to have higher IoU with the ground-truth of target instance and thus could exactly facilitate the prediction. Besides, considering the inevitable mismatch between shape prior and target instance, we propose to adaptively align the shape prior with deformable attention. Overall, our model could make full use of the interactive clicks to retrieve and align shape priors, and thus could estimate more complete masks. Extensive experiments on three benchmark datasets (i.e., KINS, D2SA and COCOA cls) demonstrate the effectiveness of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38987", "url": null, "sourceid": 46194, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38991, "uid": "a7949e3bfe0d90df9ba1700365ac42c9", "name": "3DrawAgent: Teaching LLM to Draw in 3D with early relative experience", "authors": [{"id": 191138, "fullname": "Hongcan Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191138?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 191139, "fullname": "Xinyue Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191139?format=json", "institution": "Jiangnan University"}, {"id": 191140, "fullname": "Yilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191140?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 191141, "fullname": "Yue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191141?format=json", "institution": "HaoHan Data"}, {"id": 86631, "fullname": "Yonggang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86631?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DarwAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D B\u00e9zier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that tailors the recently proposed Generalized Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, i.e., each pair consisting of a relatively better and worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model\u2019s 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DarwAgent can generate complex and coherent 3D B\u00e9zier sketches from textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for training-free 3D sketch intelligence.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38991", "url": null, "sourceid": 42736, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38993, "uid": "04a4730468a0d657ebec4e0f1dafbbd4", "name": "Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network", "authors": [{"id": 168922, "fullname": "Yuwei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/168922?format=json", "institution": "Rochester Institute of Technology"}, {"id": 73497, "fullname": "Guoyu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73497?format=json", "institution": "SUNY Binghamton"}], "abstract": "This paper introduces a novel application of machine learning in agriculture for non-destructive 3D root structure reconstruction. Plant roots are critical for providing resources for the entire plant. Ground Penetrating Radar (GPR) is a key tool for identifying subterranean objects with easy and obvious shapes, such as large pipes, but remaining challenging to assess the 3D shapes of roots. In our study, we introduce a novel approach specifically designed based on GPR signal shape priors to detect target signals and perform curve parameter regression based on multiple B-scans from GPR. This process enables the derivation of a precise curve from the detection and regression outcomes. To achieve the reconstruction of a comprehensive 3D root structure, we have developed a shape reconstruction network that processes sparse sliced 3D points through a dedicated point graph network and an upsampling network module. Our method has been rigorously trained and validated using synthetic 3D root datasets and GPR data simulated by gprMax, as well as real GPR data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38993", "url": null, "sourceid": 36477, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39000, "uid": "3af33f51acdcf0b1c4d41c96defa0993", "name": "Lafite : A Generative Latent Field for 3D Native Texturing", "authors": [{"id": 171335, "fullname": "Chia-Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/171335?format=json", "institution": ", Tsinghua University"}, {"id": 133812, "fullname": "Yuanchen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/133812?format=json", "institution": "VAST"}, {"id": 155933, "fullname": "Zi-Xin Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155933?format=json", "institution": "VAST"}, {"id": 191160, "fullname": "Ze Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191160?format=json", "institution": "University of Hong Kong"}, {"id": 130380, "fullname": "Guan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/130380?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}, {"id": 126930, "fullname": "Ding Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126930?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 84806, "fullname": "Yan-Pei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84806?format=json", "institution": "Tencent ARC Lab"}, {"id": 89456, "fullname": "Song-Hai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89456?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Generating detailed and seamless textures for 3D meshes remains an open challenge. Recent image and video generation models, empowered by large-scale visual priors, are capable of producing highly detailed images and are thus promising for multi-view texture synthesis. However, evaluating texture quality involves multiple dimensions beyond visual fidelity. Multi-view back-projection often introduces seams and inconsistencies between different views or near occluded regions, while direct generation on UV-unwrapped maps suffers from UV distortions and ambiguities.Generating textures directly in 3D space offers an inherent advantage in ensuring continuity and spatial coherence, making it a critical and worthwhile research direction. Therefore, we systematically investigate 3D-native texture generation from the perspectives of representation and generation, and present current best practices for this approach.To this end, we employ a local vector field with a structured latent representation to model the joint distribution of texture and geometry. This design enables texture generation conditioned on high-fidelity geometric features within a unified latent space. Crucially, our approach is inherently free from occlusion artifacts, multi-view inconsistencies, and UV-related distortions caused by fragmented surface parameterizations. Extensive experiments demonstrate that our method produces high-quality, seamless textures and supports flexible downstream tasks such as editing and inpainting, marking a significant step forward in 3D-native texture generation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39000", "url": null, "sourceid": 38662, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39001, "uid": "828feeb4c80ad1bcb65b2002b6ea581b", "name": "From Global Alignment to Local Semantics: Understanding Visual Representations Structures in Multimodal LLMs", "authors": [{"id": 154478, "fullname": "Yingqi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154478?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 183538, "fullname": "Junlong Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/183538?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191161, "fullname": "Anhao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191161?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 154481, "fullname": "Xiaoyu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154481?format=json", "institution": "Eastern Institute of Technology, Ningbo"}], "abstract": "Multimodal LLMs (MLLMs) convert images into visual tokens for language-model processing, yet how these tokens encode semantics remains unclear. In this paper, we identify a consistent token structure across models: visual tokens cluster into sink, dead, and alive groups, with only the alive tokens ($\\approx60$%) carrying meaningful information. Sink and dead tokens can be removed without hurting performance. Using a patch-compression benchmark and our probing tool *EmbedLens*, we show that alive tokens already encode fine-grained cues (objects, colors, OCR) before entering the LLM. Internal visual computation (visual attention and FFNs) are redundant and offers limited benefit for most tasks. This redundancy also extends to the model's depth: Our analysis shows that alive tokens align best with mid-layer LLM representations, while shallow layers contribute little. These findings provide a unified view of visual semantics in MLLMs and motivate architectures that use fewer visual tokens, reduced visual computation, and mid-layer injection for better efficiency and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39001", "url": null, "sourceid": 45599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39003, "uid": "3bfc8356903f8b39ec744b6a0154bcba", "name": "Measuring the (Un)Faithfulness of Concept-Based Explanations", "authors": [{"id": 136023, "fullname": "Shubham Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/136023?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 136101, "fullname": "Narendra Ahuja", "url": "http://cvpr.thecvf.com/api/miniconf/users/136101?format=json", "institution": "University of Illinois at Urbana-Champaign"}], "abstract": "Deep vision models perform input-output computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-understandable semantic units, or concepts. Checking if the derived explanations are faithful---that is, they represent the model's internal computation---requires a surrogate that combines concepts to compute the output. Simplifications made for interpretability inevitably reduce faithfulness, resulting in a tradeoff between the two. State-of-the-art unsupervised CBEMs (U-CBEMs) have reported increasingly interpretable concepts, while also being more faithful to the model. However, we observe that the reported improvement in faithfulness artificially results from either (1) using overly complex surrogates, which introduces an unmeasured cost to the explanation's interpretability, or (2) relying on deletion-based approaches that, as we demonstrate, do not properly measure faithfulness. We propose Surrogate Faithfulness (SURF), which (1) replaces prior complex surrogates with a simple, linear surrogate that measures faithfulness without changing the explanation's interpretability and (2) introduces well-motivated metrics that assess loss across all output classes, not just the predicted class. We validate SURF with a measure-over-measure study by proposing a simple sanity check---explanations with random concepts should be less faithful---which prior surrogates fail. SURF enables the first reliable faithfulness benchmark of U-CBEMs, revealing that many visually compelling U-CBEMs are not faithful. Code to be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39003", "url": null, "sourceid": 45855, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39008, "uid": "6d1d663a5fc0fb709ecd336753450cac", "name": "Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting", "authors": [{"id": 180184, "fullname": "Kaiqiang Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180184?format=json", "institution": "Peking University"}, {"id": 191169, "fullname": "Rui Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191169?format=json", "institution": "Peking University"}, {"id": 142884, "fullname": "Jiahao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142884?format=json", "institution": "Peking University"}, {"id": 187749, "fullname": "Zhanke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187749?format=json", "institution": "Peking University"}, {"id": 145699, "fullname": "Jie Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145699?format=json", "institution": "Peking University"}, {"id": 187748, "fullname": "Xiaoyun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187748?format=json", "institution": "Pengcheng Laboratory"}, {"id": 129700, "fullname": "Feng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129700?format=json", "institution": "Peking University"}, {"id": 86749, "fullname": "Ronggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86749?format=json", "institution": "Peking University Shenzhen Graduate School"}], "abstract": "3D Gaussian Splatting (3DGS) represents scenes through primitives with coupled intrinsic properties: geometric attributes (position, covariance, opacity) and appearance attributes (view-dependent color). Faithful reconstruction requires intrinsic geometry-appearance consistency, where geometry accurately captures 3D structure while appearance reflects photometry. However, sparse observations lead to appearance overfitting and underconstrained geometry, causing severe novel-view artifacts.We present ICO-GS (Intrinsic Geometry-Appearance Consistency Optimization for 3DGS), a principled framework that enforces this consistency through tightly coupled geometric regularization and appearance learning. Our approach first regularizes geometry via feature-based multi-view photometric constraints by employing pixel-wise top-k selection to handle occlusions and edge-aware smoothness to preserve sharp structures.Then appearance is coupled with geometry through cycle-consistency depth filtering, which identifies reliable regions to synthesize virtual views that propagate geometric correctness into appearance optimization. Experiments on LLFF, DTU, and Blender show ICO-GS substantially improves geometry and photometry, consistently outperforming existing sparse-view baselines, particularly in challenging weakly-textured regions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39008", "url": null, "sourceid": 34824, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39009, "uid": "7cc9dfab96e5f47acc9bb48d36f1cac7", "name": "Point Cloud as a Foreign Language for Multi-modal Large Language Model", "authors": [{"id": 180328, "fullname": "Sneha Paul", "url": "http://cvpr.thecvf.com/api/miniconf/users/180328?format=json", "institution": "Concordia University"}, {"id": 191170, "fullname": "Zachary Patterson", "url": "http://cvpr.thecvf.com/api/miniconf/users/191170?format=json", "institution": "Concordia University"}, {"id": 189913, "fullname": "Nizar Bouguila", "url": "http://cvpr.thecvf.com/api/miniconf/users/189913?format=json", "institution": "Concordia University"}], "abstract": "Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens\u2014treating 3D data as a foreign language that naturally extends the LLM\u2019s vocabulary. Furthermore, to enhance the model\u2019s reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment\u2013based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: https://anonymous.4open.science/r/SAGE-3D.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39009", "url": null, "sourceid": 46766, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39014, "uid": "d2cd4f2c1f5dd99a13b7c026d366b16f", "name": "Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation", "authors": [{"id": 147491, "fullname": "Pingrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147491?format=json", "institution": "Fudan University"}, {"id": 191180, "fullname": "Yifei Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/191180?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 191181, "fullname": "Pengyuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191181?format=json", "institution": "College of Computer Science and Technology, Zhejiang University; Shanghai Artificial Intelligence Laboratory"}, {"id": 191182, "fullname": "Dong An", "url": "http://cvpr.thecvf.com/api/miniconf/users/191182?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 151498, "fullname": "Li Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151498?format=json", "institution": "University of Science and Technology of China"}, {"id": 91379, "fullname": "Zhigang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91379?format=json", "institution": "Shanghai AI  Lab"}, {"id": 88588, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88588?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 88582, "fullname": "Bin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88582?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "Vision-and-Language Navigation (VLN) requires the agent to navigate based on natural instructions. This task is challenging due to partial observability, which makes it difficult to align perception with language.Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details.To this end, we propose to adaptively imagine key environmental semantics via language form, enabling a more reliable and efficient strategy. Specifically, we introduce Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation.Furthermore, we propose a cross-interaction mechanism that regularizes the imagined latent-space outputs and integrates them with the navigation expert module via a decoder-free latent interface, thereby enabling ATD to jointly harness the reasoning ability of the LLM and the task-specific knowledge of the navigation model. We conduct extensive experiments across the R2R, REVERIE, and R4R benchmarks, demonstrating that ATD achieves competitive performance with significantly fewer parameters.The code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39014", "url": null, "sourceid": 45429, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39020, "uid": "359353ccb2ed0977b00c27256b30c365", "name": "MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question Answering", "authors": [{"id": 190448, "fullname": "Junbin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190448?format=json", "institution": "National University of Singapore"}, {"id": 191191, "fullname": "Jiajun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191191?format=json", "institution": "national university of singaore, National University of Singapore; Beijing University of Posts and Telecommunications"}, {"id": 191192, "fullname": "Tianxiang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191192?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 86375, "fullname": "Xun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86375?format=json", "institution": "University of Science and Technology of China"}, {"id": 85773, "fullname": "Angela Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85773?format=json", "institution": "National University of Singapore"}], "abstract": "Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables training-free and more efficient streaming QA.  However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without the sacrifice of memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39020", "url": null, "sourceid": 42939, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39018, "uid": "e500b7708a865ec27eef36c33953b06e", "name": "HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human\u2013Scene Interaction", "authors": [{"id": 102550, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102550?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 158989, "fullname": "LI XIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/158989?format=json", "institution": "Tsinghua University"}, {"id": 88509, "fullname": "Yali Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88509?format=json", "institution": "Tsinghua University"}, {"id": 180608, "fullname": "XUEGE HOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180608?format=json", "institution": "Tsinghua University"}, {"id": 88500, "fullname": "Shengjin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88500?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Unified interpreting and synthesizing human behaviors within 3D environments is vital for advancing spatial intelligence and humanoid robotics. Despite recent advancements (\\textit{e.g.}, HSI-GPT), two fundamental capabilities expected of a unified model\u2014understanding and generation\u2014still lag behind specialist models. This is primarily due to 1) single-granularity codebook overemphasizes low-frequency motion details while neglecting motion semantics, 2) limited decoding capacity of the motion detokenizer which restricts the fidelity of human\u2013scene interactions, 3) only relying on supervised fine-tuning (SFT) failing to capture high-level motion semantics and logical reasoning with an end-to-end mapping. To this issue, we develop HSI-GPT2\u2014a reasoning-enhanced, dual-granularity motion-representational large Scene-Motion-Language model, powered by reinforcement learning (RL) with Chain-of-Thought (CoT) reasoning. First, HSI-GPT2 introduces a \\textbf{Dual-granularity Motion Tokenizer}, DMoTok, which jointly preserves both fine-grained motion details and text-aligned motion semantics for various HSI-related tasks. Further, a \\textbf{motion diffusion decoder} functions as a motion detokenizer, translating deep semantics and detailed features of LLMs into physically grounded human motions. Finally, we curate a \\textbf{Motion Chain-of-Thought} (MoCoT) data engine and extend a Group Relative Policy Optimization (GRPO) paradigm to execute long-horizon and compositionallyrich commands. Results on standard HSI benchmarks confirm the clear superiority of HSI-GPT2 in enhancing interaction quality, semantic alignment, behavioral diversity, and generalization to unseen 3D scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39018", "url": null, "sourceid": 43541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39021, "uid": "2e71e53e5d5315b15a4c18f100a15227", "name": "HQC-NBV: A Hybrid Quantum-Classical View Planning Approach", "authors": [{"id": 183395, "fullname": "Xiaotong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183395?format=json", "institution": "National University of Defense Technology"}, {"id": 72164, "fullname": "Chang-Wen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72164?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves 7.9-49.2\\% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39021", "url": null, "sourceid": 39935, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39023, "uid": "3cfc6b2b7432c074eea2bd2ad6b0851d", "name": "Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow", "authors": [{"id": 180054, "fullname": "Chengxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180054?format=json", "institution": "Korea Advanced Institute of Science and Technology (KAIST)"}, {"id": 191194, "fullname": "Wonseok Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191194?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 129062, "fullname": "Chenshuang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129062?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 152617, "fullname": "Tae-Hyun Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/152617?format=json", "institution": "KAIST"}], "abstract": "Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent works show that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. The code will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39023", "url": null, "sourceid": 33208, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39024, "uid": "8ed02495f7499c010a3b22c830438ec2", "name": "Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements", "authors": [{"id": 133363, "fullname": "Genki Kinoshita", "url": "http://cvpr.thecvf.com/api/miniconf/users/133363?format=json", "institution": "Kyoto University"}, {"id": 77132, "fullname": "Shu Nakamura", "url": "http://cvpr.thecvf.com/api/miniconf/users/77132?format=json", "institution": "Kyoto University"}, {"id": 164070, "fullname": "Ryo Kawahara", "url": "http://cvpr.thecvf.com/api/miniconf/users/164070?format=json", "institution": "Kyoto University"}, {"id": 128950, "fullname": "Shohei Nobuhara", "url": "http://cvpr.thecvf.com/api/miniconf/users/128950?format=json", "institution": "Kyoto Institute of Technology"}, {"id": 136346, "fullname": "Yasutomo Kawanishi", "url": "http://cvpr.thecvf.com/api/miniconf/users/136346?format=json", "institution": "Nagoya University; Nara Institute of Science and Technology, Japan; RIKEN"}, {"id": 88265, "fullname": "Ko Nishino", "url": "http://cvpr.thecvf.com/api/miniconf/users/88265?format=json", "institution": "Kyoto University"}], "abstract": "Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms which capture the atomic joint movements and Action Motifs which are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multiview human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Atoms and Action Motifs that significantly benefit human behavior modeling tasks including action recognition, motion prediction and synthesis.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39024", "url": null, "sourceid": 30618, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39027, "uid": "18bf4a47d5777efe47c421645d19f2f1", "name": "Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration", "authors": [{"id": 191198, "fullname": "Qiaojie Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191198?format=json", "institution": "Colorado School of Mines"}, {"id": 183488, "fullname": "Jiucai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183488?format=json", "institution": "Waste Management"}, {"id": 191199, "fullname": "Amy Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191199?format=json", "institution": null}, {"id": 191200, "fullname": "Xiaoli Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191200?format=json", "institution": "Colorado School of Mines"}], "abstract": "Accurate uncertainty estimation is essential for reliable appearance-based gaze tracking. However, domain shifts between training and testing often lead to incorrect uncertainty estimates, which is a problem overlooked in existing uncertainty-aware gaze tracking models. To overcome this problem efficiently, we formulate uncertainty estimation as a conditional distribution problem and treat the correction process as an output-level conditional distribution matching task. We therefore introduce a data-efficient post-hoc calibration method to align the predicted, high-error conditional distribution with the empirically observed distribution extracted from a small set of calibration samples. To more faithfully assess the accuracy of the resulting uncertainty estimates, we further introduce a new metric, Coverage Probability Error (CPE), to quantify the distribution-level mismatch between prediction and observation. We validate the calibration procedure across four domain shift scenarios to demonstrate improved uncertainty accuracy and its practical benefits.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39027", "url": null, "sourceid": 32734, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39029, "uid": "59c53d894d899733cf74c51da615234c", "name": "Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers", "authors": [{"id": 100177, "fullname": "Wongi Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/100177?format=json", "institution": "Seoul National University"}, {"id": 155979, "fullname": "Kyungryeol Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/155979?format=json", "institution": "Seoul National University"}, {"id": 147333, "fullname": "Hoigi Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/147333?format=json", "institution": "Seoul National University"}, {"id": 87674, "fullname": "Se Young Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87674?format=json", "institution": "Seoul National University"}], "abstract": "Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that na\\\"ive latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0$\\times$ speedup on FLUX-1.dev and 3.0$\\times$ on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9$\\times$ speedup.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39029", "url": null, "sourceid": 34797, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39033, "uid": "50265c271897a3399857d05aee3df4f5", "name": "MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction", "authors": [{"id": 181967, "fullname": "Zitian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181967?format=json", "institution": "Brown University"}, {"id": 151035, "fullname": "Xu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151035?format=json", "institution": "Amazon"}, {"id": 88537, "fullname": "Jianbo Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88537?format=json", "institution": "Bytedance"}, {"id": 88963, "fullname": "Yang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88963?format=json", "institution": "Amazon"}, {"id": 191213, "fullname": "Varad Gunjal", "url": "http://cvpr.thecvf.com/api/miniconf/users/191213?format=json", "institution": "Amazon Web Services"}, {"id": 191214, "fullname": "Songyao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191214?format=json", "institution": "Amazon"}, {"id": 88703, "fullname": "Davide Modolo", "url": "http://cvpr.thecvf.com/api/miniconf/users/88703?format=json", "institution": "Amazon"}], "abstract": "Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not reveal the model in a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO) . The first stage enhances the model's self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39033", "url": null, "sourceid": 43785, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39039, "uid": "62b395526e160d2e25c4b910ea419a90", "name": "WonderZoom: Multi-Scale 3D World Generation", "authors": [{"id": 144560, "fullname": "Jin Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144560?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 85365, "fullname": "Hong-Xing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85365?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}], "abstract": "We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to ``zoom into'' a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly outperforms state-of-the-art video and 3D models in both quality and alignment, enabling multi-scale 3D world creation from a single image. We show video results and an interactive viewer of generated multi-scale 3D worlds on the website of the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39039", "url": null, "sourceid": 36911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39040, "uid": "15c8f2ffbbc0f143001b1edb6b0759b5", "name": "V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties", "authors": [{"id": 104516, "fullname": "Ye Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104516?format=json", "institution": "Fudan University"}, {"id": 76203, "fullname": "Tong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76203?format=json", "institution": "Stanford University"}, {"id": 129565, "fullname": "Valentin Deschaintre", "url": "http://cvpr.thecvf.com/api/miniconf/users/129565?format=json", "institution": "Adobe Research"}, {"id": 86074, "fullname": "Duygu Ceylan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86074?format=json", "institution": "Adobe Systems"}, {"id": 127644, "fullname": "Iliyan Georgiev", "url": "http://cvpr.thecvf.com/api/miniconf/users/127644?format=json", "institution": "Adobe"}, {"id": 86072, "fullname": "Chun-Hao P. Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86072?format=json", "institution": "Adobe Systems"}, {"id": 153091, "fullname": "Yiwei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153091?format=json", "institution": "Adobe Systems"}, {"id": 88780, "fullname": "Xuelin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88780?format=json", "institution": "Adobe Research"}, {"id": 71652, "fullname": "Tuanfeng Y. Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71652?format=json", "institution": "Adobe Systems"}], "abstract": "Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39040", "url": null, "sourceid": 40607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39047, "uid": "463ea0572c257e83990249ebddbb8ccb", "name": "Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization", "authors": [{"id": 126583, "fullname": "Liao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126583?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 191250, "fullname": "Wentao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191250?format=json", "institution": null}, {"id": 191251, "fullname": "Yiran Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191251?format=json", "institution": "Alibaba Group"}, {"id": 191252, "fullname": "Jiahe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191252?format=json", "institution": "Alibaba Group"}, {"id": 90176, "fullname": "Tiezheng Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/90176?format=json", "institution": "Alibaba Group"}, {"id": 74017, "fullname": "Zhiguo Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/74017?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}], "abstract": "Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39047", "url": null, "sourceid": 37694, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39043, "uid": "92df9a438175ad64dbd06312acf13ade", "name": "Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination", "authors": [{"id": 87934, "fullname": "Hyeonseong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87934?format=json", "institution": "KAIST"}, {"id": 70833, "fullname": "Hyun-Kurl Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70833?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "LiDAR semantic segmentation must remain robust under various sensor and environmental corruptions to be reliable in safety-critical applications.Existing test-time adaptation methods, including approaches based on pseudo-labels and normalization statistics, have shown promising results but can still struggle under severe distribution shifts.To complement these approaches, we propose a geometry-aware test-time training framework that leverages an auxiliary self-supervised objective.Our method is based on geometric inlier discrimination (GeoID), which injects synthetic off-manifold points into the input and trains the model to distinguish geometry-consistent inliers from synthetically displaced outliers, enabling adaptation on unlabeled test data.To further stabilize this process under real corruptions, we introduce bidirectional unreliable point filtering (BiUPF), which uses inlier scores from the source-trained model to filter out unreliable regions on both original and synthetic points, focusing updates on high-confidence samples.Experiments on two large-scale corruption benchmarks, SemanticKITTI-C and nuScenes-C, show that our method consistently outperforms strong test-time adaptation baselines and improves robustness across diverse LiDAR corruptions.Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39043", "url": null, "sourceid": 33246, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39048, "uid": "bc9e8957f418b0f2a0bb86f026534734", "name": "RigMo: Unifying Rig and Motion Learning for Generative Animation", "authors": [{"id": 138169, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/138169?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 191253, "fullname": "Jiahao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191253?format=json", "institution": "University of California, Santa Cruz"}, {"id": 191254, "fullname": "Bohui Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191254?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 131817, "fullname": "Yizhou Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131817?format=json", "institution": "Carnegie Mellon University"}, {"id": 70574, "fullname": "Zongrui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/70574?format=json", "institution": "Nanyang Technological University"}, {"id": 129079, "fullname": "Michael Vasilkovsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/129079?format=json", "institution": "Snap Inc."}, {"id": 106929, "fullname": "Chaoyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106929?format=json", "institution": "Snap Inc"}, {"id": 85791, "fullname": "Jian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85791?format=json", "institution": "Snap Inc."}, {"id": 136101, "fullname": "Narendra Ahuja", "url": "http://cvpr.thecvf.com/api/miniconf/users/136101?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 191255, "fullname": "Bing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191255?format=json", "institution": "Snap Inc."}], "abstract": "Recent progress in 4D generation has advanced the reconstruction of dynamic geometry, yet the modeling of rig and motion, the two core elements of animation, remains disconnected. Existing approaches typically treat rigging and motion generation as independent tasks: auto-rigging methods rely on human-annotated skeletons and skinning weights, while motion-generation models predict dense vertex trajectories without any explicit structure. This separation contradicts the nature of animation itself, which is the coupled outcome of both structure and motion, and it limits scalability, interpretability, and control.We present RigMo, a unified generative framework that jointly learns rig and motion directly from raw mesh sequences without any rig annotations or human priors. RigMo encodes per-vertex deformations into a compact latent space and decodes a set of implicit Gaussian bones, skinning weights, and time-varying transformations that together define an animatable mesh. This design makes the model animatable by construction: a single latent representation yields both an explicit rig structure and temporally coherent motion parameters. Unlike optimization-based auto-rigging methods that overfit to a specific sequence, RigMo generalizes across object categories and motion styles, offering feed-forward inference for arbitrary deformable objects. Experiments on DeformingThings4D, Objaverse-XL, and diverse human and animal datasets demonstrate that RigMo generates smooth, interpretable, and physically consistent rigs, achieving superior reconstruction and generalization compared to existing 4D generative baselines. RigMo establishes a new paradigm for structure-aware, controllable, and scalable 4D generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39048", "url": null, "sourceid": 34641, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39049, "uid": "23d23cfad64cc753b8c3d43b2042a0d6", "name": "LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning", "authors": [{"id": 151917, "fullname": "Zebin You", "url": "http://cvpr.thecvf.com/api/miniconf/users/151917?format=json", "institution": "Renmin University of China"}, {"id": 86612, "fullname": "Shen Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/86612?format=json", "institution": "Renmin University of China"}, {"id": 73422, "fullname": "Xiaolu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73422?format=json", "institution": "Ant Group"}, {"id": 131054, "fullname": "JUN ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/131054?format=json", "institution": "Ant Group"}, {"id": 128491, "fullname": "Zhiwu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128491?format=json", "institution": "Renmin University of China"}, {"id": 126218, "fullname": "Ji-Rong Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126218?format=json", "institution": "Renmin University of China"}, {"id": 86585, "fullname": "Chongxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86585?format=json", "institution": "Renmin University of China"}], "abstract": "In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, leveraging diffusion language models' bidirectional attention to capture spatial relationships in visual data more effectively than causal, sequential processing. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive with LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation. To facilitate such research, we will open-source the LLaDA-V model along with its training and evaluation code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39049", "url": null, "sourceid": 31462, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39050, "uid": "a6cdf7d7accdc2a5afc50c9ce763cbf2", "name": "XR-Poser: Accurate Egocentric Human Motion Estimation for AR/VR", "authors": [{"id": 183153, "fullname": "Zhenyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183153?format=json", "institution": "KAUST"}, {"id": 133012, "fullname": "Sai Kumar Dwivedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133012?format=json", "institution": "Max Planck Institute for Intelligent Systems, Germany"}, {"id": 191256, "fullname": "Filip Maric", "url": "http://cvpr.thecvf.com/api/miniconf/users/191256?format=json", "institution": "Facebook"}, {"id": 149312, "fullname": "Carlos Chac\u00f3n", "url": "http://cvpr.thecvf.com/api/miniconf/users/149312?format=json", "institution": "Meta"}, {"id": 153296, "fullname": "Nadine Bertsch", "url": "http://cvpr.thecvf.com/api/miniconf/users/153296?format=json", "institution": "Facebook"}, {"id": 153298, "fullname": "Filippo Arcadu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153298?format=json", "institution": "Facebook"}, {"id": 88562, "fullname": "Tomas Hodan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88562?format=json", "institution": "Meta"}, {"id": 154995, "fullname": "Michael Ramamonjisoa", "url": "http://cvpr.thecvf.com/api/miniconf/users/154995?format=json", "institution": "Meta"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}, {"id": 139332, "fullname": "Amy Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/139332?format=json", "institution": "Meta Reality Labs"}, {"id": 191257, "fullname": "Robin Kips", "url": "http://cvpr.thecvf.com/api/miniconf/users/191257?format=json", "institution": "Research, Facebook"}, {"id": 88527, "fullname": "Cem Keskin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88527?format=json", "institution": "Facebook"}, {"id": 191258, "fullname": "Anastasia Tkach", "url": "http://cvpr.thecvf.com/api/miniconf/users/191258?format=json", "institution": "Meta Reality Labs"}, {"id": 191259, "fullname": "Chenhongyi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191259?format=json", "institution": "Facebook"}], "abstract": "Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present XR-Poser, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher\u2013student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. In experiments on the EgoBody3M benchmark, XR-Poser outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%, respectively. Furthermore, our auto-labeling system additionally improves the wrist MPJPE by 13.1%.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39050", "url": null, "sourceid": 46699, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39051, "uid": "eaeab49ac116abe88580249769561a21", "name": "Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Models", "authors": [{"id": 160339, "fullname": "Shufan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/160339?format=json", "institution": "UCLA Computer Science Department, University of California, Los Angeles"}, {"id": 187059, "fullname": "Jiuxiang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187059?format=json", "institution": "Adobe Systems"}, {"id": 76043, "fullname": "Kangning Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76043?format=json", "institution": "NEW YORK UNIVERSITY"}, {"id": 85199, "fullname": "Zhe Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85199?format=json", "institution": "Adobe Research"}, {"id": 191260, "fullname": "Zijun Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191260?format=json", "institution": "ByteDance Inc."}, {"id": 131036, "fullname": "Aditya Grover", "url": "http://cvpr.thecvf.com/api/miniconf/users/131036?format=json", "institution": "University of California, Los Angeles"}, {"id": 87401, "fullname": "Jason Kuen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87401?format=json", "institution": "Adobe Research"}], "abstract": "Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as sparse representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2$\\times$ speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39051", "url": null, "sourceid": 33984, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39052, "uid": "6dea2617ffc5cbe1eccbec05d1ab6492", "name": "FBTA: Enabling Single-GPU End-to-End Gigapixel WSI Classification with Feature Bridging and Translation Alignment", "authors": [{"id": 129416, "fullname": "Jiuyang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129416?format=json", "institution": "Harbin Institute of Technology"}, {"id": 129414, "fullname": "Jiahan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129414?format=json", "institution": "Harbin Institute of Technology"}, {"id": 87056, "fullname": "Junjun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87056?format=json", "institution": "Harbin Institute of Technology"}, {"id": 129438, "fullname": "Yongbing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129438?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Whole-slide images (WSIs) in computational pathology contain billions of pixels, making end-to-end training of feature extractors and multi-instance learning (MIL) networks infeasible on a single commodity GPU.Existing methods often freeze the feature extractor and train MIL networks on the resulting frozen features, which introduces a semantic gap that limits downstream performance. To address this issue, we propose FBTA, a Feature Bridging and Translation Alignment framework for WSI classification. FBTA is the first end-to-end MIL framework trainable on a single 24\\,GB GPU, leveraging three complementary feature-bag views: end-to-end features enable joint optimization, frozen features stabilize training, and translated features support practical inference.Experiments on diverse datasets, including TCGA-NSCLC (Shot20/50/100) and TCGA-STAD, demonstrate the effectiveness and generality of FBTA, which consistently improves performance across three MIL architectures and two extractors. For example, with ResNet-50 as the extractor, FBTA improves the accuracy of the classic ABMIL by 13.1\\% and 15.8\\% on the NSCLC-Shot50 and TCGA-STAD datasets, respectively, and further enhances the state-of-the-art MambaMIL by 4.1\\% and 9.2\\% on the same datasets. Moreover, FBTA yields additional gains for MIL models that incorporate self-supervised pretraining strategies and data augmentation techniques.These results suggest FBTA is a feasible and scalable framework for end-to-end MIL on gigapixel WSIs. The code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39052", "url": null, "sourceid": 34651, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39056, "uid": "51edcfc8df66cc438e0903f7a5e8a1f0", "name": "MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model", "authors": [{"id": 129066, "fullname": "Geonmo Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129066?format=json", "institution": "NAVER"}, {"id": 159012, "fullname": "Byeongho Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/159012?format=json", "institution": "NAVER AI Lab"}, {"id": 185695, "fullname": "Jaemyung Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185695?format=json", "institution": "NAVER AI Lab"}, {"id": 191268, "fullname": "Jaehui Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191268?format=json", "institution": "NAVER"}, {"id": 159013, "fullname": "Taekyung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/159013?format=json", "institution": "NAVER AI Lab"}, {"id": 181823, "fullname": "Sangmin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/181823?format=json", "institution": "Korea University"}, {"id": 135043, "fullname": "HeeJae Jun", "url": "http://cvpr.thecvf.com/api/miniconf/users/135043?format=json", "institution": "NAVER"}, {"id": 129103, "fullname": "Yoohoon Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129103?format=json", "institution": "NAVER"}, {"id": 129088, "fullname": "Sangdoo Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129088?format=json", "institution": "NAVER"}, {"id": 90646, "fullname": "Dongyoon Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/90646?format=json", "institution": "NAVER Corp, CLOVA AI."}], "abstract": "Universal Multimodal embedding models built on Multimodal Large Language Models (MLLMs) have traditionally employed contrastive learning, which aligns representations of query-target pairs across different modalities. Yet, despite its empirical success, they are primarily built on a \"single-turn\" formulation where each query-target pair is treated as an independent data point. This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. MuCo leverages the conversational nature of MLLMs to process multiple, related query-target pairs associated with a single image within a single forward pass. This allows us to extract a set of multiple query and target embeddings simultaneously, conditioned on a shared context representation, amplifying the effective batch size and overall training efficiency. Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T), which yields state-of-the-art retrieval performance on MMEB and M-BEIR benchmarks, while markedly enhancing both training efficiency and representation coherence across modalities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39056", "url": null, "sourceid": 45145, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39059, "uid": "455c8959885c1b38871319571e9ab72c", "name": "PE3R: Perception-Efficient 3D Reconstruction", "authors": [{"id": 88408, "fullname": "Jie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88408?format=json", "institution": "Xiamen University"}, {"id": 127162, "fullname": "Shizun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127162?format=json", "institution": "National University of Singapore"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Recent advances in 2D-to-3D perception have enabled the recovery of 3D scene semantics from unposed images. However, prevailing methods often suffer from limited generalization, reliance on per-scene optimization, and semantic inconsistencies across viewpoints. To address these limitations, we introduce PE3R, a tuning-free framework for efficient and generalizable 3D semantic reconstruction. By integrating multi-view geometry with 2D semantic priors in a feed-forward pipeline, PE3R achieves zero-shot generalization across diverse scenes and object categories without any scene-specific fine-tuning. Extensive evaluations on open-vocabulary segmentation and multi-view depth estimation show that PE3R not only achieves up to 9$\\times$ faster inference but also sets new state-of-the-art accuracy in both semantic and geometric metrics. Our approach paves the way for scalable, language-driven 3D scene understanding. Code is available in supplementary material for reproducibility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39059", "url": null, "sourceid": 32677, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39061, "uid": "9f26e2ca32a8d8893b0e3c1393b649f2", "name": "Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining", "authors": [{"id": 182527, "fullname": "Hyeonseo Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182527?format=json", "institution": "Yonsei University"}, {"id": 183052, "fullname": "Jaebyeong Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/183052?format=json", "institution": "Yonsei University"}, {"id": 164093, "fullname": "Joong-won Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/164093?format=json", "institution": "ETRI"}, {"id": 138997, "fullname": "Kibok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/138997?format=json", "institution": "Yonsei University"}], "abstract": "Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have revealed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines---without modifying any other components---is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and avoids any additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39061", "url": null, "sourceid": 38122, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39068, "uid": "7bc99e67774703e7cfa09c0375b4010f", "name": "Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos", "authors": [{"id": 182088, "fullname": "Matthew Strong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182088?format=json", "institution": "Stanford University"}, {"id": 103419, "fullname": "Wei-Jer Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/103419?format=json", "institution": "University of California, Berkeley"}, {"id": 130161, "fullname": "Quentin HERAU", "url": "http://cvpr.thecvf.com/api/miniconf/users/130161?format=json", "institution": "Huawei/University of Burgundy"}, {"id": 191294, "fullname": "Jiezhi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191294?format=json", "institution": "Applied Intuition, Inc."}, {"id": 96453, "fullname": "Yihan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/96453?format=json", "institution": "Applied Intuition"}, {"id": 153960, "fullname": "Chensheng Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153960?format=json", "institution": "University of California, Berkeley"}, {"id": 88181, "fullname": "Wei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88181?format=json", "institution": "University of California Berkeley"}], "abstract": "Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic layouts, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and motion prediction tasks. These geometry- and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39068", "url": null, "sourceid": 39417, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39072, "uid": "3ef66143a4e5311332f23b364425643e", "name": "Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval", "authors": [{"id": 181223, "fullname": "Jing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181223?format=json", "institution": "Southeast University"}, {"id": 186839, "fullname": "Hui Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186839?format=json", "institution": "Southeast University"}, {"id": 186836, "fullname": "Shipeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186836?format=json", "institution": null}, {"id": 186838, "fullname": "Pengfei Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186838?format=json", "institution": "Southeast University"}], "abstract": "This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors (TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on both the Office-Home and DomainNet datasets. Notably, when using ViT-B as the image encoder, TPSNet achieves an average $P@15$ improvement of 24.48\\% on Office-Home and a 13.86\\% improvement in $P@200$ on DomainNet.  The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39072", "url": null, "sourceid": 36060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39074, "uid": "26e31ab36807914055cf505c63c05bd1", "name": "GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction", "authors": [{"id": 173077, "fullname": "Di Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/173077?format=json", "institution": "Tsinghua University"}, {"id": 184477, "fullname": "Yikai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184477?format=json", "institution": "Tsinghua University"}, {"id": 191307, "fullname": "Wenjie Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191307?format=json", "institution": null}, {"id": 191308, "fullname": "Yifan Bu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191308?format=json", "institution": "Tsinghua University"}, {"id": 157973, "fullname": "Boya Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157973?format=json", "institution": "Nankai University"}, {"id": 191309, "fullname": "Yuexin Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191309?format=json", "institution": "Nankai University"}, {"id": 191310, "fullname": "Xiawei Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/191310?format=json", "institution": "Nankai University"}, {"id": 191311, "fullname": "Wenbiao Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/191311?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191312, "fullname": "Yiman Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191312?format=json", "institution": "Beihang University"}, {"id": 191313, "fullname": "Yuwen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191313?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191314, "fullname": "Cheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/191314?format=json", "institution": "Tsinghua University"}], "abstract": "Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. This paper introduces GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our new method introduces three key innovations: (i) a slice\u2011aware piling strategy that positions anisotropic 3D Gaussians to model through\u2011slice contributions, (ii) a differentiable projection operator that encodes the finite\u2011thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real\u2011time rendering efficiency of Gaussian primitives while preserving high\u2011frequency internal volumetric detail. Experiments on microscopy and ultrasound datasets demonstrate that our method reduces storage and reconstruction cost, sustains diagnostic fidelity, and enables fast 2D visualization, along with 3D voxelization. In practice, it delivers high-quality results in as few as $3$ minutes\u2014up to $11\\times$ faster than NeRF-based approaches\u2014and achieves consistent $16\\times$ compression over the original voxel grids, offering a practical path to deployable compression and exploration of slice-based volumetric datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39074", "url": null, "sourceid": 39997, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39077, "uid": "e0a0a80333d75c4728e228b869e94579", "name": "Guiding a Diffusion Transformer with the Internal Dynamics of Itself", "authors": [{"id": 101988, "fullname": "Xingyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/101988?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191319, "fullname": "Qifan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191319?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 149754, "fullname": "Hai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149754?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86553, "fullname": "Shuhang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86553?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "The diffusion model presents a powerful ability to obtian the entire (conditional) data distribution.However, due to the lack of sufficient training and data to learn, the model will be penalized for failing to cover low-probability areas.To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage.However, the standard CFG often leads to over-simplified or distorted samples. And the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process.This simple strategy yields significant improvements in both training efficiency and generation quality on DiTs and SiTs.On ImageNet 256\u00d7256, SiT-XL/2+IG achieves FID=5.31 and FID=1.88 which already exceeds the FID of the vanilla SiT-XL and REPA.More impressively, LightningDiT-XL/1+IG achieves FID=1.41 which achieves a large margin between all of these methods.Combined with classifier free guidance, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.23.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39077", "url": "https://zhouxingyu13.github.io/Internal-Guidance/", "sourceid": 43626, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39081, "uid": "f24225f22f15bced3886d8fa30b6cca3", "name": "ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps", "authors": [{"id": 180212, "fullname": "Sicheng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180212?format=json", "institution": "National University of Singapore"}, {"id": 70670, "fullname": "Song Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70670?format=json", "institution": "Zhejiang University"}, {"id": 180740, "fullname": "Shuyi Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180740?format=json", "institution": "Zhejiang University"}, {"id": 76351, "fullname": "Lingdong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76351?format=json", "institution": "National University of Singapore"}, {"id": 152183, "fullname": "Zikai Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152183?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 89309, "fullname": "Jianke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89309?format=json", "institution": "Zhejiang University"}, {"id": 87566, "fullname": "Huan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87566?format=json", "institution": "Northeastern University"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Multimodal large language models (MLLMs) have demonstrated significant progress in semantic scene understanding and text-image alignment, with reasoning variants enhancing performance on more complex tasks involving mathematics and logic. However, their proficiency in tasks requiring both fine-grained visual understanding and spatial reasoning remains underexplored.To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities. ReasonMap encompasses high-resolution transit maps from 30 cities and includes 1,008 question-answer pairs spanning two question types and three templates. Furthermore, we design a two-level evaluation pipeline that properly assesses answer correctness and quality. Our comprehensive evaluation of 16 popular MLLMs reveals a counterintuitive pattern: among open-source models, base variants outperform their reasoning-tuned counterparts, whereas the opposite trend is observed in closed-source models. Further analysis under the visual-masking setting confirms that strong performance necessitates direct visual grounding, rather than relying solely on language priors. We further establish a training baseline with reinforcement fine-tuning, providing a reference for future exploration. We hope this benchmark study offers new insights into visual reasoning and helps investigate the gap between open- and closed-source models. Code and data samples are in the Supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39081", "url": null, "sourceid": 43485, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39076, "uid": "89bccf7525bbf3cfae49cef1edd7932a", "name": "Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty", "authors": [{"id": 183731, "fullname": "ManGyu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/183731?format=json", "institution": "Yonsei University"}, {"id": 191318, "fullname": "Jaewon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191318?format=json", "institution": "Yonsei University"}, {"id": 190097, "fullname": "Seongwon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/190097?format=json", "institution": "Kookmin University"}, {"id": 90082, "fullname": "Euntai Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/90082?format=json", "institution": "Yonsei University"}], "abstract": "3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible.To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information\u2013based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39076", "url": null, "sourceid": 37960, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39082, "uid": "54b26778c1725894f48393baf84d3b30", "name": "Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection", "authors": [{"id": 151704, "fullname": "Xu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151704?format=json", "institution": "University of Science and Technology of China"}, {"id": 157926, "fullname": "Zihan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/157926?format=json", "institution": "University of Science and Technology of China"}, {"id": 91768, "fullname": "Yixin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91768?format=json", "institution": "University of Science and Technology of China"}, {"id": 88349, "fullname": "Zilei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88349?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Incremental Object Detection (IOD) aims to equip detectors with the ability to handle dynamic environments and emerging object categories, and the rise of vision-language models has substantially advanced this goal. However, existing studies often oversimplify real-world scenarios by assuming the incremental tasks come from a single general domain. To better investigate vision-language models under IOD, it is necessary to explore more generalized scenarios that encompass both novel categories and domains. To this end, we propose Cross-Domain Incremental Object Detection (CDIOD), a new benchmark that assesses the ability to continuously adapt to diverse object detection tasks across domains. CDIOD reveals that existing methods struggle to balance between adaptivity and stability under substantial domain shifts. To tackle this challenge, we propose Dynamic Group Subspace (DGS), a novel framework that dynamically groups tasks by distribution to promote knowledge sharing and prevent task collisions; progressively consolidates adapters to build shared subspaces and control parameter growth; and implements a dynamic training pipeline to maintain a proper stability-adaptivity balance. DGS enables vision-language models to effectively handle task streams of various distribution shifts. Extensive experiments across three benchmarks demonstrate that DGS achieves state-of-the-art performance, highlighting its robustness in diverse incremental learning scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39082", "url": null, "sourceid": 36246, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39083, "uid": "3367366e0bff001c5cfb5aedd10d8e31", "name": "Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation", "authors": [{"id": 170929, "fullname": "Baoteng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/170929?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 191325, "fullname": "Xianghao Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191325?format=json", "institution": "China Telecom"}, {"id": 159569, "fullname": "Xinran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159569?format=json", "institution": "Beijing University of Posts and Telecommunications; Beijing University of Posts and Telecommunications"}, {"id": 191326, "fullname": "Xiangyu Na", "url": "http://cvpr.thecvf.com/api/miniconf/users/191326?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 191327, "fullname": "Zhixiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191327?format=json", "institution": "China Telecom"}, {"id": 90160, "fullname": "Hao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/90160?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 152943, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152943?format=json", "institution": "China Telecom"}, {"id": 191328, "fullname": "Zhongjiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191328?format=json", "institution": "ChinaTelecom"}, {"id": 191329, "fullname": "Tianwei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191329?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 90447, "fullname": "Kongming Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90447?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 90236, "fullname": "Zhanyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90236?format=json", "institution": "Beijing University of Post and Telecommunication"}], "abstract": "Text-to-Image (T2I) generation technology has achieved remarkable progress in recent years. Concurrently, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and have been successfully applied to T2I tasks. However, the uniform sampling strategy commonly adopted during training often ignores the match between sample difficulty and the model\u2019s current learning capability, leading to low training efficiency. We argue that the key to unleashing the model\u2019s potential lies in continuously providing ``high-value samples'' that match its evolving competence. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt is used to generate a group of images, and a reward model assigns a reward to each image. We use the variance of these rewards as a proxy indicator\u2014higher variance implies the model's understanding of the prompt is still unstable, indicating stronger learnability and thus higher value. CGPO adaptively constructs the curriculum by dynamically identifying and selecting high-value samples for training based on reward variance. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39083", "url": null, "sourceid": 36179, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39084, "uid": "8f921645a776c79989438af2d2808085", "name": "LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World", "authors": [{"id": 191330, "fullname": "Nan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191330?format=json", "institution": "Meta"}, {"id": 89078, "fullname": "Julian Straub", "url": "http://cvpr.thecvf.com/api/miniconf/users/89078?format=json", "institution": "Meta Reality Labs Research"}, {"id": 137201, "fullname": "Fan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/137201?format=json", "institution": "Facebook"}, {"id": 127741, "fullname": "Richard Newcombe", "url": "http://cvpr.thecvf.com/api/miniconf/users/127741?format=json", "institution": "Meta, Reality Labs Research"}, {"id": 127728, "fullname": "Jakob Engel", "url": "http://cvpr.thecvf.com/api/miniconf/users/127728?format=json", "institution": "Research, Meta Reality Labs"}, {"id": 91423, "fullname": "Lingni Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/91423?format=json", "institution": "Facebook"}], "abstract": "Tracking 3D human motion from egocentric, multi-camera devices is challenged by severe egomotion and partial visibility or occlusions. Existing methods are designed for monocular video often recorded from static or slowly-moving cameras and cannot easily leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures.  We propose LAMP ($\\textbf{L}$ocalization $\\textbf{A}$ware $\\textbf{M}$ulti-camera $\\textbf{P}$eople Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process: First, we leverage the device's known 6-DoF pose and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained Transformer model fits 3D human motion directly to this spatio-temporal ray cloud in world coordinates. This \"lift-then-fit\" approach allows to learn and leverage a natural prior over world-space human motion, as well as providing an elegant framework to flexibly incorporate information from multiple, temporally asynchronous, partially observing, and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric multi-camera setting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39084", "url": null, "sourceid": 43859, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39085, "uid": "b292c42b08d209a8baa530adc393671d", "name": "Affine Perspective-Three-Point Problem", "authors": [{"id": 184066, "fullname": "Gaku Nakano", "url": "http://cvpr.thecvf.com/api/miniconf/users/184066?format=json", "institution": "NEC Corporation"}], "abstract": "This paper addresses the Perspective-Three-Point (P3P) problem under affine camera models. We derive direct closed-form solvers for weak perspective and para perspective, which are representative affine camera models. The affine P3P solution reduces to a bi-quadratic equation. Unlike exact P3P solvers that require a cubic or quartic equation, it allows for the simple and stable calculation of real solutions using the quadratic formula. Since affine approximations are valid only when scene depth variation is small, we further propose an iterative correction that upgrades the affine solution to the exact P3P solution. Through extensive comparisons using synthetic data and public datasets, we demonstrate that affine P3P solvers with two upgrade iterations achieve performance substantially comparable to that of the state-of-the-art P3P solver.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39085", "url": null, "sourceid": 40040, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39087, "uid": "9ffd9d410fda83d9d2371d1657f73b50", "name": "StreamingTOM: Streaming Token Compression for Efficient Video Understanding", "authors": [{"id": 176025, "fullname": "Xueyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/176025?format=json", "institution": "Westlake University"}, {"id": 157978, "fullname": "Keda Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157978?format=json", "institution": "Westlake University"}, {"id": 184746, "fullname": "Kele Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184746?format=json", "institution": "Westlake University"}, {"id": 87566, "fullname": "Huan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87566?format=json", "institution": "Northeastern University"}], "abstract": "Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation.Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks.However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency.Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens.Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length.Experiments demonstrate our method achieves $15.7\\times$ kv-cache compression, $1.2\\times$ lower peak memory and $2\\times$ faster TTFT compared to prior SOTA.StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\\%$ on offline benchmarks and $55.8\\%/3.7$ on RVS.These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39087", "url": null, "sourceid": 36656, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39088, "uid": "12d29d5c5e0241bd62b01c442312eb52", "name": "From Rays to Projections: Better Inputs for Feed-Forward View Synthesis", "authors": [{"id": 143524, "fullname": "Zirui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143524?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 126492, "fullname": "Zeren Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126492?format=json", "institution": "University of Oxford"}, {"id": 88372, "fullname": "Martin R. Oswald", "url": "http://cvpr.thecvf.com/api/miniconf/users/88372?format=json", "institution": "University of Amsterdam"}, {"id": 96218, "fullname": "Jie Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/96218?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Pl\u00fccker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39088", "url": null, "sourceid": 39591, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39089, "uid": "5e285258b2a7b03236d37eb86a562ee5", "name": "Distilling Balanced Knowledge from a Biased Teacher", "authors": [{"id": 180497, "fullname": "Seonghak Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/180497?format=json", "institution": "Agency for Defense Development"}], "abstract": "Conventional knowledge distillation, designed for model compression, fails on long-tailed distributions because the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. We propose Long-Tailed Knowledge Distillation (LTKD), a novel framework that reformulates the conventional objective into two components: a cross-group loss, capturing mismatches in prediction distributions across class groups (head, medium, and tail), and a within-group loss, capturing discrepancies within each group's distribution. This decomposition reveals the specific sources of the teacher's bias. To mitigate the inherited bias, LTKD introduces (1) a rebalanced cross-group loss to calibrates the teacher's group-level predictions and (2) a reweighted within-group loss to ensures equal contribution from all groups. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT demonstrate that LTKD significantly outperforms existing methods in both overall and tail-class accuracy, thereby proving its ability to distill balanced knowledge from a biased teacher for real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39089", "url": null, "sourceid": 45632, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39093, "uid": "cb2eaafe9f844c32c74a0267257ba8e5", "name": "Flow Map Distillation Without Data", "authors": [{"id": 153408, "fullname": "Shangyuan Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153408?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 153407, "fullname": "Nanye Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/153407?format=json", "institution": "New York University"}, {"id": 90241, "fullname": "Saining Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90241?format=json", "institution": "Facebook"}, {"id": 128313, "fullname": "Tommi Jaakkola", "url": "http://cvpr.thecvf.com/api/miniconf/users/128313?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "State-of-the-art flow models achieve remarkable quality but require slow, iterative sampling. To accelerate this, flow maps can be distilled from pre-trained teachers, a procedure that conventionally requires sampling from an external dataset. We argue that this data-dependency introduces a fundamental risk of Teacher-Data Mismatch, as a static dataset may provide an incomplete or even misaligned representation of the teacher's full generative capabilities. This leads us to question whether this reliance on data is truly necessary for successful flow map distillation. In this work, we explore a data-free alternative that samples only from the prior distribution\u2014a distribution the teacher is guaranteed to follow by construction, thereby circumventing the mismatch risk entirely. To demonstrate the practical viability of this philosophy, we introduce a principled framework that learns to predict the teacher's sampling path while actively correcting for its own compounding errors to ensure high fidelity. Our approach surpasses all data-based counterparts and establishes a new state-of-the-art by a significant margin. Specifically, distilling from SiT-XL/2+REPA, our method reaches an impressive FID of $1.45$ on ImageNet $256{\\times}256$, and $1.49$ on ImageNet $512{\\times}512$, both with only 1 sampling step. We hope our work establishes a more robust paradigm for accelerating generative models and motivates the broader adoption of flow map distillation without data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39093", "url": null, "sourceid": 42302, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39096, "uid": "7d1ca441acca88b450e706d65533dbd2", "name": "GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation", "authors": [{"id": 147237, "fullname": "Ken Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/147237?format=json", "institution": null}, {"id": 104366, "fullname": "Yunhan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104366?format=json", "institution": "The University of Hong Kong"}, {"id": 76470, "fullname": "Jingxiang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76470?format=json", "institution": "Tsinghua University"}, {"id": 86697, "fullname": "Xihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86697?format=json", "institution": "The University of Hong Kong"}, {"id": 75944, "fullname": "Yebin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75944?format=json", "institution": "Tsinghua University"}, {"id": 126930, "fullname": "Ding Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126930?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 84806, "fullname": "Yan-Pei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84806?format=json", "institution": "Tencent ARC Lab"}], "abstract": "We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. Given a textureless object, we render normal and point maps from predefined viewpoints and accept simple 2D prompts\u2014clicks or boxes\u2014to guide part selection. These prompts are processed by a shared SAM2 backbone augmented with LoRA and residual geometry fusion, enabling view-specific reasoning while preserving pretrained priors. The predicted masks are back-projected to the object, aggregated across views.Our method enables fine-grained, part-specific control without requiring text prompts, per-shape optimization, or full 3D labels. In contrast to global clustering or scale-based methods, prompts are explicit, spatially grounded, and interpretable. We achieve state-of-the-art class-agnostic performance on PartObjaverse-Tiny and PartNetE, outperforming both slow optimization-based pipelines and fast but coarse feedforward approaches. Our results highlight a new paradigm: aligning the paradigm of 3D segmentation with SAM2, leveraging interactive 2D inputs to unlock controllability and precision in object-level part understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39096", "url": null, "sourceid": 44707, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39099, "uid": "d334ba52f9051d625363551c6dd564e6", "name": "Dynamics-Aware Preference Optimization for Vision-Language Models", "authors": [{"id": 147015, "fullname": "jusheng zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147015?format=json", "institution": "National University of Singapore; SUN YAT-SEN UNIVERSITY"}, {"id": 191355, "fullname": "Kaitong Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191355?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 189359, "fullname": "Jing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189359?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 85791, "fullname": "Jian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85791?format=json", "institution": "Snap Inc."}, {"id": 128912, "fullname": "Keze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128912?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Preference-based finetuning of vision-language models (VLMs) is notoriously unstable, as trivially wrong negatives inject uninformative gradients that distort optimization and degrade calibration. This work revisits this issue through the lens of learning dynamics and identifies a core pathology, the squeezing effect, where easy negatives retain large, misaligned gradients despite having negligible loss.To address this, we propose Cooling-Weighted Direct Preference Optimization (CW-DPO), a two-stage framework that first smooths and then stabilizes the alignment process. Stage 1 employs a constrained SFT phase with low-weight \u201cgentle negatives\u2019\u2019 to regularize overconfident distributions and flatten the loss landscape. Stage 2 introduces a competence-aware cooling weight that adaptively scales negative gradients according to the model\u2019s average per-token log-probability, suppressing uninformative updates while emphasizing hard, on-policy contrasts. This dynamics-aware weighting effectively mitigates the squeezing effect and enables smoother convergence.Extensive experiments on mainstream benchmarks\u2014including COCO, Flickr30k, NoCaps, MMMU, and MMBench1.1\u2014show that CW-DPO achieves state-of-the-art performance, for example +3.4 CIDEr over PPO and +2.4% absolute accuracy on MMMU, while improving calibration and halving convergence steps. This demonstrates that smoothing before cooling constitutes a simple yet general principle for robust VLM preference optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39099", "url": null, "sourceid": 45735, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39100, "uid": "4160090a26bf2b0296821aed16f33f51", "name": "TextOVSR: Text-Guided Real-World Opera Video Super-Resolution", "authors": [{"id": 175692, "fullname": "Hua Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175692?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 182893, "fullname": "Xin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182893?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 186937, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186937?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 191356, "fullname": "Jiayi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191356?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 152254, "fullname": "Kui Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152254?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187136, "fullname": "Fei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187136?format=json", "institution": "Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), also known as Guangming Laboratory"}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Many classic opera videos exhibit poor visual quality due to the limitations of early filming equipment and long-term degradation during storage. Although real-world video super-resolution (RWVSR) has achieved significant advances in recent years, directly applying existing methods to degraded opera videos remains challenging. The difficulties are twofold. First, accurately modeling real-world degradations is complex: simplistic combinations of classical degradation kernels fail to capture the authentic noise distribution, while methods that extract real noise patches from external datasets are prone to style mismatches that introduce visual artifacts. Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. Specifically, degradation-descriptive text, derived from the degradation process, is incorporated into the negative branch to constrain the solution space. Simultaneously, content-descriptive text is incorporated into a positive branch and our proposed Text-Enhanced Discriminator (TED) to provide semantic guidance for enhanced texture reconstruction. Furthermore, we design a Degradation-Robust Feature Fusion (DRF) module to facilitate cross-modal feature fusion while suppressing degradation interference. Extensive experiments on our constructed benchmark for opera videos, OperaLQ, demonstrate that TextOVSR outperforms state-of-the-art methods in both qualitative and quantitative evaluations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39100", "url": null, "sourceid": 42429, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39102, "uid": "7c0d5fe6c602bc990fb88b539bc3a45e", "name": "Affostruction: 3D Affordance Grounding with Generative Reconstruction", "authors": [{"id": 131473, "fullname": "Chunghyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/131473?format=json", "institution": "POSTECH"}, {"id": 191360, "fullname": "Seunghyeon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191360?format=json", "institution": "Ewha Women&#x27;s University"}, {"id": 86304, "fullname": "Minsu Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/86304?format=json", "institution": "POSTECH"}], "abstract": "This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose a unified framework for affordance grounding and reconstruction, dubbed Affostruction, where affordance grounding actively combines with shape generation. In our approach, reconstructing complete geometry from partial observations enables affordance prediction on unobserved regions, while affordance heatmaps guide active view selection to improve reconstruction quality of functional regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling.Affostruction achieves 19.1 aIoU on affordance grounding (40.4\\% improvement) and 32.67 IoU for 3D reconstruction (67.7\\% improvement), enabling accurate affordance prediction on complete shapes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39102", "url": null, "sourceid": 31054, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39105, "uid": "8adcbdcc53c6a9e4a177848a8e275dd2", "name": "YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection", "authors": [{"id": 180763, "fullname": "Xu Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180763?format=json", "institution": "Tencent Technology (Shenzhen) Co.Ltd"}, {"id": 88659, "fullname": "Jinlong Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88659?format=json", "institution": "Tencent Youtu Lab"}, {"id": 89708, "fullname": "Zhenye Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89708?format=json", "institution": "Tencent Youtu Lab"}, {"id": 107179, "fullname": "Jiawen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107179?format=json", "institution": "Singapore Management University"}, {"id": 89125, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89125?format=json", "institution": "Tencent YouTu Lab"}], "abstract": "Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance.To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference.Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4\\% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8\\% mAP and 17.8\\% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39105", "url": null, "sourceid": 40994, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39110, "uid": "f200537982c3729a0eb7af742d370826", "name": "PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video", "authors": [{"id": 179981, "fullname": "Suyi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179981?format=json", "institution": "National University of Singapore"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Physically plausible reconstruction of human\u2013object dynamics from a single video remains under-explored in physics-based methods. Most prior approaches omit human-generated internal actuation by assuming motion driven solely by gravity and simple contacts. They also rely on idealized constitutive laws that underfit heterogeneous and anisotropic materials. We introduce PhysHO, which tightly couples SMPL-driven Linear Blend Skinning (LBS) with a Material Point Method (MPM) simulator to address these gaps. Our key insight is to use LBS as an interpretable actuation prior and MPM to propagate those forces through contact under physical constraints. Concretely, we derive targeted actuation with a PD controller guided by LBS trajectories and gate it per particle via a learnable LBS-impact factor so that only particles inside the SMPL volume are directly actuated. We model real materials with residual neural constitutive laws layered on expert elastic\u2013plastic models and conditioned on per particle to capture heterogeneity and anisotropy. We stabilize monocular learning with structure-preserving 3D flow supervision and a progressive and loss-balanced training schedule. Our PhysHO reconstructs observed dynamics with high fidelity, and predicts future motion and simulates outcomes under novel human actions. Experimental results demonstrate robust human-driven interactions beyond gravity-only scenes. Our code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39110", "url": null, "sourceid": 37342, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39112, "uid": "9c21455c626adc99e60f9f4edcfe9a69", "name": "WaDi: Weight Direction-aware Distillation for One-step Image Synthesis", "authors": [{"id": 151904, "fullname": "Lei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151904?format=json", "institution": "Nankai University"}, {"id": 191385, "fullname": "Yang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191385?format=json", "institution": "Nankai University"}, {"id": 91115, "fullname": "Senmao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91115?format=json", "institution": "Nankai University"}, {"id": 185793, "fullname": "Ge Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185793?format=json", "institution": "Nankai University"}, {"id": 91119, "fullname": "Yaxing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91119?format=json", "institution": "Nankai University"}, {"id": 86573, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86573?format=json", "institution": "Nankai University"}], "abstract": "Despite the impressive performance of diffusion models such as Stable Diffusion (SD) in image generation, their slow inference limits practical deployment. Recent works accelerate inference by distilling multi-step diffusion into one-step generators. To better understand the distillation mechanism, we analyze U-Net/DiT weight changes between one-step students and their multi-step teacher counterparts. Our analysis reveals that changes in weight direction significantly exceed those in weight norm, highlighting it as the key factor during distillation. Motivated by this insight, we propose the **Lo**w-rank **R**ot**a**tion of weight **D**irection (LoRaD), a parameter-efficient adapter tailored to one-step diffusion distillation. LoRaD is designed to model these structured directional changes using learnable low-rank rotation matrices. We further integrate LoRaD into Variational Score Distillation (VSD), resulting in **W**eight Direction-**a**ware **Di**stillation (WaDi)\u2014a novel one-step distillation framework. WaDi achieves state-of-the-art FID scores on COCO 2014 and COCO 2017 while using only approximately 10\\% of the trainable parameters of the U-Net/DiT. Furthermore, the distilled one-step model demonstrates strong versatility and scalability, generalizing well to various downstream tasks such as controllable generation, relation inversion, and high-resolution synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39112", "url": null, "sourceid": 36526, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39113, "uid": "681e38a876f8eb9fd9822b8f56545438", "name": "CoT-Edit: Let CoT Guide Instruction Video Editing", "authors": [{"id": 144709, "fullname": "Sen Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144709?format=json", "institution": null}, {"id": 191386, "fullname": "Fengbin Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191386?format=json", "institution": "University of Science and Technology of China"}, {"id": 187395, "fullname": "Youliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187395?format=json", "institution": "Tsinghua University"}, {"id": 158204, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158204?format=json", "institution": "University of Science and Technology of China"}, {"id": 85129, "fullname": "Zhibo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85129?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Text-driven instruction-based video editing in complex scenes remains challenging: purely textual prompts often fail to capture precise spatial relationships and physical constraints, resulting in target ambiguity and physically implausible outcomes. To address this, we propose a plan--guide--edit framework that explicitly bridges semantic intent and spatial execution. In our framework, a Chain-of-Thought (CoT)-enhanced multimodal large language model (MLLM) serves as a planner, performing structured reasoning over the video and instructions to derive a precise sequence of bounding boxes and attribute-enriched editing directives. These spatial priors then guide a box-conditioned mask generator, transforming ambiguous global retrieval into localized, context-aware refinement and producing masks that more accurately capture object scale, contact relationships, and placement. Building on these spatial and semantic signals, a diffusion-based editor integrates the masks, enriched instructions, and frame features to render high-fidelity edits that remain temporally coherent and spatially well aligned. Trained first in a modular manner and then jointly, our framework achieves superior performance with reduced data requirements, delivering precise localization in scenes with multiple similar objects and physically consistent object additions, and extensive experiments demonstrate state-of-the-art performance over multiple strong baseline methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39113", "url": null, "sourceid": 34696, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39114, "uid": "12fab7a1d259efd5295bbbceadb09f81", "name": "STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution", "authors": [{"id": 154352, "fullname": "Junyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154352?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 89307, "fullname": "Jiangxin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89307?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 155769, "fullname": "Long Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/155769?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191387, "fullname": "Yixin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191387?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 87633, "fullname": "Jinshan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87633?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39114", "url": null, "sourceid": 45595, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39115, "uid": "a2816f807f2001d46e4d06248790f850", "name": "Obstruction reasoning for robotic grasping", "authors": [{"id": 174204, "fullname": "Runyu Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/174204?format=json", "institution": "Fondazione Bruno Kessler &amp; University of Trento"}, {"id": 181574, "fullname": "Matteo Bortolon", "url": "http://cvpr.thecvf.com/api/miniconf/users/181574?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 153746, "fullname": "Francesco Giuliari", "url": "http://cvpr.thecvf.com/api/miniconf/users/153746?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 146354, "fullname": "Alice Fasoli", "url": "http://cvpr.thecvf.com/api/miniconf/users/146354?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 181573, "fullname": "Sergio Povoli", "url": "http://cvpr.thecvf.com/api/miniconf/users/181573?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 105964, "fullname": "Guofeng Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/105964?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 106509, "fullname": "Yiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106509?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 84839, "fullname": "Fabio Poiesi", "url": "http://cvpr.thecvf.com/api/miniconf/users/84839?format=json", "institution": "Fondazione Bruno Kessler"}], "abstract": "Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39115", "url": null, "sourceid": 41274, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39116, "uid": "608e170b60b7fd6f11914a4ec9dfedbd", "name": "Transition Models: Rethinking the Generative Learning Objective", "authors": [{"id": 129494, "fullname": "ZiDong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129494?format=json", "institution": "Department of Automation, Tsinghua University, Tsinghua University"}, {"id": 107255, "fullname": "Yiyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107255?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 191388, "fullname": "Xiaoyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/191388?format=json", "institution": "University of Sydney"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 126956, "fullname": "Yangguang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126956?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 151075, "fullname": "Wanli Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151075?format=json", "institution": "Shanghai AI Lab"}, {"id": 87059, "fullname": "Lei Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87059?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval \\(\\Delta t\\). This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps.Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to \\(4096\\times4096\\). All the codes and model checkpoints will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39116", "url": null, "sourceid": 31114, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39119, "uid": "ea46b042960a80da5ba3b374911d10cb", "name": "3D Space as a Scratchpad for Editable Text-to-Image Generation", "authors": [{"id": 131800, "fullname": "Oindrila Saha", "url": "http://cvpr.thecvf.com/api/miniconf/users/131800?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 191393, "fullname": "Vojtech Krs", "url": "http://cvpr.thecvf.com/api/miniconf/users/191393?format=json", "institution": "Adobe Systems"}, {"id": 130304, "fullname": "Radomir Mech", "url": "http://cvpr.thecvf.com/api/miniconf/users/130304?format=json", "institution": "University of Calgary"}, {"id": 75679, "fullname": "Subhransu Maji", "url": "http://cvpr.thecvf.com/api/miniconf/users/75679?format=json", "institution": "University of Massachusetts, Amherst"}, {"id": 106544, "fullname": "Matheus Gadelha", "url": "http://cvpr.thecvf.com/api/miniconf/users/106544?format=json", "institution": "Adobe Systems"}, {"id": 87942, "fullname": "Kevin Blackburn-Matzen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87942?format=json", "institution": "Adobe Systems"}], "abstract": "Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning.Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent.We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis.Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection.The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs.Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images.Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation.Our results highlight a new paradigm for vision\u2013language models that deliberate not only in language, but also in space.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39119", "url": null, "sourceid": 30892, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39120, "uid": "6e5ddfaed8eaa6673c987b40fc2019a8", "name": "RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion", "authors": [{"id": 180303, "fullname": "Jianghan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/180303?format=json", "institution": "Beijing Institute of Technology"}, {"id": 149794, "fullname": "Hong Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/149794?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191394, "fullname": "Jinfu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191394?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191395, "fullname": "Yucong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191395?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191396, "fullname": "Shihan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/191396?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191397, "fullname": "Jingfan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191397?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191398, "fullname": "Danni Ai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191398?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191399, "fullname": "Tianyu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191399?format=json", "institution": null}, {"id": 191400, "fullname": "Deqiang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191400?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191401, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191401?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Infrared and Visible Image Fusion (IVIF) aims to combine complementary information from infrared and visible images to overcome the limitations of a single modality. While existing methods typically employ fixed or sample-adaptive fusion paradigms where fusion weights are static or derived from global pixel distributions, they often overlook spatial inconsistencies in pixel distribution within images, leading to suboptimal performance. To address this issue, we propose RegionFuse, a Region-Adaptive Pixel Distribution Learning Network for IVIF, which dynamically generates fusion weights based on local pixel distributions to construct a region-wise adaptive fusion paradigm. RegionFuse introduces a Mixture of Region Attention (MoRA) mechanism, which assigns each region to several specialized experts, enabling region-level feature interaction tailored to specific local distributions. Furthermore, we design a Region Feature Compression Module (RFCM) and place it after each MoRA to enhance informative regions and suppress redundant ones.  Extensive experiments on various benchmarks demonstrate the superiority and robustness of RegionFuse, especially in handling non-uniform pixel distributions. Evaluations on NIR-VIS and downstream tasks further confirm its generalizability and practical utility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39120", "url": null, "sourceid": 39250, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39122, "uid": "c1d5dce0983c0cdc77eb9dadd6d6206b", "name": "Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus", "authors": [{"id": 181335, "fullname": "Xijie Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181335?format=json", "institution": "Peking University"}, {"id": 133113, "fullname": "Lin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133113?format=json", "institution": "Beijing Institute of Technology"}, {"id": 73079, "fullname": "Wei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73079?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 86326, "fullname": "Yonghong Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/86326?format=json", "institution": "Peking University"}], "abstract": "Autofocus in dynamic environments remains challenging for conventional frame-based sensors, which often fail under fast motion, low light, or high dynamic range conditions. Event cameras, with microsecond temporal resolution and asynchronous brightness detection, offer a promising alternative. However, typical event-based autofocus methods assume that the sharpest focus corresponds to the maximum event rate.In this paper, we reveal a counterintuitive yet consistent phenomenon: the true focus actually corresponds to a local minimum in the event-rate curve. We theoretically derive this behavior from the physics of event generation and show that as defocus blur increases, the event rate first rises and then declines, forming a dual-peak-valley structure across focal distances. Based on this insight, we propose an Event Structural Valley-based Autofocus (ESVA) framework that identifies the valley between two dominant peaks as the true focal position. ESVA integrates structural smoothing, consistency filtering, and a dual-peak constraint to robustly recover the valley under noise and motion disturbances. Extensive experiments on multiple synthetic and real datasets demonstrate that ESVA achieves sub-millisecond focusing accuracy and consistently outperforms existing event-only autofocus methods without any image reconstruction or supervision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39122", "url": null, "sourceid": 42712, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39124, "uid": "65086f9dd032d3d47bd1a5e6e9db542a", "name": "MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting", "authors": [{"id": 180961, "fullname": "Haoran Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180961?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Realistic reconstruction of dynamic 4D scenes is essential for understanding the physical world.Despite recent progress in monocular view synthesis, existing methods still struggle to recover accurate 3D geometry and temporally consistent motion in complex environments.To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences, enabling faithful reconstruction of high-fidelity scene structures and coherent motion representation under complex dynamics.To handle motion with arbitrary flexibility and long-term variation, we introduce a scalable motion field built upon cluster-based bases that adaptively grow to capture diverse motion patterns over time.Moreover, we introduce a progressive optimization strategy that extends naturally to unseen frames. This strategy comprises two propagation modules: 1) A background module that adapts to newly appearing objects, refines camera poses, and accounts for shadows; 2) A foreground module that refines motion through a three-stage process.Extensive experiments on challenging real-world datasets demonstrate that our MotionScale achieves superior reconstruction quality and motion consistency that significantly outperform prior 4D scene reconstruction methods.Our code will be open-sourced on paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39124", "url": null, "sourceid": 35513, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39125, "uid": "214ace4c93ff3e251c239e9d47d1517b", "name": "SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation", "authors": [{"id": 154272, "fullname": "Shuai Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154272?format=json", "institution": "Antgroup Group"}, {"id": 90155, "fullname": "Biao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/90155?format=json", "institution": "Alibaba Group"}, {"id": 126965, "fullname": "Yujie Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/126965?format=json", "institution": "Fudan University"}, {"id": 73091, "fullname": "Shiwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73091?format=json", "institution": "Alibaba Group"}, {"id": 191405, "fullname": "Zhuoxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191405?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191406, "fullname": "Ke Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/191406?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186123, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186123?format=json", "institution": "Sichuan University"}, {"id": 90363, "fullname": "Kecheng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90363?format=json", "institution": "Ant Group"}, {"id": 186273, "fullname": "Xing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186273?format=json", "institution": "Ant Group"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}], "abstract": "Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., cats or dogs) to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which alternately optimizes subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that SynMotion outperforms existing baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39125", "url": null, "sourceid": 45465, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39127, "uid": "e7bf1017621f65d4e7858af08b345bed", "name": "DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation", "authors": [{"id": 101795, "fullname": "Zehong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/101795?format=json", "institution": "Peking University"}, {"id": 126204, "fullname": "Longhui Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/126204?format=json", "institution": "Huawei Cloud Technologies Ltd."}, {"id": 188174, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188174?format=json", "institution": "Nanjing University"}, {"id": 89087, "fullname": "Shiliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89087?format=json", "institution": "Peking University"}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-**DeCo**upled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of **1.62** (256\u00d7256) and **2.22** (512\u00d7512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39127", "url": null, "sourceid": 41762, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39128, "uid": "92cfd88945f213d00026de20ba91b717", "name": "Semantic Alignment for Pose-Invariant Identity Preserving Diffusion", "authors": [{"id": 156833, "fullname": "Jiwon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156833?format=json", "institution": "Korea University"}, {"id": 156832, "fullname": "SeonHwa Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156832?format=json", "institution": "Korea Univ."}, {"id": 156834, "fullname": "Soobin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/156834?format=json", "institution": "Sookmyung Women&#x27;s University"}, {"id": 156837, "fullname": "Eunju Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/156837?format=json", "institution": "Sookmyung Women&#x27;s University"}, {"id": 127571, "fullname": "Kyong Hwan Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/127571?format=json", "institution": "Korea University"}], "abstract": "Recent T2I diffusion models have evolved to control multiple conditions, including structure, appearance, and text prompt. Despite this progress, training-based methods demand heavy computation, whereas training-free methods often 're-imagine' the subject to satisfy given structure, thereby compromising identity preservation and attenuating fine textures.We propose SeAl (Semantic Alignment for Pose-Invariant Identity Preserving Diffusion), a novel training-free framework that addresses the 're-imagining' problem from the perspective of 'infusion'. SeAl integrates structure, appearance, and text prompt with three modules: AnchorAlign pre-aligns spatial discrepancies, Reference-guided Appearance Infusion injects identity via semantic matching, and Delta-Bridge leverages the guidance delta to mediate text\u2013appearance conflicts. We demonstrate that our method faithfully reflects all three control factors and dramatically reduces the identity leakage endemic to prior methods. Notably, SeAl excels on challenging datasets where identity preservation typically fails (e.g., distinctive animal features or complex human attire), establishing a novel paradigm for training-free identity preservation in diffusion models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39128", "url": null, "sourceid": 36954, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39129, "uid": "3efcf88e3453fc6d9ce464e51d3a81d4", "name": "Event-based Motion Deblurring with Unpaired Data", "authors": [{"id": 76908, "fullname": "Hoonhee Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/76908?format=json", "institution": "KAIST"}, {"id": 129337, "fullname": "Yuhwan Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129337?format=json", "institution": "KAIST"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "Event cameras provide high-temporal-resolution, motion-centric measurements that remain reliable under fast motion and challenging illumination, making them a promising sensing modality for motion deblurring. However, existing deblurring methods typically require large-scale paired blur\u2013sharp datasets, which are extremely difficult to obtain in real-world settings, especially when an additional modality such as events is involved. In this work, we introduce EMP, an event-based motion deblurring framework that operates entirely in an unpaired setting, removing the need for aligned blur\u2013sharp supervision. EMP bridges the disjoint blur and sharp domains through event information and leverages two complementary training mechanisms tailored to the unpaired regime: (1) an event-based physical prior with confidence masking that provides reliable self-supervisory signals for blurry inputs, and (2) a generative blur modeling process that extracts blur-related frequency-domain cues from blur\u2013event pairs and transfers them to sharp images to synthesize realistic blur. As a result, these mechanisms enable stable and effective deblurring without requiring paired labels. Extensive experiments on various real-event datasets, including REBlur, EventAid, and HighREV, show that EMP outperforms existing unpaired baselines and achieves performance competitive with paired methods. We will make our code publicly available to the research community.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39129", "url": null, "sourceid": 41853, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39130, "uid": "55dcf135a9a3dd23b624749813f5bbc9", "name": "ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting", "authors": [{"id": 180661, "fullname": "Jiayu Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/180661?format=json", "institution": "Peking University"}, {"id": 191411, "fullname": "Xinpeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191411?format=json", "institution": "Peking University"}, {"id": 89860, "fullname": "Zhiyi Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89860?format=json", "institution": "Peking University"}, {"id": 191412, "fullname": "Shiqiang Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/191412?format=json", "institution": "Guangdong Bohua UHD Innovation Center Co., Ltd."}, {"id": 85492, "fullname": "Ge Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85492?format=json", "institution": "Peking University Shenzhen Graduate School"}], "abstract": "Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39130", "url": null, "sourceid": 36578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39131, "uid": "2f4ce765370963b46cc97175da3eae88", "name": "See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection", "authors": [{"id": 191413, "fullname": "Zhiheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191413?format=json", "institution": "Baidu"}, {"id": 191414, "fullname": "Tong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191414?format=json", "institution": "Baidu"}, {"id": 182120, "fullname": "Shuning Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182120?format=json", "institution": "Zhejiang University"}, {"id": 191415, "fullname": "Naiming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191415?format=json", "institution": "Harbin Institute of Technology"}, {"id": 191416, "fullname": "Yumeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191416?format=json", "institution": "Baidu"}], "abstract": "Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \\textbf{ForeSight}, which enables VLMs to \\textbf{See Further} with low-level visual cues and \\textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39131", "url": null, "sourceid": 43376, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39132, "uid": "f466571800af70edf099b3cb72b33f5f", "name": "4D Primitive-M\u00e2ch\u00e9: Glueing Primitives for Persistent 4D Scene Reconstruction", "authors": [{"id": 127570, "fullname": "Kirill Mazur", "url": "http://cvpr.thecvf.com/api/miniconf/users/127570?format=json", "institution": "Imperial College London"}, {"id": 87294, "fullname": "Marwan Taher", "url": "http://cvpr.thecvf.com/api/miniconf/users/87294?format=json", "institution": "The University of Sheffield"}, {"id": 87268, "fullname": "Andrew J. Davison", "url": "http://cvpr.thecvf.com/api/miniconf/users/87268?format=json", "institution": "Imperial College London"}], "abstract": "We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene.  In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39132", "url": null, "sourceid": 41171, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40358?format=json"], "related_events_ids": [40358]}, {"id": 40358, "uid": "f466571800af70edf099b3cb72b33f5f", "name": "4D Primitive-M\u00e2ch\u00e9: Glueing Primitives for Persistent 4D Scene Reconstruction", "authors": [{"id": 127570, "fullname": "Kirill Mazur", "url": "http://cvpr.thecvf.com/api/miniconf/users/127570?format=json", "institution": "Imperial College London"}, {"id": 87294, "fullname": "Marwan Taher", "url": "http://cvpr.thecvf.com/api/miniconf/users/87294?format=json", "institution": "The University of Sheffield"}, {"id": 87268, "fullname": "Andrew J. Davison", "url": "http://cvpr.thecvf.com/api/miniconf/users/87268?format=json", "institution": "Imperial College London"}], "abstract": "We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene.  In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40358", "url": null, "sourceid": -41171, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39132?format=json"], "related_events_ids": [39132]}, {"id": 39138, "uid": "e49774d191d2589093dfa6dc07095670", "name": "Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing", "authors": [{"id": 126903, "fullname": "Rishubh Parihar", "url": "http://cvpr.thecvf.com/api/miniconf/users/126903?format=json", "institution": "Indian Institute of Science, Bangalore"}, {"id": 88548, "fullname": "Or Patashnik", "url": "http://cvpr.thecvf.com/api/miniconf/users/88548?format=json", "institution": "Tel Aviv University"}, {"id": 158587, "fullname": "Daniil Ostashev", "url": "http://cvpr.thecvf.com/api/miniconf/users/158587?format=json", "institution": "Snap Inc."}, {"id": 76920, "fullname": "R. Venkatesh Babu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76920?format=json", "institution": "Indian Institute of Science"}, {"id": 87616, "fullname": "Daniel Cohen-Or", "url": "http://cvpr.thecvf.com/api/miniconf/users/87616?format=json", "institution": "Google"}, {"id": 158585, "fullname": "Kuan-Chieh Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158585?format=json", "institution": "Snap Inc."}], "abstract": "Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model's modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39138", "url": null, "sourceid": 42590, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39136, "uid": "3f882a4bb95c43e27924205306955021", "name": "CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance", "authors": [{"id": 170058, "fullname": "Hanyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/170058?format=json", "institution": "Tsinghua University"}, {"id": 191425, "fullname": "Yiyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191425?format=json", "institution": "Tsinghua University"}, {"id": 154876, "fullname": "Jiawei Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154876?format=json", "institution": "Tsinghua University"}, {"id": 132561, "fullname": "Fangfu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132561?format=json", "institution": "Tsinghua University"}, {"id": 191426, "fullname": "Ran Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/191426?format=json", "institution": "Tsinghua University"}, {"id": 76969, "fullname": "Yueqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76969?format=json", "institution": "Tsinghua University"}], "abstract": "Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called **CFG-Ctrl**, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional\u2013unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (**SMC-CFG**), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39136", "url": null, "sourceid": 32478, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39140, "uid": "1786aee548304a7a90c0fd42c9bd919b", "name": "High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning", "authors": [{"id": 130671, "fullname": "Dailan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/130671?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 191432, "fullname": "Xiahong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191432?format=json", "institution": "Sensetime"}, {"id": 191433, "fullname": "Shulun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191433?format=json", "institution": "Sensetime"}, {"id": 71061, "fullname": "Hao Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71061?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 187174, "fullname": "Bingqi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187174?format=json", "institution": "Sensetime Group Limmited"}, {"id": 153375, "fullname": "Guanglu Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/153375?format=json", "institution": "Sensetime"}, {"id": 86566, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86566?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86526, "fullname": "Hongsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86526?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Face swapping aims to seamlessly transfer a source facial identity onto a target while preserving target attributes such as pose and expression. Diffusion models, known for their superior generative capabilities, have recently shown promise in advancing face-swapping quality. This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. We further enhance fidelity by incorporating identity and adversarial losses in a post-training refinement stage. Our proposed identity-constrained diffusion-based face-swapping model outperforms existing methods in both qualitative and quantitative evaluations, demonstrating superior identity similarity and attribute consistency, achieving a new state-of-the-art performance in high-fidelity face swapping.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39140", "url": null, "sourceid": 38382, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39145, "uid": "b32a0e6fda31ee75a103500e772d1ac3", "name": "DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer", "authors": [{"id": 141507, "fullname": "Yuxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/141507?format=json", "institution": "NVIDIA"}, {"id": 182159, "fullname": "Katarina Tothova", "url": "http://cvpr.thecvf.com/api/miniconf/users/182159?format=json", "institution": "NVIDIA"}, {"id": 86830, "fullname": "Zian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86830?format=json", "institution": "NVIDIA, University of Toronto"}, {"id": 87520, "fullname": "Kangxue Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87520?format=json", "institution": "NVIDIA"}, {"id": 159505, "fullname": "Haithem Turki", "url": "http://cvpr.thecvf.com/api/miniconf/users/159505?format=json", "institution": "NVIDIA"}, {"id": 191448, "fullname": "Riccardo de Lutio", "url": "http://cvpr.thecvf.com/api/miniconf/users/191448?format=json", "institution": "NVIDIA"}, {"id": 137205, "fullname": "Yen-Yu Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/137205?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 87005, "fullname": "Or Litany", "url": "http://cvpr.thecvf.com/api/miniconf/users/87005?format=json", "institution": "NVIDIA / Technion"}, {"id": 151008, "fullname": "Sanja Fidler", "url": "http://cvpr.thecvf.com/api/miniconf/users/151008?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 86828, "fullname": "\u017dan Goj\u010di\u010d", "url": "http://cvpr.thecvf.com/api/miniconf/users/86828?format=json", "institution": "NVIDIA"}], "abstract": "Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into photorealistic, temporally consistent outputs. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively, is a custom data curation pipeline that constructs synthetic\u2013real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. Experiments show that DiffusionHarmonizer substantially improves perceptual realism, being chosen by 84.28\\% of users in our comparative study over the second best method. Furthermore, it matches the temporal coherence of state-of-the art video models while maintaining the inference efficiency of single-step image models, offering a scalable and practical solution for photorealistic simulation in both research and production settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39145", "url": null, "sourceid": 32687, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39146, "uid": "c611127b94520ac5dacf8e2102c581ba", "name": "VOSR: A Vision-Only Generative Model for Image Super-Resolution", "authors": [{"id": 132960, "fullname": "Rongyuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132960?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88758, "fullname": "Lingchen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88758?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 129145, "fullname": "Zhengqiang ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129145?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 131285, "fullname": "Xiangtao Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131285?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 145175, "fullname": "Jixin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/145175?format=json", "institution": "Nanyang Technological University"}, {"id": 157791, "fullname": "Shihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157791?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Large-scale pre-trained text-to-image (T2I) diffusion models, such as Stable Diffusion, can be finetuned for image super-resolution (SR) with highly realistic details. While impressive, pre-training such multi-modal models demands billions of high-quality text-image pairs and substantial computational resources, despite that SR is fundamentally an image-to-image (I2I) task. This raises a critical question: do we truly need multi-modal priors and billion-scale text-image data to solve a purely vision task? In this paper, we propose **VOSR**, a **V**ision-**O**nly **S**uper-**R**esolution framework that eliminates the need for textual priors and multi-modal pretraining. We identify two key limitations in previous image-based, uni-modal diffusion models: limited visual semantic guidance and unstable unconditional training. To this end, we leverage a pretrained vision encoder to inject semantic cues, and introduce a relaxed unconditional objective that partially uses the low-quality condition to stabilize training. To accelerate inference, we adopt a modified shortcut model for one-step SR with minimal quality degradation. VOSR is trained from scratch with significantly less data and a lower computational cost compared to T2I-based diffusion models. However, VOSR achieves comparable or even better performance than state-of-the-art T2I-tuned SR methods on both synthetic and real-world benchmarks, demonstrating its potential as a scalable and competitive alternative for generative SR. Codes and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39146", "url": null, "sourceid": 44789, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39147, "uid": "06eed225533224d9b5d32c8f7606a633", "name": "MAD: Motion Appearance Decoupling for efficient Driving World Models", "authors": [{"id": 151466, "fullname": "Ahmad Rahimi", "url": "http://cvpr.thecvf.com/api/miniconf/users/151466?format=json", "institution": "EPFL"}, {"id": 181961, "fullname": "Valentin Gerard", "url": "http://cvpr.thecvf.com/api/miniconf/users/181961?format=json", "institution": "EPFL - EPF Lausanne; Universit\u00e9 de Lorraine"}, {"id": 86061, "fullname": "\u00c9loi Zablocki", "url": "http://cvpr.thecvf.com/api/miniconf/users/86061?format=json", "institution": "Valeo"}, {"id": 77540, "fullname": "Matthieu Cord", "url": "http://cvpr.thecvf.com/api/miniconf/users/77540?format=json", "institution": "Sorbonne Universit\u00e9"}, {"id": 97325, "fullname": "Alex Alahi", "url": "http://cvpr.thecvf.com/api/miniconf/users/97325?format=json", "institution": "EPFL"}], "abstract": "Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning.We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively \u201cdressing\u201d the motion with texture and lighting.This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6\\% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39147", "url": null, "sourceid": 38109, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39151, "uid": "3de809f0da843c4f73fbff60159632be", "name": "ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss", "authors": [{"id": 191455, "fullname": "Jiaying Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/191455?format=json", "institution": "University of Queensland"}, {"id": 153124, "fullname": "Heming Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/153124?format=json", "institution": "Australian National University"}, {"id": 86350, "fullname": "Kaihao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86350?format=json", "institution": "Australian National University"}, {"id": 191456, "fullname": "Sean Tweedy", "url": "http://cvpr.thecvf.com/api/miniconf/users/191456?format=json", "institution": "The University of Queensland"}, {"id": 183016, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183016?format=json", "institution": "Adelaide University"}], "abstract": "Single-image human mesh recovery provides a compact 3D, person-centric representation that supports analysis, animation, AR and VR, rehabilitation, and human\u2013computer interaction. However, prevailing systems impose an intact-limb prior and degrade on people with limb loss, because fixed-topology models cannot represent residual limbs.In this work, we present ResiHMR, a residual-limb aware framework for single-image 3D human modeling. ResiHMR adopts residual-limb keypoints and introduces two components: (i) a topology-adaptive Residual Anchor-Factor Optimization module that constrains estimation to the observed kinematic subgraph of anatomically valid structures, and (ii) a geometry-based Residual-Limb Reconstruction module that estimates residual-limb boundaries and convex limb-termination geometry. Together, these modules introduce topology-aware optimization and explicit termination geometry as tools for human mesh recovery under non-standard limb anatomy.Unlike joint-removal methods in a fixed topology, ResiHMR explicitly reconstructs residual-limb surfaces and aligns optimization with limb-loss topology, which better matches prosthetic biomechanics and real-world use. To the best of our knowledge, this is the first single-image HMR system that explicitly reconstructs residual-limb surfaces and performs topology-adaptive optimization for individuals with limb loss. On a curated dataset of real-world images with limb loss, compared with SMPLify-X, ResiHMR reduces intact-joint 2D MPJPE from 41.32 to 37.40, increases mIoU from 0.662 to 0.703, and improves anatomical plausibility in expert ratings.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39151", "url": null, "sourceid": 34818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39156, "uid": "d1f60334a2e88e320f513b8d40d1b84e", "name": "Efficient Encoder-Free Fourier-based 3D Large Multimodal Model", "authors": [{"id": 105964, "fullname": "Guofeng Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/105964?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 135282, "fullname": "Wei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/135282?format=json", "institution": "ELLIS Unit &amp; LIT AI Lab, JKU"}, {"id": 91165, "fullname": "Luigi Riz", "url": "http://cvpr.thecvf.com/api/miniconf/users/91165?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 191472, "fullname": "Yujiao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191472?format=json", "institution": "CSIRO"}, {"id": 106509, "fullname": "Yiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106509?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 84839, "fullname": "Fabio Poiesi", "url": "http://cvpr.thecvf.com/api/miniconf/users/84839?format=json", "institution": "Fondazione Bruno Kessler"}], "abstract": "Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We introduce Fase3D, the first efficient encoder-free Fourier-based 3D LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. We will release code and models publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39156", "url": null, "sourceid": 44949, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39158, "uid": "dbfc43b5a635df63a2448f9c979d9bf5", "name": "From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images", "authors": [{"id": 183225, "fullname": "Ruikun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183225?format=json", "institution": "Beijing Institute of Technology"}, {"id": 70888, "fullname": "Yan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70888?format=json", "institution": "ANU"}, {"id": 88760, "fullname": "Liyuan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88760?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression:1) each spot often contains multiple cells with distinct gene expression profiles;2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39158", "url": null, "sourceid": 34218, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39160, "uid": "52238db3e51471b5f923a3481975f65d", "name": "XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network", "authors": [{"id": 95262, "fullname": "Kihwan Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/95262?format=json", "institution": "Korea Electronics Technology Institute"}, {"id": 191481, "fullname": "Juyeon Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191481?format=json", "institution": "BLUEDOT Inc."}, {"id": 174714, "fullname": "Jeongheum Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174714?format=json", "institution": "BLUEDOT"}, {"id": 191482, "fullname": "Sijung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191482?format=json", "institution": "Bluedot Inc"}, {"id": 188049, "fullname": "Minyong Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188049?format=json", "institution": "BLUEDOT Inc."}], "abstract": "With the rapid growth of stereoscopic 3D devices, real-time stereoscopic conversion has become increasingly essential. However, most existing approach rely on depth estimation, forward warping, and heavy inpainting network, resulting in high computational cost and artifacts near occlusion boundaries. Diffusion-based models have also been explored, but they suffer from iterative sampling and geometric inconsistency, making them unsuitable for real-time deployment. To address these issues, we propose Bi-Warp, a simple yet effective approach that synthesizes the right view without inpainting network by leveraging warping operations. Our approach estimates backward flow, approximates the corresponding forward flow, and generates two candidate right views via bidirectional warping. A learnable mask adaptively fuses the candidates, preserving left\u2013right geometric consistency. Building on Bi-Warp, we introduce XPaintNet, a lightweight network that achieves comparable visual quality to state-of-the-art methods while maintaining real-time performance over 100 FPS at 2K resolution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39160", "url": null, "sourceid": 31035, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39163, "uid": "e41e2bf8a60897a711c94976b5df5070", "name": "Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification", "authors": [{"id": 181916, "fullname": "Nghia Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181916?format=json", "institution": "University of Pennsylvania"}, {"id": 155762, "fullname": "Tianjiao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/155762?format=json", "institution": "University of Pennsylvania"}, {"id": 131913, "fullname": "Rene Vidal", "url": "http://cvpr.thecvf.com/api/miniconf/users/131913?format=json", "institution": "University of Pennsylvania and Amazon"}], "abstract": "Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as a sparse combination of concept embeddings. However, because such methods ignore the hierarchical structure of concepts, they can produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding and Pursuit (HCEP), a framework that induces a hierarchy of concept vectors in the latent space and uses hierarchical sparse coding to recover the concepts present in an image. Given a semantic hierarchy of concepts, we construct a corresponding hierarchy of concept vectors and, assuming the correct concepts for an image form a rooted path in the hierarchy, derive desirable conditions for identifying them in the embedded space. We show that hierarchical sparse coding reliably recovers hierarchical concept vectors, whereas vanilla sparse coding fails. Our experiments demonstrate that HCEP outperforms baselines on real-world datasets in concept precision and recall while maintaining competitive classification accuracy. Moreover, when the number of samples is limited, HCEP achieves superior classification accuracy and concept recovery. These results suggest that incorporating hierarchical structures into sparse coding yields more reliable and interpretable image classification models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39163", "url": null, "sourceid": 39232, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39166, "uid": "4635d9474a5ef94cd03d40e385f4b177", "name": "ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction", "authors": [{"id": 184088, "fullname": "Yuheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184088?format=json", "institution": "Hunan University"}, {"id": 154552, "fullname": "Mengfei Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154552?format=json", "institution": "Hunan University"}, {"id": 70993, "fullname": "Kunyu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/70993?format=json", "institution": "KIT"}, {"id": 154554, "fullname": "Yuhang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154554?format=json", "institution": "Hunan University"}, {"id": 191493, "fullname": "Di Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191493?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 89634, "fullname": "Kailun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89634?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}], "abstract": "3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57\\% mIoU overall and +24.80\\% tail-class mIoU; on VAA-KITTI, it improves AuPRC$_r$ by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39166", "url": null, "sourceid": 31714, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39167, "uid": "67f28743b185754c52095e930d61e36c", "name": "VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion", "authors": [{"id": 129161, "fullname": "Linfeng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129161?format=json", "institution": "Wuhan University"}, {"id": 191494, "fullname": "Yeda Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191494?format=json", "institution": "\u6b66\u6c49\u5927\u5b66"}, {"id": 191495, "fullname": "Meiqi Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191495?format=json", "institution": "Wuhan University"}, {"id": 129146, "fullname": "Zizhuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129146?format=json", "institution": "Wuhan University"}, {"id": 191496, "fullname": "Yuxin Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191496?format=json", "institution": "Wuhan University"}, {"id": 104468, "fullname": "Xunpeng Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/104468?format=json", "institution": "Wuhan University"}, {"id": 191497, "fullname": "Chunyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191497?format=json", "institution": "Wuhan University"}, {"id": 104466, "fullname": "Han Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104466?format=json", "institution": "Southeast University"}, {"id": 129156, "fullname": "HAO ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129156?format=json", "institution": "Wuhan University"}, {"id": 86222, "fullname": "Jiayi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86222?format=json", "institution": "Wuhan University"}], "abstract": "Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This paper proactively compensates for the dilemmas. First, we construct M3SVD, a benchmark dataset with 220 temporally synchronized and spatially registered infrared-visible videos comprising 153,797 frames, filling the data gap for the video fusion community. Secondly, we propose VideoFusion, a multi-modal video fusion model that fully exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequences, effectively mitigating temporal inconsistency and interference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39167", "url": null, "sourceid": 39776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39171, "uid": "a19fbff55f49a023807bfe3ea9a9946e", "name": "Text-guided Feature Disentanglement for Cross-modal Gait Recognition", "authors": [{"id": 180191, "fullname": "Zhiyang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180191?format=json", "institution": "Xiamen University"}, {"id": 126436, "fullname": "Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126436?format=json", "institution": "Xiamen University"}], "abstract": "Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the $top\\text{-}k$ matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39171", "url": null, "sourceid": 41104, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39175, "uid": "b334cac4ee40f5ccda7dc3d9f1e4f388", "name": "CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering", "authors": [{"id": 76883, "fullname": "Mingfang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76883?format=json", "institution": "The University of Tokyo"}, {"id": 182002, "fullname": "Jingjing Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182002?format=json", "institution": "Woven by Toyota"}, {"id": 182395, "fullname": "Ashutosh Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/182395?format=json", "institution": "Woven by Toyota, Inc."}, {"id": 136252, "fullname": "Rajat Saini", "url": "http://cvpr.thecvf.com/api/miniconf/users/136252?format=json", "institution": "Woven by Toyota"}, {"id": 178401, "fullname": "Mustafa Erdogan", "url": "http://cvpr.thecvf.com/api/miniconf/users/178401?format=json", "institution": "Woven by Toyota Inc."}, {"id": 191509, "fullname": "Hsuan-Kung Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191509?format=json", "institution": "National Tsinghua University; Woven by Toyota"}, {"id": 77362, "fullname": "Caixin Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77362?format=json", "institution": "The University of Tokyo"}, {"id": 187251, "fullname": "Yifei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187251?format=json", "institution": "The University of Tokyo"}, {"id": 69170, "fullname": "Yoichi Sato", "url": "http://cvpr.thecvf.com/api/miniconf/users/69170?format=json", "institution": "University of Tokyo"}, {"id": 85099, "fullname": "Quan Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85099?format=json", "institution": "Woven by Toyota"}], "abstract": "Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39175", "url": null, "sourceid": 41537, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39174, "uid": "eb47593d2d06ea177c0fdb7013b1707b", "name": "D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation", "authors": [{"id": 158231, "fullname": "Shengzhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158231?format=json", "institution": "Arizona State University"}, {"id": 179653, "fullname": "Hao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179653?format=json", "institution": "Arizona State University"}], "abstract": "Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on quasi-concavity of the network output mask function $u$. Instead of constraining a single binary segmentation, we require all super-level sets of $u$ to be convex, transforming global shape constraints into local, differentiable inequalities on $u$ and its derivatives. From this principle we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification operator, a gradient based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed by a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding.  Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing prior shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1\u20130\u20131 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors under one continuous, differentiable framework.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39174", "url": null, "sourceid": 30834, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39178, "uid": "bb74717d9aed787435ff5eae508b15f0", "name": "Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing", "authors": [{"id": 184224, "fullname": "Delong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184224?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 191512, "fullname": "Haotian Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191512?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 152334, "fullname": "Zhaohui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152334?format=json", "institution": "Sensetime"}, {"id": 152331, "fullname": "Zhiyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152331?format=json", "institution": "Sensetime"}, {"id": 191513, "fullname": "Shihao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/191513?format=json", "institution": "Sensetime"}, {"id": 152335, "fullname": "Mingjie Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152335?format=json", "institution": "SenseTime Research"}, {"id": 131585, "fullname": "Zhicheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131585?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131551, "fullname": "Fei Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/131551?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Precise and controllable image editing remains a significant challenge. Current methods often rely on text prompts, but achieving accurate spatial localization solely through descriptions is inherently difficult. Mask-based approaches, though offering better control, typically require overly precise user annotations, thus increasing user burden and leading to unnatural results. To bridge this gap, we introduce the **I**nteractive **I**nstruction-based **I**mage **E**diting ($I_3E$) task, which generates high-quality edits from a more intuitive combination: concise text instructions and imprecise spatial guidance. To address the critical lack of suitable data, we propose an efficient pipeline to generate Inter-Edit, a new million-scale training dataset that simulates realistic user masks---not strictly segment-aligned. We also present a comprehensive benchmark, featuring a meticulously human-annotated test set that captures diverse, localization-dependent editing scenarios and realistic user interaction patterns. To evaluate this task, we introduce a new suite of position-aware metrics that strongly correlate with human perceptual judgments. Finally, we develop three baseline models trained on Inter-Edit. Extensive experiments demonstrate that our methods significantly enhance $I_3E$ performance, achieving substantial improvements in localization and edit quality, and outperforming existing state-of-the-art models. The Inter-Edit dataset and all related code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39178", "url": null, "sourceid": 46510, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39183, "uid": "208a95b2942b24eb3c259896db223cd7", "name": "No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors", "authors": [{"id": 191523, "fullname": "Kan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/191523?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191524, "fullname": "Gang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191524?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 180386, "fullname": "TAO LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180386?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "We propose a novel unsupervised framework for online video stabilization. Unlike deep learning-based stabilizers that require paired stable/unstable datasets, our method models the classical three-stage stabilization pipeline and integrates a multithreaded buffering mechanism, effectively addressing three key challenges of end-to-end learning: limited data, poor controllability, and inefficiency on resource-constrained hardware. Existing benchmarks mainly focus on handheld, forward-view, visible-light videos, restricting the application of stabilization in domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our approach consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to that of offline methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39183", "url": null, "sourceid": 45170, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39187, "uid": "39469285974d13e4e2a6d1010747820d", "name": "Verifying Neural Network Robustness with Dual Perturbations", "authors": [{"id": 180477, "fullname": "Hai Duong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180477?format=json", "institution": "George Mason University"}, {"id": 191538, "fullname": "Son Vu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191538?format=json", "institution": "Hanoi University of Science and Technology"}, {"id": 180404, "fullname": "Thanh Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/180404?format=json", "institution": "National Institute of Information and Communications Technology (NICT)"}, {"id": 191539, "fullname": "ThanhVu Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191539?format=json", "institution": "George Mason University"}], "abstract": "Safety-critical deep learning systems must be robust against real-world corruptions combining spatially correlated distortions and independent noise.Current deep neural network verification methods handle these perturbations separately, either checking independent pixel-wise perturbations or restricted convolutional transformations using predefined patterns.This gap prevents assessing robustness under realistic conditions where both perturbation types occur simultaneously.To address these limitations, we propose VeriDou, a framework that introduces:(i) universal convolutional perturbations that enable verification across continuous spatial distortion spaces, and(ii) dual perturbations that capture both convolutional distortions and independent pixel-level variations.Our evaluation on a set of diverse benchmarks with 14340 instances shows VeriDou's dual perturbations approach found substantially more adversarial examples on networks that existing methods claimed to be highly robust.This shows that VeriDou is able to explore a broader range of unsafe regions and thus enhances formal assessment of robustness.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39187", "url": null, "sourceid": 46108, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39191, "uid": "54996ced8ec545754b9c7404027969d8", "name": "FastGS: Training 3D Gaussian Splatting in 100 Seconds", "authors": [{"id": 175919, "fullname": "Shiwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/175919?format=json", "institution": "NanKai University"}, {"id": 191545, "fullname": "Tianci Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191545?format=json", "institution": "Nankai University"}, {"id": 175967, "fullname": "Yongchun Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175967?format=json", "institution": "NanKai University"}, {"id": 146696, "fullname": "Biao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/146696?format=json", "institution": "Nankai University"}], "abstract": "The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29\u00d7 training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45\u00d7 acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-6\u00d7 training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39191", "url": null, "sourceid": 39982, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39192, "uid": "aff71375ac1acbe384b7ff4f4952c6ff", "name": "Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation", "authors": [{"id": 75964, "fullname": "Shubhankar Borse", "url": "http://cvpr.thecvf.com/api/miniconf/users/75964?format=json", "institution": "Qualcomm AI Research"}, {"id": 144201, "fullname": "Phuc Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/144201?format=json", "institution": "University of Science, VNU-HCM"}, {"id": 176944, "fullname": "Farzad Farhadzadeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/176944?format=json", "institution": "Qualcomm AI Research"}, {"id": 90619, "fullname": "Seokeon Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/90619?format=json", "institution": "Qualcomm AI Research"}, {"id": 76220, "fullname": "Phong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76220?format=json", "institution": "VinAI"}, {"id": 76979, "fullname": "Anh Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/76979?format=json", "institution": "VinAI Research"}, {"id": 90623, "fullname": "Sungrack Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/90623?format=json", "institution": "Qualcomm AI Research"}, {"id": 85738, "fullname": "Munawar Hayat", "url": "http://cvpr.thecvf.com/api/miniconf/users/85738?format=json", "institution": "Monash University"}, {"id": 85634, "fullname": "Fatih Porikli", "url": "http://cvpr.thecvf.com/api/miniconf/users/85634?format=json", "institution": "QualComm"}], "abstract": "Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39192", "url": null, "sourceid": 36268, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39195, "uid": "4d8da35761b21e91e579563043237f3f", "name": "EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy", "authors": [{"id": 168495, "fullname": "Yumeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/168495?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 71290, "fullname": "Zanwei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/71290?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191557, "fullname": "Yekun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191557?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 176202, "fullname": "Chen Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176202?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 180804, "fullname": "Yunbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180804?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}], "abstract": "Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors; yet their assumptions break down for morphologically anisotropic structures. We present **EMGauss**, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a **Teacher\u2013Student bootstrapping mechanism** that uses high-confidence predictions on unobserved slices as pseudo-supervisory signals. Compared with diffusion- and GAN-based reconstruction methods, EMGauss substantially improves interpolation quality, enables continuous slice synthesis, and eliminates the need for large-scale pretraining. Beyond vEM, it potentially provides a generalizable slice-to-3D solution across diverse imaging domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39195", "url": null, "sourceid": 33421, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39201, "uid": "bb0848134966fc5db45aa411b8933a5d", "name": "Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving", "authors": [{"id": 183343, "fullname": "Yuan Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/183343?format=json", "institution": "Northeastern University"}, {"id": 152374, "fullname": "Hongchen Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152374?format=json", "institution": "Northeastern University"}, {"id": 191571, "fullname": "JiaoWang JiaoWang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191571?format=json", "institution": null}, {"id": 176338, "fullname": "Qu Liqi", "url": "http://cvpr.thecvf.com/api/miniconf/users/176338?format=json", "institution": "Northeastern Universal,China"}], "abstract": "Current research in autonomous driving heavily relies on large-scale driving datasets for model fitting or trial-and-error learning strategies in simulation environments. However, these approaches suffer from limited behavioral diversity and fail to cover complex edge-case interactions. To address these limitations, we model the driving environment as an Active Markov Game (AMG) and introduce a multi-agent co-evolutionary training framework for more realistic interactive learning. The AMG formulation extends traditional Markov games by explicitly making state transitions and rewards dependent on the evolving strategies of the agents, thus capturing the interactive dynamics and strategic coupling between the ego vehicle and surrounding agents. Building on this, our multi-agent co-evolutionary training mechanism jointly optimizes the ego vehicle's policy and a diverse pool of opponent strategies, allowing all agents to adapt to each other's behaviors during training. This game-theoretic approach produces a robust ego agent capable of handling diverse, non-stationary driving strategies, overcoming the \"non-responsive opponent\" limitation found in prior methods. In CARLA simulations of unsignalized intersections and long-tail scenarios, our method performs exceptionally well, achieving near-perfect success rates (98\\%) with minimal collisions (2\\%), and significantly outperforming state-of-the-art baselines such as PPO, DDPG, and IPPO in terms of generalization, safety margins, and control smoothness. These results demonstrate that our approach substantially enhances the robustness, safety, and strategic adaptability of autonomous driving in complex multi-agent environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39201", "url": null, "sourceid": 36098, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39202, "uid": "d98d13ba5eca356faee405b3d49d316c", "name": "TrackMAE: Video Representation Learning via Track Mask and Predict", "authors": [{"id": 154595, "fullname": "Renaud Vandeghen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154595?format=json", "institution": "University of Li\u00e8ge"}, {"id": 156275, "fullname": "Fida Mohammad Thoker", "url": "http://cvpr.thecvf.com/api/miniconf/users/156275?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 71630, "fullname": "Marc Van Droogenbroeck", "url": "http://cvpr.thecvf.com/api/miniconf/users/71630?format=json", "institution": "University of Li\u00e8ge"}, {"id": 75441, "fullname": "Bernard Ghanem", "url": "http://cvpr.thecvf.com/api/miniconf/users/75441?format=json", "institution": "KAUST"}], "abstract": "Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only models motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. In TrackMAE, we use an off-the-shelf point tracker to sparsely track points in the input videos generating motion trajectories. Furthermore, we exploit the extracted trajectories to improve the random tube masking with a motion-aware masking strategy.We enhance video representations learned in both pixel and feature semantic reconstruction space by providing a complimentary supervision signal in the form of motion targets. We evaluate on six datasets across diverse downstream settings and find that TrackMAE consistently outperforms the state-of-the-art video SSL baselines, therefore learning more discriminative and generalizable representations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39202", "url": null, "sourceid": 39758, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39209, "uid": "bacd7873c0e40a3b3c722bb0cc5de6bd", "name": "Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation", "authors": [{"id": 96222, "fullname": "Daniel Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/96222?format=json", "institution": "Seoul National University"}, {"id": 69929, "fullname": "Kyoung Mu Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/69929?format=json", "institution": "Seoul National University"}], "abstract": "Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39209", "url": null, "sourceid": 31122, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39211, "uid": "c034fd0abafb327e741dd9e317af3998", "name": "Diffusion Sampling Path Tells More:  An Efficient Plug-and-Play Strategy for Sample Filtering", "authors": [{"id": 181247, "fullname": "SIXIAN WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/181247?format=json", "institution": "Shenzhen Research Institute of Big Data"}, {"id": 191594, "fullname": "Zhiwei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191594?format=json", "institution": "Alibaba Group"}, {"id": 191595, "fullname": "Tsung-Hui Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191595?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Diffusion models often exhibit inconsistent sample quality due to stochastic variations inherent in their sampling trajectories. Although training-based fine-tuning and inference-time alignment techniques aim to improve sample fidelity, they typically necessitate full denoising processes and external reward signals. This incurs substantial computational costs, hindering their broader applicability. In this work, we unveil an intriguing phenomenon: a previously unobserved yet exploitable link between sample quality and characteristics of the denoising trajectory during classifier-free guidance (CFG). Specifically, we identify a strong correlation between high-density regions of the sample distribution and the Accumulated Score Differences (ASD)\u2014the cumulative divergence between conditional and unconditional scores. Leveraging this insight, we introduce CFG-Rejection, an efficient, plug-and-play strategy that filters low-quality samples at an early stage of the denoising process, crucially without requiring external reward signals or model retraining. Importantly, our approach necessitates no modifications to model architectures or sampling schedules and maintains full compatibility with existing diffusion frameworks. We validate the effectiveness of CFG-Rejection in image generation through extensive experiments, demonstrating marked improvements on human preference scores (HPSv2, PickScore) and challenging benchmarks (GenEval, DPG-Bench). We anticipate that CFG-Rejection will offer significant advantages for diverse generative modalities beyond images, paving the way for more efficient and reliable high-quality sample generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39211", "url": null, "sourceid": 35067, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39214, "uid": "8f24f9bb371471be344cdb6fbcd99688", "name": "Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection", "authors": [{"id": 191605, "fullname": "Jialei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191605?format=json", "institution": "National University of Defense Technology"}, {"id": 191606, "fullname": "Li Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191606?format=json", "institution": "National University of Defense Technology"}, {"id": 191607, "fullname": "Jiehua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191607?format=json", "institution": null}, {"id": 191608, "fullname": "Yuhang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/191608?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 132148, "fullname": "Yongxiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132148?format=json", "institution": "National University of Defense Technology"}, {"id": 191609, "fullname": "Jiangming Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191609?format=json", "institution": "National University of Defense Technology"}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}], "abstract": "Recent advancements in remote sensing object detection have predominantly focused on oriented bounding box design and small object feature enhancement, while often overlooking the intrinsic geometric properties of remote sensing images, such as rotation invariance and structural symmetry. Many aerial objects appear in multiple orientations and exhibit clear symmetrical patterns, which, if not explicitly modeled, can lead to detection failures and inaccurate localization under geometric variation or partial occlusion. To address this, we propose the Rotation Invariant and Symmetry Aware Pixel Difference Network (RIS-PiDiNet), which introduces a novel convolutional operator called Rotation Invariant and Symmetry Aware Pixel Difference Convolution (RIS-PDC). This operator replaces traditional convolution with a mathematically grounded formulation that encodes rotation group priors and symmetrical constraints. RIS-PDC utilizes pixel differences and symmetry-guided aggregation in the polar harmonic space, enabling the network to infer partially visible structures and deduce occluded symmetrical parts. Besides improving detection accuracy, RIS-PDC enhances model interpretability by embedding geometric principles into the network design. Feature visualizations demonstrate rotation-consistent activations and symmetry-complete responses, revealing how the network captures underlying object structure even under partial visibility or orientation changes. This yields geometrically interpretable detection decisions. To our knowledge, RIS-PiDiNet is the first remote sensing object detection framework that jointly incorporates rotation invariance and symmetry modeling within a unified architecture. Extensive evaluations on standard benchmarks validate its effectiveness, achieving state-of-the-art performance on DOTA-v1.0 (78.53\\% mAP single-scale, 81.81\\% multi-scale), HRSC2016 (98.60\\% mAP), and DIOR-R (67.28\\% mAP), all with acceptable computational overhead and no increase in parameter count.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39214", "url": null, "sourceid": 39795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39217, "uid": "c04c93a41c414f2801d74c3d5a063a34", "name": "ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions", "authors": [{"id": 150186, "fullname": "Xiaoxue Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150186?format=json", "institution": "Fudan University"}, {"id": 126954, "fullname": "Xinyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126954?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 126959, "fullname": "Yaohui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126959?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 86632, "fullname": "Yu Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86632?format=json", "institution": "Shanghai Aritifcal Intelligence Laboratory"}], "abstract": "Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling.However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting.Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions.To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39217", "url": null, "sourceid": 46386, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39218, "uid": "7c29440875ef4bda3f5f3e5d8d786786", "name": "SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping", "authors": [{"id": 128203, "fullname": "Hongyu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128203?format=json", "institution": "Princeton University"}, {"id": 90297, "fullname": "Jia Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90297?format=json", "institution": "Princeton University"}], "abstract": "Transparent objects are common in daily life, and understanding their multi-layer depth information, including both the transparent surface and the objects behind it, is crucial for real-world applications that interact with transparent materials.However, existing depth methods produce only a single depth map, which is inherently ambiguous for transparent surfaces.In this work, We propose a multi-layer depth estimation method, SeeGroup, consisting of novel recurrent decomposition module design and an intensity-based formulation for multi-layer depth. Experiments demonstrate that our method significantly improves the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34\\% to 70.67\\%.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39218", "url": null, "sourceid": 31600, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40359?format=json"], "related_events_ids": [40359]}, {"id": 40360, "uid": "a56cdb06e9002dc7485f5969674474a3", "name": "SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching", "authors": [{"id": 154004, "fullname": "Yasaman Haghighi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154004?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 97325, "fullname": "Alex Alahi", "url": "http://cvpr.thecvf.com/api/miniconf/users/97325?format=json", "institution": "EPFL"}], "abstract": "Diffusion models achieve state-of-the-art video generation but their many sequential denoising steps create a major computational bottleneck. Existing acceleration methods reuse cached model outputs at fixed timesteps chosen through heuristics, requiring heavy tuning and failing to adapt to each sample\u2019s complexity. We address this with a principled, sensitivity-aware caching framework. We first formalize the caching problem by analyzing the network's output sensitivity with respect to changes in its inputs\u2014namely, the noisy latent and the timestep. We demonstrate that this sensitivity is the key indicator of caching error. Building on this insight, we introduce Sensitivity-Aware Caching ($\\text{SenCache}$), a dynamic strategy that adaptively selects which timesteps to cache on a per-sample basis. This allows for less caching on challenging samples and more aggressive acceleration on simpler ones. Our method provides a robust theoretical grounding for adaptive caching, offering an explanation for why previous empirical criteria are partially effective and extending them with a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX and LTX-Video models demonstrate that our method outperforms existing caching strategies in visual quality under similar computational budgets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40360", "url": null, "sourceid": -40256, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39219?format=json"], "related_events_ids": [39219]}, {"id": 40359, "uid": "7c29440875ef4bda3f5f3e5d8d786786", "name": "SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping", "authors": [{"id": 128203, "fullname": "Hongyu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128203?format=json", "institution": "Princeton University"}, {"id": 90297, "fullname": "Jia Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90297?format=json", "institution": "Princeton University"}], "abstract": "Transparent objects are common in daily life, and understanding their multi-layer depth information, including both the transparent surface and the objects behind it, is crucial for real-world applications that interact with transparent materials.However, existing depth methods produce only a single depth map, which is inherently ambiguous for transparent surfaces.In this work, We propose a multi-layer depth estimation method, SeeGroup, consisting of novel recurrent decomposition module design and an intensity-based formulation for multi-layer depth. Experiments demonstrate that our method significantly improves the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34\\% to 70.67\\%.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40359", "url": null, "sourceid": -31600, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39218?format=json"], "related_events_ids": [39218]}, {"id": 39219, "uid": "a56cdb06e9002dc7485f5969674474a3", "name": "SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching", "authors": [{"id": 154004, "fullname": "Yasaman Haghighi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154004?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 97325, "fullname": "Alex Alahi", "url": "http://cvpr.thecvf.com/api/miniconf/users/97325?format=json", "institution": "EPFL"}], "abstract": "Diffusion models achieve state-of-the-art video generation but their many sequential denoising steps create a major computational bottleneck. Existing acceleration methods reuse cached model outputs at fixed timesteps chosen through heuristics, requiring heavy tuning and failing to adapt to each sample\u2019s complexity. We address this with a principled, sensitivity-aware caching framework. We first formalize the caching problem by analyzing the network's output sensitivity with respect to changes in its inputs\u2014namely, the noisy latent and the timestep. We demonstrate that this sensitivity is the key indicator of caching error. Building on this insight, we introduce Sensitivity-Aware Caching ($\\text{SenCache}$), a dynamic strategy that adaptively selects which timesteps to cache on a per-sample basis. This allows for less caching on challenging samples and more aggressive acceleration on simpler ones. Our method provides a robust theoretical grounding for adaptive caching, offering an explanation for why previous empirical criteria are partially effective and extending them with a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX and LTX-Video models demonstrate that our method outperforms existing caching strategies in visual quality under similar computational budgets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39219", "url": null, "sourceid": 40256, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40360?format=json"], "related_events_ids": [40360]}, {"id": 39223, "uid": "84a98e9bea194d59e442e2be756a2e08", "name": "Learning Differentiable Hierarchies in 3D Gaussian Splatting", "authors": [{"id": 103185, "fullname": "Youqi Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/103185?format=json", "institution": "Peking University"}, {"id": 129326, "fullname": "Wugen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129326?format=json", "institution": "Peking University"}, {"id": 128617, "fullname": "Hongbin Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/128617?format=json", "institution": "Peking University"}], "abstract": "Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its unordered Gaussians make level-of-detail (LoD) construction and model compression highly challenging, limiting its applicability in customized scenarios.In this work, we propose a learning-based Gaussian hierarchy representation that ranks Gaussians by their contribution to the scene, enabling flexible LoD representations across arbitrary Gaussian counts.We first introduce a unified, continuous formulation and metric for Gaussian hierarchy. Then, we introduce a hierarchy-based modulated rendering method built upon a Differentiable Decreasing Step Function, which enables efficient hierarchy learning while maintaining approximately equivalent rendering. Moreover, we develop a PDF-Guided Active-Region Sampling strategy that encourages the learned hierarchy to become widely distributed within its value range.Our method requires no additional training stages and produces Gaussian hierarchies within training time comparable to classical 3DGS. Experiments on multiple datasets show that our approach achieves performance comparable to or surpassing state-of-the-art methods in both LoD rendering and model pruning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39223", "url": null, "sourceid": 41357, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39227, "uid": "de39ebd3184a468cd1659f2d8a6717ba", "name": "PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning", "authors": [{"id": 182503, "fullname": "Yuanhang Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/182503?format=json", "institution": "State Key Laboratory of CAD&amp;CG, Zhejiang University"}, {"id": 191643, "fullname": "Tao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191643?format=json", "institution": "Zhejiang University"}, {"id": 191644, "fullname": "Xingxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191644?format=json", "institution": "Zhejiang University"}, {"id": 191645, "fullname": "Boming Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191645?format=json", "institution": "Zhejiang University"}, {"id": 75767, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75767?format=json", "institution": "Beijing Institute of General Artificial Intelligence"}, {"id": 87066, "fullname": "Ruizhen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87066?format=json", "institution": "Shenzhen University"}, {"id": 191646, "fullname": "Peter Yichen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191646?format=json", "institution": "MIT"}, {"id": 86219, "fullname": "Hujun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86219?format=json", "institution": "Zhejiang University"}, {"id": 76752, "fullname": "Zhaopeng Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/76752?format=json", "institution": "Zhejiang University"}], "abstract": "Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. In the spirit of Linear Blend Skinning, we learn continuous skinning fields as basis functions lifting motion subspace coordinates to full-space deformation, with subspace defined by handle transformations. To generate mesh-free, discretization-agnostic, and physically consistent skinning fields that generalize well across diverse 3D shapes, PhysSkin employs a new neural skinning fields autoencoder which consists of a transformer-based encoder and a cross-attention decoder.Furthermore, we also develop a novel physics-informed self-supervised learning strategy that incorporates on-the-fly skinning-field normalization and conflict-aware gradient correction, enabling effective balancing of energy minimization, spatial smoothness, and orthogonality constraints.PhysSkin shows outstanding performance on generalizable neural skinning and enables real-time physics-based animation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39227", "url": null, "sourceid": 40568, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39234, "uid": "6644da5af8cbc1c19a5c4ffbb088773c", "name": "Polarization State Tracing for Reflection Removal and Color-Consistent Reconstruction", "authors": [{"id": 182497, "fullname": "Dongyue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182497?format=json", "institution": "Shenyang Institute of Automation Chinese Academy of Science"}, {"id": 191671, "fullname": "Yang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191671?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 129312, "fullname": "Jiandong Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/129312?format=json", "institution": "The Shenyang Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Colored glass is widely used in everyday settings, yet its reflective and absorptive properties often introduce ghost shadows and color bias in captured images. However, existing methods typically neglect the absorption issue, making it difficult to address color bias caused by colored glass. To address this, we are the first to apply polarization imaging theory to model the light transmission process within glass. Specifically, we propose a novel imaging model, the Polarization State Tracing Model (PSTM), which traces polarized light along multiple propagation paths and accounts for wavelength-selective absorption, enabling joint reflection removal and color-consistent reconstruction. Guided by PSTM, we design a Channel Ring Attention (CRA) mechanism to efficiently capture inter-angle polarization dependencies and enhance feature interaction across polarization channels, ensuring physically consistent recovery. Besides, the recovered polarization information can be directly applied to advanced downstream tasks, such as Shape-from-Polarization (SfP). We construct a real-world dataset, GlassPol, containing a wide range of glass materials, enabling testing under diverse optical conditions. Extensive experiments show that our method outperforms existing state-of-the-art methods, achieving up to a 3dB improvement in PSNR, establishing a new benchmark for polarized reflection removal.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39234", "url": null, "sourceid": 37525, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39237, "uid": "db15b038c3c1d8960907e8ac5691498f", "name": "FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation", "authors": [{"id": 180442, "fullname": "Hanxiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180442?format=json", "institution": "CASIA"}, {"id": 133812, "fullname": "Yuanchen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/133812?format=json", "institution": "VAST"}, {"id": 107310, "fullname": "Ying-Tian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107310?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 155933, "fullname": "Zi-Xin Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155933?format=json", "institution": "VAST"}, {"id": 185521, "fullname": "Biao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185521?format=json", "institution": "KAUST"}, {"id": 87014, "fullname": "Weize Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87014?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 126930, "fullname": "Ding Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126930?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 84806, "fullname": "Yan-Pei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84806?format=json", "institution": "Tencent ARC Lab"}, {"id": 86977, "fullname": "Dong-Ming Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86977?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our ``one-face-one-token'' strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39237", "url": null, "sourceid": 44253, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39238, "uid": "0bf5bbf3842e8c8742a4d76148e0ef89", "name": "$\\text{F}^2\\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling", "authors": [{"id": 155872, "fullname": "Huanjing Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155872?format=json", "institution": "Tianjin University"}, {"id": 174399, "fullname": "Dawei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/174399?format=json", "institution": "Tianjin University"}, {"id": 191684, "fullname": "Shaoxiong Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191684?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 87226, "fullname": "Jingyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87226?format=json", "institution": "Tianjin University"}], "abstract": "Reconstructing High Dynamic Range (HDR) videos from sequences of alternating-exposure Low Dynamic Range (LDR) frames remains highly challenging, especially under dynamic scenes where cross-exposure inconsistencies and complex motion make inter-frame alignment difficult, leading to ghosting and detail loss. Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose $\\text{F}^2\\text{HDR}$, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. The proposed framework integrates a flow adapter that adapts generic optical flow for robust cross-exposure alignment, a motion mask that identifies salient motion regions to guide ghosting and noise suppression, and a motion-aware refinement network that aggregates complementary information for coherent detail reconstruction. Extensive experiments demonstrate that $\\text{F}^2\\text{HDR}$ achieves state-of-the-art performance on real-world HDR video benchmarks, producing ghost-free and high-fidelity results under large motion and exposure variations. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39238", "url": null, "sourceid": 34081, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39239, "uid": "ff40b691c06418347e31890b6fc1e29f", "name": "Personalized Audio-driven Whole-body Talking Avatars", "authors": [{"id": 183032, "fullname": "Seungeun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/183032?format=json", "institution": "Klleon"}, {"id": 127901, "fullname": "SeungJun Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/127901?format=json", "institution": "KLleon"}, {"id": 191685, "fullname": "Hah Min Lew", "url": "http://cvpr.thecvf.com/api/miniconf/users/191685?format=json", "institution": "Klleon AI Research"}, {"id": 191686, "fullname": "Ji-Su Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191686?format=json", "institution": "KLleon"}, {"id": 153134, "fullname": "Gyeong-Moon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/153134?format=json", "institution": "Korea University"}], "abstract": "Prior conversational 3D avatar systems map audio to parametric poses and then render, creating a lossy bottleneck where quantization, retargeting, and tracking errors accumulate. This degrades audio\u2013motion synchronization and suppresses micro-articulations critical for realism\u2014such as bilabial closures, cheek inflation, nasolabial motion, blinks, and fine hand gestures\u2014especially under single-image personalization. We propose an end-to-end framework that builds a full-body, photorealistic 3D conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is modeled as a particle-based deformation field of 3D Gaussian primitives in a canonical space, with an audio-conditioned dynamics module that outputs per-particle trajectories for face, hands, and body, enabling localized high-frequency control with globally coherent motion. A splat-based differentiable renderer preserves identity, texture, and multi-view realism, while feature-level distillation from a large audio-driven video diffusion model and weak supervision from synthetic audio-conditioned clips further improve synchronization and natural expressivity. Joint photometric and temporal objectives shape the audio-conditioned deformation and rendering. Experiments across diverse speakers show improved lip\u2013audio sync, fine facial detail, and conversational gesture naturalness over pose-driven baselines, while preserving identity from a single photo and supporting photorealistic novel-view synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39239", "url": null, "sourceid": 31881, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39241, "uid": "222768daf052ac21011033265af61211", "name": "InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding", "authors": [{"id": 182395, "fullname": "Ashutosh Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/182395?format=json", "institution": "Woven by Toyota, Inc."}, {"id": 136252, "fullname": "Rajat Saini", "url": "http://cvpr.thecvf.com/api/miniconf/users/136252?format=json", "institution": "Woven by Toyota"}, {"id": 182002, "fullname": "Jingjing Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182002?format=json", "institution": "Woven by Toyota"}, {"id": 178401, "fullname": "Mustafa Erdogan", "url": "http://cvpr.thecvf.com/api/miniconf/users/178401?format=json", "institution": "Woven by Toyota Inc."}, {"id": 76883, "fullname": "Mingfang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76883?format=json", "institution": "The University of Tokyo"}, {"id": 191688, "fullname": "Betty Dem", "url": "http://cvpr.thecvf.com/api/miniconf/users/191688?format=json", "institution": "Toyota Motor Corporation"}, {"id": 181550, "fullname": "Norimasa Kobori", "url": "http://cvpr.thecvf.com/api/miniconf/users/181550?format=json", "institution": "Mercari, Inc."}, {"id": 85099, "fullname": "Quan Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85099?format=json", "institution": "Woven by Toyota"}], "abstract": "Current vision\u2013language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce $\\textbf{InstAP}$, an $\\textbf{Inst}$ance-$\\textbf{A}$ware $\\textbf{P}$re-training framework that jointly optimizes global image\u2013text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial\u2013temporal regions. To support this, we present $\\textbf{InstVL}$, a large-scale dataset ($2$ million images, $50,000$ videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39241", "url": null, "sourceid": 37729, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39245, "uid": "0e2013ea8ef4a1c5845a0b0fdacd16e0", "name": "LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging", "authors": [{"id": 147725, "fullname": "HE HUANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/147725?format=json", "institution": "Wuhan University"}, {"id": 191695, "fullname": "Yujun Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191695?format=json", "institution": null}, {"id": 131639, "fullname": "Wei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/131639?format=json", "institution": "Wuhan University"}], "abstract": "Deep unfolding networks (DUNs) have achieved remarkable success and become the mainstream paradigm for spectral compressive imaging (SCI) reconstruction. Existing DUNs are derived from full-HSI imaging models, where each stage operates directly on the high-dimensional HSI, refining the entire data cube based on the single 2D coded measurement. However, this paradigm leads to computational redundancy and suffers from the ill-posed nature of mapping 2D residuals back to 3D space of HSI. In this paper, we propose two novel imaging models corresponding to the spectral basis and subspace image by explicitly integrating low-rank (LR) decomposition with the sensing model. Compared to recovering the full HSI, estimating these compact low-dimensional components significantly mitigates the ill-posedness. Building upon these novel models, we develop the Low-Rank Deep Unfolding Network (LRDUN), which jointly solves the two subproblems within an unfolded proximal gradient descent (PGD) framework. Furthermore, we introduce a Generalized Feature Unfolding Mechanism (GFUM) that decouples the physical rank in the data-fidelity term from the feature dimensionality in the prior module, enhancing the representational capacity and flexibility of the network. Extensive experiments on simulated and real datasets demonstrate that the proposed LRDUN achieves state-of-the-art (SOTA) reconstruction quality with significantly reduced computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39245", "url": null, "sourceid": 44781, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39247, "uid": "884a06e5988eb41cfbd466142929bffe", "name": "FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers", "authors": [{"id": 89221, "fullname": "Minguk Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89221?format=json", "institution": "POSTECH"}, {"id": 87833, "fullname": "Suha Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/87833?format=json", "institution": "POSTECH"}], "abstract": "Recent progress in video generation has shifted large-scale models from convolutional architectures to Diffusion Transformers (DiT), yet latent-to-pixel video decoders remain predominantly convolutional. These decoders rely on heavy 3D convolutions, which slow down streaming generation and require spatial\u2013temporal tiling to handle high-resolution or long-duration outputs. We introduce FlashDecoder, the first Transformer-based latent-to-pixel video decoder designed for streaming. FlashDecoder processes video latents frame-by-frame during both training and inference, applying bidirectional spatial attention within each frame while maintaining causal temporal dependencies through a rolling KV cache. Crucially, causality is enforced by sequential frame processing rather than explicit attention masks, enabling the use of memory-efficient bidirectional attention kernels throughout. This unified streaming approach ensures constant per-frame computation and bounded memory via a fixed-size KV cache with automatic eviction of older frames, enabling stable training at resolutions up to 720p. Integrated into the Wan2.2 video VAE, FlashDecoder matches the reconstruction quality of the convolutional decoder (PSNR 38.38 vs. 38.29; LPIPS 0.046 vs. 0.039) while decoding up to 4x faster\u2014139 FPS at 480p and 69.6 FPS at 720p\u2014achieving real-time high-resolution video decoding on a single H100 GPU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39247", "url": null, "sourceid": 35759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39248, "uid": "8309ea7711b015f2b77aabc69bdcd99c", "name": "LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization", "authors": [{"id": 183444, "fullname": "Jianshi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183444?format=json", "institution": "Xiamen University"}, {"id": 156856, "fullname": "Minghang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156856?format=json", "institution": "Xiamen University"}, {"id": 142036, "fullname": "dq Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142036?format=json", "institution": "xiamen university"}, {"id": 77098, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77098?format=json", "institution": "schoold of informatics xiamen university"}, {"id": 70825, "fullname": "Sheng Ao", "url": "http://cvpr.thecvf.com/api/miniconf/users/70825?format=json", "institution": "Sun Yat-sen University"}, {"id": 86709, "fullname": "Siqi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86709?format=json", "institution": "Xiamen University"}, {"id": 86653, "fullname": "Chenglu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86653?format=json", "institution": "Xiamen University"}, {"id": 86652, "fullname": "Cheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86652?format=json", "institution": "Xiamen University"}], "abstract": "LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose **LEADER**, a robust LiDAR-based localization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predictions. Extensive experiments on the Oxford RobotCar and NCLT datasets demonstrate that PerfectLoc outperforms state-of-the-art methods, achieving 24.1% and 73.9% relative reductions in position error over existing techniques, respectively. Source code will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39248", "url": null, "sourceid": 36780, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39251, "uid": "8dd6c8ced32cca354f48b298d317d706", "name": "TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts", "authors": [{"id": 157242, "fullname": "Jeimin Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/157242?format=json", "institution": "Yonsei University"}, {"id": 177817, "fullname": "Hyunju Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/177817?format=json", "institution": "Yonsei University"}, {"id": 128195, "fullname": "Bumsub Ham", "url": "http://cvpr.thecvf.com/api/miniconf/users/128195?format=json", "institution": "Yonsei University"}], "abstract": "Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRA-Experts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly. Our code will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39251", "url": null, "sourceid": 40180, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39252, "uid": "e5112e4098b5067659fd84a835110ebc", "name": "KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization", "authors": [{"id": 191708, "fullname": "Mengxin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191708?format=json", "institution": "Southeast University"}, {"id": 181091, "fullname": "Yulin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181091?format=json", "institution": "Southeast University"}, {"id": 191709, "fullname": "Chen LUO", "url": "http://cvpr.thecvf.com/api/miniconf/users/191709?format=json", "institution": null}, {"id": 191710, "fullname": "Yongzhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191710?format=json", "institution": "Southeast University"}, {"id": 191711, "fullname": "Yijun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191711?format=json", "institution": "Southeast University"}], "abstract": "Rotational symmetry is an important prior in 6D pose estimation, improving pose accuracy and ensuring the consistency of symmetry-aware evaluation metrics. However, current symmetry annotations for 3D objects are still largely manual or semi-automatic, often requiring predefined symmetry types or rotational orders and thus limiting scalability. This work introduces a fully automatic and reference-free framework that performs symmetry-type classification, rotational-order identification, and full-axis localization across all eight canonical 3D rotational symmetry types. The method localizes a dominant high-order axis, infers its rotational order through self-consistency analysis, and reconstructs the complete symmetry structure under a hierarchy-guided geometric formulation. A texture-aware extension further models appearance-induced reductions in rotational order while preserving axis orientations. Extensive experiments on idealized and real-world datasets demonstrate strong accuracy and generalization, achieving 94.75% accuracy on 438 symmetric objects in GSO. Training FoundationPose with these priors improves accuracy by up to 1.0% across five BOP datasets, indicating that automatically estimated rotational priors can provide quantitative gains in downstream 6D pose estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39252", "url": null, "sourceid": 37793, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39256, "uid": "3f4f8e5619fbbf55669bece66b224fc5", "name": "Rethinking Intermediate Representation for VLM-based Robot Manipulation", "authors": [{"id": 191723, "fullname": "Weiliang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191723?format=json", "institution": null}, {"id": 182239, "fullname": "Jialin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182239?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 191724, "fullname": "Jia-Hui Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191724?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 191725, "fullname": "Gang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191725?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 75486, "fullname": "Li Erran Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/75486?format=json", "institution": "AWS AI, Amazon"}, {"id": 191726, "fullname": "Yunhui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191726?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 153553, "fullname": "Mingyu Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/153553?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 87709, "fullname": "Pheng-Ann Heng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87709?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87614, "fullname": "Chi-Wing Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87614?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Vision-Language Model (VLM) is now an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar structure, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. Also, we design a novel open-vocabulary segmentation paradigm with an in-context learning strategy to precisely localize fine-grained object parts for manipulation (e.g., cup handle, teapot opening) effectively with the shortest inference time over all state-of-the-art parallel works. We then formulate new metrics for action-generalizability and VLM-comprehensibility to evaluate mainstream representations, demonstrating the strong performance of SEAM on both aspects. Extensive realworld experiments further manifest the SOTA performance of SEAM under varying settings and tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39256", "url": null, "sourceid": 40541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39257, "uid": "efc6e26ff4b44d54c7760f184ee68506", "name": "HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification", "authors": [{"id": 172503, "fullname": "YANG CHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/172503?format=json", "institution": "Zhejiang University"}, {"id": 191727, "fullname": "Xiaomeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191727?format=json", "institution": "Zhejiang University"}, {"id": 183135, "fullname": "Keli Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183135?format=json", "institution": "Zhejiang University"}, {"id": 148795, "fullname": "Yuntao Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/148795?format=json", "institution": "Zhejiang University"}], "abstract": "Hierarchical classification (HC) on degraded images presents challenges due to feature corruption, unreliable confidence estimation, and fine-grained misclassification. Existing methods often struggle to balance semantic consistency and adaptive decision paths under low-quality visual conditions. To address this, we propose HierUQ, a unified framework that integrates uncertainty quantification with adaptive granularity reconciliation. A Vision Transformer backbone extracts global features, which are fused with semantic embeddings via bilinear and semantic-guided cross-attentions. We develop a principled Hierarchical Uncertainty Quantification (HUQ) strategy based on label smoothing and proper scoring rules. When confidence is insufficient, a Confidence-Aware Path Adjustment (CAPA) mechanism adaptively rolls back predictions to higher-level nodes, mitigating overclassification and error propagation, stabilizing the learning trajectory, overcoming degradation-induced interference, and enhancing fine-grained classification accuracy. To enhance learning, we introduce a self-paced joint optimization (MLJO) over multi-level objectives with dynamic loss weighting. Experiments on degraded remote sensing and natural image benchmarks show that HierUQ achieves state-of-the-art performance with strong robustness and adaptability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39257", "url": null, "sourceid": 38519, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39259, "uid": "da7d1d702f88ade45627510b78a887ce", "name": "Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection", "authors": [{"id": 181516, "fullname": "Zhihao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181516?format=json", "institution": "Michigan State University"}, {"id": 89429, "fullname": "Abhinav Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/89429?format=json", "institution": "Michigan State University"}, {"id": 148360, "fullname": "Girish Ganesan Ganesan", "url": "http://cvpr.thecvf.com/api/miniconf/users/148360?format=json", "institution": "Michigan State University"}, {"id": 73926, "fullname": "Xiaoming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73926?format=json", "institution": "Michigan State University"}], "abstract": "Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image.Without auxiliary sensors such as LiDAR, this task is inherently ill-posed since the 3D-to-2D projection introduces depth ambiguity.Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection.However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors.Therefore, neither parallel nor sequential prediction is optimal.In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs.A Chain-of-Prediction (CoP) explores inter-attribute correlations through feature-level learning, propagation, and aggregation, while an Uncertainty-Guided Selector (GS) dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty.By combining their strengths, MonoCoP achieves state-of-the-art (SOTA) performance on KITTI, nuScenes, and Waymo, significantly improving depth accuracy, particularly for distant objects.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39259", "url": null, "sourceid": 37514, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39261, "uid": "39baa31554d76571e44a1d255504040f", "name": "Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors", "authors": [{"id": 72967, "fullname": "Aixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/72967?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 90381, "fullname": "Mochu Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90381?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 174657, "fullname": "Bosen Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/174657?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 90394, "fullname": "Zhexiong Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90394?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 130439, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130439?format=json", "institution": "Australian National University"}, {"id": 87079, "fullname": "Yuchao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87079?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Adversarial robustness of BEV 3D object detectors is critical for autonomous driving (AD). Existing invasive attacks require altering the target vehicle itself (*e.g.* attaching patches), making them unrealistic and impractical for real-world evaluation. While non-invasive attacks that place adversarial objects in the environment are more practical, current methods still lack the multi-view and temporal consistency needed for physically plausible threats. In this paper, we present the first framework for generating universal, non-invasive, and 3D consistent adversarial objects that expose fundamental vulnerabilities for BEV 3D object detectors. Instead of modifying target vehicles, our method inserts rendered objects into scenes with an occlusion-aware module that enforces physical plausibility across views and time. To maintain attack effectiveness across views and frames, we optimize adversarial object appearance using a BEV spatial feature-guided optimization strategy that attacks the detector's internal representations. Extensive experiments demonstrate that our learned universal adversarial objects can consistently degrade multiple BEV detectors from various viewpoints and distances.More importantly, the new environment-manipulation attack paradigm exposes models' over-reliance on contextual cues and provides a practical pipeline for robustness evaluation in AD systems.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39261", "url": null, "sourceid": 36663, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39262, "uid": "7266d81a711c85f84940326843a21265", "name": "Physical Simulator In-the-Loop Video Generation", "authors": [{"id": 157881, "fullname": "Lin Geng Foo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157881?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute for Informatics"}, {"id": 76219, "fullname": "Mark He Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76219?format=json", "institution": "Singapore University of Technology and Design (SUTD)"}, {"id": 154883, "fullname": "Alexandros Lattas", "url": "http://cvpr.thecvf.com/api/miniconf/users/154883?format=json", "institution": "Google"}, {"id": 186632, "fullname": "Stylianos Moschoglou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186632?format=json", "institution": "Google"}, {"id": 75876, "fullname": "Thabo Beeler", "url": "http://cvpr.thecvf.com/api/miniconf/users/75876?format=json", "institution": "Google"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}], "abstract": "Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-Loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39262", "url": null, "sourceid": 41996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39264, "uid": "71f0603a717117b0bc2df3f784862801", "name": "PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation", "authors": [{"id": 180207, "fullname": "shuyan ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/180207?format=json", "institution": "Xiamen University"}, {"id": 176507, "fullname": "Yifan Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/176507?format=json", "institution": "Xiamen University"}, {"id": 180045, "fullname": "Changli Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180045?format=json", "institution": "Xiamen University"}, {"id": 178334, "fullname": "yonghan zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/178334?format=json", "institution": "Xiamen University"}, {"id": 131760, "fullname": "Jiayi Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/131760?format=json", "institution": "Xiamen University"}, {"id": 88415, "fullname": "Liujuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88415?format=json", "institution": "Xiamen University"}, {"id": 86308, "fullname": "Rongrong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/86308?format=json", "institution": "Xiamen University"}], "abstract": "Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, the first large-scale UAV reasoning segmentation benchmark, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision covering all three reasoning types. We further propose PixDLM, a pixel-level multimodal language model equipped with a Dual-Path Vision Encoder that preserves fine-grained high-resolution cues while maintaining strong global semantic alignment. Extensive experiments on DRSeg demonstrate that PixDLM achieves superior semantic consistency and spatial localization accuracy compared with existing multimodal models, offering a unified and efficient baseline for UAV reasoning segmentation. All datasets, models, and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39264", "url": null, "sourceid": 32934, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39265, "uid": "1458e84a7f1a80984de07508b68237d8", "name": "Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation", "authors": [{"id": 181643, "fullname": "Ting Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181643?format=json", "institution": "Tianjin University"}, {"id": 86158, "fullname": "Qilong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86158?format=json", "institution": "university  of tianjin of china"}, {"id": 90664, "fullname": "Qibin Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90664?format=json", "institution": "Nankai University"}, {"id": 86159, "fullname": "Qinghua Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86159?format=json", "institution": "Tianjin University"}], "abstract": "The rise of vision-language models (VLMs) has driven the initial exploration of open-vocabulary remote sensing image semantic segmentation (OVRSIS), enabling recognition of unseen categories in complex Earth observation scenes. However, existing methods primarily focus on enhancing visual representations of domain-specific remote sensing images, while overlooking the effect of textual information. In this paper, we argue that there exists a crucial issue of textual ambiguity in OVRSIS task, limiting the final segmentation performance. Therefore, we propose a plug-and-play yet effective Test-time Multi-Prompt Adaptation (TMPA) method to mitigate textual ambiguity in OVRSIS. Specifically, our TMPA first generates a group of diverse, context-aware descriptions for each category instead of the naive class name by executing a large language model with a task-driven prompt, which can effectively avoid some textual ambiguity, i.e., background class has different meanings in various tasks. Furthermore, TMPA develops a visual-guided test-time adaptation strategy for the generated multi-prompts, which adaptively refines the prompt representations of each category with high-confidence visual features for the uncertain predictions with high entropy, making our TMPA better applicable to different scenarios. Particularly, a pixel-level loss with entropy minimization is proposed to optimize the text prompt with a bias during inference, where prompt bias is constructed based on a weighted combination of high-confidence visual features. Our TMPA can be flexibly integrated into existing methods for boosting their performance. Extensive experiments are conducted on 17 remote sensing datasets, and the results show our TMPA can significantly improve its counterparts, while achieving state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39265", "url": null, "sourceid": 31055, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39268, "uid": "3d830ca43771c3ac56b1c24d2a9f1779", "name": "MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging", "authors": [{"id": 183558, "fullname": "Yuxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183558?format=json", "institution": "Purdue University"}, {"id": 183560, "fullname": "Wei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183560?format=json", "institution": "Purdue University"}, {"id": 126346, "fullname": "Qi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126346?format=json", "institution": "Purdue University"}], "abstract": "We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel\u2019s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entire visible spectrum (250 nm). Relative to snapshot hyperspectral imagers, it achieves the shortest total track length and the highest reconstruction accuracy on benchmark datasets. The demonstrated prototype reconstructs high-quality hyperspectral datacubes and either an HDR image or two orthogonal polarization channels from a snapshot measurement.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39268", "url": null, "sourceid": 42273, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40361?format=json"], "related_events_ids": [40361]}, {"id": 39271, "uid": "755945a59ff256394631b079277ab8bc", "name": "Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models", "authors": [{"id": 180381, "fullname": "Huatian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180381?format=json", "institution": "University of Science and Technology of China"}, {"id": 76376, "fullname": "Zhendong Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76376?format=json", "institution": "University of Science and Technology of China"}, {"id": 128550, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128550?format=json", "institution": "University of Science and Technology of China"}, {"id": 85538, "fullname": "Yongdong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85538?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Direct Preference Optimization (DPO) has proven to be an effective solution for mitigating hallucination in Multimodal Large Language Models (MLLMs) by learning from preference pairs. One of its key challenge lies in how to transfer the sequence-level preference into fine-grained supervision on visual fidelity. To safeguard vision-related tokens that are prone to hallucination, existing methods typically allocate training emphasis according to the model's self-assessed visual sensitivity signals. However, such sensitivity, estimated by a model still under training, introduces self-referential bias: reinforcing already well-learned visual cues while neglecting hard-to-perceive but critical details, thereby limiting deeper alignment. In this work, we propose an Uncertainty-aware Exploratory Direct Preference Optimization (UE-DPO) method for MLLMs, which enables the model to uncover its cognitive deficiencies and actively explore for self-correction, guided by token-level epistemic uncertainty. Specifically, we first quantify the uncertainty from the model's failure to ground token predictions in the given image. Then, based on an uncertainty-aware exploration intensity, we encourage more learning pressure on visually deficient tokens in preferred samples, and alleviate the over-penalization of beneficial knowledge in dispreferred samples. Further, we provide a theoretical justification for our method, and extensive experiments on hallucination benchmarks demonstrate its effectiveness and robustness. All code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39271", "url": null, "sourceid": 30601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39266, "uid": "31aa9651503a33c7f28f2a6e4d46b6e0", "name": "Frequency-domain Manipulation for Face Obfuscation", "authors": [{"id": 87695, "fullname": "Jintae Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87695?format=json", "institution": "Korea University"}, {"id": 191734, "fullname": "Keunsoo Ko", "url": "http://cvpr.thecvf.com/api/miniconf/users/191734?format=json", "institution": "The Catholic University of Korea"}, {"id": 87670, "fullname": "Chang-Su Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87670?format=json", "institution": "Korea University"}], "abstract": "Facial image datasets have become essential resources for various face analysis tasks, but their use raises significant privacy concerns. To address this issue, face obfuscation has emerged as a practical approach to hide identity from humans while retaining cues decipherable by machines. However, existing methods often leave exploitable visual traces, making them vulnerable to reconstruction attacks that restore hidden identity. To address this issue, we propose a frequency-domain manipulation framework, called FreM, which adjusts frequency subbands differently to hide identity, retain machine-decipherable cues, and improve robustness against reconstruction attacks. Specifically, the proposed FreM first decomposes a facial image into frequency subbands and applies subband-adaptive modulation that regulates information according to the characteristics of each subband. The modulation parameters are then refined to yield the reliable obfuscated result. Extensive experiments across multiple face analysis benchmarks demonstrate that FreM achieves superior obfuscation quality and strong robustness against reconstruction attacks. The source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39266", "url": null, "sourceid": 39877, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39291, "uid": "ef482c2b5df361ebe176e3bade57d833", "name": "Chaining Basic Capabilities for Embodied Task Planning", "authors": [{"id": 180523, "fullname": "Peiran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180523?format=json", "institution": "Peking University"}, {"id": 191779, "fullname": "Jiaqi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191779?format=json", "institution": "Peking University"}, {"id": 89566, "fullname": "Yadong Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89566?format=json", "institution": "Peking University"}], "abstract": "This paper focuses on embodied task planning, where an agent acquires visual observations from the environment and executes atomic actions to accomplish a given task. Although recent Vision-Language Models (VLMs) have achieved impressive results in multimodal understanding and reasoning, their performance remains limited when applied to embodied planning that involves multi-turn interaction, long-horizon reasoning, and extended context analysis. To bridge this gap, we propose a capability-driven planning pipeline, in which the model actively invokes different sub-capabilities. Each capability maintains its own context, and produces intermediate reasoning results or interacts with the environment according to the query given by a scheduler. This framework decomposes complex planning into a sequence of basic vision-language problems that VLMs can better address, enabling a more transparent and controllable reasoning process. The scheduler and all capabilities are implemented with a single VLM, without relying on external tools. To train this VLM, we adopt a multi-stage paradigm that consists of: (1) behavior cloning with expert plans, (2) DAgger training using trajectories collected by the model, and (3) reinforcement learning guided by an expert policy. Across these stages, we exploit the internal information of the environment simulator to construct high-quality supervision for each capability, and we further introduce augmented and synthetic data to enhance the model\u2019s performance in more diverse scenarios. Extensive experiments on widely used embodied task planning benchmarks validate the effectiveness of the proposed approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39291", "url": null, "sourceid": 34756, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40361, "uid": "3d830ca43771c3ac56b1c24d2a9f1779", "name": "MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging", "authors": [{"id": 183558, "fullname": "Yuxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183558?format=json", "institution": "Purdue University"}, {"id": 183560, "fullname": "Wei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183560?format=json", "institution": "Purdue University"}, {"id": 126346, "fullname": "Qi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126346?format=json", "institution": "Purdue University"}], "abstract": "We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel\u2019s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entire visible spectrum (250 nm). Relative to snapshot hyperspectral imagers, it achieves the shortest total track length and the highest reconstruction accuracy on benchmark datasets. The demonstrated prototype reconstructs high-quality hyperspectral datacubes and either an HDR image or two orthogonal polarization channels from a snapshot measurement.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40361", "url": null, "sourceid": -42273, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39268?format=json"], "related_events_ids": [39268]}, {"id": 39270, "uid": "b3a36c38b2a370995afb25590c8ac245", "name": "PowerCLIP: Powerset Alignment for Fine-Grained Contrastive Pre-Training", "authors": [{"id": 184018, "fullname": "Masaki Kawamura", "url": "http://cvpr.thecvf.com/api/miniconf/users/184018?format=json", "institution": "Institute of Science Tokyo"}, {"id": 90313, "fullname": "Nakamasa Inoue", "url": "http://cvpr.thecvf.com/api/miniconf/users/90313?format=json", "institution": "Tokyo Institute of Technology"}, {"id": 140608, "fullname": "Rintaro Yanagi", "url": "http://cvpr.thecvf.com/api/miniconf/users/140608?format=json", "institution": "AIST, National Institute of Advanced Industrial Science and Technology"}, {"id": 87986, "fullname": "Hirokatsu Kataoka", "url": "http://cvpr.thecvf.com/api/miniconf/users/87986?format=json", "institution": "National Institute of Advanced Industrial Science and Technology (AIST)"}, {"id": 147498, "fullname": "Rio Yokota", "url": "http://cvpr.thecvf.com/api/miniconf/users/147498?format=json", "institution": "Institute of Science Tokyo"}], "abstract": "Contrastive pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics spanning multiple image regions.To address this limitation, we propose PowerCLIP, a novelcontrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees.As this approach increases computational complexity exponentially due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from $\\mathit{\\mathcal{O}(2^{M})}$ to $\\mathit{\\mathcal{O}(M)}$ with respect to the number of regions $M$, provably approximating the exact loss value with arbitrary precision.Our extensive experiments demonstrate that PowerCLIPoutperforms state-of-the-art methodsin zero-shot classification and retrieval tasks, underscoring compositionality and robustness of our approach. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39270", "url": null, "sourceid": 34735, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39272, "uid": "c8651789d418118b810c58696ae5ac18", "name": "Generative Video Motion Editing with 3D Point Tracks", "authors": [{"id": 89238, "fullname": "Yao-Chih Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/89238?format=json", "institution": "University of Maryland College Park"}, {"id": 129904, "fullname": "Zhoutong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129904?format=json", "institution": "Adobe Systems"}, {"id": 91712, "fullname": "Gabriel Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91712?format=json", "institution": "Adobe Research"}, {"id": 91153, "fullname": "Jui-Hsien Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91153?format=json", "institution": "Adobe Systems"}, {"id": 180187, "fullname": "Joon-Young Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/180187?format=json", "institution": "Adobe Research"}, {"id": 88945, "fullname": "Jia-Bin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88945?format=json", "institution": "University of Maryland, College Park"}, {"id": 75717, "fullname": "Eli Shechtman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75717?format=json", "institution": "Adobe Research, US"}, {"id": 154196, "fullname": "Zhengqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154196?format=json", "institution": "Google"}], "abstract": "Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39272", "url": null, "sourceid": 37235, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39277, "uid": "5128811422870279d063413608e0bc4b", "name": "Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context", "authors": [{"id": 154562, "fullname": "JiaKui Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154562?format=json", "institution": "Peking University"}, {"id": 130994, "fullname": "Jialun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130994?format=json", "institution": "Baidu"}, {"id": 156875, "fullname": "Liying Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156875?format=json", "institution": "Macau University of Science and Technology"}, {"id": 191751, "fullname": "Xinliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191751?format=json", "institution": "Peking University"}, {"id": 191752, "fullname": "Kaiwen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191752?format=json", "institution": "Peking University"}, {"id": 159093, "fullname": "Shuang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/159093?format=json", "institution": "Peking University / Georgia Institute of Technology"}, {"id": 143970, "fullname": "Yuanwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/143970?format=json", "institution": "Peking University"}, {"id": 87044, "fullname": "Haibin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87044?format=json", "institution": "Kuaishou Technology"}, {"id": 152943, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152943?format=json", "institution": "China Telecom"}, {"id": 128636, "fullname": "Yanye Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128636?format=json", "institution": "Peking University"}], "abstract": "Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context\". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39277", "url": null, "sourceid": 38028, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39280, "uid": "1fba5ce97f702c38b3e76367546cd227", "name": "R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment", "authors": [{"id": 191756, "fullname": "Zhuangzi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191756?format=json", "institution": "Nanyang Technological University"}, {"id": 191757, "fullname": "Jian Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191757?format=json", "institution": null}, {"id": 191758, "fullname": "Shilv Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191758?format=json", "institution": "Nanyang Technological University"}, {"id": 86501, "fullname": "Weisi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/86501?format=json", "institution": "Nanyang Technological University"}], "abstract": "Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: (1) existing CG datasets lack systematic descriptions of rendering quality; and (2) existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of $3.5\\mathrm{K}$ CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question\u2013answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM\u2019s understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment. The dataset and code will be publicly released to support future research in this area.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39280", "url": null, "sourceid": 42438, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39281, "uid": "617c81b47e4758942128b6bd2319d9c1", "name": "FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips", "authors": [{"id": 130387, "fullname": "Mengtian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130387?format=json", "institution": "Shanghai University"}, {"id": 183735, "fullname": "Kunyan Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/183735?format=json", "institution": "Shanghai University"}, {"id": 191759, "fullname": "Yi Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/191759?format=json", "institution": "Shanghai University"}, {"id": 191760, "fullname": "Ruobing Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/191760?format=json", "institution": "Shanghai University"}, {"id": 191761, "fullname": "Ying Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191761?format=json", "institution": "Shanghai University"}, {"id": 176774, "fullname": "Wenwu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176774?format=json", "institution": "University of Surrey"}, {"id": 90560, "fullname": "Zhifeng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90560?format=json", "institution": "shanghai university"}], "abstract": "Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporal aligned audio remains labor-intensive. We propose \\textbf{FoleyDesigner}, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporal controllable Foley generation, and professional mixing capabilities.Technically, FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatiotemporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate film industry-grade post-production practices. To address the lack of high-quality stere Foley datasets in film, we introduce \\textbf{FilmStereo}, the first professional stereo Foley dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For application, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility.Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with integration validated in film industrial-grade workflows.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39281", "url": null, "sourceid": 42785, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39284, "uid": "bed661249f23a7680d776e668cd73d08", "name": "Revisiting Visual Corruptions in LVLMs:  A Shape\u2013Texture Perspective on Model Failures", "authors": [{"id": 191764, "fullname": "Xinkuan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191764?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences; Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 127370, "fullname": "Meina Kan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127370?format=json", "institution": "Institute of Computing Technoloy, Chinese Academy of Sciences"}, {"id": 182763, "fullname": "Zhenliang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182763?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u9662\u8ba1\u7b97\u6280\u672f\u7814\u7a76\u6240"}, {"id": 191765, "fullname": "Yongbin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191765?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86312, "fullname": "Shiguang Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86312?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Large vision\u2013language models (LVLMs) are highly vulnerable to visual corruptions, substantially compromising their reliability and limiting real-world deployment. Prior work has attributed this degradation primarily to insufficient visual grounding and overreliance on language priors. However, these explanations often overlook the heterogeneous nature of corruptions, which perturb model perception in fundamentally different ways. We revisit this problem from a corruption-centric perspective and show that diverse corruptions can be organized along two complementary perceptual dimensions\u2014shape and texture\u2014which induce distinct failure modes. To address them, we propose Shape\u2013Texture Dual-Path Contrastive Decoding (ST-CD), a training-free inference framework that constructs complementary contrastive views to diagnose and correct shape- and texture-induced biases through adaptive fusion. Experiments across multiple LVLMs and robustness benchmarks demonstrate that ST-CD consistently improves robustness under heterogeneous corruptions, suggesting that leveraging the complementarity between shape and texture provides a general and effective principle for building robust multimodal models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39284", "url": null, "sourceid": 37706, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39289, "uid": "759cf6a07a81645b6b5dd37a90db63a5", "name": "ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models", "authors": [{"id": 143831, "fullname": "Linqing Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/143831?format=json", "institution": "Beihang University"}, {"id": 89815, "fullname": "Yi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89815?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 191778, "fullname": "Yifei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191778?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 178415, "fullname": "Ziyu Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/178415?format=json", "institution": "AgiBot"}, {"id": 75839, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75839?format=json", "institution": "Beihang University"}, {"id": 99906, "fullname": "Guanghui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/99906?format=json", "institution": "AgiBot"}], "abstract": "Vision-Language-Action (VLA) models have emerged essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings.Recent advancements have introduced explicit intermediary reasoning\u2014such as subtask prediction (language) or goal image synthesis (vision)\u2014to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning.Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.45%, 84.14% and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39289", "url": null, "sourceid": 39357, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39303, "uid": "580120b43fdd69f78c8d01fa0a6853b8", "name": "DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting", "authors": [{"id": 181566, "fullname": "Gaowei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181566?format=json", "institution": "Dalian University of Technology"}, {"id": 127233, "fullname": "Lihe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127233?format=json", "institution": "Dalian University of Technology"}], "abstract": "Zero-shot anomaly detection (ZSAD) aims to utilize auxiliary data to train models for generalized learning of unseen categories, which has important application value in fields such as industrial quality inspection and medical diagnosis. Although methods based on CLIP show potential, their pre-training objective of focusing on overall semantic alignment between images and text makes the model insensitive to local details, which is inherently contradictory to the need for fine-grained local features in anomaly detection. Existing improvement methods rely on predefined text prompt frameworks to perceive local information, but struggle to effectively address the issue of insufficient local perception. To address this, this paper proposes a dynamic local visual prompting method based on CLIP (DLVP-CLIP). DLVP dynamically identifies and extracts local visual features from key regions in images as prompt tokens using the Semantic-Aware Local Feature Selector (SLFS) module, and utilizes the multi-modal local prompt (MLoP) module to jointly optimize representations in both visual and textual spaces, achieving more precise cross-modal alignment. Additionally, the high-low frequency decomposition module (HFD) is introduced to separate and process global structural and local textural information via wavelet transformation, thereby enhancing detail perception. Extensive experiments on 13 anomaly detection datasets demonstrate that DLVP-CLIP achieves outstanding ZSAD performance on datasets from the industrial and medical domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39303", "url": null, "sourceid": 46569, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40362, "uid": "5b8fb4dc83626faa47bdf214c6119098", "name": "CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling", "authors": [{"id": 104825, "fullname": "Li Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/104825?format=json", "institution": "Shandong University"}, {"id": 89410, "fullname": "Weikai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89410?format=json", "institution": "Tencent America"}, {"id": 154392, "fullname": "Yujie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154392?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 76921, "fullname": "Yingda Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/76921?format=json", "institution": "Peking University"}, {"id": 153580, "fullname": "Zeyu HU", "url": "http://cvpr.thecvf.com/api/miniconf/users/153580?format=json", "institution": "Tencent Lightspeed Studios, Singapore"}, {"id": 185827, "fullname": "Runze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185827?format=json", "institution": "Tencent"}, {"id": 185828, "fullname": "Keyang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185828?format=json", "institution": "Tencent LightSpeed Studio"}, {"id": 187423, "fullname": "Shengju Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187423?format=json", "institution": "Tencent"}, {"id": 153582, "fullname": "Xin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153582?format=json", "institution": "LightSpeed Studios"}, {"id": 155002, "fullname": "Xueying Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/155002?format=json", "institution": "Shandong University"}], "abstract": "Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this  gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame  learned directly from data.  By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical representation yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40362", "url": null, "sourceid": -46491, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39290?format=json"], "related_events_ids": [39290]}, {"id": 39292, "uid": "b875668dea0dce49c9d7999107a854cd", "name": "ReMoT: Reinforcement Learning with Motion Contrast Triplets", "authors": [{"id": 180816, "fullname": "Cong Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180816?format=json", "institution": "Xian Jiaotong University"}, {"id": 191780, "fullname": "Zeyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191780?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 181750, "fullname": "Jiangyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181750?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188493, "fullname": "SongLin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188493?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 187241, "fullname": "Yifan Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187241?format=json", "institution": "Alibaba Group"}, {"id": 191781, "fullname": "Lin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191781?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 158940, "fullname": "Zhiheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/158940?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 87250, "fullname": "Yihong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87250?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency\u2014a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (i) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (ii) Group Relative Policy Optimization, which we empirically validate, yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning.  We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves SOTA performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1 performance leap on spatio-temporal reasoning tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39292", "url": null, "sourceid": 43601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39293, "uid": "7f59f49965ea6a0f208e543c814b4e91", "name": "Region-Adaptive Sampling for Diffusion Transformers", "authors": [{"id": 182441, "fullname": "Ziming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182441?format=json", "institution": "National University of Singapore"}, {"id": 191782, "fullname": "Yifan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191782?format=json", "institution": "Microsoft"}, {"id": 191783, "fullname": "Chengruidong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191783?format=json", "institution": "Alibaba Group"}, {"id": 191784, "fullname": "Yiqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191784?format=json", "institution": "National University of Singapore"}, {"id": 149630, "fullname": "Lili Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149630?format=json", "institution": "Microsoft Research Asia"}, {"id": 153640, "fullname": "Yang You", "url": "http://cvpr.thecvf.com/api/miniconf/users/153640?format=json", "institution": "National University of Singapore"}, {"id": 89473, "fullname": "Yuqing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89473?format=json", "institution": "Research, Microsoft"}], "abstract": "Diffusion models (DMs) have become the state-of-the-art for generative tasks across domains, but their reliance on sequential forward passes limits real-time performance. Prior acceleration methods mainly reduce sampling steps or reuse intermediate results. Leveraging the flexibility of Diffusion Transformers (DiTs) to handle variable token counts, we propose RAS, a training-free sampling strategy that dynamically assigns different update ratios to image regions based on model focus. Our key observation is that at each step, DiTs concentrate on semantically meaningful areas, and these regions exhibit strong continuity across consecutive steps. Exploiting this, RAS updates only focused regions while reusing cached noise for others, with focus determined from the previous step\u2019s output. Evaluated on Stable Diffusion 3 and Lumina-Next-T2I, RAS achieves up to 2.36\u00d7 and 2.51\u00d7 speedups, respectively, with minimal quality loss. This demonstrates a practical step toward more efficient diffusion transformers for real-time generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39293", "url": null, "sourceid": 42261, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39294, "uid": "2fedf51adb7da7d37f99cca6c69549b5", "name": "Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning", "authors": [{"id": 180834, "fullname": "Zhi-Hao Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180834?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191785, "fullname": "Longfei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191785?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191786, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191786?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Vision-language pre-training (VLP) has achieved remarkable performance across diverse multimodal learning (MML) tasks. Recently, many efforts have focused on reconstructing missing modalities to improve the adaptability of VLP models in incomplete MML scenarios. However, these approaches overlook the learning imbalance under severe missing-modality conditions, i.e., the optimization process is dominated by reconstructed samples, thereby weakening complete-sample representations. In this paper, we propose a novel ANchor-guided Gradient Alignment (ANGA) framework to address these issues. Specifically, we first retrieve similar instances to reconstruct the missing modalities, thereby alleviating information deficiency. We then introduce an entropy-driven curriculum that progressively integrates reliable reconstructed samples with complete ones to form an optimization anchor, which guides gradient alignment to mitigate learning imbalance. Furthermore, we design a semantic-enhanced adapter that leverages the retrieved instances to generate dynamic prompts, further enhancing the robustness of the VLP model. Extensive experiments on widely used datasets demonstrate the superiority of ANGA over state-of-the-art (SOTA) baselines across various missing-modality scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39294", "url": null, "sourceid": 31314, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39296, "uid": "fd446d68b55157f52b803e6b6ada53cc", "name": "SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals", "authors": [{"id": 127094, "fullname": "Soyeon Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/127094?format=json", "institution": "KAIST"}, {"id": 132338, "fullname": "Chang Wook Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/132338?format=json", "institution": "KAIST / Anigma Technologies"}, {"id": 144939, "fullname": "Hyunjung Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/144939?format=json", "institution": "KAIST"}], "abstract": "Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency.We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space.This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing.SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy\u2013efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence. Code and pretrained models will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39296", "url": null, "sourceid": 33599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39302, "uid": "285976ea634eee4a3e2204b519e7e7d9", "name": "CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling", "authors": [{"id": 178065, "fullname": "Zhiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178065?format=json", "institution": "University of Science and Technology of China"}, {"id": 69935, "fullname": "Dianmo Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/69935?format=json", "institution": "University of Science and Technology of China"}, {"id": 131736, "fullname": "Qi Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131736?format=json", "institution": "University of Science and Technology of China"}, {"id": 191802, "fullname": "Shilong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191802?format=json", "institution": "University of Science and Technology of China"}, {"id": 131741, "fullname": "Tao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131741?format=json", "institution": "University of Science and Technology of China"}, {"id": 191803, "fullname": "Zhou Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191803?format=json", "institution": "Lingyang Industrial Internet Co., Ltd."}, {"id": 90580, "fullname": "Nenghai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90580?format=json", "institution": "University of Science and Technology of China"}], "abstract": "In-Context Learning (ICL) has shown great effectiveness in developing generalist image segmentation models. Its significant advantage over text-based descriptions is the ability to convey intricate visual appearance details through simple reference images. However, finding a perfectly matching single example for real-world rare and complex concepts is difficult;and existing methods are largely confined to semantic or instance-level understanding of the reference image, struggle to express more precise segmentation needs through the input. To address this, we propose \\textbf{CDICS}, a novel framework that leverages \\textbf{C}ompositional prompts and phased task \\textbf{D}ecoupling to achieve compositional prompt-controlled \\textbf{I}n-\\textbf{C}ontext \\textbf{S}egmentation. Our method introduces compositional prompts derived from reference prompts, combining semantic, part and color images to dynamically define segmentation targets. To effectively fuse this control information, ensure synergy while suppressing interference, and mitigate feature coupling risks, our decoupled two-stage architecture, which firstly performs coarse-grained semantic localization, then refines the result using compositional appearance prompts to precisely match the specified attributes. This design extends traditional in-context segmentation, enabling it to support compositional prompts. Additionally, we reconstructed two datasets and their benchmarks to acquire data with part-color-specific attributes. Our method demonstrates superior performance on the compositional prompt-controlled in-context segmentation task. It also extends the capabilities of existing in-context segmentation, and makes an attempt toward real-world fine-grained segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39302", "url": null, "sourceid": 45479, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39299, "uid": "f9aceeacd04f4eba8b36012486d63a75", "name": "Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images", "authors": [{"id": 127549, "fullname": "Yikun Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/127549?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 134399, "fullname": "Yan Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/134399?format=json", "institution": "Alibaba Group"}, {"id": 191794, "fullname": "Bowen Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191794?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 90281, "fullname": "Jun Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90281?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191795, "fullname": "Huijia Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191795?format=json", "institution": "Ant Group"}, {"id": 90268, "fullname": "Weiqiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90268?format=json", "institution": "University of Southern California"}, {"id": 88759, "fullname": "Liqing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88759?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187996, "fullname": "Jianfu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187996?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising practical concerns for digital integrity. Vision-language models (VLMs) can provide natural language explanations, but standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images and offer limited grounding in the pixels. We propose \\textbf{Locate-Then-Examine} (LTE), a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real vs. AI-generated verdict and its explanation. LTE explicitly links each decision to localized visual evidence through region proposals and region-aware reasoning. To support training and evaluation, we introduce \\textbf{TRACE}, a dataset of 20,000 real and high-quality synthetic images with region-level annotations and automatically generated forensic explanations, constructed by a VLM-based pipeline with additional consistency checks and quality control. Across TRACE and multiple external benchmarks, LTE achieves competitive accuracy and improved robustness while providing human-understandable, region-grounded explanations suitable for forensic deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39299", "url": null, "sourceid": 41499, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39313, "uid": "e5a55ca0e0c107c9ece3e8c09650a4a1", "name": "Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel", "authors": [{"id": 180161, "fullname": "Weidong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180161?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191823, "fullname": "Zhiyuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191823?format=json", "institution": null}, {"id": 191824, "fullname": "Xinyan Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191824?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191825, "fullname": "Chen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191825?format=json", "institution": "College of Computer Science and Technology, Xidian University"}, {"id": 86511, "fullname": "Zhaopan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86511?format=json", "institution": "Harbin Institute of Technology"}, {"id": 155620, "fullname": "Pengfei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155620?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 90912, "fullname": "Yan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/90912?format=json", "institution": "University of Science and Technology of China"}, {"id": 153640, "fullname": "Yang You", "url": "http://cvpr.thecvf.com/api/miniconf/users/153640?format=json", "institution": "National University of Singapore"}, {"id": 129149, "fullname": "Wangbo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129149?format=json", "institution": "National University of Singapore"}], "abstract": "The large vision foundation model, SAM2, has achieved remarkable performance in video object segmentation and tracking (VOST). However, its effectiveness is hindered by significant computational overhead. While model pruning is a widely used strategy to address this issue, traditional static and input-agnostic pruning approaches fall short in managing the diverse and complex nature of video data effectively. A promising alternative is dynamic networks, yet they often struggle to translate theoretical computational reductions into actual acceleration. Furthermore, both static and dynamic approaches typically focus on visual features of individual frames while neglecting the temporal correlations between them, limiting their performance in handling complex video streams.To address these challenges, we propose Recurrent Dynamic Submodel (RDS), a dynamic architecture that adaptively selects submodel blocks for each frame. Specifically, it has a lightweight Prediction-aware Router (PAR), which leverages both the segmentation mask from the previous frame and the visual features of the current frame to make routing decisions, enabling the submodel to explicitly capture the temporal nature of video data. Additionally, to reduce the cost of adapting the dynamic submodel, we introduce an Importance-aware LoRA (I-LoRA), tuning parameters only in the most critical blocks. Extensive experiments on various benchmarks demonstrate the effectiveness of our approach.For example, it achieves a 1.3\u00d7 speedup on the DAVIS 2017 dataset with less than 1% performance degradation, while introducing only 3% (6.7M) trainable parameters and requiring only 0.003% (6.7k) of the SAM2 training data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39313", "url": null, "sourceid": 40673, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39309, "uid": "5ebfa70416c6e41452ddde4ce2b536ac", "name": "BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection", "authors": [{"id": 184115, "fullname": "Melissa Schween", "url": "http://cvpr.thecvf.com/api/miniconf/users/184115?format=json", "institution": "Leibniz University Hannover"}, {"id": 134970, "fullname": "Mathis Kruse", "url": "http://cvpr.thecvf.com/api/miniconf/users/134970?format=json", "institution": "Leibniz University Hannover"}, {"id": 73701, "fullname": "Bodo Rosenhahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/73701?format=json", "institution": "Leibniz University of Hannover"}], "abstract": "We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images.Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world.A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation.We evaluate our approach on the SARD dataset containing office and dining room scenes.Our method achieves more than 10\\% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster.Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5\\% deviation.This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs.Our code will be published upon acceptance on https://github.com/author/BUSSARD.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39309", "url": null, "sourceid": 43042, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39315, "uid": "c181c9a2877323fea8d91c6d24f95eb9", "name": "Towards Open Environments: General Vision-Language Navigation via Fast-Slow Interactive Reasoning", "authors": [{"id": 180103, "fullname": "Li Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180103?format=json", "institution": "Tianjin University"}, {"id": 88250, "fullname": "Aming Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88250?format=json", "institution": "Xidian University"}, {"id": 155719, "fullname": "Zihao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155719?format=json", "institution": "Tianjin University"}, {"id": 86180, "fullname": "Yahong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86180?format=json", "institution": "Tianjin University"}], "abstract": "Vision-Language Navigation (VLN) aims to enable agents to navigate to a target location based on language instructions. Traditional VLN often follows a close-set assumption, i.e., training and test data share the same style of the input images and instructions. However, the real world is open and filled with various unseen environments, posing enormous difficulties for close-set methods.To this end, we focus on the General Scene Adaptation (GSA-VLN) task, aiming to learn generalized navigation ability by introducing diverse environments and inconsistent intructions.Towards this task, when facing unseen environments and instructions, the challenge mainly lies in how to enable the agent to dynamically produce generalized strategies during the navigation process. Recent research indicates that by means of fast and slow cognition systems, human beings could generate stable policies, which strengthen their adaptation for open world. Inspired by this idea, we propose the slow4fast-VLN, establishing a dynamic interactive fast-slow reasoning framework.The fast-reasoning module, an end-to-end strategy network, outputs actions via real-time input. It accumulates execution records in a history repository to build memory.The slow-reasoning module analyze the memories generated by the fast-reasoning module. Through deep reflection, it extracts experiences that enhance the generalization ability of decision-making. These experiences are structurally stored and used to continuously optimize the fast-reasoning module. Unlike traditional methods that treat fast-slow reasoning as independent mechanisms, our framework enables fast-slow interaction. By leveraging the experiences from slow reasoning, it continually improves the accuracy and generalization ability of fast decisions. This interaction allows the system to continuously adapt and efficiently execute navigation tasks when facing unseen scenarios. Extensive experiments demonstrate the superiorities of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39315", "url": null, "sourceid": 35977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39321, "uid": "c31de872e4b768853e4180258bb2ab00", "name": "From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking", "authors": [{"id": 181479, "fullname": "Yuqing Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181479?format=json", "institution": "East China University of Science and Technology"}, {"id": 89749, "fullname": "Yuchen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89749?format=json", "institution": "Fudan University, Shanghai AI Laboratory"}, {"id": 191844, "fullname": "Rui Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191844?format=json", "institution": null}, {"id": 191845, "fullname": "Weilong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191845?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 142670, "fullname": "Xu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/142670?format=json", "institution": "Fudan University; Renmin University of China; Virginia Polytechnic Institute and State University; Tsinghua University, Tsinghua University"}, {"id": 185902, "fullname": "Huaicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185902?format=json", "institution": null}, {"id": 188080, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188080?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 186859, "fullname": "Xiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186859?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}], "abstract": "End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The source code will be made publicly available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39321", "url": null, "sourceid": 41073, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39322, "uid": "4f96fd60ac7a551d3c509e7739878bd1", "name": "3D Gaussian Splatting at Arbitrary Resolution with Compact Proxy Anchors", "authors": [{"id": 180820, "fullname": "Mingyun Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180820?format=json", "institution": "Hanyang University, Seoul, South Korea"}, {"id": 166495, "fullname": "Seongro Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/166495?format=json", "institution": "Inria"}, {"id": 76930, "fullname": "Francois Bremond", "url": "http://cvpr.thecvf.com/api/miniconf/users/76930?format=json", "institution": "inria"}, {"id": 76718, "fullname": "Donghyeon Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/76718?format=json", "institution": "Hanyang University"}], "abstract": "Despite achieving high-quality rendering, 3D Gaussian Splatting suffers from aliasing when the rendering resolution changes, as it is typically trained at a fixed resolution. To address this limitation, we introduce a method that enables the model to generate resolution-adaptive 3D Gaussians under arbitrary resolution changes. In particular, built upon Scaffold-GS, we enhance the anchor feature representation by incorporating a resolution-embedding to encode continuous resolution information. From these enhanced anchor features, a pixel coverage gate dynamically forms resolution-adaptive 3D Gaussians. Furthermore, we drastically reduce storage requirements by selecting a compact subset of proxy anchors and designing a residual anchor predictor that reconstructs the unselected leaf anchors based on the proxy anchors, enabling faithful scene representation without compromising visual fidelity. As a result, our method provides continuous and alias-free rendering across resolutions while maintaining practical scalability and memory efficiency. Extensive experiments across diverse resolution ranges demonstrate that our approach achieves an optimal balance between fidelity and memory, enabling practical arbitrary-resolution view synthesis even in resource-constrained settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39322", "url": null, "sourceid": 45521, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39326, "uid": "f7660308943d2350de0f14731a7abd7f", "name": "Visual Autoregressive Modeling via Next Focus Prediction", "authors": [{"id": 157000, "fullname": "Xiaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157000?format=json", "institution": "BAIDU Inc,"}, {"id": 98162, "fullname": "Chenming Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/98162?format=json", "institution": "Baidu"}, {"id": 128187, "fullname": "Yanpeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128187?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 69984, "fullname": "Jiaming Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/69984?format=json", "institution": "Sun Yat-Sen University"}, {"id": 86716, "fullname": "Delin Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86716?format=json", "institution": "Fudan University"}, {"id": 156019, "fullname": "Yansong Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156019?format=json", "institution": "Xiamen University"}, {"id": 185761, "fullname": "Weihao Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185761?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 90836, "fullname": "Haibao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90836?format=json", "institution": "University of Hong Kong"}, {"id": 76439, "fullname": "Dingkang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76439?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moir\u00e9 patterns. To tackle this issue, we present \\textbf{FVAR}, which reframes the paradigm from \\emph{next-scale prediction} to \\emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: 1) \\emph{Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; 2) \\emph{Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and 3) \\emph{High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39326", "url": null, "sourceid": 39105, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39328, "uid": "c325a1f2d1226ac996f297052c91a683", "name": "Model Merging in the Essential Subspace", "authors": [{"id": 180706, "fullname": "Longhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180706?format=json", "institution": "Southeast University"}, {"id": 86644, "fullname": "Lei Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86644?format=json", "institution": "Southeast University"}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 84884, "fullname": "Xin Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84884?format=json", "institution": "Southeast University"}], "abstract": "Model merging aims to integrate multiple task-specific fine-tuned models derived from a shared pre-trained checkpoint into a single multi-task model without additional training. Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models. In this paper, we propose ESM (Essential Subspace Merging) , a robust framework for effective model merging. We begin by performing Principal Component Analysis (PCA) on feature shifts induced by parameter updates. The resulting principal directions span an essential subspace that dominantly influences feature representations. Each task's parameter update matrix is projected onto its respective essential subspace for low-rank decomposition before merging. This methodology mitigates inter-task interference while preserving core task-specific functionality. Furthermore, we introduce a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion. Extensive experiments across multiple task sets and model scales demonstrate that our method achieves state-of-the-art performance in multi-task model merging.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39328", "url": null, "sourceid": 46542, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39331, "uid": "1d150e73a0f97f5a1682cf36e0ceb422", "name": "A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett\u2013Luce Ranking", "authors": [{"id": 148200, "fullname": "chengan che", "url": "http://cvpr.thecvf.com/api/miniconf/users/148200?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 191341, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191341?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 191863, "fullname": "Xinyue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191863?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 191343, "fullname": "Sophia Tsoka", "url": "http://cvpr.thecvf.com/api/miniconf/users/191343?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 181202, "fullname": "Luis Carlos Garcia Peraza Herrera", "url": "http://cvpr.thecvf.com/api/miniconf/users/181202?format=json", "institution": "King&#x27;s College London"}], "abstract": "Procedural activities, ranging from routine cooking to complex surgical operations, are highly structured as a set of actions conducted in a specific temporal order. Despite their success on static images and short clips, current self-supervised learning methods often overlook the procedural nature that underpins such activities. We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. Our approach integrates twonovel probabilistic objectives based on the Plackett-Luce (PL) model. The primary PL objective trains the model to sort sampled frames chronologically, compelling it to learn the global workflow progression. The secondary objective, a spatio-temporal jigsaw loss, complements the learning by capturing fine-grained, cross-frame object correlations. Our approach consistently achieves superior performance across five surgical and cooking benchmarks. Specifically, PL-Stitch yields significant gains in surgical phase recognition (e.g., +11.4 pp k-NN accuracy on Cholec80) and cooking action segmentation (e.g., +5.7 pp linear probing accuracy on Breakfast), demonstrating its effectiveness for procedural video representation learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39331", "url": null, "sourceid": 36140, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39334, "uid": "84a1f8046f6aa35c11a1391e1b4d9e89", "name": "Multi-view Pyramid Transformer: Look Coarser to See Broader", "authors": [{"id": 153532, "fullname": "Gyeongjin Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153532?format=json", "institution": "Sungkyunkwan University"}, {"id": 183036, "fullname": "Seung kwon Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183036?format=json", "institution": "Yonsei University"}, {"id": 159455, "fullname": "Seungtae Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/159455?format=json", "institution": "Yonsei University"}, {"id": 153533, "fullname": "Younggeun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/153533?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 191874, "fullname": "Jungwoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191874?format=json", "institution": "Yonsei University"}, {"id": 159432, "fullname": "Eunbyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/159432?format=json", "institution": "Yonsei University"}], "abstract": "We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details,\" MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39334", "url": null, "sourceid": 46265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39337, "uid": "dc96a8266db03d23c786f970c7ddabc0", "name": "The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment", "authors": [{"id": 154650, "fullname": "Ziheng Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154650?format=json", "institution": "Nankai University"}, {"id": 131537, "fullname": "Yiren Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/131537?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191877, "fullname": "Yaoli Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191877?format=json", "institution": "Zhejiang University"}, {"id": 182025, "fullname": "Shihao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182025?format=json", "institution": "Nankai University"}, {"id": 90664, "fullname": "Qibin Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90664?format=json", "institution": "Nankai University"}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39337", "url": null, "sourceid": 30936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39341, "uid": "1074afbd8b5b7eb6576c26ecbf6f7e17", "name": "HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering", "authors": [{"id": 183796, "fullname": "Dan Ben Ami", "url": "http://cvpr.thecvf.com/api/miniconf/users/183796?format=json", "institution": "Ben Gurion University"}, {"id": 191883, "fullname": "Gabriele Serussi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191883?format=json", "institution": "Ben Gurion University of the Negev"}, {"id": 191884, "fullname": "Kobi Cohen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191884?format=json", "institution": "Ben Gurion University of the Negev"}, {"id": 154453, "fullname": "Chaim Baskin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154453?format=json", "institution": "Ben Gurion University of the Negev"}], "abstract": "Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. In this direction, we present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question is constructed to require aggregating at least three non-overlapping evidential cues across distinct video segments (so neither language priors nor a single snapshot can suffice). HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS)-the smallest number of frames a model must fuse to answer correctly-and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS $5.5$ vs. $2.6$-$4.2$). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31\u201342% are only slightly above the 20% random-guess rate. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided.By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39341", "url": null, "sourceid": 46197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39343, "uid": "b923dbd86db34a1294e93af71efb59ad", "name": "Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern", "authors": [{"id": 182683, "fullname": "Xiaopei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182683?format=json", "institution": "Tsinghua University"}, {"id": 191888, "fullname": "Guanning Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191888?format=json", "institution": "Department of Computer Science and Technology, Tsinghua University"}, {"id": 127213, "fullname": "Zhanhao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127213?format=json", "institution": "UC Berkeley"}, {"id": 86599, "fullname": "Jun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86599?format=json", "institution": "Tsinghua University"}, {"id": 89698, "fullname": "Xiaolin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89698?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Visible-thermal (RGB-T) object detection is a crucial technology for applications such as autonomous driving, where multimodal fusion enhances performance in challenging conditions like low light. However, the security of RGB-T detectors, particularly in the physical world, has been largely overlooked. This paper proposes a novel approach to RGB-T physical attacks using adversarial clothing with a non-overlapping RGB-T pattern (NORP). To simulate full-view (0$^{\\circ}$\u2013360$^{\\circ}$) RGB-T attacks, we construct 3D RGB-T models for human and adversarial clothing. NORP is a new adversarial pattern design using distinct visible and thermal materials without overlap, avoiding the light reduction in overlapping RGB-T patterns (ORP). To optimize the NORP on adversarial clothing, we propose a spatial discrete-continuous optimization (SDCO) method. We systematically evaluated our method on RGB-T detectors with different fusion architectures, demonstrating high attack success rates both in the digital and physical worlds. Additionally, we introduce a fusion-stage ensemble method that enhances the transferability of adversarial attacks across unseen RGB-T detectors with different fusion architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39343", "url": null, "sourceid": 40462, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39348, "uid": "3514f1f7f9a8c61c063aa8cd834c5561", "name": "rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training", "authors": [{"id": 182502, "fullname": "Tianyang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/182502?format=json", "institution": "University of Science and Technology of China"}, {"id": 191900, "fullname": "Ming Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191900?format=json", "institution": "University of Science and Technology of China"}, {"id": 189723, "fullname": "Yan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189723?format=json", "institution": "University of Science and Technology of China"}, {"id": 189722, "fullname": "Yang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189722?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Unsupervised remote photoplethysmography (rPPG) promises to leverage unlabeled video data, but its potential is hindered by a critical challenge: training on low-quality \"in-the-wild\" videos severely degrades model performance. A critical step missing here is to assess the suitability of the videos for rPPG model learning before using them for the task. Existing video quality assessment (VQA) methods are mainly designed for human perception and not directly applicable to the above purpose. In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. We integrate signal-level and scene-level analyses and design a dual-branch assessment architecture. The signal-level branch evaluates the physiological signal quality of the videos via robust signal-to-noise ratio (SNR) estimation with a multi-method consensus mechanism, while the scene-level branch uses a multimodal large language model (MLLM) to identify interferences like motion and unstable lighting. Furthermore, we propose a two-stage adaptive sampling (TAS) strategy that utilizes the quality score to curate optimal training datasets. Experiments show that by training on large-scale, \"in-the-wild\" videos filtered by our framework, we can develop unsupervised rPPG models that achieve a substantial improvement in accuracy on standard benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39348", "url": null, "sourceid": 40844, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39350, "uid": "f9dff983997e5617eb90301227612fee", "name": "Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling", "authors": [{"id": 191902, "fullname": "Minseok Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191902?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191903, "fullname": "Mark Hamilton", "url": "http://cvpr.thecvf.com/api/miniconf/users/191903?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 84981, "fullname": "Changick Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/84981?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "We present \\textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14\u00d7/16\u00d7 (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\\approx0.419 \\text{s}$ per 224\u00d7224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39350", "url": null, "sourceid": 46333, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39351, "uid": "8760d018b44febe1c3bbed05489bf9f2", "name": "VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling", "authors": [{"id": 181692, "fullname": "weiqi li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181692?format=json", "institution": "Sun Yat-sen University"}, {"id": 191904, "fullname": "QuandeZhang QuandeZhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191904?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 174443, "fullname": "ruifeng zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/174443?format=json", "institution": "Sun Yat-sen University"}, {"id": 131863, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/131863?format=json", "institution": "SUN YAT-SEN UNIVERSITY, Tsinghua University"}, {"id": 75467, "fullname": "Guangrun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75467?format=json", "institution": "University of Oxford"}], "abstract": "Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling.  To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5\\% to 87.1\\% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8\\% success with 4.7M parameters\u2014matching LoRA-scale finetuning at far lower cost.  Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39351", "url": null, "sourceid": 33111, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39353, "uid": "ffc87aa6d02f68087d1978176980b783", "name": "AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend", "authors": [{"id": 76315, "fullname": "Hengyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76315?format=json", "institution": "University College London"}, {"id": 75982, "fullname": "Lourdes Agapito", "url": "http://cvpr.thecvf.com/api/miniconf/users/75982?format=json", "institution": "University College London"}], "abstract": "We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39353", "url": null, "sourceid": 39938, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39356, "uid": "b2fb7865dfca2461177cbad7ec520b0e", "name": "SafeRoPE:Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers", "authors": [{"id": 181305, "fullname": "Xiang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181305?format=json", "institution": "Fudan University"}, {"id": 152949, "fullname": "Feifei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152949?format=json", "institution": "Fudan University"}, {"id": 91173, "fullname": "Mi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91173?format=json", "institution": "Fudan University"}, {"id": 191914, "fullname": "Geng Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191914?format=json", "institution": "Fudan University"}, {"id": 191915, "fullname": "Xiaoyu You", "url": "http://cvpr.thecvf.com/api/miniconf/users/191915?format=json", "institution": "East China University of Science and Technology"}, {"id": 91167, "fullname": "Min Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91167?format=json", "institution": "Fudan University"}], "abstract": "Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of \\textbf{safety-critical heads} is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images.Motivated by these insights, we propose SafeRoPE, a training-free and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combine both head-wise LRS and RoPE perturbation to perform risk-specific head-wise rotation on embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves the SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39356", "url": null, "sourceid": 41276, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39357, "uid": "0a41cf40123f35dac58d66443fd55e51", "name": "Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners", "authors": [{"id": 106015, "fullname": "Nikita Araslanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/106015?format=json", "institution": "TU Munich"}, {"id": 94931, "fullname": "Martin Sundermeyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/94931?format=json", "institution": "Google"}, {"id": 130344, "fullname": "Hidenobu Matsuki", "url": "http://cvpr.thecvf.com/api/miniconf/users/130344?format=json", "institution": "Google"}, {"id": 87968, "fullname": "David Joseph Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87968?format=json", "institution": "Google"}, {"id": 87927, "fullname": "Federico Tombari", "url": "http://cvpr.thecvf.com/api/miniconf/users/87927?format=json", "institution": "Google, TUM"}], "abstract": "One of the most exciting applications of vision models involve pixel-level reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39357", "url": null, "sourceid": 35602, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40366?format=json"], "related_events_ids": [40366]}, {"id": 39715, "uid": "ee7f18a35a02dc1e2e9ba29982dedbd3", "name": "TouchDream: 3D Object Completion through Imagined Touch", "authors": [{"id": 154753, "fullname": "Yuanbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154753?format=json", "institution": "Dalian University of Technology"}, {"id": 192711, "fullname": "Xinning Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192711?format=json", "institution": "Dalian University of Technology"}, {"id": 159472, "fullname": "Zhaoxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159472?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 192712, "fullname": "Changlong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192712?format=json", "institution": "Dalian University of Technology"}, {"id": 155488, "fullname": "qianchen xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/155488?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 86043, "fullname": "Xiaopeng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/86043?format=json", "institution": "Dalian University of Technology"}, {"id": 85034, "fullname": "Xin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85034?format=json", "institution": "Dalian University of Technology"}], "abstract": "Point cloud completion is crucial for robust 3D perception but remains challenging due to its ill-posed nature. Coarse-to-fine methods can lead to unconstrained local guesses in the absence of key structures, whereas diffusion-based approaches may introduce geometric inconsistencies. To overcome these limitations, we present TouchDream, a novel framework that leverages a diffusion model to 'dream' of tactile sensing on object surfaces, which reformulates the sensing process as a learnable generative modeling task. Unlike visual cues, tactile data provides rich local geometry that can be directly converted into 3D space for point fusion, offering a powerful guide for detail-aware completion. Specifically, our approach generate compact tactile latent representations conditioned on coarse points and sampled touch poses. A touch-guided refinement module then leverages touch features to optimize coarse points. Extensive experiments show that our TouchDream model achieves the state-of-the-art performance, significantly enhancing the recovery of local details.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39715", "url": null, "sourceid": 44224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39716, "uid": "a49759174cc1b87887edcced9cd486c7", "name": "Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation", "authors": [{"id": 181267, "fullname": "Chonghua Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/181267?format=json", "institution": "Xidian University"}, {"id": 84913, "fullname": "Dong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84913?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 84947, "fullname": "Shuang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84947?format=json", "institution": "Xidian University"}, {"id": 91886, "fullname": "Dou Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/91886?format=json", "institution": "XIDIAN UNIVERSITY"}, {"id": 192713, "fullname": "Ning Huyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192713?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 77038, "fullname": "Zhun Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77038?format=json", "institution": "University of Nottingham"}], "abstract": "Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in-domain accuracy while neglecting out-of-domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain-agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query-based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9\\% in foundation-to-foundation (F2F) and +10.6\\% in foundation-to-local (F2L) distillation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39716", "url": null, "sourceid": 38580, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39717, "uid": "89d3f7e40882c24e14163199d6c814e0", "name": "MatLat: Material Latent Space for PBR Texture Generation", "authors": [{"id": 179960, "fullname": "Kyeongmin Yeo", "url": "http://cvpr.thecvf.com/api/miniconf/users/179960?format=json", "institution": "KAIST"}, {"id": 192714, "fullname": "Yunhong Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/192714?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 192715, "fullname": "Jaihoon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192715?format=json", "institution": "KAIST"}, {"id": 75799, "fullname": "Minhyuk Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/75799?format=json", "institution": "KAIST"}], "abstract": "We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, **MatLat**, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel\u2013latent spatial correspondence. Ablations studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39717", "url": null, "sourceid": 40657, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40366, "uid": "0a41cf40123f35dac58d66443fd55e51", "name": "Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners", "authors": [{"id": 106015, "fullname": "Nikita Araslanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/106015?format=json", "institution": "TU Munich"}, {"id": 94931, "fullname": "Martin Sundermeyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/94931?format=json", "institution": "Google"}, {"id": 130344, "fullname": "Hidenobu Matsuki", "url": "http://cvpr.thecvf.com/api/miniconf/users/130344?format=json", "institution": "Google"}, {"id": 87968, "fullname": "David Joseph Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87968?format=json", "institution": "Google"}, {"id": 87927, "fullname": "Federico Tombari", "url": "http://cvpr.thecvf.com/api/miniconf/users/87927?format=json", "institution": "Google, TUM"}], "abstract": "One of the most exciting applications of vision models involve pixel-level reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40366", "url": null, "sourceid": -35602, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39357?format=json"], "related_events_ids": [39357]}, {"id": 39362, "uid": "fcf21509b5dd8503aa090c93bbc8eb2c", "name": "Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models", "authors": [{"id": 182492, "fullname": "Xinyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182492?format=json", "institution": "Zhejiang University"}, {"id": 90363, "fullname": "Kecheng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90363?format=json", "institution": "Ant Group"}, {"id": 191927, "fullname": "Minfeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191927?format=json", "institution": "Zhejiang University"}, {"id": 155770, "fullname": "Wei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155770?format=json", "institution": "University of Science and Technology of China"}, {"id": 71162, "fullname": "Fan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71162?format=json", "institution": "ustc"}, {"id": 86247, "fullname": "Wei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86247?format=json", "institution": "University of Science and Technology of China"}, {"id": 131775, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131775?format=json", "institution": "State key laboratory of CAD&amp;CG"}], "abstract": "Chain-of-Thought (CoT) has recently shown encouraging progress in the vision language model. However, the pure-vision CoT (i.e., chain-of-vision) has been underexplored in visual in-context learning. In this paper, we introduce Diffusion Guided Chain-of-Vision, which integrates an explicit chain-of-thought process into autoregressive vision models through vision prior from pre-trained diffusion models. Concretely, we find that pre-trained diffusion models induce a reliable probability flow in image space, where intermediate images sampled along this flow exhibit visual coherence and serve as task-free, chain-of-vision supervision for pure-vision autoregressive models. Extensive experiments on diverse vision tasks and multi-scale models validate the effectiveness of our proposed method for visual in-context learning. Code and dataset will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39362", "url": null, "sourceid": 44104, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39365, "uid": "f69b857233949c6a79158d4bb7ab5061", "name": "Mixture of Prototypes for Test-time Adaptive Segmentation", "authors": [{"id": 73833, "fullname": "Guangrui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73833?format=json", "institution": "University of Technology Sydney"}, {"id": 191934, "fullname": "Zhengyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191934?format=json", "institution": "Chongqing University"}, {"id": 182663, "fullname": "Yongxin Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/182663?format=json", "institution": "Chongqing University"}], "abstract": "Test-Time Adaptive Segmentation (TTA-Seg) aims to adapt a trained segmentation model to test data under distribution shift in an unsupervised manner. Existing approaches typically utilize class-wise prototypes to capture and transfer the source distribution, but inevitably neglecting the diversity within source samples. In this paper, we propose a new test-time adaptation paradigm based on the mixture-of-experts (MoE), where domain experts are designed to 1) better capture the source distribution, and 2) dynamically adjust their contribution in test case prediction. Specifically, during source training, prototypes are derived as the class-wise average for source pixel features. We then generate multiple experts through clustering these prototypes, providing each class with several experts with enhanced representativeness. At test time, each pixel's prediction is drawn from all experts' knowledge in an adaptive manner, \\ie, a gating network assigns weight according to pixel-expert correlation. To optimize the system, we devise a min-max entropy optimization scheme for the gating network but keeping the rest frozen, minimizing the entropy of model prediction but maximizing the entropy in expert selection. Consequently, the model is urged to derived confident predictions with effective utilization of domain experts, hence promoting the adaptation. Experiments on two scenarios, Test-time Adaptation (TTA) and the more challenging continual TTA, demonstrate that our approach achieves the new state-of-the-art.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39365", "url": null, "sourceid": 35473, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39368, "uid": "30b316f9f309658403dbe13be9cdd839", "name": "LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising", "authors": [{"id": 183490, "fullname": "Longzhao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183490?format=json", "institution": "Beijing Jiaotong University"}, {"id": 142137, "fullname": "shuo zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/142137?format=json", "institution": "Beijing Jiaotong University"}, {"id": 142968, "fullname": "Chen Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/142968?format=json", "institution": "Beijing Jiaotong University"}, {"id": 191939, "fullname": "Qian Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191939?format=json", "institution": "Beijing Jiaotong University"}, {"id": 185678, "fullname": "Youfang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185678?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Recent advances in learning-based Light Field (LF) image denoising have achieved impressive results. However, these methods rely heavily on large-scale noisy-clean image pairs and often fail to generalize to unseen or complex noise.In this work, we observe that the inherent multi-view consistency of LF images makes it highly unlikely for noise to be coherent across views, offering a more reliable supervisory signal for self-supervised denoising.Building on this insight, we extend the blind-spot principle to the LF domain and propose a novel LF Blind-View denoising Network (LF-BVN). We first introduce a geometric invariance mask that leverages angular redundancy for efficient full-view supervision. To enforce cross-view photometric consistency, we further introduce latent representation volumes and enforce consistency between them.Additionally, we exploit focus stacks to extract latent depth cues from noisy observations, providing further guidance.Extensive experiments show that LF-BVN achieves competitive denoising performance while maintaining strong cross-view consistency without requiring clean data or external supervision.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39368", "url": null, "sourceid": 31520, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39370, "uid": "8b48b3ca4e8d5eceaae8e5864b25650f", "name": "PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation", "authors": [{"id": 182050, "fullname": "Boce Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182050?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Medical image segmentation demands both high accuracy and computational efficiency, yet existing methods face a critical trade-off: CNNs lack global context while transformers incur prohibitive costs for deployment on resource-constrained devices. To address this challenge, we propose $P$hysics-informed $M$ulti-scale $R$efinement Network (PMRNet), integrating symplectic geometry, renormalization group theory, and entropy diffusion to guide feature learning. PMRNet features three innovations: (1) a physics-informed encoder with Enhanced Symplectic Convolution for boundary detection and Renormalization Group-informed Downsampling for information preservation; (2) a Pseudo-Global Receptive Field module achieving near-global context with linear complexity through entropy-driven diffusion; and (3) a boundary-aware decoder for precise delineation. With only $0.87$M parameters and $3.43$ GFLOPs, PMRNet achieves $87.25\\%$ IoU and $92.56\\%$ Dice on the challenging Clinic dataset, outperforming state-of-the-art (SOTA) models with even $100\\times$ more parameters across $12$ medical imaging datasets while maintaining computational efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39370", "url": null, "sourceid": 37767, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39372, "uid": "21e5668c6ee00a2aba949906282b540c", "name": "VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction", "authors": [{"id": 183390, "fullname": "Weitai Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183390?format=json", "institution": "University of Illinois Chicago"}, {"id": 87401, "fullname": "Jason Kuen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87401?format=json", "institution": "Adobe Research"}, {"id": 106348, "fullname": "Mengwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/106348?format=json", "institution": "Adobe"}, {"id": 191260, "fullname": "Zijun Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191260?format=json", "institution": "ByteDance Inc."}, {"id": 150984, "fullname": "Yan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/150984?format=json", "institution": "University of Illinois Chicago"}, {"id": 76043, "fullname": "Kangning Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76043?format=json", "institution": "NEW YORK UNIVERSITY"}], "abstract": "Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM\u2019s pretrained reasoning ability. In contrast, we propose **VGent**, a modular encoder\u2013decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) **QuadThinker**, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) **mask-aware label** for resolving detection\u2013segmentation ambiguity; and (iii) **global target recognition** to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with **+20.6%** F1 improvement over prior methods, and further boosts gIoU by **+8.2%** and cIoU by **+5.8%** under visual reference challenges, while maintaining constant, fast inference latency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39372", "url": null, "sourceid": 36592, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39373, "uid": "fa2bb63769b5280b13957feadeea5502", "name": "Scale Space Diffusion", "authors": [{"id": 183078, "fullname": "Soumik Mukhopadhyay", "url": "http://cvpr.thecvf.com/api/miniconf/users/183078?format=json", "institution": "University of Maryland, College Park"}, {"id": 191944, "fullname": "Prateksha Udhayanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191944?format=json", "institution": "University of Maryland, College Park"}, {"id": 98052, "fullname": "Abhinav Shrivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/98052?format=json", "institution": "University of Maryland"}], "abstract": "Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a similar hierarchy via low-pass filtering. We formalize this connection and show that highly noisy diffusion states contain no more information than small, downsampled images - raising the question of why they must be processed at full resolution. To address this, we fuse scale spaces into the diffusion process by formulating a family of diffusion models with generalized linear degradations and practical implementations. Using downsampling as the degradation yields our proposed Scale Space Diffusion method. To support Scale-Space Diffusion we introduce FlexiUNet, a UNet variant that performs resolution-preserving and resolution-increasing denoising using only the necessary parts of the network. We evaluate our framework on CelebA and ImageNet and analyze its scaling behavior across resolutions and network depths.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39373", "url": null, "sourceid": 43688, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39378, "uid": "b8f5a378adbebec3b6fb49840d4adb21", "name": "DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation", "authors": [{"id": 173509, "fullname": "WENXUAN CHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/173509?format=json", "institution": "Southeast University"}, {"id": 186466, "fullname": "Ming Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186466?format=json", "institution": "Southeast University"}, {"id": 191956, "fullname": "Huimin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191956?format=json", "institution": "Southeast University"}, {"id": 191957, "fullname": "Wankou Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191957?format=json", "institution": "Southeast University"}], "abstract": "Referring video object segmentation (RVOS) aims to segment objects within a video according to natural language expressions. Unlike earlier works focusing on static single-object scenarios, recent studies address more complex motion scenes. Previous methods typically adopt a query-based, logically multi-stage pipeline to handle these scenarios. However, this paradigm learns trajectory consistency modeling and multimodal fusion from scratch, which often leads to trajectory inconsistencies and insufficient multimodal understanding. To address these limitations, we propose DeRVOS, a framework that decouples RVOS into two key branches: consistent trajectory generation and multimodal understanding. We extract temporally consistent object representations using a powerful pretrained instance trajectory generation model and perform cross-modal alignment via a unified multimodal encoder, enabling upstream modeling of trajectory consistency and vision-language understanding. This design reduces RVOS to the task of modeling the relationship between referring expressions and instance trajectories. To connect the two branches and enable efficient motion-aware semantic understanding, we introduce the Trajectory Alignment and Implicit Selection (TAIS) module, which progressively performs cross-frame multimodal alignment and motion-guided implicit trajectory selection. Extensive experiments demonstrate that DeRVOS achieves state-of-the-art results on both traditional RVOS benchmarks and the challenging MeViS dataset, surpassing LVLM-based methods by 4.7%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39378", "url": null, "sourceid": 32680, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39379, "uid": "30b8014701a623705dd4bbc01ee7860b", "name": "Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition", "authors": [{"id": 153302, "fullname": "Zhanbo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153302?format=json", "institution": "Michigan State University"}, {"id": 191958, "fullname": "Dingqiang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/191958?format=json", "institution": "Johns Hopkins University"}, {"id": 73926, "fullname": "Xiaoming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73926?format=json", "institution": "Michigan State University"}, {"id": 76130, "fullname": "Yu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76130?format=json", "institution": "Michigan State University"}], "abstract": "Existing set-based gait recognition methods achieve remarkable performance by capturing global semantic context.However, their order-invariant nature prevents them from modeling the fine-grained kinematic patterns that unfold over time.To unify the global and process-level representations, we propose GaitMax, a framework that captures both semantic context and kinematic motion.GaitMax leverages attention-based spatiotemporal modeling to dynamically represent detailed part-level trajectories.While this detailed representation is more powerful, it also captures more nuisance factors (e.g., clothing, viewpoint), leading to potential shortcuts.To mitigate this, we introduce CDLoss, a Conditional Decorrelation Loss that explicitly disentangles the gait embeddings from nuisance factors using vision-language supervision.This loss requires high-quality nuisance descriptions. We therefore construct GCaption, a new resource that provides natural language annotations for multiple gait datasets, moving beyond simple categorical labels. GCaption not only enables CDLoss but also serves as a foundation for future context-aware gait analysis.The superiority of GaitMax is validated through extensive experiments on multiple large-scale gait benchmarks. Models, code, and resources will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39379", "url": null, "sourceid": 41009, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39375, "uid": "dbfd2cf7057917dab8d9070d88aa2c2c", "name": "VGGTracker: Fast Spatial Tracking with Visual Geometry Transformer", "authors": [{"id": 191950, "fullname": "Chengjie Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191950?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 168799, "fullname": "GUILE WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/168799?format=json", "institution": "Huawei Technologies Canada CO., LTD."}, {"id": 128417, "fullname": "Dongfeng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128417?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 90680, "fullname": "Bingbing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90680?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Existing 3D point tracking methods mostly rely on heuristic designs or scene reconstruction, which incurs significant computational overhead and makes it difficult to meet the demands of real-time applications.To address this problem, in this work, we present VGGTracker, a novel spatial tracker that leverages a feed-forward visual geometry transformer to predict the trajectories of arbitrary query points from monocular videos in real time.Specifically, we employ a query initialization mechanism to maintain and update a global feature vector and a set of frame-level feature vectors for each query point.Then, we propose a new spatial tracking framework, which consists of a visual geometry transformer backbone, a global embedding branch, a frame-level embedding branch, and a tracking head.The key innovation lies in the dual-branch embedding design, where the global embedding branch integrates geometry-grounded features of the entire video into global query features to optimize track information across the entire sequence and the frame-level branch combines geometry-grounded features of each respective frame into frame-level query features to refine fine-grained track coordinate predictions.Furthermore, to facilitate collaboration between the global branch and the frame-level branch, we introduce an interaction module which enables unidirectional or bidirectional information exchange between the global query features and frame-level query features.Extensive experiments on various point tracking benchmark datasets show that our approach achieves significantly fast spatial tracking speed compared with state-of-the-art methods, while maintaining comparable tracking accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39375", "url": null, "sourceid": 44665, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39383, "uid": "1d5acea22de068fb6e18a7e2de543002", "name": "CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis", "authors": [{"id": 183073, "fullname": "Dongyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183073?format=json", "institution": "University of Surrey"}, {"id": 130514, "fullname": "Dar-Yen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130514?format=json", "institution": "SketchX"}, {"id": 76978, "fullname": "Yi-Zhe Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/76978?format=json", "institution": "University of Surrey"}], "abstract": "Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognisable distortions. We identify the root cause as \\emph{condition signal contamination} -- competing probability distributions in the denoising trajectory that make balanced generation impossible. We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: $\\mathcal{P}^{\\mathrm{i}}$ (pure identity), $\\mathcal{P}^{\\mathrm{s}}$ (pure shape), and $\\mathcal{P}^{\\mathrm{i+s}}$ (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers $\\mathcal{P}^{\\mathrm{i+s}}$ toward optimal balance: $\\mathcal{E}\\_{\\mathrm{shape}}$ ensures sketch fidelity through layout and semantic alignment, while $\\mathcal{E}\\_{\\mathrm{id}}$ employs token-level correspondence matching robust to extreme distortions. Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualises the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39383", "url": null, "sourceid": 32990, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39386, "uid": "4b97f0f09717a033b75c10a27f9f5339", "name": "DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression", "authors": [{"id": 180717, "fullname": "Junqi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180717?format=json", "institution": "Nanjing University"}, {"id": 126807, "fullname": "Ming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126807?format=json", "institution": "Nanjing University"}, {"id": 191970, "fullname": "Xingchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191970?format=json", "institution": "nanjing university"}, {"id": 191971, "fullname": "Anle Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/191971?format=json", "institution": "nanjing university"}, {"id": 154214, "fullname": "Ruiqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154214?format=json", "institution": "Nanjing University"}, {"id": 85046, "fullname": "Zhan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/85046?format=json", "institution": "Nanjing University"}], "abstract": "Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage.Most existing diffusion codecs employ UNet architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only $8\\times$ spatial downscaling), resulting in excessive computation.In contrast, conventional VAE-based codecs work in much deeper latent domains ($16\\times$\u2013$64\\times$ downscaled), motivating a key question:Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality?To address this, we introduce DiT-IC\u2014an Aligned Diffusion Transformer for Image Compression\u2014which replaces the UNet with a Diffusion Transformer capable of performing diffusion in latent space entirely at $32\\times$ downscaled resolution.DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms:(1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction;(2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and(3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference.With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30\u00d7 faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct $2048\\times2048$ images on a 16 GB laptop GPU. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39386", "url": null, "sourceid": 46624, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39395, "uid": "8d16fcbd56cdafde0fc1ba4e68c4636c", "name": "Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers", "authors": [{"id": 180373, "fullname": "Yiqing Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180373?format=json", "institution": "Peking University"}, {"id": 131537, "fullname": "Yiren Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/131537?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets.Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39395", "url": null, "sourceid": 31276, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39397, "uid": "4754d8f63266be774e77a7268383769c", "name": "iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception", "authors": [{"id": 155234, "fullname": "Sarthak Mehrotra", "url": "http://cvpr.thecvf.com/api/miniconf/users/155234?format=json", "institution": "Indian institute of Technology Bombay"}, {"id": 181662, "fullname": "Sairam Rebbapragada", "url": "http://cvpr.thecvf.com/api/miniconf/users/181662?format=json", "institution": null}, {"id": 191989, "fullname": "Mani Bonthu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191989?format=json", "institution": "Indian Institute of Technology, Hyderabad"}, {"id": 153478, "fullname": "Vineeth Balasubramanian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153478?format=json", "institution": "Microsoft Research and IIT-Hyderabad"}], "abstract": "Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow\u2013fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39397", "url": null, "sourceid": 45045, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39400, "uid": "a1467076d0ad80c80096422188601896", "name": "Rethinking Glyph Spatial Information in Font Generation", "authors": [{"id": 183712, "fullname": "Peng Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/183712?format=json", "institution": "Jilin University"}, {"id": 76396, "fullname": "Xi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76396?format=json", "institution": "Jilin University"}], "abstract": "Few-shot Font Generation (FFG) aims to create a complete font from a limited number of references, offering significant practical value. However, existing methods neglect glyph spatial information, which leads to two critical limitations. At the pipeline level, distorted rendering introduces spatial bias, impairing vectorization and dataset quality, and this problem is compounded by the lack of unified standards, which undermines a unified benchmark. At the model level, the implicit coupling of shape and position hinders fine-grained optimization and generalization. We address these challenges in the context of Chinese font generation, where glyph complexity demands superior model capability. Consequently, we first propose a Spatial-Preserving Rendering (SPR) protocol, which eliminates spatial bias and enables accurate vectorization. Alongside, we release an OFL-licensed Chinese font dataset to establish a unified benchmark. Then, technically, we propose GlyphSpatialNet, a two-stage framework to explicitly model glyph spatial information in pixel space. In first stage, we design a Shape-Position Decoupling (SPD) architecture and a Gradient Broadcasting Module (GBM) to achieve font style transfer in low resolution. In second stage, we design Style Detail Enhancement (SDE), which refines the style details for high resolution outputs. Extensive experiments demonstrate the effectiveness of our approach. Code and dataset are provided in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39400", "url": null, "sourceid": 35075, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39401, "uid": "fd3b98e9942a9a54d497b6ff1f7e30c5", "name": "Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search", "authors": [{"id": 180735, "fullname": "Xinlei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180735?format=json", "institution": "University of Science and Technology of China"}, {"id": 126523, "fullname": "Xiulian Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126523?format=json", "institution": "Microsoft Research Asia"}, {"id": 87718, "fullname": "Xiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87718?format=json", "institution": "Microsoft Research Asia"}, {"id": 76559, "fullname": "Zhiwei Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76559?format=json", "institution": "USTC"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "Long video understanding presents significant challenges for vision-language models due to extremely long context windows.Existing solutions rely on naive chunking strategies with retrieval-augmented generation, suffer from information fragmentation and a loss of global coherence. We propose a unified framework that achieves coherent and comprehensive understanding of long videos.Our approach overcome limitations of current solutions by combining audiovisual entity cohesion with hierarchical video indexing and agentic search. First, we preserves semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 81.0% on LVBench. Notably, it delivers exceptional performance in the challenging reasoning category (79.6%) and achieves 86.7% in temporal grounding. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39401", "url": null, "sourceid": 35148, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39403, "uid": "3ede8eb28f99ac378ad24129fcfa1bfb", "name": "Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction", "authors": [{"id": 182316, "fullname": "Zhihao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182316?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 192002, "fullname": "Chaozhuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192002?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 192003, "fullname": "Litian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192003?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 192004, "fullname": "Xi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192004?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Early prediction of Mild Cognitive Impairment (MCI) conversion is hampered by a trade-off between immediacy\u2014making fast predictions from a single baseline sMRI\u2014and accuracy\u2014leveraging longitudinal scans to capture disease progression. We propose MCI-Diff, a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data, achieving both real-time risk assessment and high predictive performance. First, a multi-task sequence reconstruction strategy trains a shared denoising network on interpolation and extrapolation tasks to handle irregular follow-up sampling and learn robust latent trajectories. Second, an LLM-driven \u201clinguistic compass\u201d is introduced for clinical plausibility sampling: generated feature candidates are quantized, tokenized, and scored by a fine-tuned language model conditioned on expected structural biomarkers, guiding autoregressive generation toward realistic disease patterns. Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines, improving early conversion accuracy by 5\u201312\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39403", "url": null, "sourceid": 42530, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39406, "uid": "32639db9ccbc0455d0fb654e7bbdfc05", "name": "Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification", "authors": [{"id": 163497, "fullname": "Joo Hyung OH", "url": "http://cvpr.thecvf.com/api/miniconf/users/163497?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 181052, "fullname": "Minyoung Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/181052?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 192012, "fullname": "Sung Whan Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/192012?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 179891, "fullname": "Jae-Young Sim", "url": "http://cvpr.thecvf.com/api/miniconf/users/179891?format=json", "institution": "Ulsan National Institute of Science and Technology"}], "abstract": "Extending person re-identification (ReID) to a federated scenario has recently drawn attention due to privacy concerns of individuals, but existing methods mostly assume sufficient diversity in pose variations even within a decentralized client. We focus on a more realistic federated-by-camera scenario, where each client corresponds to a single camera and thus captures only a sparse set of poses. To enrich pose variety, we propose **Pose-guided Enriched Feature Learning (PEFL)** that explicitly augments pose-diverse samples in the federated ReID scenario. Specifically, a Pose-Extraction Module (PEM) disentangles pose-relevant and pose-irrelevant feature components, where Pose-Relationship Knowledge Distillation (PKD) method helps identify the correct pose and Semantic Consistency Maintenance (SCM) method preserves semantics even with pose changes. In addition, a Compatibility Regularization method ensures the PEM to be compatible with the feature space of the global model. By recombining pose-relevant and -irrelevant components across identities via PEM, our PEFL synthesizes pose-swapped features, thereby largely facilitating contrastive learning of ReID models. Extensive experiments on Market1501 and MSMT17 under the federated-by-camera setting demonstrate that PEFL consistently outperforms federated ReID baselines and their conjunctions with the existing feature augmentation methods; thus achieving state-of-the-art federated ReID performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39406", "url": null, "sourceid": 36458, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39408, "uid": "37f653f9417817d75d1d1d4eb1360f3a", "name": "FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories", "authors": [{"id": 156309, "fullname": "Lei Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/156309?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 131667, "fullname": "Hubery Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/131667?format=json", "institution": "Tencent"}, {"id": 186775, "fullname": "Gongye Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186775?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 130504, "fullname": "Zhengyao Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/130504?format=json", "institution": "University of Hong Kong"}, {"id": 85166, "fullname": "Jingcai Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85166?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 86641, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86641?format=json", "institution": "WeChat, Tencent"}, {"id": 86370, "fullname": "Wenhan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86370?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 73271, "fullname": "Yujiu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73271?format=json", "institution": "Tsinghua University"}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}], "abstract": "With the success of flow matching in visual generation, sampling efficiency remains a critical bottleneck for its practical application. Among flow models' accelerating methods, ReFlow has been somehow overlooked although it has theoretical consistency with flow matching. This is primarily due to its suboptimal performance in practical scenarios compared to consistency distillation and score distillation. In this work, we investigate this issue within the ReFlow framework and propose FlowSteer, a method unlocks the potential of ReFlow-based distillation by guiding the student along teacher's authentic generation trajectories. We first identify that Piecewised ReFlow's performance is hampered by a critical distribution mismatch during the training and propose Online Trajectory Alignment(OTA) to resolve it. Then, we introduce a adversarial distillation objective applied directly on the ODE trajectory, improving the student's adherence to the teacher's generation trajectory. Furthermore, we find and fix a previously undiscovered flaw in the widely-used $\\texttt{FlowMatchEulerDiscreteScheduler}$ that largely degrades few-step inference quality. Our experiment result on SD3 demonstrates our method's efficacy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39408", "url": null, "sourceid": 43716, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39409, "uid": "d6a7c7617be97c79d7d430463fa6af8f", "name": "MoRe: Motion-aware Feed-forward 4D Reconstruction Transrformer", "authors": [{"id": 149447, "fullname": "Juntong Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149447?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187194, "fullname": "Zequn Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187194?format=json", "institution": "Li Auto Inc."}, {"id": 130530, "fullname": "Weiqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130530?format=json", "institution": "Tsinghua University"}, {"id": 153284, "fullname": "Donglin Di", "url": "http://cvpr.thecvf.com/api/miniconf/users/153284?format=json", "institution": "Harbin Institute of Technology"}, {"id": 191296, "fullname": "Xuancheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191296?format=json", "institution": "Li Auto Inc."}, {"id": 192015, "fullname": "Chengmin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192015?format=json", "institution": "Li Auto Inc."}, {"id": 76426, "fullname": "Yu-Shen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76426?format=json", "institution": "Tsinghua University"}], "abstract": "Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39409", "url": null, "sourceid": 41818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39411, "uid": "ea344d40ba4d83dcb5e08328b1789614", "name": "CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection", "authors": [{"id": 174425, "fullname": "Zhipeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174425?format=json", "institution": "University of Exeter"}, {"id": 192019, "fullname": "Chunbo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192019?format=json", "institution": "University of Exeter"}], "abstract": "Vision\u2013language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose \\textbf{CrossVL}, a framework combining \\textbf{Complexity-Aware Pathway Aggregation (CPA)} and \\textbf{Paired Curriculum Learning (PCL)} for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground\u2013aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2\u2019s aerial mAP from 58.66\\% to 61.03\\% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3\u00d7 reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39411", "url": null, "sourceid": 40709, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39432, "uid": "bb230dffabdb3f62c65202031ce32653", "name": "2nd Match: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching", "authors": [{"id": 180033, "fullname": "Caleb Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180033?format=json", "institution": "University of Washington"}, {"id": 86657, "fullname": "Eli Shlizerman", "url": "http://cvpr.thecvf.com/api/miniconf/users/86657?format=json", "institution": "University of Washington, University of Washington"}], "abstract": "Diffusion models achieve remarkable performance across diverse generative tasks in computer vision, but their high computational cost remains a major barrier to deployment. Model pruning offers a promising way to reduce inference cost and enable lightweight diffusion models. However, pruning leads to quality degradation due to reduced capacity. A key limitation of existing pruning approaches is that pruned models are finetuned using the same objective as the dense model (denoising score matching). Since the dense model is accessible during finetuning, it warrants a more effective approach for knowledge transfer from the dense to the pruned model. Motivated by this, we propose 2ndMatch (2ndM), a general-purpose finetuning framework that introduces a 2nd-order Jacobian Matching loss inspired by Finite-Time Lyapunov Exponents. 2ndM teaches the pruned model to mimic the sensitivity of the dense teacher, i.e., how to respond to small perturbations over time, through scalable random projections. The framework is architecture-agnostic and applies to both U-Net- and Transformer-based diffusion models. Experiments on CIFAR-10, CelebA, LSUN, ImageNet, and MSCOCO demonstrate that 2ndM reduces the performance gap between pruned and dense models, substantially improving output quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39432", "url": null, "sourceid": 33537, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39418, "uid": "b50c6c685af227009d5e3b4d5a57ee05", "name": "Enhancing Part-Level Point Grounding for Any Open-Source MLLMs", "authors": [{"id": 130488, "fullname": "Jin-Cheng Jhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130488?format=json", "institution": "National Tsing Hua University"}, {"id": 156889, "fullname": "Fu-En Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156889?format=json", "institution": "Amazon"}, {"id": 135461, "fullname": "Xin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135461?format=json", "institution": "Amazon"}, {"id": 97203, "fullname": "Nan Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/97203?format=json", "institution": "Amazon"}, {"id": 98825, "fullname": "Lu Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/98825?format=json", "institution": "Amazon"}, {"id": 93149, "fullname": "Min Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/93149?format=json", "institution": "Amazon/NTHU"}, {"id": 130482, "fullname": "Cheng-Hao Kuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/130482?format=json", "institution": "Amazon"}], "abstract": "Visual grounding aims to associate free-form textual queries with specific regions in an image. While recent Multimodal Large Language Models (MLLMs) have demonstrated promising capabilities in this domain, they primarily excel at object-level grounding and often struggle with part-level grounding\u2014an essential requirement for fine-grained tasks such as robotic manipulation. In this work, we introduce a general approach that equips any open-source MLLMs with accurate 2D part-level point grounding, offering a more flexible alternative to conventional grounding representations. Our method leverages the attention mechanisms inherently present in MLLMs. By synthesizing text-conditioned, grounding-aware queries within intermediate layers via the proposed Q-Synth Module, we extract target-relevant attention patterns and refine them using a lightweight Attention-to-Point Decoder that converts these patterns into a point-centric heatmap for final prediction. Notably, all original MLLM parameters are frozen, ensuring full preservation of their pre-trained capabilities. Experiments show that our design consistently improves part-level grounding accuracy across datasets and can be seamlessly integrated into any open-source MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39418", "url": null, "sourceid": 33825, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39419, "uid": "204e6f2de036648c69169f6f95d7ba17", "name": "PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation", "authors": [{"id": 103431, "fullname": "baiqin wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/103431?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 76305, "fullname": "Xiangyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76305?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 192042, "fullname": "Fan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192042?format=json", "institution": "University of Pittsburgh; University of Pittsburgh"}, {"id": 147329, "fullname": "HAO XU", "url": "http://cvpr.thecvf.com/api/miniconf/users/147329?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}], "abstract": "Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over talking face, such as speaking style and emotional expression, resulting in uniform facial motion. In this paper, we focus on improving two key factors: lip-audio alignment control(LAC) and emotion control(EMC), to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control ensures accurate lip-sync across varied speaking styles to simulate different talking habits, whereas emotion control aims to generate realistic emotional expressions with varying intensities and mixed emotional states. To achieve precise facial animation control, we propose a novel and efficient framework, PC-Talk, which enables lip-audio alignment control and emotion control through implicit keypoint deformations. First, our LAC module generates lip-synced talking faces with a specific speaking style, derived from either a video reference or preset options. It also supports lip movement scale adjustment and fine-grained editing of speaking styles for specific articulations. Second, our EMC module produces vivid emotional facial expressions through pure emotional deformation. It further enables precise control over emotion intensity and the compound emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on HDTF and MEAD datasets in experiments. The code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39419", "url": null, "sourceid": 39699, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39420, "uid": "6a1fbf6a6315721b9e8931e69112c21e", "name": "REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion", "authors": [{"id": 90467, "fullname": "Xuewei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90467?format=json", "institution": "Zhejiang University"}, {"id": 180566, "fullname": "Xinghan Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180566?format=json", "institution": "Shanghai Dianji University"}, {"id": 192043, "fullname": "Zhimin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192043?format=json", "institution": "Shanghai Dainji University"}, {"id": 86317, "fullname": "Xi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86317?format=json", "institution": "Zhejiang University"}], "abstract": "As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to give complete scene perception based on an ultra-wide angle of view. Most PASS methods often focus on spherical geometry with RGB input or using the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL-SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical-dynamic Multi-Modal Fusion (SMMF). REL is made up of Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represents 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, which uses different fusion strategies to match the different regions in panoramic images. Experimental results show that REL-SF4PASS considerably improves performance and robustness on popular benchmark, Stanford2D3D Panoramic datasets. It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39420", "url": null, "sourceid": 30817, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39439, "uid": "28ccee6fd479a8719fa96fc5a959362f", "name": "V-DPM: Video Reconstruction with Dynamic Point Maps", "authors": [{"id": 186522, "fullname": "Edgar Sucar", "url": "http://cvpr.thecvf.com/api/miniconf/users/186522?format=json", "institution": "University of Oxford"}, {"id": 186521, "fullname": "Eldar Insafutdinov", "url": "http://cvpr.thecvf.com/api/miniconf/users/186521?format=json", "institution": "Amazon"}, {"id": 98911, "fullname": "Zihang Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/98911?format=json", "institution": "University of Oxford"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}], "abstract": "New, powerful 3D representations such as DUSt3R\u2019s invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed-forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend the concept to dynamic 3D content, also representing 3D scene motion.However, DPMs have so far been limited to image pairs and, like DUSt3R, require post-processing via optimization when more than two views are involved. We argue that DPMs are far more meaningful when applied to videos and introduce V-DPM to demonstrate this.First, we show how to set up DPMs for videos to optimize their representational power, ease of neural prediction, and reuse of pre-trained models. Second, we implement these ideas on top of VGGT, a recent state-of-the-art 3D reconstructor. Although VGGT was trained on static scenes, we show that a small amount of synthetic data suffices to adapt it into an effective V-DPM predictor.This yields state-of-the-art 3D and 4D reconstruction in dynamic settings. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs reconstruct not only dynamic depth but also the full 3D motion of every point in the scene.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39439", "url": null, "sourceid": 37894, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39423, "uid": "4c859737028fd7781247eaa13a4ea759", "name": "HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation", "authors": [{"id": 181389, "fullname": "Thinh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181389?format=json", "institution": "VinUniversity"}, {"id": 181354, "fullname": "Le Trung Phan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181354?format=json", "institution": "VinUniversity"}, {"id": 192048, "fullname": "Binh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192048?format=json", "institution": "Ho Chi Minh city University of Science, Vietnam National University; Ho Chi Minh city University of Science, Vietnam National University"}, {"id": 192049, "fullname": "Khoa D Doan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192049?format=json", "institution": "VinUniversity"}, {"id": 107423, "fullname": "KOK SENG WONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/107423?format=json", "institution": "VinUniversity"}], "abstract": "Federated Learning (FL) is a decentralized approach where multiple clients collaboratively train a shared global model without sharing their raw data. Despite its effectiveness, conventional FL faces scalability challenges due to excessive computational and communication demands placed on a single central server as the number of participating devices grows. Hierarchical Federated Learning (HFL) addresses these issues by distributing model aggregation tasks across intermediate nodes (stations), thereby enhancing system scalability and robustness against single points of failure. However, HFL still suffers from a critical yet often overlooked limitation: domain shift, where data distributions vary significantly across different clients and stations, reducing model performance on unseen target domains. While Federated Domain Generalization (FedDG) methods have emerged to improve robustness to domain shifts, their integration into HFL frameworks remains largely unexplored. In this paper, we formally introduce Hierarchical Federated Domain Generalization (HFedDG), a novel scenario designed to investigate domain shift within hierarchical architectures. Specifically, we propose HFedATM, a hierarchical aggregation method that first aligns the convolutional filters of models from different stations through Filter-wise Optimal Transport Alignment and subsequently merges aligned models using a Shrinkage-aware Regularized Mean Aggregation. Our extensive experimental evaluations demonstrate that HFedATM significantly boosts the performance of existing FedDG baselines across multiple datasets and maintains computational and communication efficiency. Moreover, theoretical analyses indicate that HFedATM achieves tighter generalization error bounds compared to standard hierarchical averaging, resulting in faster convergence and stable training behavior.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39423", "url": null, "sourceid": 39331, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39427, "uid": "227aef76fd3e65664c541ae9c04cca44", "name": "TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures", "authors": [{"id": 98813, "fullname": "Hyeongjin Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/98813?format=json", "institution": "Seoul National University"}, {"id": 96222, "fullname": "Daniel Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/96222?format=json", "institution": "Seoul National University"}, {"id": 69929, "fullname": "Kyoung Mu Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/69929?format=json", "institution": "Seoul National University"}], "abstract": "Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human\u2013object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human\u2013object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39427", "url": null, "sourceid": 43609, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39429, "uid": "9b06d560889c93933db1586208866eba", "name": "SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering", "authors": [{"id": 161353, "fullname": "jiahao niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/161353?format=json", "institution": "Sun Yat-sen University"}, {"id": 102435, "fullname": "rongjia zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/102435?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86814, "fullname": "Wenju Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86814?format=json", "institution": "University of Kansas, Lawrence"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87176, "fullname": "Qing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87176?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "We presents SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material\u2013illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination\u2013material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constraint together with a de-shadowing model. Extensive experiments on benchmark datasets show that our method consistently improves both reconstruction fidelity and inverse rendering quality over existing 3DGS-based inverse rendering approaches. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39429", "url": null, "sourceid": 31574, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39431, "uid": "b66974d1cb6b6a61dc2d659ff795410d", "name": "ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer", "authors": [{"id": 189410, "fullname": "Wenxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189410?format=json", "institution": "East China Normal University"}, {"id": 180930, "fullname": "Jingchen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180930?format=json", "institution": "Tsinghua University"}, {"id": 192066, "fullname": "Chenyang Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192066?format=json", "institution": "Alibaba Group"}, {"id": 192067, "fullname": "Mo-Ran Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192067?format=json", "institution": "Tsinghua University"}, {"id": 133385, "fullname": "Haozhe Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/133385?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 90686, "fullname": "Guiguang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/90686?format=json", "institution": "Tsinghua University"}, {"id": 89370, "fullname": "Yuchen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/89370?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Recent advances in gigapixel-level imaging have brought High-Resolution Wide shots  to the forefront of research. However, these images present significant challenges: extreme sparsity of foreground, gigapixel-level resolutions and diverse target counts. This makes traditional close-up detectors inaccurate and slow as they are overwhelmed by the background. Although previous research has explored sparse backbones, their fixed sparsity patterns lack the adaptability required to handle diverse target numbers.To address this, we introduce ElasticFormer, a sparse backbone that dynamically allocates computational resources based on foreground proportion. After scoring windows based on variance, proposed ElasticSelector module will predict the foreground proportion for top-k selection. The mechanism guides the model to select target-containing windows, scaling resources in areas where objects are clustered.We introduce a novel loss function combined with the 3-phase training strategy for ElasticSelector, allowing it to function properly when bounding box annotations are missing. A WSOD study is carried on PASCAL VOC 2007 to evaluate its extensibility. Further, ElasticNet is created to verify its backbone-agnostic nature. In experiments on the PANDA gigapixel benchmark, ElasticFormer reduces backbone FLOPs by 80\\% while achieving a significant improvement in AP$_{50}$ when compared to fixed-ratio sparse methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39431", "url": null, "sourceid": 37197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39434, "uid": "b34157d44b5503bf9ef416fec0912265", "name": "GEM: Generating LiDAR World Model via Deformable Mamba", "authors": [{"id": 88510, "fullname": "Yang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88510?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 175584, "fullname": "Zhaojiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175584?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 88605, "fullname": "Qiang Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88605?format=json", "institution": "DiDi Chuxing"}, {"id": 142971, "fullname": "Youquan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142971?format=json", "institution": "Fudan University"}, {"id": 192069, "fullname": "renliang Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192069?format=json", "institution": "Aibee Inc."}, {"id": 127401, "fullname": "Jianjun Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/127401?format=json", "institution": "Nanjing University of Science and Techonology"}, {"id": 85000, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85000?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}], "abstract": "World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose **GEM**: a **G**enerative LiDAR world model that leverages d**E**formable **M**amba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and it's potential to generate \"what-if\" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39434", "url": null, "sourceid": 45243, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39436, "uid": "c24c47dd1b2e9aac64cab553d94a22d7", "name": "Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation", "authors": [{"id": 136952, "fullname": "Taehoon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/136952?format=json", "institution": "University of Edinburgh"}, {"id": 88748, "fullname": "Henry Gouk", "url": "http://cvpr.thecvf.com/api/miniconf/users/88748?format=json", "institution": "University of Edinburgh"}, {"id": 73910, "fullname": "Timothy Hospedales", "url": "http://cvpr.thecvf.com/api/miniconf/users/73910?format=json", "institution": "University of Edinburgh; Samsung AI Research Centre"}], "abstract": "Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function.  We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the  structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward).Since the unconditional embedding in classifier-free guidance serves as the anchor for the model's generative distribution, Null-TTA directly steers model's generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters.Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39436", "url": null, "sourceid": 39162, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39437, "uid": "bb221b9bbbf25cc108cffe12fe10fbc2", "name": "ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection", "authors": [{"id": 182392, "fullname": "Yitong Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182392?format=json", "institution": "UESTC"}, {"id": 143939, "fullname": "Lihua Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/143939?format=json", "institution": "Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science &amp; Innovation, Chinese Academy of Sciences"}, {"id": 90845, "fullname": "Jiwei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/90845?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 159018, "fullname": "Ran Ran", "url": "http://cvpr.thecvf.com/api/miniconf/users/159018?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 192072, "fullname": "Shiyuan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192072?format=json", "institution": null}, {"id": 192073, "fullname": "Zeyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192073?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 143983, "fullname": "Shuaifeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/143983?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155494, "fullname": "Nianxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155494?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 84847, "fullname": "Heng Tao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84847?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Test-Time Adaptive Object Detection (TTAOD) aims to maintain detection performance under distribution shifts without retraining. While recent vision-language models enable open-vocabulary detection, existing TTAOD methods\u2014whether closed-set or open-vocabulary\u2014focus exclusively on improving classification confidence and largely overlook the degradation of bounding box localization. To address this critical gap, we propose ViTPrompt (Visual Token-Prompting), a training-free framework that jointly refines both bounding boxes and class scores at test time. Our key insight is to augment the original text prompt with instance-aware visual tokens extracted from high-confidence detections in an initial forward pass; this enriched prompt is then used in a second inference stage, where the cross-modal decoder leverages the enhanced semantic context to produce more accurate box coordinates and classification logits. ViTPrompt requires no backpropagation, parameter updates, or external memory, making it highly efficient for real-time deployment. Experiments on multiple out-of-distribution benchmarks demonstrate that ViTPrompt achieves state-of-the-art performance, delivering consistent improvements in both localization accuracy and classification fidelity , and establishing itself as a holistic solution for open-vocabulary TTAOD.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39437", "url": null, "sourceid": 39411, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39440, "uid": "e9c519a82a563ea4613914e83d06acda", "name": "3D-LATTE: Latent Space 3D Editing from Textual Instructions", "authors": [{"id": 93733, "fullname": "Maria Parelli", "url": "http://cvpr.thecvf.com/api/miniconf/users/93733?format=json", "institution": "University of T\u00fcbingen"}, {"id": 166991, "fullname": "Michael Oechsle", "url": "http://cvpr.thecvf.com/api/miniconf/users/166991?format=json", "institution": "Google"}, {"id": 192078, "fullname": "Michael Niemeyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/192078?format=json", "institution": "Google"}, {"id": 87927, "fullname": "Federico Tombari", "url": "http://cvpr.thecvf.com/api/miniconf/users/87927?format=json", "institution": "Google, TUM"}, {"id": 69174, "fullname": "Andreas Geiger", "url": "http://cvpr.thecvf.com/api/miniconf/users/69174?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals.   Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39440", "url": null, "sourceid": 40790, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40368?format=json"], "related_events_ids": [40368]}, {"id": 39449, "uid": "165d6d6efc3c235ff0c042d3d8cfa8fd", "name": "Unified Vector Floorplan Generation via Markup Representation", "authors": [{"id": 96749, "fullname": "Kaede Shiohara", "url": "http://cvpr.thecvf.com/api/miniconf/users/96749?format=json", "institution": "The University of Tokyo"}, {"id": 92654, "fullname": "Toshihiko Yamasaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/92654?format=json", "institution": "The University of Tokyo"}], "abstract": "Automatic residential floorplan generation has long been a central challenge bridging architecture and computer graphics, aiming to make spatial design more efficient and accessible. While early methods based on constraint satisfaction or combinatorial optimization ensure feasibility, they lack diversity and flexibility. Recent generative models achieve promising results but struggle to generalize across heterogeneous conditional tasks, such as generation from site boundaries, room adjacency graphs, or partial layouts, due to their suboptimal representations. To address this gap, we introduce Floorplan Markup Language (FML), a general representation that encodes floorplan information within a single structured grammar, which casts the entire floorplan generation problem into a next token prediction task. Leveraging FML, we develop a transformer-based generative model, Floorplan Markup Language Model (FMLM), capable of producing high-fidelity and functional floorplans under diverse conditions. Comprehensive experiments on the RPLAN dataset demonstrate that FMLM, despite being a single model, surpasses the previous task-specific state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39449", "url": null, "sourceid": 38065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39446, "uid": "a333e28e13fad944153b8375bf5cab8d", "name": "Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation", "authors": [{"id": 149722, "fullname": "Jinfan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149722?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 192096, "fullname": "Wuze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192096?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 127918, "fullname": "Zhangli Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127918?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192097, "fullname": "Zhehan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192097?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 70256, "fullname": "Ye Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/70256?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86840, "fullname": "Bingbing Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/86840?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "In stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous B\u00e9zier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30\u201350\\%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30\u201340\\% compared to existing differentiable vectorization methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39446", "url": null, "sourceid": 37702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39447, "uid": "fb577e980675cae874a4956e5f1937c3", "name": "Bootstrapping Multi-view Learning for Test-time Noisy Correspondence", "authors": [{"id": 154709, "fullname": "Changhao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/154709?format=json", "institution": "Sichuan University"}, {"id": 192098, "fullname": "Di Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192098?format=json", "institution": null}, {"id": 192099, "fullname": "Shuxian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192099?format=json", "institution": "Sichuan University"}, {"id": 192100, "fullname": "Yanji Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192100?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 86197, "fullname": "Xi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86197?format=json", "institution": "Sichuan University"}, {"id": 86154, "fullname": "Peng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86154?format=json", "institution": "Sichuan University"}], "abstract": "Multi-view learning fuses complementary views to improve perception, but real-world deployments often suffer from Test-time Noisy Correspondence (TNC)~\u2014~cross-view misalignment caused by asynchronous sampling, transient network congestion, or other disturbances. Such misalignment introduces semantic inconsistency and significantly degrades performance. Existing remedies typically estimate view-specific reliability from clean, well-aligned training data and then extrapolate to noisy fusion at inference, resulting in a train-test task gap and reduced robustness against TNC. To bridge this gap, we propose \\underline{\\textbf{\\textcolor{red}{B}}}ootstrapping \\underline{\\textbf{\\textcolor{red}{M}}}ulti-view \\underline{\\textbf{\\textcolor{red}{L}}}earning (BML)~\u2014~a plug-and-play framework that explicitly learns to fuse under TNC. Specifically, BML performs in-place TNC bootstrapping to construct a controllable noise-augmented training set that simulates realistic correspondence distortion, thereby eliminating the task gap without external data. Unlike prior uncertainty-based approaches that model reliability in an unsupervised manner, BML presents a reveal-supervised paradigm, wherein a lightweight estimator jointly models intra-view predictive uncertainty (view quality) and inter-view prediction discrepancy (correspondence consistency) to produce calibrated reliability weights guided by both task objectives and bootstrapped supervision. Once deployed, these reliability weights directly modulate fusion, suppressing corrupted views while preserving informative ones. Across 11 benchmarks spanning diverse noise ratios, BML consistently outperforms state-of-the-art baselines and maintains robustness against TNC. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39447", "url": null, "sourceid": 32530, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39448, "uid": "b4b6a4686b2ca011ee0f72d0bc1e7b50", "name": "RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models", "authors": [{"id": 156907, "fullname": "Timing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156907?format=json", "institution": "Johns Hopkins University"}, {"id": 156906, "fullname": "Feng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156906?format=json", "institution": "Johns Hopkins University"}, {"id": 156909, "fullname": "Guoyizhe Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/156909?format=json", "institution": "Johns Hopkins University"}], "abstract": "Mamba, originally introduced for language modeling, has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba\u2019s representational properties and make three primary contributions. First, we theoretically analyze Mamba\u2019s relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba\u2019s capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba\u2019s potential for interpretability. Notably, our model also achieves a 78.5\\% linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39448", "url": null, "sourceid": 37337, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39452, "uid": "c53efb25385f33b9aaddf6bbc3b8bc33", "name": "EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval", "authors": [{"id": 182383, "fullname": "Yuhan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182383?format=json", "institution": "Sun Yat-sen University"}, {"id": 154561, "fullname": "Pengwen Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/154561?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192103, "fullname": "Chuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192103?format=json", "institution": "Beijing Jiaotong University"}, {"id": 89667, "fullname": "Dayan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89667?format=json", "institution": "iie,cas"}, {"id": 126239, "fullname": "Xiaochun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126239?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Text-video retrieval tasks have seen significant improvements due to the recent development of large-scale vision-language pre-trained models. Traditional methods primarily focus on video representations or cross-modal alignment, while recent works shift toward enriching text expressiveness to better match the rich semantics in videos. However, these methods use only interactions between text and frames/video, and ignore rich interactions among the internal frames within a video, so the final expanded text cannot capture frame contextual information, leading to disparities between text and video. In response, we introduce Energy-Aware Fine-Grained Relationship Learning Network (EagleNet) to generate accurate and context-aware enriched text embeddings. Specifically, the proposed Fine-Grained Relationship Learning mechanism (FRL) first constructs a text-frame graph by the generated text candidates and frames, then learns relationships among texts and frames, which are finally used to aggregate text candidates into an enriched text embedding that incorporates frame contextual information. To further improve fine-grained relationship learning in FRL, we design Energy-Aware Matching (EAM) to model the energy of text-frame interactions and thus accurately capture the distribution of real text-video pairs. Moreover, for more effective cross-modal alignment and stable training, we replace the conventional softmax-based contrastive loss with the sigmoid loss. Extensive experiments have demonstrated the superiority of EagleNet across MSRVTT, DiDeMo, MSVD, and VATEX.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39452", "url": null, "sourceid": 31869, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39453, "uid": "b7a67633eed3d4aad8a6eeb0b88057ca", "name": "CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization", "authors": [{"id": 148321, "fullname": "Yitong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/148321?format=json", "institution": "Fudan University"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}, {"id": 130875, "fullname": "Xipeng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130875?format=json", "institution": "Fudan University"}, {"id": 86826, "fullname": "Yu-Gang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86826?format=json", "institution": "Fudan University"}], "abstract": "Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the \"next-token prediction\" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 22.72 PSNR and 0.681 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39453", "url": null, "sourceid": 40547, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39455, "uid": "bceae4e660518105326a313513671bf9", "name": "Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation", "authors": [{"id": 174205, "fullname": "Haonan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/174205?format=json", "institution": "Wangxuan Institute of Computer Technology, Peking University"}, {"id": 143877, "fullname": "Yuxuan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/143877?format=json", "institution": "Wangxuan Institute of Computer Technology, Peking University"}, {"id": 88921, "fullname": "Zhouhui Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88921?format=json", "institution": "Peking University"}], "abstract": "Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39455", "url": null, "sourceid": 34239, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39462, "uid": "f9ed45b93dd4614775806adf05661bfe", "name": "Generative Point Tracking and Trajectory Forecasting", "authors": [{"id": 192122, "fullname": "Xuanchen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192122?format=json", "institution": "Cornell University"}, {"id": 90602, "fullname": "Ang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90602?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 192123, "fullname": "Chao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192123?format=json", "institution": "Cornell University"}, {"id": 192124, "fullname": "Andrew Owens", "url": "http://cvpr.thecvf.com/api/miniconf/users/192124?format=json", "institution": "Cornell University"}], "abstract": "Motion forecasting predicts where points will move in the future, while motion tracking predicts where they are in the present. Despite these conceptual similarities, existing approaches to these two problems are quite different. In this paper, we propose a unified model that can address both tasks. We train a causal, video-conditioned flow matching model to predict point positions. The resulting model can easily toggle between point tracking to forecasting by changing its visual signal. Despite our model's simplicity, we find that it outperforms prior work in point forecasting and obtains performance that is competitive with the state-of-the-art on the TAP-Vid DAVIS benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39462", "url": null, "sourceid": 30943, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39466, "uid": "8cc517d3c93bf6cc4b072c85794f7613", "name": "Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation", "authors": [{"id": 180549, "fullname": "Filip Wolf", "url": "http://cvpr.thecvf.com/api/miniconf/users/180549?format=json", "institution": "University of Ljubljana"}, {"id": 180481, "fullname": "Blaz Rolih", "url": "http://cvpr.thecvf.com/api/miniconf/users/180481?format=json", "institution": "University of Ljubljana, Slovenia"}, {"id": 183529, "fullname": "Luka Cehovin Zajc", "url": "http://cvpr.thecvf.com/api/miniconf/users/183529?format=json", "institution": "University of Ljubljana"}], "abstract": "Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastive distillation framework for multispectral imagery that aligns the student\u2019s pretraining objective with the contrastive self-distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross-modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39466", "url": "https://wolfilip.github.io/DEO/", "sourceid": 33470, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39469, "uid": "2149347baf196b460776a239583cc183", "name": "Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy", "authors": [{"id": 129325, "fullname": "Wooseong Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129325?format=json", "institution": "KAIST"}, {"id": 176159, "fullname": "Wonyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/176159?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology (KAIST)"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "Merging multiple Low-Rank Adaptation (LoRA) modules into a single model is a promising approach for constructing general-purpose systems, but it remains challenging because low-rank update directions introduced by LoRA adapters often span different subspaces and contribute unevenly across directions. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model\u2019s ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We then propose TARA-Merging, short for Task-Rank Anisotropy Alignment. It explicitly incorporates task preferences by aligning the merging weights with a preference-weighted cross-entropy pseudo loss with preserving LoRA directions that encode task-relevant subspaces. This alignment ensures that the merged model maintains broad subspace coverage and accounts for anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39469", "url": null, "sourceid": 38477, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39470, "uid": "6670bc8dbbaa81dbf2ccc736bf9722d5", "name": "CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation", "authors": [{"id": 184070, "fullname": "Fengyi Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184070?format=json", "institution": "Tsinghua Shenzhen International Graduate School"}, {"id": 192137, "fullname": "Sicheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192137?format=json", "institution": null}, {"id": 87900, "fullname": "Wenming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87900?format=json", "institution": "Tsinghua University,"}], "abstract": "Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches. Code and demo video are in the supplementary material and will be released upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39470", "url": null, "sourceid": 35360, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39473, "uid": "0f0c5117cdf71e86cd21ee67f05f20c2", "name": "Low-Rank Residual Diffusion Models", "authors": [{"id": 175571, "fullname": "Junfu Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/175571?format=json", "institution": "Georgia Institute of Technology"}, {"id": 144246, "fullname": "Jiang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144246?format=json", "institution": "Institute of automation, Chinese academy of science; NCEPU"}], "abstract": "Diffusion models have achieved remarkable progress in image generation and restoration. However, most frameworks assume a full-rank residual space, neglecting its inherent low-dimensional structure in near-domain transformations such as deraining and deblurring. We propose the Low-Rank Residual Diffusion Model (LRDM), which performs diffusion within a compact low-rank residual subspace for efficient and structure-preserving restoration. We establish the Low-Rank Residual Assumption, showing that the variational lower bound becomes tighter when residuals lie in a low-rank space. LRDM further introduces an Asymmetric Residual Diffusion Process, constraining the forward process in the low-rank domain while maintaining full-rank flexibility in the reverse direction. An Adaptive Rank Selection mechanism dynamically adjusts the rank across timesteps to capture varying residual complexity. Experiments on deraining, deblurring, and deshading benchmarks show that LRDM surpasses full-rank diffusion baselines and achieves state-of-the-art performance, validating the advantage of modeling diffusion in a low-rank residual space.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39473", "url": null, "sourceid": 34312, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39474, "uid": "4d5b1d7451664f7b4d7359d1f4ea0c88", "name": "Exploring Visual Pretraining for Learning Language Intelligence", "authors": [{"id": 93407, "fullname": "Zhonghan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/93407?format=json", "institution": "Zhejiang University"}, {"id": 107538, "fullname": "Yiming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107538?format=json", "institution": "University of Science and Technology of China"}, {"id": 73993, "fullname": "Wenwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73993?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 192143, "fullname": "Haiteng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192143?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 192144, "fullname": "Xingguang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192144?format=json", "institution": "University of Science and Technology of China"}, {"id": 192145, "fullname": "Zhangwei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192145?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192146, "fullname": "Kuikun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192146?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192147, "fullname": "Yuzhe Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192147?format=json", "institution": "Shanghai Jiaotong University; Shanghai Artificial Intelligence Laboratory"}, {"id": 185714, "fullname": "Size Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185714?format=json", "institution": "Nanyang Technological University"}, {"id": 192148, "fullname": "Haian Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192148?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 192149, "fullname": "Jianfei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192149?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 192150, "fullname": "haijun Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/192150?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 192151, "fullname": "Demin Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/192151?format=json", "institution": null}, {"id": 192152, "fullname": "Yunhua Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192152?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 192153, "fullname": "Qipeng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192153?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 86591, "fullname": "Gaoang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86591?format=json", "institution": "Zhejiang University"}, {"id": 71116, "fullname": "Kai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71116?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "While the most fundamental pretraining paradigm typically trains modality-specific models on their respective datasets, the Platonic Representation Hypothesis that representations eventually align across modalities as data and model scale suggests an intriguing possibility: large language models (LLMs) could be pretrained on visual corpora to reach parity with text-pretrained models, thereby expanding data sources to break the text-scaling bottlenecks, and leveraging richer visual cues for more comprehensive corpus understanding. This paper makes the first attempt to demonstrate the feasibility of this implication by introducing Masked Autoregressive Pretraining for Learning language intelligencE (MAPLE), a novel visual pretraining paradigm for LLMs that leverages raw document images to improve language intelligence. MAPLE is universal to integrate masked auto-regressive models with various LLM backbones, where the LLMs are incentivized to generate latent hypotheses for the masked regions based on the unmasked regions. We verify MAPLE in the domain of math reasoning with multiple LLM backbones and show that MAPLE consistently surpasses text-only pretraining relatively by at most 40.2\\% on average accuracy across four math reasoning benchmarks. Further analyses show that visually pretrained LLMs learn a shared latent space that aligns document visuals with text and exploits layout and structural cues, supporting visual pretraining as a feasible and scalable route to stronger language models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39474", "url": null, "sourceid": 46521, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39476, "uid": "77415b528feaea76b34180a6f6412e77", "name": "Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance", "authors": [{"id": 182403, "fullname": "Muyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182403?format=json", "institution": "The University of Sydney"}, {"id": 192156, "fullname": "Yucheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192156?format=json", "institution": "Dolby"}, {"id": 192157, "fullname": "Jianbo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192157?format=json", "institution": null}, {"id": 192158, "fullname": "Elliot Osborne", "url": "http://cvpr.thecvf.com/api/miniconf/users/192158?format=json", "institution": "Dolby"}, {"id": 84821, "fullname": "Bo Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/84821?format=json", "institution": "HKBU"}, {"id": 84796, "fullname": "Tongliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84796?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}], "abstract": "Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39476", "url": null, "sourceid": 42170, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39477, "uid": "c6c65000e8d245e161471faa4801208c", "name": "Composing Concepts from Images and Videos via Concept-prompt Binding", "authors": [{"id": 180371, "fullname": "Xianghao Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180371?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 180155, "fullname": "Zeyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180155?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 136279, "fullname": "Yuwei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/136279?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 151499, "fullname": "Zhuoran ZHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/151499?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 100858, "fullname": "Songchun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/100858?format=json", "institution": "Zhejiang University"}, {"id": 150941, "fullname": "Anyi Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/150941?format=json", "institution": "HKUST"}], "abstract": "Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind \\& Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39477", "url": null, "sourceid": 34877, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39481, "uid": "25c913a442ff0c90d4756bbf202ba3dd", "name": "Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement", "authors": [{"id": 182132, "fullname": "Junrong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182132?format=json", "institution": "University of Science and Technology of China"}, {"id": 192164, "fullname": "Shancheng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192164?format=json", "institution": "Shenzhen University"}, {"id": 192165, "fullname": "Yadong Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192165?format=json", "institution": null}, {"id": 85497, "fullname": "Hongtao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85497?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms code-only baselines, advanced MLLMs, and existing layout models, establishing visual feedback as critical for design-oriented MLLMs. The code will be publicly available in the future.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39481", "url": null, "sourceid": 36613, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39483, "uid": "03c7f28e71151ecf37efa593de96aeaa", "name": "E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation", "authors": [{"id": 174830, "fullname": "Mayur Deshmukh", "url": "http://cvpr.thecvf.com/api/miniconf/users/174830?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 101980, "fullname": "Hiroyasu Akada", "url": "http://cvpr.thecvf.com/api/miniconf/users/101980?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 192166, "fullname": "Helge Rhodin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192166?format=json", "institution": "Universit\u00e4t Bielefeld"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}, {"id": 75786, "fullname": "Vladislav Golyanik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75786?format=json", "institution": "MPI for Informatics"}], "abstract": "Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at $80$ Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to $19$% (MPJPE) and temporal stability by up to $2.7\\times$. Our source code will be publicly released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39483", "url": null, "sourceid": 44072, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39484, "uid": "6c5226a2f95f24ed182a70dddd0523fe", "name": "All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference", "authors": [{"id": 145504, "fullname": "Yi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145504?format=json", "institution": "Wuhan University"}, {"id": 73021, "fullname": "Libing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73021?format=json", "institution": "Wuhan University"}, {"id": 192167, "fullname": "Zhuangzhuang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192167?format=json", "institution": "City University of Hong Kong"}, {"id": 178426, "fullname": "Jing Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178426?format=json", "institution": "Guangzhou University"}, {"id": 192168, "fullname": "Lijuan Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192168?format=json", "institution": "Wuhan University"}, {"id": 192169, "fullname": "Jiaqi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192169?format=json", "institution": "Wuhan University"}], "abstract": "Collaborative perception (CP) enables multiple vehicles to augment their individual perception capacities through the exchange of feature-level sensory data. However, this fusion mechanism is inherently vulnerable to adversarial attacks, especially in fully untrusted-vehicle environments. Existing defense approaches often assume a trusted ego vehicle as a reference or incorporate additional binary classifiers. These assumptions limit their practicality in real-world deployments due to the questionable trustworthiness of ego vehicles, the requirement for real-time detection, and the need for generalizability across diverse scenarios. To address these challenges, we propose a novel Pseudo-Random Bayesian Inference (PRBI) framework, a first efficient defense method tailored for fully untrusted-vehicle CP. PRBI detects adversarial behavior by leveraging temporal perceptual discrepancies, using the reliable perception from the preceding frame as a dynamic reference. Additionally, it employs a pseudo-random grouping strategy that requires only two verifications per frame, while applying Bayesian inference to estimate both the number and identities of malicious vehicles. Theoretical analysis has proven the convergence and stability of the proposed PRBI framework. Extensive experiments show that PRBI requires only 2.5 verifications per frame on average, outperforming existing methods significantly, and restores detection precision to between 79.4% and 86.9% of pre-attack levels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39484", "url": null, "sourceid": 36280, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39482, "uid": "f2f88e8e9aabd9bc9d0cd8472bb287d3", "name": "Abstract 3D Perception for Spatial Intelligence in Vision-Language Models", "authors": [{"id": 126413, "fullname": "Yifan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126413?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 179596, "fullname": "Fangneng Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179596?format=json", "institution": "Harvard University &amp; MIT"}, {"id": 166392, "fullname": "Kaichen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/166392?format=json", "institution": "University of Oxford"}, {"id": 85755, "fullname": "Yilun Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/85755?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 188104, "fullname": "Paul Pu Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188104?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 89796, "fullname": "Hanspeter Pfister", "url": "http://cvpr.thecvf.com/api/miniconf/users/89796?format=json", "institution": "Harvard University"}], "abstract": "Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents.We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input.To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning.Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\\% gain on SAT Real compared with baseline methods for instance.These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39482", "url": null, "sourceid": 35323, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39487, "uid": "7c2dccff2cb6e946fb068f6b604ea868", "name": "What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs", "authors": [{"id": 182888, "fullname": "Zhihan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/182888?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192175, "fullname": "Lijun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192175?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192176, "fullname": "Jiaxi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192176?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192177, "fullname": "Xinzhu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192177?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 183557, "fullname": "Haixia Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183557?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192178, "fullname": "Fan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192178?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a $\\textbf{black-box}$ FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with $\\textbf{one-step inference}$. This decoupled design simplifies learning and enables effective training with $\\textbf{few image\u2013feature pairs}$. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39487", "url": null, "sourceid": 32895, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39489, "uid": "aba37915c93417bfa9d92dd23a38771e", "name": "What Is It Like to Be a Noise? An Entropy-based Gaussian Noise Regularization for Diffusion Models", "authors": [{"id": 152544, "fullname": "Pascal Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152544?format=json", "institution": "ETH Z\u00fcrich / DisneyResearch|Studios"}, {"id": 192188, "fullname": "Kai Lascheit", "url": "http://cvpr.thecvf.com/api/miniconf/users/192188?format=json", "institution": null}, {"id": 152546, "fullname": "Jingwei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152546?format=json", "institution": "DisneyResearch|Studios"}, {"id": 86229, "fullname": "Markus Gross", "url": "http://cvpr.thecvf.com/api/miniconf/users/86229?format=json", "institution": "Disney Research, Disney"}, {"id": 152547, "fullname": "Vinicius C. Azevedo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152547?format=json", "institution": "Disney Research, Disney Research"}], "abstract": "Optimizing noise latents in diffusion models is powerful for controllable generation, reward-guided sampling, and latent inversion, but the process is notoriously unstable. Without a principled regularizer, optimized latents drift away from the Gaussian prior, collapsing out of the typical set and producing severe artifacts. Existing constraints like norm-matching or simple KL divergence losses are often insufficient, as they fail to capture the full statistical properties of true Gaussian noise. We propose a principled, differentiable regularizer that correctly targets the high-mass typical set rather than the high-probability mode. Our energy function tractably approximates the KL divergence by matching low-order statistics. It combines a 1D marginal term to match the pixel-value histogram and a 2D spatial term to enforce decorrelation. By applying this in a multi-scale pyramid, our method penalizes correlations at all ranges, effectively projecting samples closer onto the true Gaussian typical set. We demonstrate its effectiveness for robust, artifact-free reward-guided generation and model-free latent inversion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39489", "url": null, "sourceid": 40944, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39490, "uid": "d23fa7359db9f32935fa0a1cc8c59a07", "name": "Content-Aware Dynamic Patchification for Efficient Video Diffusion", "authors": [{"id": 182495, "fullname": "Sheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182495?format=json", "institution": "University of Pittsburgh"}, {"id": 88938, "fullname": "Connelly Barnes", "url": "http://cvpr.thecvf.com/api/miniconf/users/88938?format=json", "institution": "Adobe Systems"}, {"id": 129638, "fullname": "Mamshad Nayeem Rizve", "url": "http://cvpr.thecvf.com/api/miniconf/users/129638?format=json", "institution": "Amazon"}, {"id": 151254, "fullname": "Hongwu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151254?format=json", "institution": "University of Connecticut"}, {"id": 131389, "fullname": "Zhengang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131389?format=json", "institution": "Northeastern University"}, {"id": 192189, "fullname": "Ohi Dibua", "url": "http://cvpr.thecvf.com/api/miniconf/users/192189?format=json", "institution": "Adobe Systems"}, {"id": 107548, "fullname": "Alireza Ganjdanesh", "url": "http://cvpr.thecvf.com/api/miniconf/users/107548?format=json", "institution": "University of Maryland, College Park"}, {"id": 192190, "fullname": "Xulong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192190?format=json", "institution": "University of Pittsburgh"}, {"id": 92672, "fullname": "Yan Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/92672?format=json", "institution": "Adobe Systems"}, {"id": 135327, "fullname": "Yifan Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/135327?format=json", "institution": "Adobe Systems"}], "abstract": "Diffusion Transformers (DiTs) achieve strong video generation performance but suffer from prohibitive computation cost due to dense spatiotemporal tokenization. Most existing works rely on uniform patchification, tokenizing non-overlapping spatiotemporal with a fixed patch size regardless of the underlying content. This content-agnostic tokenization results in substantial redundant computation, especially in visually simple or static areas. To address this inefficiency while preserving the video generation quality, we propose DynaPatch, a fine-grained dynamic patchification framework that adaptively selects patch sizes for each spatiotemporal region based on content complexity. A lightweight router predicts patch sizes directly from the latents encoded by 3D Variational Autoencoder (VAE), and is jointly optimized with the diffusion model through diffusion loss, an attention-guided saliency alignment loss, and a token-budget regularizer. Learnable patchify/unpatchify layers integrate seamlessly with standard DiT backbones, allowing flexible tokenization without architectural changes. Experiments demonstrate that DynaPatch can effectively reduce redundant computations while preserving fine details, achieving 1.3\u20131.8$\\times$ acceleration with minimal quality degradation. On VBench, DynaPatch attains a Total Score of 83.42 at 30\\% token reduction, significantly outperforming prior patchification and token pruning approaches. These results indicate that content-aware patchification offers an effective direction for efficient and scalable video diffusion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39490", "url": null, "sourceid": 42753, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39495, "uid": "db5aaf4d715a4bf8247fef10ee7698ca", "name": "Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models", "authors": [{"id": 181510, "fullname": "Shiran Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/181510?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 192198, "fullname": "Chenyi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192198?format=json", "institution": "Peking University"}, {"id": 189954, "fullname": "Yuang Ai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189954?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 128266, "fullname": "Qihang Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128266?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 76200, "fullname": "Huaibo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76200?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76640, "fullname": "Ran He", "url": "http://cvpr.thecvf.com/api/miniconf/users/76640?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high- variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an \"Expand-and-Prune\" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39495", "url": null, "sourceid": 40542, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39496, "uid": "9191b0a3b4c41e6732dbb644bd52d6fc", "name": "MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second", "authors": [{"id": 147620, "fullname": "Chenguo Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/147620?format=json", "institution": "Peking University"}, {"id": 185519, "fullname": "Yuchen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185519?format=json", "institution": "Peking University"}, {"id": 181092, "fullname": "Panwang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181092?format=json", "institution": "ByteDance"}, {"id": 153032, "fullname": "Yifan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153032?format=json", "institution": "PICO, ByteDance"}, {"id": 129719, "fullname": "Tao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129719?format=json", "institution": "National University of Singapore"}, {"id": 178438, "fullname": "Honglei Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/178438?format=json", "institution": "ByteDance Ltd"}, {"id": 93898, "fullname": "Katerina Fragkiadaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/93898?format=json", "institution": "CMU"}, {"id": 89566, "fullname": "Yadong Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89566?format=json", "institution": "Peking University"}], "abstract": "We present MoVieS, a motion-aware view synthesis model that reconstruct 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39496", "url": null, "sourceid": 32421, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39498, "uid": "c8af2c4e697bad9f8923682b361402fc", "name": "CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning", "authors": [{"id": 156839, "fullname": "Jiange Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156839?format=json", "institution": "Nanjing University"}, {"id": 141762, "fullname": "tom tomlinson", "url": "http://cvpr.thecvf.com/api/miniconf/users/141762?format=json", "institution": "University of Science and Technology of China; Shanghai Artificial Intelligence Laboratory"}, {"id": 129379, "fullname": "Haoyi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129379?format=json", "institution": "University of Science and Technology of China"}, {"id": 155975, "fullname": "Mingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155975?format=json", "institution": "Zhejiang University"}, {"id": 192203, "fullname": "Kaijing Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192203?format=json", "institution": "Fudan University; Shanghai Artificial Intelligence Laboratory"}, {"id": 156840, "fullname": "Yating Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156840?format=json", "institution": "Tongji University"}, {"id": 86405, "fullname": "Gangshan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86405?format=json", "institution": "Nanjing University"}, {"id": 70916, "fullname": "Tong He", "url": "http://cvpr.thecvf.com/api/miniconf/users/70916?format=json", "institution": "Shanghai AI Lab"}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "Unsupervised learning of latent motion from Internet videos is crucial for building generalist robots. Existing discrete methods generally mitigate the shortcut learning problem caused by extracting excessive static background information through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the difficulty of shortcut learning and explicitly enhance motion cues. Additionally, to ensure that continuous latent motion better captures meaningful foreground information, we further propose a temporal contrastive learning (Tcn) scheme. Specifically, positive pairs are constructed from motion representations with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcn work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zero-shot generalization, enabling it to generate effective pseudo action labels for unseen videos. The shared continuous distribution of robot action and video latent motion also significantly benefits the joint learning of unified policy. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and autoregressive architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39498", "url": null, "sourceid": 40581, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39500, "uid": "a4d89012e1026d243518a526dacccacf", "name": "Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer", "authors": [{"id": 182161, "fullname": "Haoru Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/182161?format=json", "institution": "UC Berkeley"}, {"id": 182172, "fullname": "Tairan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182172?format=json", "institution": "NVIDIA, CMU"}, {"id": 165949, "fullname": "Zi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/165949?format=json", "institution": "NVIDIA"}, {"id": 181826, "fullname": "Qingwei Ben", "url": "http://cvpr.thecvf.com/api/miniconf/users/181826?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 192208, "fullname": "Wenli Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192208?format=json", "institution": null}, {"id": 96336, "fullname": "Zhengyi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96336?format=json", "institution": "Carnegie Mellon University"}, {"id": 192209, "fullname": "Xingye Da", "url": "http://cvpr.thecvf.com/api/miniconf/users/192209?format=json", "institution": null}, {"id": 192210, "fullname": "Fernando Casta\u00f1eda", "url": "http://cvpr.thecvf.com/api/miniconf/users/192210?format=json", "institution": "NVIDIA"}, {"id": 141900, "fullname": "Guanya Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/141900?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 149610, "fullname": "Shankar Sastry", "url": "http://cvpr.thecvf.com/api/miniconf/users/149610?format=json", "institution": "University of California Berkeley"}, {"id": 169493, "fullname": "Linxi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/169493?format=json", "institution": "NVIDIA"}, {"id": 75460, "fullname": "Yuke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75460?format=json", "institution": "University of Texas - Austin"}], "abstract": "Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher\u2013student\u2013bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure designed to mitigate partial observability and improve closed-loop consistency in sim-to-real RL. Trained entirely on synthetic simulation data, the resulting policy achieves robust zero-shot performance across diverse articulated objects\u2014including multiple door types\u2014and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation from pure RGB perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39500", "url": null, "sourceid": 37616, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39506, "uid": "63fb561c81923bcdbb86140a1801305d", "name": "Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction", "authors": [{"id": 192224, "fullname": "David Novikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/192224?format=json", "institution": "Weizmann Institute of Science"}, {"id": 192225, "fullname": "Eilon Vaknin Laufer", "url": "http://cvpr.thecvf.com/api/miniconf/users/192225?format=json", "institution": ""}, {"id": 77334, "fullname": "Narek Tumanyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/77334?format=json", "institution": "Weizmann Institute of Science"}, {"id": 89397, "fullname": "Mark Sheinin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89397?format=json", "institution": "Weizmann Institute of Science"}], "abstract": "The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years.However, most conventional cameras are bandwidth-limited to 30\u201360 FPS, restricting these methods to static or slowly evolving scenes.While overcoming bandwidth limitations is difficult in general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific scenarios (e.g., motion capture and particle image velocimetry).However, most of these methods require modifications to camera optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequentialcolor sequence. This results in simultaneous multi-view capture of the scene, in which high-speed temporal information is encoded in the images' color. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39506", "url": null, "sourceid": 30773, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39510, "uid": "2ca08d883dbcb8270601ac798440314d", "name": "MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation", "authors": [{"id": 180063, "fullname": "Zeyu An", "url": "http://cvpr.thecvf.com/api/miniconf/users/180063?format=json", "institution": "Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University"}, {"id": 192236, "fullname": "Wanyu Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192236?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 192237, "fullname": "Feng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192237?format=json", "institution": "Merck KGaA"}, {"id": 156854, "fullname": "Shujun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156854?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Recent advances in diffusion-based language models (DLMs) have shown remarkable potential for de novo protein design. However, enabling controllable protein generation requires integrating diverse biological conditions, such as structure, functions, and chemical interactions, each represented in distinct modalities. Existing approaches often either support a single condition or treat multiple conditions through separate modality-specific encoders. This isolation limits cross-modal interaction, reduces generation quality, and complicates the incorporation of new conditions without retraining or redesigning the backbone. To address these limitations, we introduce $\\textbf{MMCP-GEN}$, a DLM for $\\textbf{M}$ulti-$\\textbf{M}$odal, Multi-$\\textbf{C}$ondition $\\textbf{P}$rotein sequence $\\textbf{GEN}$eration. $\\textbf{MMCP-GEN}$ establishes a new paradigm for controllable protein generation under complex multimodal constraints. Its core is a modality-composable and extensible conditioning mechanism that fuses heterogeneous biological conditions via learnable queries and modality-indicator heads, enabling disentangled, extensible, and cross-modal condition integration without retraining the backbone. A joint generation\u2013and\u2013scoring objective further aligns sequence recovery with structural fidelity. Empirically, $\\textbf{MMCP-GEN}$ achieves state-of-the-art performance across structure-, function-, and ligand-conditioned tasks, improving sequence recovery by up to 5\\% and outperforming attentive baselines in diverse functional annotation tasks. These results establish $\\textbf{MMCP-GEN}$ as a general and high-fidelity framework for controllable protein generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39510", "url": null, "sourceid": 44354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39511, "uid": "86a9ac7667ec45a86614223897b4f765", "name": "Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models", "authors": [{"id": 181396, "fullname": "Ji Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/181396?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 192238, "fullname": "Wei Suo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192238?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 86484, "fullname": "Peng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86484?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 88282, "fullname": "Yanning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88282?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process.However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can conveniently integrate with other hallucination mitigation methods and further boost their performance. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39511", "url": null, "sourceid": 33633, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39512, "uid": "0d1bed06d643471985149a14a1eb24d0", "name": "TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs", "authors": [{"id": 192239, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192239?format=json", "institution": "nanjing university"}, {"id": 159447, "fullname": "Teng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159447?format=json", "institution": "Tencent ARC Lab"}, {"id": 86827, "fullname": "Yuying Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/86827?format=json", "institution": "University of Hong Kong"}, {"id": 76515, "fullname": "Yixiao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/76515?format=json", "institution": "Tencent"}, {"id": 153861, "fullname": "Xinhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153861?format=json", "institution": "Nanjing University"}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding.While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored.In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design.We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset.Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free Reinforcement Learning with Verifiable Rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens-7B, an MLLM that achieves state-of-the-art performance among open-source models and even surpasses proprietary models such as GPT-4o and GPT-5. All codes, data, and models will be released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39512", "url": null, "sourceid": 37412, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39514, "uid": "023e1de20a1a68aab1eeb0562b018ed4", "name": "No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency", "authors": [{"id": 187860, "fullname": "Cho-Ying Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187860?format=json", "institution": "Bosch"}, {"id": 150358, "fullname": "Zixun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150358?format=json", "institution": "Bosch Research North America"}, {"id": 129002, "fullname": "Xinyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129002?format=json", "institution": "Robert Bosch Research NA"}, {"id": 84539, "fullname": "Liu Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/84539?format=json", "institution": "Bosch Research"}], "abstract": "We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39514", "url": null, "sourceid": 40418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39516, "uid": "e03b5dacc18a5945d81f8e97be4b64a3", "name": "ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior", "authors": [{"id": 182604, "fullname": "Weikai Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182604?format=json", "institution": "South China University of Technology"}, {"id": 86056, "fullname": "Ziqian Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86056?format=json", "institution": "South China University of Technology"}, {"id": 192247, "fullname": "Kehua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192247?format=json", "institution": "South China University of Technology"}, {"id": 157602, "fullname": "Haoran Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157602?format=json", "institution": "HKUST"}, {"id": 86005, "fullname": "Huiping Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86005?format=json", "institution": "South China University of Technology"}, {"id": 192248, "fullname": "Ruidong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192248?format=json", "institution": "Zhejiang Normal University"}, {"id": 192249, "fullname": "Cen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192249?format=json", "institution": "South China University of Technology"}, {"id": 185920, "fullname": "Hao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185920?format=json", "institution": "Beihang University"}], "abstract": "Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39516", "url": null, "sourceid": 44265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40371?format=json"], "related_events_ids": [40371]}, {"id": 40371, "uid": "e03b5dacc18a5945d81f8e97be4b64a3", "name": "ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior", "authors": [{"id": 182604, "fullname": "Weikai Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182604?format=json", "institution": "South China University of Technology"}, {"id": 86056, "fullname": "Ziqian Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86056?format=json", "institution": "South China University of Technology"}, {"id": 192247, "fullname": "Kehua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192247?format=json", "institution": "South China University of Technology"}, {"id": 157602, "fullname": "Haoran Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157602?format=json", "institution": "HKUST"}, {"id": 86005, "fullname": "Huiping Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86005?format=json", "institution": "South China University of Technology"}, {"id": 192248, "fullname": "Ruidong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192248?format=json", "institution": "Zhejiang Normal University"}, {"id": 192249, "fullname": "Cen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192249?format=json", "institution": "South China University of Technology"}, {"id": 185920, "fullname": "Hao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185920?format=json", "institution": "Beihang University"}], "abstract": "Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model's behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of MLLMs is encoded in a subspace. Steering along directions within this subspace can enforce adherence to user instructions, forming the basis of a defense. However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. ARGUS also introduces lightweight injection detection stage to to activate the defense on-demand, and a post-filtering stage to verify defense success. Experimental results show that ARGUS can achieve robust defense against multimodal IPI while maximally preserving the MLLM's utility.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40371", "url": null, "sourceid": -44265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39516?format=json"], "related_events_ids": [39516]}, {"id": 39517, "uid": "49324aa17bffaf1a43a8613a2de9db4d", "name": "Evidential Neural Radiance Fields", "authors": [{"id": 146024, "fullname": "Ruxiao Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/146024?format=json", "institution": "Yale University"}, {"id": 72573, "fullname": "Alex Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/72573?format=json", "institution": "Yale University"}], "abstract": "Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39517", "url": null, "sourceid": 46445, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40372?format=json"], "related_events_ids": [40372]}, {"id": 40372, "uid": "49324aa17bffaf1a43a8613a2de9db4d", "name": "Evidential Neural Radiance Fields", "authors": [{"id": 146024, "fullname": "Ruxiao Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/146024?format=json", "institution": "Yale University"}, {"id": 72573, "fullname": "Alex Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/72573?format=json", "institution": "Yale University"}], "abstract": "Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40372", "url": null, "sourceid": -46445, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39517?format=json"], "related_events_ids": [39517]}, {"id": 39719, "uid": "870458535281d3dfce64d423354091b1", "name": "PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning", "authors": [{"id": 176777, "fullname": "Zekai Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/176777?format=json", "institution": null}, {"id": 87543, "fullname": "Xu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87543?format=json", "institution": "HKUST"}], "abstract": "360\u00b0 panoramic images are increasingly used in VR, autonomous driving, and robotics for holistic scene understanding. However, current Vision\u2013Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce \\textbf{\\textit{PanoEnv}}, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations\u2014depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34\\% overall and 8.36\\% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward combining five geometry-aware strategies (e.g., distance tolerance, spatial consistency). A two-stage curriculum further mitigates catastrophic forgetting: Stage\\~1 trains on structured tasks (T/F, MCQ), and Stage\\~2 fine-tunes on mixed OE data for generalization. Our 7B model sets a new SoTA performance, improving total accuracy to 52.93\\% (+3.59\\%) and OE accuracy to 14.83\\% while maintaining structured-task performance. It also achieves top semantic scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39719", "url": null, "sourceid": 34557, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39721, "uid": "4edac96613950d6c00a872b37ee20e5d", "name": "HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation", "authors": [{"id": 90703, "fullname": "Xiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90703?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85823, "fullname": "Zhifei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85823?format=json", "institution": "Adobe Research"}, {"id": 86848, "fullname": "He Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86848?format=json", "institution": "Adobe Systems"}, {"id": 85199, "fullname": "Zhe Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85199?format=json", "institution": "Adobe Research"}, {"id": 88967, "fullname": "Yuqian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88967?format=json", "institution": "University of Illinois, Urbana-Champaign"}, {"id": 84712, "fullname": "Qing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84712?format=json", "institution": "Adobe Systems"}, {"id": 73091, "fullname": "Shiwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73091?format=json", "institution": "Alibaba Group"}, {"id": 129766, "fullname": "Yijun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129766?format=json", "institution": "Adobe Research"}, {"id": 192722, "fullname": "Shaoteng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192722?format=json", "institution": "Adobe Research"}, {"id": 141485, "fullname": "Haitian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/141485?format=json", "institution": "Adobe Systems"}, {"id": 87401, "fullname": "Jason Kuen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87401?format=json", "institution": "Adobe Research"}, {"id": 130363, "fullname": "Yuehuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130363?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 91032, "fullname": "Changxin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/91032?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 91049, "fullname": "Nong Sang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91049?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39721", "url": null, "sourceid": 46668, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39720, "uid": "f21a3576923d49d665e584551894e33a", "name": "Region-Wise Correspondence Prediction between Manga Line Art Images", "authors": [{"id": 192721, "fullname": "Yingxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192721?format=json", "institution": "CyberAgent, Inc."}, {"id": 181635, "fullname": "Jiafeng Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181635?format=json", "institution": "CyberAgent"}, {"id": 181637, "fullname": "Qianru Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181637?format=json", "institution": "CyberAgent"}, {"id": 73889, "fullname": "Yusuke Matsui", "url": "http://cvpr.thecvf.com/api/miniconf/users/73889?format=json", "institution": "The University of Tokyo"}], "abstract": "Understanding region-wise correspondences between manga line art images is fundamental for high-level manga processing, supporting downstream tasks such as line art colorization and in-between frame generation. Unlike natural images that contain rich visual cues, manga line art consists only of sparse black-and-white strokes, making it challenging to determine which regions correspond across images. In this work, we introduce a new task: predicting region-wise correspondence between raw manga line art images without any annotations. To address this problem, we propose a Transformer-based framework trained on large-scale, automatically generated region correspondences. The model learns to suppress noisy matches and strengthen consistent structural relationships, resulting in robust patch-level feature alignment within and across images. During inference, our method segments each line art and establishes coherent region-level correspondences through edge-aware clustering and region matching. We construct manually annotated benchmarks for evaluation, and experiments across multiple datasets demonstrate both high patch-level accuracy and strong region-level correspondence performance, achieving 78.4-84.4% region-level accuracy. These results highlight the potential of our method for real-world manga and animation applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39720", "url": null, "sourceid": 36220, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39537, "uid": "7b7bd53ba5107cf5fb2c2a51dd887c95", "name": "Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis", "authors": [{"id": 148622, "fullname": "M. Kerem Aydin", "url": "http://cvpr.thecvf.com/api/miniconf/users/148622?format=json", "institution": "Northwestern Univeristy"}, {"id": 131774, "fullname": "Vishwanath Saragadam", "url": "http://cvpr.thecvf.com/api/miniconf/users/131774?format=json", "institution": "Rice University"}, {"id": 88885, "fullname": "Emma Alexander", "url": "http://cvpr.thecvf.com/api/miniconf/users/88885?format=json", "institution": "Northwestern University"}], "abstract": "Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the-art performance across thermal-only NVS benchmarks, without requiring any dataset-specific tuning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39537", "url": null, "sourceid": 45781, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40373, "uid": "860b989a383593396648518c761c64a5", "name": "CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation", "authors": [{"id": 192254, "fullname": "Pablo Messina", "url": "http://cvpr.thecvf.com/api/miniconf/users/192254?format=json", "institution": "Pontificia Universidad Catolica de Chile"}, {"id": 76856, "fullname": "Andr\u00e9s Villa", "url": "http://cvpr.thecvf.com/api/miniconf/users/76856?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 87404, "fullname": "Juan Le\u00f3n Alc\u00e1zar", "url": "http://cvpr.thecvf.com/api/miniconf/users/87404?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 151119, "fullname": "Karen Sanchez", "url": "http://cvpr.thecvf.com/api/miniconf/users/151119?format=json", "institution": "King Abdullah University of Science and Technology, Saudi Arabia"}, {"id": 76727, "fullname": "Carlos Hinojosa", "url": "http://cvpr.thecvf.com/api/miniconf/users/76727?format=json", "institution": "KAUST"}, {"id": 192255, "fullname": "Denis Parra", "url": "http://cvpr.thecvf.com/api/miniconf/users/192255?format=json", "institution": "Pontificia Universidad Catolica de Chile"}, {"id": 87388, "fullname": "Alvaro Soto", "url": "http://cvpr.thecvf.com/api/miniconf/users/87388?format=json", "institution": "Universidad Cat\u00f3lica de Chile"}, {"id": 75441, "fullname": "Bernard Ghanem", "url": "http://cvpr.thecvf.com/api/miniconf/users/75441?format=json", "institution": "KAUST"}], "abstract": "Medical vision\u2013language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present \"CURE\", an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance emphasizing on harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40373", "url": null, "sourceid": -45682, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39519?format=json"], "related_events_ids": [39519]}, {"id": 39520, "uid": "05af5f0beb11c1a6fdf5983c4a243a0e", "name": "RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos", "authors": [{"id": 154435, "fullname": "Lixin Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/154435?format=json", "institution": "Department of Computer Science, ETHZ - ETH Zurich"}, {"id": 151462, "fullname": "Chengwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151462?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 192256, "fullname": "Georgios Paschalidis", "url": "http://cvpr.thecvf.com/api/miniconf/users/192256?format=json", "institution": "University of Amsterdam"}, {"id": 183382, "fullname": "Chen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183382?format=json", "institution": "Meta Reality Labs"}, {"id": 90576, "fullname": "Manuel Kaufmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/90576?format=json", "institution": "ETH Zurich"}, {"id": 90571, "fullname": "Juan Jose Zarate", "url": "http://cvpr.thecvf.com/api/miniconf/users/90571?format=json", "institution": "Department of Computer Science, ETHZ - ETH Zurich"}, {"id": 85632, "fullname": "Dimitrios Tzionas", "url": "http://cvpr.thecvf.com/api/miniconf/users/85632?format=json", "institution": "University of Amsterdam"}], "abstract": "Reconstructing people, objects, and their interactions in 3D is a long-standing and fundamental goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. We will release our code and data to foster future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39520", "url": null, "sourceid": 31573, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39521, "uid": "788e8dcccb8757a5a7f29481ec1d6fc0", "name": "Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models", "authors": [{"id": 168169, "fullname": "Shengli Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/168169?format=json", "institution": "Southern University of Science and Technology"}, {"id": 130623, "fullname": "Minghang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130623?format=json", "institution": "Peking University"}, {"id": 86658, "fullname": "Feng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86658?format=json", "institution": "Southern University of Science and Technology"}, {"id": 89783, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89783?format=json", "institution": "Peking University"}], "abstract": "Spatial reasoning is the process of locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. QuatRoPE's holistic vector encoding of 3D coordinates guarantees a high degree of spatial consistency, maintaining fidelity to the scene's geometric integrity. Additionally, we introduce the Isolated Gated RoPE Extension (IGRE), which effectively limits QuatRoPE's influence to object-related tokens, thereby minimizing interference with the LLM's existing positional embeddings and maintaining the LLM's original capabilities. Extensive experiments demonstrate the effectiveness of our approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39521", "url": null, "sourceid": 42176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39523, "uid": "e8692656a12b4f6719d5aad145c55987", "name": "Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration", "authors": [{"id": 183738, "fullname": "Danil Tokhchukov", "url": "http://cvpr.thecvf.com/api/miniconf/users/183738?format=json", "institution": "MSU"}, {"id": 192258, "fullname": "Aysel Mirzoeva", "url": "http://cvpr.thecvf.com/api/miniconf/users/192258?format=json", "institution": "Lomonosov Moscow State University"}, {"id": 192259, "fullname": "Andrey Kuznetsov", "url": "http://cvpr.thecvf.com/api/miniconf/users/192259?format=json", "institution": "Innopolis University; FusionBrain Lab"}, {"id": 183439, "fullname": "Konstantin Sobolev", "url": "http://cvpr.thecvf.com/api/miniconf/users/183439?format=json", "institution": "FusionBrain Lab, AXXX"}], "abstract": "In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose *Calibri*, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. *Calibri* frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just $\\sim 10^2$ parameters. Additionally, *Calibri* introduces an innovative inference-time ensemble scaling strategy to further boost generative performance. Experimental results reveal that despite its lightweight design, *Calibri* consistently improves performance across various text-to-image models. Notably, *Calibri* also reduces the inference steps required for image generation, all while maintaining high-quality outputs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39523", "url": null, "sourceid": 45734, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39526, "uid": "10e64da97df296012bccffbd03474c1e", "name": "TSTM: Temporal Segmentation for Task-related Mask in Visual Reinforcement Learning Generalization", "authors": [{"id": 181540, "fullname": "Weicheng Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/181540?format=json", "institution": "Shandong University"}, {"id": 153289, "fullname": "Wenjia Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153289?format=json", "institution": "School of Software, Shandong University"}, {"id": 192271, "fullname": "Zhengzhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192271?format=json", "institution": "Shandong University"}, {"id": 84891, "fullname": "Yilong Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84891?format=json", "institution": "Shandong University"}, {"id": 155291, "fullname": "Xiankai Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155291?format=json", "institution": "Shandong University"}], "abstract": "Achieving strong policy generalization to unseen environments remains a core challenge in visual reinforcement learning, and segmenting task-relevant regions to mitigate the influence of irrelevant visual cues has emerged as a promising direction. However, existing methods rely solely on the current observation, lack temporal information, and fail to exploit preceding observations, leaving learned policies susceptible to task-irrelevant background variations and ultimately degrading policy performance. In this paper, we propose temporal segmentation for task-relevant mask in visual reinforcement learning, named TSTM, which extracts task-relevant regions from sequential observations by exploiting temporal information, thereby producing more reliable masks and improving policy generalization. TSTM introduces a temporal segmentation network with an encoder-temporal-decoder architecture, where a convolutional LSTM module captures temporal dependencies across observations. To reduce inference overhead, we further develop a lightweight student network as an efficient substitute for the teacher network. The resulting task-relevant masks are encoded by a CNN-based encoder, and invariant representation learning is employed to improve robustness by enforcing consistency between representations of the original and augmented observation sequences. With these task-relevant representations, we train an actor-critic agent to learn a policy with strong generalization capability. Experimental results demonstrate that TSTM achieves superior generalization performance over existing state-of-the-art methods on most visual RL tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39526", "url": null, "sourceid": 30654, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39529, "uid": "14739f8f2f934dd79f02c43efddf2ff8", "name": "Self-Corrected Image Generation with Explainable Latent Rewards", "authors": [{"id": 180263, "fullname": "Yinyi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180263?format=json", "institution": "Carnegie Mellon University"}, {"id": 183874, "fullname": "Hrishikesh Gokhale", "url": "http://cvpr.thecvf.com/api/miniconf/users/183874?format=json", "institution": "Carnegie Mellon University"}, {"id": 192273, "fullname": "Marios Savvides", "url": "http://cvpr.thecvf.com/api/miniconf/users/192273?format=json", "institution": "Carnegie Mellon University"}, {"id": 152510, "fullname": "Jindong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152510?format=json", "institution": "William &amp; Mary"}, {"id": 73862, "fullname": "Shengfeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73862?format=json", "institution": "Singapore Management University"}], "abstract": "Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable Latent Rewards. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors, offering a data-efficient and generalizable solution to bridging the gap between comprehension and synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39529", "url": null, "sourceid": 39509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39530, "uid": "1ac82f4c978d3fd0c45fbaf5af117278", "name": "MeanFlow Transformers with Representation Autoencoders", "authors": [{"id": 192274, "fullname": "Zheyuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192274?format=json", "institution": "Tencent"}, {"id": 181016, "fullname": "Chieh-Hsin Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181016?format=json", "institution": "Sony AI"}, {"id": 185793, "fullname": "Ge Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185793?format=json", "institution": "Nankai University"}, {"id": 153173, "fullname": "Yuki Mitsufuji", "url": "http://cvpr.thecvf.com/api/miniconf/users/153173?format=json", "institution": "Sony AI"}, {"id": 85600, "fullname": "Stefano Ermon", "url": "http://cvpr.thecvf.com/api/miniconf/users/85600?format=json", "institution": "Stanford University"}], "abstract": "MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE) for high-dimensional data modeling. However, MF training remains computationally demanding and is often unstable. During inference, the SD-VAE decoder dominates the generation cost, and MF depends on complex guidance hyperparameters for class-conditional generation. In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. We observe that naive MF training in the RAE latent space suffers from severe gradient explosion. To stabilize and accelerate training, we adopt Consistency Mid-Training for trajectory-aware initialization and use a two-stage scheme: distillation from a pre-trained flow matching teacher to speed convergence and reduce variance, followed by an optional bootstrapping stage with a one-point velocity estimator to further reduce deviation from the oracle mean flow. This design removes the need for guidance, simplifies training configurations, and reduces computation in both training and sampling.  Empirically, our method achieves a 1-step FID of 2.03, outperforming vanilla MF\u2019s 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256. We further scale our approach to ImageNet 512, achieving a competitive one-step FID of 3.23 with the lowest GFLOPS among all baselines.Code and proofs are available in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39530", "url": null, "sourceid": 40358, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39532, "uid": "4828decbcf77a6591a9264b2e10ee945", "name": "S2D: Selective Spectral Decay for Quantization Friendly Conditioning of Neural Activations", "authors": [{"id": 192278, "fullname": "Arnav Chavan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192278?format=json", "institution": "Amazon"}, {"id": 132292, "fullname": "Nahush Lele", "url": "http://cvpr.thecvf.com/api/miniconf/users/132292?format=json", "institution": "Indian Institute of Technology Dhanbad"}, {"id": 192279, "fullname": "Udbhav Bamba", "url": "http://cvpr.thecvf.com/api/miniconf/users/192279?format=json", "institution": "Amazon"}, {"id": 192280, "fullname": "Sankalp Dayal", "url": "http://cvpr.thecvf.com/api/miniconf/users/192280?format=json", "institution": "Amazon"}, {"id": 84589, "fullname": "Aditi Raghunathan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84589?format=json", "institution": "Carnegie Mellon University"}, {"id": 192281, "fullname": "Deepak Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/192281?format=json", "institution": "Amazon"}], "abstract": "Activation outliers in large-scale transformer models pose a fundamental challenge to model quantization, creating excessively large ranges that cause severe accuracy drops during quantization. We empirically observe that outlier severity intensifies with pre-training scale (e.g., progressing from CLIP to the more extensively trained SigLIP and SigLIP2). Through theoretical analysis as well as empirical correlation studies, we establish the direct link between these activation outliers and dominant singular values of the weights. Building on this insight, we propose Selective Spectral Decay ($S^2D$), a geometrically-principled conditioning method that surgically regularizes only the weight components corresponding to the largest singular values during fine-tuning. Through extensive experiments, we demonstrate that $S^2D$ significantly reduces activation outliers and produces well-conditioned representations that are inherently quantization-friendly. Models trained with $S^2D$ achieve up to 7\\% improved PTQ accuracy on ImageNet under W4A4 quantization and 4\\% gains when combined with QAT. These improvements also generalize across downstream tasks and vision-language models, enabling the scaling of increasingly large and rigorously trained models without sacrificing deployment efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39532", "url": null, "sourceid": 36978, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39533, "uid": "9f85e0d133be9efa90e245d4508f5824", "name": "Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance", "authors": [{"id": 181183, "fullname": "Shigeng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/181183?format=json", "institution": "University of Jyv\u00e4skyl\u00e4"}, {"id": 157249, "fullname": "Hongming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157249?format=json", "institution": "Dalian University of Technology"}, {"id": 192282, "fullname": "Guiyang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192282?format=json", "institution": "China Medical University Shenyang"}, {"id": 192283, "fullname": "Tuomo Rossi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192283?format=json", "institution": "University of Jyv\u00e4skyl\u00e4"}, {"id": 192284, "fullname": "Tommi K\u00e4rkk\u00e4inen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192284?format=json", "institution": "University of Jyv\u00e4skyl\u00e4"}, {"id": 157248, "fullname": "Fengyu Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/157248?format=json", "institution": "Dalian University of Technology"}], "abstract": "In hematoxylin-eosin (H\\&E) to virtual immunohistochemistry (IHC) staining, paired images enable supervised learning but suffer from inherent spatial dislocation, limiting pixel-level constraints. Thus, auxiliary tasks have been increasingly employed with paired data to provide complementary supervision. However, existing methods largely overlook the rich semantic information embedded in auxiliary task models. This paper proposes a novel framework for virtual IHC staining guided by dual-aligned multi-task features, which fully explores semantic cues from auxiliary tasks. To realize effective guidance, we address two obstacles: (1) the spatial mismatch between paired H\\&E and IHC feature representations; (2) the task gap between auxiliary task features and virtual staining features. To resolve the spatial mismatch, we generate an alignment matrix that aligns H\\&E and IHC features. Specifically, we first introduce structure-enhanced learning to restore semantic consistency in regions affected by inaccurate staining in virtual IHC images. Then, we separately cluster features from virtual IHC and real IHC images, and establish semantic correspondences using an active-passive matching mechanism. This ensures that only semantically aligned regions are matched, reducing the impact of staining variability on the alignment matrix. To bridge the task gap, we introduce a task-gap alignment module trained under the principle that auxiliary features are considered aligned if they improve the performance of the virtual IHC staining model. Extensive experiments on two public datasets with four biomarkers demonstrate the effectiveness of our framework. Our code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39533", "url": null, "sourceid": 44061, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39538, "uid": "7f91d39bc9175aa9a5ab21f76159ff3d", "name": "Rethinking the Semantic-based Autoencoder", "authors": [{"id": 190075, "fullname": "Xiaojie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190075?format=json", "institution": "Tiktok"}, {"id": 192295, "fullname": "Yang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192295?format=json", "institution": "ByteDance Inc."}, {"id": 190074, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190074?format=json", "institution": "University of Central Florida"}, {"id": 192296, "fullname": "Yancheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192296?format=json", "institution": "University of Central Florida"}, {"id": 129185, "fullname": "Zonglin Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129185?format=json", "institution": "New York University"}, {"id": 186320, "fullname": "Yunpeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186320?format=json", "institution": "ByteDance Inc."}, {"id": 87116, "fullname": "Rui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87116?format=json", "institution": "TikTok"}, {"id": 90195, "fullname": "Daquan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90195?format=json", "institution": "National University of Singapore"}], "abstract": "Latent generative modeling has emerged as the dominant paradigm for Diffusion Transformers (DiT), where a pretrained autoencoder compresses image pixels into a latent space to facilitate the diffusion process. Recently, the use of semantic encoders within autoencoders (AEs) has gained attention, yet their influence on image reconstruction and diffusion model training remains insufficiently explored. In this study, we perform an in-depth examination of how semantic encoders shape latent representation learning for the autoencoders. Our findings reveal a fundamental trade-off: while semantic encoders generate latent spaces enriched with visual semantics, their high level of abstraction makes it challenging to capture fine-grained geometric relationships, thereby requiring larger models and longer training for convergence. To address this issue, we build upon recent advances in representation learning that enable the joint modeling of both semantic abstraction and geometric detail. This leads to a Semantic Auto-Encoder (S-AE) that achieves state-of-the-art performance, combining superior reconstruction quality and discriminative capability. Specifically, with S-AE, we are able to provide a unified latent space that achieves 0.06 FID for image reconstruction and  81.9\\% classification accuracy on ImageNet, set a state-of-the-art benchmark. Codes and model weights will be made publicably available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39538", "url": null, "sourceid": 32606, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39548, "uid": "dfeb99cdfcf6f7fc01785f979eff8ea3", "name": "MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation", "authors": [{"id": 181872, "fullname": "Chao Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/181872?format=json", "institution": "The University of Tokyo, RIKEN"}, {"id": 192325, "fullname": "Minghe Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192325?format=json", "institution": "University College London, University of London"}, {"id": 152106, "fullname": "Naoto Yokoya", "url": "http://cvpr.thecvf.com/api/miniconf/users/152106?format=json", "institution": "The University of Tokyo"}], "abstract": "We study monocular metric depth estimation (MMDE) without camera intrinsics at training or inference. When focal length and scene depth vary together, depth changes are difficult to perceive from image, yet the edge-frequency statistics exhibit systematic, scale-correlated shifts. Building on this observation, we introduce a spectral quantile estimator (SQE) that analyzes the Fourier spectrum of a predicted edge map and outputs a single score used as a proxy for metric scale. We propose MD2E, a method that models depth-to-edge cues by deriving edge targets from depth annotations, calibrating metric scale using the spectral score, and using edge predictions to regularize depth boundaries while producing metric depth. Across diverse cameras and datasets, MD2E achieves state-of-the-art monocular metric depth in both zero-shot and fine-tuning settings without camera metadata.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39548", "url": null, "sourceid": 45065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39549, "uid": "9be4d22ce19d25b5e31f97d0a69d7d0a", "name": "From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding", "authors": [{"id": 182158, "fullname": "Yuyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182158?format=json", "institution": "University of Oxford"}, {"id": 192326, "fullname": "Yiping Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/192326?format=json", "institution": "CSIRO; University of Adelaide"}, {"id": 183425, "fullname": "Anjie Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/183425?format=json", "institution": "University of Oxford"}, {"id": 132622, "fullname": "Jiayuan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132622?format=json", "institution": "University of Oxford"}, {"id": 185658, "fullname": "Jiazhen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185658?format=json", "institution": "Technical University of Munich"}, {"id": 157200, "fullname": "Can Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157200?format=json", "institution": "University of Oxford"}, {"id": 186608, "fullname": "Jiajun Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186608?format=json", "institution": "University of Adelaide"}, {"id": 129507, "fullname": "Fengbei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129507?format=json", "institution": "Cornell University"}, {"id": 107213, "fullname": "Junde Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107213?format=json", "institution": "University of Oxford"}], "abstract": "Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward leads to minimal learning signals when all candidate responses are failed in challenging scenarios.In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate\u2019s improvement over the initial attempt and converts it into informative shaping signals.These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions.Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39549", "url": null, "sourceid": 38760, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39551, "uid": "70d5c573c693c4053f908d9d9314ce87", "name": "Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning", "authors": [{"id": 86380, "fullname": "Yu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86380?format=json", "institution": "Tongji University"}, {"id": 167071, "fullname": "Hongli Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/167071?format=json", "institution": "Tongji University, China"}], "abstract": "Weakly supervised video anomaly detection (WS-VAD) involves identifying the temporal intervals that contain anomalous events in untrimmed videos, where only video-level annotations are provided as supervisory signals. However, a key limitation persists in WS-VAD, as dense frame-level annotations are absent, which often leaves existing methods struggling to learn anomaly semantics effectively. To address this issue, we propose a novel framework named LAS-VAD, short for \\textbf{L}earning \\textbf{A}nomaly \\textbf{S}emantics for WS-\\textbf{VAD}, which integrates anomaly-connected component mechanism and intention awareness mechanism. The former is designed to assign video frames into distinct semantic groups within a video, and frame segments within the same group are deemed to share identical semantic information. The latter leverages an intention-aware strategy to distinguish between similar normal and abnormal behaviors (e.g., taking items and stealing). To further model the semantic information of anomalies, as anomaly occurrence is accompanied by distinct characteristic attributes (i.e., explosions are characterized by flames and thick smoke), we additionally incorporate anomaly attribute information to guide accurate detection. Extensive experiments on two benchmark datasets, XD-Violence and UCF-Crime, demonstrate that our LAS-VAD outperforms current state-of-the-art methods with remarkable gains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39551", "url": null, "sourceid": 33065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39553, "uid": "eecc0d37eaa4864bb79836e354d49f66", "name": "When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness", "authors": [{"id": 177138, "fullname": "Sunoh Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/177138?format=json", "institution": "Dankook University"}, {"id": 192332, "fullname": "Daeho Um", "url": "http://cvpr.thecvf.com/api/miniconf/users/192332?format=json", "institution": "University of Seoul"}], "abstract": "Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. For reproducibility, we provide our code in the Supplementary Material and will publicly release it on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39553", "url": null, "sourceid": 39791, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39555, "uid": "986e781559edbb32e805d4b135780812", "name": "Unlocking Token Rewards via Training-Free Reward Attribution", "authors": [{"id": 128180, "fullname": "WU Sitong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128180?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 128162, "fullname": "Haoru Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128162?format=json", "institution": "HKU"}, {"id": 87856, "fullname": "Bin Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87856?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187643, "fullname": "Xichen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187643?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 77019, "fullname": "Jingyao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77019?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 181212, "fullname": "Shaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181212?format=json", "institution": "University of Science and Technology of China"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}, {"id": 129111, "fullname": "Bei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129111?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 154575, "fullname": "Jiaya Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/154575?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}], "abstract": "In this paper, we propose an extremely efficient, training-free method to extract token-level reward signals directly from an existing deep reward model. Our core idea is to attribute the overall process reward to individual tokens by estimating each token's influence. This influence is defined as the change in the final macroscopic reward (e.g., the process reward) when a token is replaced with a semantically null token. Naively calculating this influence is computationally infeasible, requiring $N$ forward passes through the PRM for an $N$-token sequence. We overcome this bottleneck by proposing a highly efficient gradient-based estimator. Specifically, we use a first-order Taylor approximation, which simplifies the influence calculation to the inner product of the difference between the token embedding and the null token embedding, and the gradient of the reward with respect to the token embedding. This requires only a single forward and backward pass. The resulting token-level rewards enable standard RL algorithms to perform precise credit assignment without requiring additional reward model training. Experiments on challenging reasoning benchmarks demonstrate that our method substantially improves policy optimization efficiency and enhances the generalization of LLM reasoning capabilities. Our P2T outperforms the outcome reward by +4.9\\% on MathVista for Qwen2.5-VL-7B-Instruct, and +11.5\\% on AIME24 for Qwen2.5-Math-7B, while with a around 4$\\times$ faster convergence.Our results underscore the importance of fine-grained reward shaping and provide a simple, plug-and-play solution to unlock token-level supervision from existing PRMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39555", "url": null, "sourceid": 31744, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39557, "uid": "34af18c1d15951df6e45725ac615e068", "name": "VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models", "authors": [{"id": 180447, "fullname": "Jiawei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180447?format=json", "institution": "Southeast University"}, {"id": 192336, "fullname": "Minjie Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/192336?format=json", "institution": "Southeast University"}, {"id": 192337, "fullname": "Zihan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192337?format=json", "institution": "Southeast University"}, {"id": 192338, "fullname": "Zhuoran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192338?format=json", "institution": "Southeast University"}, {"id": 192339, "fullname": "LizheXie LizheXie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192339?format=json", "institution": "Nanjing Medical University"}, {"id": 192340, "fullname": "Yining HU", "url": "http://cvpr.thecvf.com/api/miniconf/users/192340?format=json", "institution": "Southeast University"}], "abstract": "Large vision-language models (LVLMs) have achieved impressive performance across a variety of multimodal tasks, yet remain vulnerable to targeted adversarial attacks, particularly in black-box settings. In this paper, we propose \\textbf{VCP-Attack}, a transferable targeted attack framework that combines structured contrastive supervision with subspace-guided perturbation optimization. Specifically, we employ a dynamic PCA-based projection to constrain perturbations within semantically meaningful low-dimensional subspaces, and design a multi-sample contrastive loss to align adversarial features with target semantics while pushing them away from the source semantics. Extensive experiments on seven open-source and three proprietary LVLMs\u2014including GPT-4o, Claude, and Gemini\u2014show that VCP-Attack achieves \\textbf{state-of-the-art} performance in black-box targeted attacks. Under a fixed perturbation budget ($\\epsilon = 16/255$), our method achieves an average attack success rate (ASR) of 94.2\\% on open-source models and 83.1\\% on proprietary models, surpassing the strongest baselines by 23.3\\% and 16.8\\%, respectively. Notably, VCP-Attack achieves a 95.6\\% ASR on GPT-4o. Comprehensive ablation studies and visualizations further validate the effectiveness of the dynamic subspace projection and semantic contrastive supervision. While evaluated on image captioning, our approach is model-agnostic and exhibits strong potential for broader applications to black-box adversarial settings in vision-language tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39557", "url": null, "sourceid": 43674, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39563, "uid": "0c00d0b2133a405d5b410c2d0316f262", "name": "Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts", "authors": [{"id": 172717, "fullname": "Hongkun Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/172717?format=json", "institution": "Zhejiang University"}, {"id": 192355, "fullname": "Yuwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192355?format=json", "institution": "Zhejiang University"}, {"id": 192356, "fullname": "Wanyi Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192356?format=json", "institution": "Zhejiang University"}, {"id": 192357, "fullname": "ShengHui Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192357?format=json", "institution": null}, {"id": 192358, "fullname": "Qitong Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192358?format=json", "institution": "Zhejiang University"}, {"id": 192359, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192359?format=json", "institution": "Zhejiang University"}, {"id": 192360, "fullname": "Rufei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192360?format=json", "institution": null}, {"id": 192361, "fullname": "Changju Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192361?format=json", "institution": null}, {"id": 191927, "fullname": "Minfeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191927?format=json", "institution": "Zhejiang University"}, {"id": 192362, "fullname": "Dongming Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192362?format=json", "institution": "Hithink RoyalFlush Information Network Co.,Ltd."}, {"id": 131775, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131775?format=json", "institution": "State key laboratory of CAD&amp;CG"}], "abstract": "Multimodal large language models (MLLMs) have shown considerable potential in chart understanding and reasoning tasks. However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine-grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model, **Chart-FR1**, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. Specifically, we propose **Focus-CoT**, a visual focusing chain-of-thought that enhances fine-grained perception by explicitly linking reasoning steps to key visual cues, such as local image regions and OCR signals. Building on this, we introduce **Focus-GRPO**, a focus-driven reinforcement learning algorithm with an information-efficiency reward that compresses redundant visual information for efficient focusing, and an adaptive KL penalty mechanism that enables flexible control over reasoning depth when more visual cues are discovered. Furthermore, to fill the gap in benchmarks for HID charts, we release **HID-Chart**, a challenging benchmark with an information-density metric designed to evaluate fine-grained chart reasoning capabilities. Extensive experiments on multiple chart benchmarks demonstrate that our Chart-FR1 outperforms state-of-the-art MLLMs in chart understanding and reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39563", "url": null, "sourceid": 43332, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39565, "uid": "e20453c4a33cc42c30d103d6d327503b", "name": "A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images", "authors": [{"id": 129660, "fullname": "Sungik Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/129660?format=json", "institution": "LG AI Research"}, {"id": 129631, "fullname": "Hankook Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/129631?format=json", "institution": "Sungkyunkwan University"}, {"id": 192363, "fullname": "Jaehoon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192363?format=json", "institution": "LG AI RESEARCH"}, {"id": 192364, "fullname": "Robin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192364?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 186292, "fullname": "Stanley Jungkyu Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186292?format=json", "institution": "Language Lab, LG AI Research"}, {"id": 88544, "fullname": "Moontae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/88544?format=json", "institution": "University of Illinois, Chicago"}], "abstract": "As recent AI models have successfully generated high-resolution photorealistic images, it has also been socially important to detect whether an image is generated by AI. Since training data for the detection task is often not available due to the diversity of generative models, training-free detection approaches have been practically considered. A common approach is to utilize the image-level reconstruction error from the latent diffusion model (LDM). However, we find this score suffers from instance-specific biases, particularly in images with simple backgrounds. To this end, we propose a novel image-level debiasing score function that cancels out background contribution by normalizing the reconstruction error on the augmented images with similar background information. To be specific, we show that rotation and low-pass filtering are effective augmentation strategies. To promote generalization to broader generative models, we newly explore latent-level reconstruction error as an additional training-free signal. However, we observe that the latent-level score also suffers to latent-specific bias. To mitigate this, we introduce a rotation-based latent-level debiasing score based on the normalization of the rotated latent. We unify the aforementioned scores into a single unified debiasing score, RDD, which achieves state-of-the-art training-free detection performance across diverse generative models. Furthermore, our framework can be robust to corruption of the examined images.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39565", "url": null, "sourceid": 40030, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39567, "uid": "a64905bb671079d458594784b79f77db", "name": "Spot The Ball: A Benchmark for Visual Social Inference", "authors": [{"id": 176496, "fullname": "Neha Balamurugan", "url": "http://cvpr.thecvf.com/api/miniconf/users/176496?format=json", "institution": "Stanford University "}, {"id": 192366, "fullname": "Sarah A Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192366?format=json", "institution": "Stanford University"}, {"id": 157286, "fullname": "Cristobal Eyzaguirre", "url": "http://cvpr.thecvf.com/api/miniconf/users/157286?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 192367, "fullname": "Tobias Gerstenberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/192367?format=json", "institution": "Stanford University"}], "abstract": "Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This capacity drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce \\stb, a challenging benchmark for evaluating visual social inference in vision\u2013language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate ($20$\u2013$34$%) than models ($\\leq17$%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human\u2013model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39567", "url": null, "sourceid": 38563, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39575, "uid": "30e34847e6a2bcd7a40f671acae634d5", "name": "Temporal Inversion for Learning Interval Change in Chest X-Rays", "authors": [{"id": 153563, "fullname": "Hanbin Ko", "url": "http://cvpr.thecvf.com/api/miniconf/users/153563?format=json", "institution": "Seoul National University"}, {"id": 192384, "fullname": "Kyungmin Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/192384?format=json", "institution": "Seoul National University; Seoul National University Hospital"}, {"id": 192385, "fullname": "Doowoong Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192385?format=json", "institution": "Korea University"}, {"id": 192386, "fullname": "Chang Min Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/192386?format=json", "institution": "Seoul National University Hospital; Seoul National University"}], "abstract": "Recent advances in vision--language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion---reversing image pairs---as a supervisory signal for temporal reasoning. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of directional change. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-T_retrieval, a benchmark for progression-aware retrieval. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment across multiple architectures. Overall, temporal inversion provides a simple and general principle for building order-aware medical vision--language models and supports temporally robust reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39575", "url": null, "sourceid": 44679, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39578, "uid": "bbeb39a6407d964feec2fc19b22e52aa", "name": "GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance", "authors": [{"id": 155752, "fullname": "Jiale Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/155752?format=json", "institution": "Zhejiang University"}, {"id": 157405, "fullname": "Jiarui Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157405?format=json", "institution": "Zhejiang University"}, {"id": 88731, "fullname": "Zesong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88731?format=json", "institution": "Zhejiang University"}, {"id": 192395, "fullname": "Kaixuan Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192395?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 86219, "fullname": "Hujun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86219?format=json", "institution": "Zhejiang University"}, {"id": 76752, "fullname": "Zhaopeng Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/76752?format=json", "institution": "Zhejiang University"}], "abstract": "We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs.To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution.To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39578", "url": null, "sourceid": 37787, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39583, "uid": "c8c3df7ec1fd40eef503adad760f1ab1", "name": "EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses", "authors": [{"id": 157051, "fullname": "Enrico Pallotta", "url": "http://cvpr.thecvf.com/api/miniconf/users/157051?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 73683, "fullname": "Sina Mokhtarzadeh Azar", "url": "http://cvpr.thecvf.com/api/miniconf/users/73683?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 181656, "fullname": "Lars Doorenbos", "url": "http://cvpr.thecvf.com/api/miniconf/users/181656?format=json", "institution": "University of Bonn - Institute of Computer Science"}, {"id": 192408, "fullname": "Serdar Ozsoy", "url": "http://cvpr.thecvf.com/api/miniconf/users/192408?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 70808, "fullname": "Umar Iqbal", "url": "http://cvpr.thecvf.com/api/miniconf/users/70808?format=json", "institution": "NVIDIA"}, {"id": 75807, "fullname": "J\u00fcrgen Gall", "url": "http://cvpr.thecvf.com/api/miniconf/users/75807?format=json", "institution": "University of Bonn"}], "abstract": "Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39583", "url": null, "sourceid": 43970, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39590, "uid": "192541804394679421f61ac9f8b8c194", "name": "Precise Object and Effect Removal with Adaptive Target-Aware Attention", "authors": [{"id": 145175, "fullname": "Jixin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/145175?format=json", "institution": "Nanyang Technological University"}, {"id": 184748, "fullname": "Zhouxia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184748?format=json", "institution": "Nanyang Technological University"}, {"id": 106685, "fullname": "Peiqing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106685?format=json", "institution": "S-Lab, Nanyang Technological University"}, {"id": 75939, "fullname": "Shangchen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75939?format=json", "institution": "Nanyang Technological University"}], "abstract": "Object removal requires eliminating not only the target object but also its associated visual effects such as shadows and reflections. However, diffusion-based inpainting and removal methods often introduce artifacts, hallucinate contents, alter background, and struggle to remove object effects accurately. To address these challenges, we propose ObjectClear, a novel framework that decouples foreground removal from background reconstruction via an adaptive target-aware attention mechanism. This design empowers the model to precisely localize and remove both objects and their effects while maintaining high background fidelity. Moreover, the learned attention maps are leveraged for an attention-guided fusion strategy during inference, further enhancing visual consistency. To facilitate the training and evaluation of this framework, we construct OBER, a large-scale dataset for OBject-Effect Removal, which provides paired images with and without object-effects, along with precise masks for both objects and their effects. The dataset comprises high-quality captured and simulated data, covering diverse objects, effects, and complex multi-object scenes. Extensive experiments demonstrate that ObjectClear outperforms prior methods, achieving superior object-effect removal quality and background fidelity, especially in challenging real-world scenarios. Code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39590", "url": null, "sourceid": 41582, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39584, "uid": "1227f1221621bd026563d53b9cc7d864", "name": "GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents", "authors": [{"id": 130387, "fullname": "Mengtian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130387?format=json", "institution": "Shanghai University"}, {"id": 192409, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192409?format=json", "institution": "Shanghai University"}, {"id": 192410, "fullname": "Ruixue Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192410?format=json", "institution": "Shanghai University"}, {"id": 192411, "fullname": "Yiyan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192411?format=json", "institution": "Shanghai University"}, {"id": 90560, "fullname": "Zhifeng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90560?format=json", "institution": "shanghai university"}, {"id": 128022, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128022?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39584", "url": null, "sourceid": 35730, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39587, "uid": "6d6072ea730f062537e458a1e7d47e78", "name": "MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning", "authors": [{"id": 192416, "fullname": "Wall Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192416?format=json", "institution": "Samsung"}, {"id": 177228, "fullname": "Chaeyoung Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/177228?format=json", "institution": "Seoul National University of Science and Technology"}, {"id": 85480, "fullname": "Hanul Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85480?format=json", "institution": "Seoul National University of Science and Technology"}], "abstract": "Recently, TabPFN has gained attention as a foundation model for tabular data. However, it struggles to integrate heterogeneous modalities such as images and text, which are common in domains like healthcare and marketing, thereby limiting its applicability. To address this, we present the Multi-Modal Prior-data Fitted Network (MMPFN), which extends TabPFN to handle tabular and non-tabular modalities in a unified manner. MMPFN comprises per-modality encoders, modality projectors, and pre-trained foundation models. The modality projectors serve as the critical bridge, transforming non-tabular embeddings into tabular-compatible tokens for unified processing. To this end, we introduce a multi-head gated MLP and a cross-attention sampler that extract richer context from non-tabular inputs while mitigates attention imbalance issue in multimodal learning. Extensive experiments on medical and general-purpose multimodal datasets demonstrate that MMPFN consistently outperforms competitive state-of-the-art methods and effectively exploits non-tabular modalities alongside tabular features. These results highlight the promise of extending prior-data fitted networks to the multimodal setting, offering a scalable and effective framework for heterogeneous data learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39587", "url": null, "sourceid": 32467, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39588, "uid": "d0ab3bfa5ebcd995f488c3a90256291a", "name": "MemFlow: A Lightweight Forward Memorizing Framework for Quick Domain Adaptive Feature Mapping", "authors": [{"id": 158370, "fullname": "Jianming Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/158370?format=json", "institution": "South China University of Technology"}, {"id": 183160, "fullname": "Chengjun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183160?format=json", "institution": "South China University of Technology"}, {"id": 192417, "fullname": "Liang Depin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192417?format=json", "institution": "South China University of Technology"}, {"id": 192418, "fullname": "Qianli Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192418?format=json", "institution": "South China University of Technology"}, {"id": 184735, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184735?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 129330, "fullname": "Xueqi Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129330?format=json", "institution": ", Chinese Academy of Sciences"}], "abstract": "Deploying pretrained visual models in real-world environments often suffers from significant performance degradation due to the diversity of testing scenarios. Continuous adaptation of learning models on edge devices via unlabeled data collected from the target domain is highly effective for boosting generalization capability. However, gradient-backpropagation-based optimization of the massive parameters in deep neural networks is vastly more time-consuming than forward inference, rendering online learning infeasible on low-power edge devices. To address this critical challenge, we propose a lightweight gradient-free forward-memorizing framework, namely MemFlow, which leverages a frozen backbone and enables efficient fine-tuning of the mapping between features and predictions. Specifically, MemFlow employs randomly connected neurons to memorize feature-label associations; within the network, spiking signals are propagated, and predictions are generated by associating neuron-stored memories according to their confidence levels. More notably, MemFlow supports reinforced memorization of feature mappings using unlabeled data, thereby enabling rapid adaptation to new domains. Extensive experiments on four real-world cross-domain datasets demonstrate that MemFlow achieves performance improvements of up to 10% while consuming less than 1% of the computational time required by traditional domain adaptation methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39588", "url": null, "sourceid": 40922, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39593, "uid": "e774f82167847c35a60e178a4861f7ac", "name": "SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving", "authors": [{"id": 192429, "fullname": "jingyu li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192429?format=json", "institution": "Fudan University"}, {"id": 192430, "fullname": "Junjie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192430?format=json", "institution": "Li Auto Inc."}, {"id": 192431, "fullname": "Dongnan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192431?format=json", "institution": "Tongji University"}, {"id": 192432, "fullname": "Xiangkai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192432?format=json", "institution": null}, {"id": 192433, "fullname": "Bin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192433?format=json", "institution": "Li Auto Inc."}, {"id": 185069, "fullname": "Zhihui Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185069?format=json", "institution": "Li Auto Inc."}, {"id": 153220, "fullname": "XianPeng Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153220?format=json", "institution": "LiAuto"}, {"id": 75928, "fullname": "Xiatian Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75928?format=json", "institution": "University of Surrey"}, {"id": 73475, "fullname": "Li Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73475?format=json", "institution": "Fudan University"}], "abstract": "Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning.To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning.Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39593", "url": null, "sourceid": 37437, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39601, "uid": "00701bfa411bb151a8516b9730037148", "name": "Event-Based Motion Deblurring Using Task-Oriented 3D Gaussian Event Representations", "authors": [{"id": 192455, "fullname": "Shengdong Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192455?format=json", "institution": "Beijing University Of Technology"}, {"id": 146939, "fullname": "Haoxiang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/146939?format=json", "institution": "Beijing University Of Technology"}, {"id": 131016, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131016?format=json", "institution": "Southeast University"}, {"id": 131032, "fullname": "Zhen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131032?format=json", "institution": "Beijing University of Technology"}, {"id": 87377, "fullname": "Yongjian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87377?format=json", "institution": "Beijing University of Technology"}], "abstract": "Event-based motion deblurring has attracted increasing attention as the high temporal resolution of event cameras provides motion cues unavailable to RGB sensors, enabling stronger deblurring. In real-world scenes, motion blur is often complex and nonlinear, with different regions exhibiting diverse speeds and directions. However, most existing approaches rely on handcrafted event representations that overlook such spatiotemporal motion heterogeneity, resulting in suboptimal deblurring performance. To address this issue, we design a learnable 3D Gaussian event representation module that adaptively selects key spatiotemporal coordinates beneficial for deblurring based on the distributions of the blurred image and event density, and integrates the event stream using a 3D Gaussian weighting kernel, thereby extracting local motion features sensitive to motion direction and velocity. In addition, to fully exploit the motion information aggregated in our event representation, a two-stage fusion strategy is employed. Local motion features are used in the first stage to enhance detail restoration, followed by a bidirectional attention fusion module that leverages the one-dimensional Gaussian-weighted event frames for global position correction, thereby achieving precise alignment of the overall structure. Extensive experiments on synthetic and real-world datasets validate the effectiveness of our approach and yield a substantial improvement over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39601", "url": null, "sourceid": 44962, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39600, "uid": "bf56d3ff4ea20391eeb73af2dc7e0d07", "name": "TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization", "authors": [{"id": 86694, "fullname": "Wonhyeok Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86694?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 185937, "fullname": "Kyumin Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185937?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 157087, "fullname": "Jihun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/157087?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 157089, "fullname": "Kyoungmin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/157089?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 157090, "fullname": "Seunghun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/157090?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 185938, "fullname": "Jaeyeul Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185938?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 192454, "fullname": "Minwoo Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192454?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 86689, "fullname": "Sunghoon Im", "url": "http://cvpr.thecvf.com/api/miniconf/users/86689?format=json", "institution": "DGIST"}], "abstract": "Multi-task learning (MTL) involves the simultaneous optimization of multiple task-specific losses, often leading to gradient conflicts and scale imbalances that result in negative transfer. While existing multi-task optimization methods attempt to mitigate these challenges, they either lack the stochasticity needed to escape poor local minima or fail to explicitly address conflicts at the gradient level. In this work, we propose TaskForce, a novel multi-task optimization framework incorporating cooperative multi-agent reinforcement learning (MARL), where agents learn to find an effective joint optimization strategy based on their respective task gradients and losses. To keep the optimization process compact yet informative, agents observe a summary of the training dynamics that consists of the gradient Gram matrix---capturing both gradient magnitudes and pairwise alignments---and task loss values. Each agent then predicts the balancing parameters that determine the weight of their contribution to the final gradient update. Crucially, we design a hybrid reward function that incorporates both gradient-based signals and loss improvement dynamics, enabling agents to effectively resolve gradient conflicts and avoid poor convergence by considering both direct gradient information and the resulting impact on loss reduction. TaskForce achieves consistent improvements over state-of-the-art MTL baselines on NYU-v2, Cityscapes, and QM9, demonstrating the promise of cooperative MARL in complex multi-task scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39600", "url": null, "sourceid": 33908, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39604, "uid": "eb3e78a6d08bc2aad5ab094608a71c77", "name": "Unpaired Deep Image Deraining Using Reward-Guided Self-Reinforcement Learning", "authors": [{"id": 181996, "fullname": "Yinghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181996?format=json", "institution": "National University of Defense Technology"}, {"id": 156998, "fullname": "Yeying Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/156998?format=json", "institution": "Tencent"}, {"id": 90859, "fullname": "Xiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90859?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 182512, "fullname": "Yanyan Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/182512?format=json", "institution": "Hefei University of Technology"}, {"id": 144302, "fullname": "Ziyang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144302?format=json", "institution": "University of Trento"}, {"id": 131420, "fullname": "Yaowen Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131420?format=json", "institution": "National University of Defense Technology"}], "abstract": "Unsupervised deraining has attracted increasing attention due to its flexible data requirements during model training. Lacking paired supervision makes it challenging for the network to achieve a compact optimization space within complex and diversity rain degradation data. Additionally, some high-quality deraining results produced during the network\u2019s training process are overlooked, despite their potential to constrain the optimization space. To overcome them, we introduce a Reward-Guided Self-reinforcement Unsupervised Image Deraining framework, RGSUD. Our RGSUD consists of two stages: rewards recycling and self-reinforcement (SR) strategy training. For the former, we propose a Vision Language Model (VLM) based dynamic reward recycling mechanism to select the optimal deraining results from outputs during model training. In this way, we can robustly collect high-quality deraining results. For the latter, reward-driven optimization is adopted to construct the connection between the rewards and current deraining result, which constrains the optimization space of RGSUD. Thus, the network can learn deraining knowledge within a more compact optimization space, further enhancing deraining performance. The proposed SR strategy achieves over 1 dB improvement on Rain100L and real-world dataset RealRain1K-L, compared to the baseline. Extensive experiments on multiple datasets demonstrate that our proposed framework performs favorably over state-of-the-art unsupervised deraining methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39604", "url": null, "sourceid": 43229, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39606, "uid": "6fef9896ea55b5fe1a342b38bb464c78", "name": "AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM", "authors": [{"id": 192463, "fullname": "Lian Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192463?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192464, "fullname": "Ziqiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192464?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192465, "fullname": "Jibin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192465?format=json", "institution": "Foshan University"}, {"id": 192466, "fullname": "Jin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192466?format=json", "institution": null}, {"id": 131644, "fullname": "Z. Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131644?format=json", "institution": "University of British Columbia"}, {"id": 180020, "fullname": "xiangui Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180020?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions.To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Increase Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions.To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head.Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8\\% and 37.1\\%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39606", "url": null, "sourceid": 34849, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39611, "uid": "80cff27b15b30ca4c0186164c64f5a18", "name": "EMR-SM: Explicit Mesh Reconstruction with Dynamic Topology Adaptation", "authors": [{"id": 183780, "fullname": "Chuanjin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183780?format=json", "institution": "University of Science and Technology of China"}, {"id": 107403, "fullname": "Lifan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107403?format=json", "institution": "University of Science and Technology of China"}, {"id": 77347, "fullname": "Wenjie Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77347?format=json", "institution": "University of Science and Technology of China"}, {"id": 190620, "fullname": "Hanzhi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190620?format=json", "institution": "University of Science and Technology of China"}, {"id": 88062, "fullname": "Wenfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88062?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose EMR-SM, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, EMR-SM is the first framework that directly optimizes meshes with real-time adaptive topology refinement. Extensive experiments demonstrate that EMR-SM achieves a balance among accuracy, computational efficiency, and mesh conciseness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39611", "url": null, "sourceid": 39534, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39614, "uid": "685677a17902e2dbc239ca05391544e1", "name": "Evaluating Generative Models via One-Dimensional Code Distributions", "authors": [{"id": 180062, "fullname": "Zexi Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/180062?format=json", "institution": "Tencent"}, {"id": 181779, "fullname": "Pengcheng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181779?format=json", "institution": "\u817e\u8baf\u79d1\u6280\uff08\u5317\u4eac\uff09\u6709\u9650\u516c\u53f8"}, {"id": 185905, "fullname": "Yijia Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185905?format=json", "institution": "Fudan University"}, {"id": 155554, "fullname": "Jinchao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155554?format=json", "institution": "WeChat AI"}, {"id": 149440, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/149440?format=json", "institution": "Tencent Inc"}], "abstract": "Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \\emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce Codebook Histogram Distance(CHD), a training-free distribution metric in token space, and Code Mixture Model Score(CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose VisForm, a benchmark of 210K images spanning 62 visual forms and 11 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39614", "url": null, "sourceid": 33524, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39618, "uid": "1b0c7224572324e8772d17a2cc5134a5", "name": "StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation", "authors": [{"id": 192494, "fullname": "Junlin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192494?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 192495, "fullname": "Quanlong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192495?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 90982, "fullname": "Ruifei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90982?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 73195, "fullname": "Kuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73195?format=json", "institution": "Sun Yat-sen University"}, {"id": 87903, "fullname": "Yanhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87903?format=json", "institution": "Oppo Research Institute"}, {"id": 119982, "fullname": "Jinguo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/119982?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 154487, "fullname": "Haonan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154487?format=json", "institution": "OPPO Guangdong Mobile Telecommunications Co., Ltd."}, {"id": 73155, "fullname": "Xiang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/73155?format=json", "institution": "Shenzhen Research Institiute of Big Data"}, {"id": 185257, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185257?format=json", "institution": null}], "abstract": "The transition of Retrieval-Augmented Generation (RAG) from offline video analysis to online, streaming scenarios presents a set of critical, unexplored challenges. These include the need for on-the-fly semantic segmentation of continuous video, the inherent tension between low-latency processing and high-quality knowledge extraction, and the demand for query-specific temporal reasoning. We propose StreamRAG, a novel framework designed to overcome these hurdles. StreamRAG is built upon three core technical pillars: (1) a Stream Event Segmentation (SES) module that performs real-time boundary detection to chunk the stream into meaningful units; (2) a Token-Reusing Accelerator that drastically cuts down captioning latency by leveraging computational overlap between consecutive frames; and (3) a Dynamic Retrieval Gate that modulates the retrieval scope and strategy based on the query's temporal sensitivity and contextual similarity. Empirical evaluation confirms that StreamRAG establishes a new state-of-the-art, delivering superior accuracy with minimal latency in streaming video comprehension.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39618", "url": null, "sourceid": 45000, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39621, "uid": "6429389c5e8a4c9555be876f8484331a", "name": "Cluster-aware Anchor Learning for Multi-View Clustering", "authors": [{"id": 180623, "fullname": "Zhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180623?format=json", "institution": "Anhui University of Technology"}, {"id": 192500, "fullname": "Fanhui Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192500?format=json", "institution": "Anhui University of Technology"}, {"id": 75854, "fullname": "Tianyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75854?format=json", "institution": "Jiangnan University"}, {"id": 129533, "fullname": "Xiaojun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129533?format=json", "institution": "Jiangnan University"}], "abstract": "Anchor-based multi-view clustering is attractive for its efficiency, yet most methods fix the number of anchors a priori, implicitly assuming uniform needs across clusters. In practice, clusters differ in information richness, scale, and intrinsic structure, motivating adaptive per-cluster anchor allocation. We propose Cluster-aware Anchor Learning (CAL), which learns a consensus anchor matrix and organizes its columns into cluster-specific anchor groups. CAL imposes an $\\ell_{2,1}$-norm column-sparsity penalty on each group to suppress redundancy and preserve cluster-discriminative features, thereby automatically determining how many anchors each cluster retains. To further enhance separability, CAL introduces an inter-cluster regularization that constrains relationships among groups, promoting mutual dissimilarity.  This data-driven design learns higher-quality, cluster-aware anchors and yields a more discriminative representation matrix across multiple views.  Extensive experiments on multiple benchmarks show that CAL  outperforms state-of-the-art multi-view clustering methods, demonstrating superior effectiveness, robustness, and adaptability to heterogeneous cluster structures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39621", "url": null, "sourceid": 45433, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39622, "uid": "32f14bac32a5cce63388a2ec80598c08", "name": "FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention", "authors": [{"id": 183506, "fullname": "Zipeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183506?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 88296, "fullname": "Dan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88296?format=json", "institution": "CSE, HKUST"}], "abstract": "3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences.In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of **descriptor tokens**. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead.Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just **9.3%** for 1,000 images, and scaling efficiently to sequences exceeding **3,000** images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39622", "url": null, "sourceid": 41888, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39624, "uid": "7b3f65a67546eca7a9249e1310a6be4f", "name": "GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution", "authors": [{"id": 154583, "fullname": "Qiaosi Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154583?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 129135, "fullname": "Shuai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129135?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 132960, "fullname": "Rongyuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132960?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88758, "fullname": "Lingchen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88758?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 129145, "fullname": "Zhengqiang ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129145?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR.  In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39624", "url": null, "sourceid": 39988, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39629, "uid": "db9c81673b36dd9dacb092db793de572", "name": "All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models", "authors": [{"id": 102431, "fullname": "Xinyu Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/102431?format=json", "institution": "Australian National University"}, {"id": 192523, "fullname": "Shu Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192523?format=json", "institution": "Australian National University"}, {"id": 192524, "fullname": "Zhaoyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192524?format=json", "institution": "General Electric"}, {"id": 164574, "fullname": "Mengqi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/164574?format=json", "institution": "Australia National University"}, {"id": 159453, "fullname": "Peter Henry Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159453?format=json", "institution": "General Electric"}, {"id": 130439, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130439?format=json", "institution": "Australian National University"}], "abstract": "Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39629", "url": null, "sourceid": 43562, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39632, "uid": "237a7496de9290fe1e0d7682e6a6633b", "name": "MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding", "authors": [{"id": 181476, "fullname": "Xuanhang Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181476?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192529, "fullname": "Zhonghao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192529?format=json", "institution": "East China Normal University"}, {"id": 192530, "fullname": "Cheng Zhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192530?format=json", "institution": "Zhejiang University"}, {"id": 179964, "fullname": "YU LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/179964?format=json", "institution": "Zhejiang University"}], "abstract": "Diffusion-native watermarking provides a more secure and reliable way to trace images from latent diffusion models (LDMs) by embedding information directly into the generative process. However, existing methods suffer from a fundamental limitation: their embedding capacity is extremely small. We introduce MaxMark, a high-capacity watermarking framework that supports embed rich watermark messages into generated images. MaxMark uses two components: a robust watermark embedding module that enhance the secret message and places them into reliable regions of the latent noise, and a distribution transformation module that maps the watermarked latent back to an approximate Gaussian, ensuring compatibility with the diffusion process and preserving image fidelity. The distribution transformation is implemented with an invertible neural network (INN), whose exactly reversible structure enables precise recovery and efficient training. Experiments show that MaxMark surpasses prior methods in capacity, robustness, and imperceptibility, achieving up to a 46\\% improvement in bit accuracy for large watermark payloads.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39632", "url": null, "sourceid": 44206, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39634, "uid": "fbdbd0b268103ccdc44bf1682d51592f", "name": "Image Diffusion Preview with Consistency Solver", "authors": [{"id": 133160, "fullname": "Fu-Yun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133160?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 161111, "fullname": "Hao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/161111?format=json", "institution": "Research, Google"}, {"id": 128126, "fullname": "Liangzhe Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128126?format=json", "institution": "Google"}, {"id": 107365, "fullname": "Sanghyun Woo", "url": "http://cvpr.thecvf.com/api/miniconf/users/107365?format=json", "institution": "New York University"}, {"id": 88081, "fullname": "Boqing Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88081?format=json", "institution": "Google"}, {"id": 75881, "fullname": "Bohyung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/75881?format=json", "institution": "Seoul National University"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 156398, "fullname": "Han Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156398?format=json", "institution": "Reve AI"}, {"id": 192533, "fullname": "Yukun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192533?format=json", "institution": "Google"}, {"id": 85747, "fullname": "Ting Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85747?format=json", "institution": "Google DeepMind"}, {"id": 128046, "fullname": "Long Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128046?format=json", "institution": "Google DeepMind"}], "abstract": "The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. In this paper, we propose ConsistencySolver derived from general linear multistep methods, a lightweight,  trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency.Experimental results demonstrate that ConsistencySolver significantly improves generation quality in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39634", "url": null, "sourceid": 33937, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39636, "uid": "41768754d777815578bfe2fa95da614d", "name": "ID-Crafter: VLM-Grounded Online RL  for Compositional Multi-Subject Video Generation", "authors": [{"id": 181092, "fullname": "Panwang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181092?format=json", "institution": "ByteDance"}, {"id": 185518, "fullname": "Jingjing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185518?format=json", "institution": null}, {"id": 185519, "fullname": "Yuchen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185519?format=json", "institution": "Peking University"}, {"id": 147620, "fullname": "Chenguo Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/147620?format=json", "institution": "Peking University"}, {"id": 152583, "fullname": "Chenxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152583?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 152581, "fullname": "Hengyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152581?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 192539, "fullname": "Tingting Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192539?format=json", "institution": null}, {"id": 89566, "fullname": "Yadong Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89566?format=json", "institution": "Peking University"}], "abstract": "Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39636", "url": null, "sourceid": 35230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39637, "uid": "279b369266fbc65a0f271fe465b9e8e0", "name": "Dual Ascent Diffusion for Inverse Problems", "authors": [{"id": 182952, "fullname": "Minseo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182952?format=json", "institution": "Stanford University"}, {"id": 192540, "fullname": "Axel Levy", "url": "http://cvpr.thecvf.com/api/miniconf/users/192540?format=json", "institution": null}, {"id": 85845, "fullname": "Gordon Wetzstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85845?format=json", "institution": "Stanford University"}], "abstract": "Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39637", "url": null, "sourceid": 40263, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39644, "uid": "99cc80ad462f9441f221e1b991028a86", "name": "OVI-MAP: Open-Vocabulary Instance-Semantic Mapping", "authors": [{"id": 182688, "fullname": "Zilong Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182688?format=json", "institution": "ETH Zurich"}, {"id": 87927, "fullname": "Federico Tombari", "url": "http://cvpr.thecvf.com/api/miniconf/users/87927?format=json", "institution": "Google, TUM"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}, {"id": 158503, "fullname": "Johanna Wald", "url": "http://cvpr.thecvf.com/api/miniconf/users/158503?format=json", "institution": "Google"}, {"id": 74200, "fullname": "Daniel Barath", "url": "http://cvpr.thecvf.com/api/miniconf/users/74200?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks. The source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39644", "url": null, "sourceid": 36337, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39646, "uid": "e45e346bebac5fb8db1d1c63f751f3d8", "name": "MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models", "authors": [{"id": 76624, "fullname": "Xincheng Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76624?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192558, "fullname": "Zefeng Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/192558?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192559, "fullname": "Chao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192559?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184048, "fullname": "Jiayang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/184048?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 89431, "fullname": "Chongyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89431?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training  MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39646", "url": null, "sourceid": 34430, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39652, "uid": "56de413a1eb075afd0e1da0472bcbe48", "name": "Learning to Infer Parameterized Representations of Plants from 3D Scans", "authors": [{"id": 183162, "fullname": "Samara Ghrer", "url": "http://cvpr.thecvf.com/api/miniconf/users/183162?format=json", "institution": "Inria"}, {"id": 192570, "fullname": "Christophe Godin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192570?format=json", "institution": "INRIA"}, {"id": 85519, "fullname": "Stefanie Wuhrer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85519?format=json", "institution": "INRIA"}], "abstract": "Plants frequently contain numerous organs, organized in 3D branching systems defining the plant's architecture. Reconstructing the architecture of plants from unstructured observations is challenging because of self-occlusion and spatial proximity between organs, which are often thin structures. To achieve the challenging task, we propose an approach that allows to infer a parameterized representation of the plant's architecture from a given 3D scan of a plant. In addition to the plant's branching structure, this representation contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using a procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is applicable to any plant that can be represented as binary axial tree. We quantitatively evaluate our approach on Chenopodium Album plants on reconstruction, segmentation and skeletonization, which are important problems in plant phenotyping. In addition to carrying out several tasks at once, our method achieves results on-par with strong baselines for each task. We apply our method, trained exclusively on synthetic data, to 3D scans and show that it generalizes well.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39652", "url": null, "sourceid": 38267, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39653, "uid": "84037c86334bb0dc014f73da6bb04dca", "name": "Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving", "authors": [{"id": 174507, "fullname": "Xubo Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174507?format=json", "institution": "Wuhan University"}, {"id": 129306, "fullname": "Haoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129306?format=json", "institution": "Horizon Robotics"}, {"id": 192571, "fullname": "Fei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192571?format=json", "institution": "Horizon Robotics"}, {"id": 155564, "fullname": "Rui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155564?format=json", "institution": "Horizon Robotics"}, {"id": 192572, "fullname": "Yanhu Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192572?format=json", "institution": null}, {"id": 76467, "fullname": "Wen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76467?format=json", "institution": "Wuhan University"}, {"id": 88693, "fullname": "Huai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88693?format=json", "institution": "Wuhan University"}], "abstract": "3D occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable  geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocates region-specific experts to focus on different spatial regions, effectively addressing spatial semantic variations. Thus, the two components make complementary contributions: depth guidance ensures geometric alignment, while region experts enhance semantic learning. Experiments on the Occ3D--nuScenes benchmark demonstrate that Dr.Occ improves the strong baseline BEVDet4D by 7.43\\% mIoU and 3.09\\% IoU under the full vision-only setting. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39653", "url": null, "sourceid": 34228, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39654, "uid": "2b9a3dc9e24557b457a54df977f21cd1", "name": "AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence", "authors": [{"id": 180435, "fullname": "Jiawei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180435?format=json", "institution": "Shanghai Qi Zhi Institute"}, {"id": 192573, "fullname": "Kaizhe Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192573?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 192574, "fullname": "Yingqian Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192574?format=json", "institution": "Fudan University"}, {"id": 192575, "fullname": "Yuanchen Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/192575?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 192576, "fullname": "Zhengrong Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192576?format=json", "institution": "Tsinghua University"}, {"id": 150896, "fullname": "Huazhe Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150896?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often limited to specific object shapes due to the constrained data diversity. Leveraging powerful 3D generative models and vision foundation models (VFM), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation tra-jectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly im-proving data efficiency in robot learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39654", "url": null, "sourceid": 41110, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39656, "uid": "127417dfbda8ba627d66c0dae4aef409", "name": "REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting", "authors": [{"id": 182023, "fullname": "Changyue Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182023?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 88165, "fullname": "Minghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88165?format=json", "institution": "Zhejiang University"}, {"id": 192579, "fullname": "Yiping Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192579?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 178557, "fullname": "Chuxiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178557?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 192580, "fullname": "Xinyuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192580?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 104052, "fullname": "Jiajun Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/104052?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 76382, "fullname": "Zhou Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76382?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically, multiple global views are first fed into the MLLM agent in parallel for coarse-level localization, aggregating responses to robustly identify the target object. Then, several close-up novel views of the object are synthesized to perform fine-grained local segmentation, yielding accurate and consistent 3D masks. Extensive experiments show that REALM achieves remarkable performance in interpreting both explicit and implicit instructions across LERF, 3D-OVS, and our newly introduced REALM3D benchmarks. Furthermore, our agent framework seamlessly supports a range of 3D interaction tasks, including object removal, replacement, and style transfer, demonstrating its practical utility and versatility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39656", "url": null, "sourceid": 45472, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39661, "uid": "01262c816d03a88372753fac1d70f02d", "name": "LATTICE: Democratize High-Fidelity 3D Generation at Scale", "authors": [{"id": 155059, "fullname": "Zeqiang Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/155059?format=json", "institution": "Tencent"}, {"id": 186526, "fullname": "Yunfei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186526?format=json", "institution": "Tencent Hunyuan"}, {"id": 186525, "fullname": "Zibo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186525?format=json", "institution": "Tencent"}, {"id": 90339, "fullname": "Haolin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90339?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 192590, "fullname": "Qingxiang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192590?format=json", "institution": "Tencent Hunyuan"}, {"id": 127361, "fullname": "Jingwei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127361?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a  recitified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39661", "url": null, "sourceid": 39745, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39662, "uid": "5fc624523b2074a3440e9312f271d68c", "name": "Frequency-Aware Flow Matching for High-Quality Image Generation", "authors": [{"id": 88421, "fullname": "Sucheng Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/88421?format=json", "institution": "South China University of Technology, Tsinghua University"}, {"id": 88900, "fullname": "Qihang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88900?format=json", "institution": "Johns Hopkins University"}, {"id": 188843, "fullname": "Ju He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188843?format=json", "institution": "Amazon"}, {"id": 91095, "fullname": "Xiaohui Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91095?format=json", "institution": "ByteDance"}, {"id": 94929, "fullname": "Liang-Chieh Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/94929?format=json", "institution": "Amazon Frontier AI &amp; Robotics"}], "abstract": "Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process.    Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output.    By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled\u2014low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness.    On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39662", "url": null, "sourceid": 32793, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39665, "uid": "7ea9f068a460ee6c285e7ca7af850c51", "name": "Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation", "authors": [{"id": 156166, "fullname": "Zaijing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156166?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192596, "fullname": "Bing Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192596?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89896, "fullname": "Rui Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89896?format=json", "institution": "Harbin Institute of Technology"}, {"id": 132952, "fullname": "Gongwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132952?format=json", "institution": "Harbin Institute of Technology"}, {"id": 128613, "fullname": "Dongmei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128613?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 192597, "fullname": "Pengwei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192597?format=json", "institution": "Electronic Engineering, Tsinghua University, Tsinghua University"}, {"id": 84767, "fullname": "Jianye Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84767?format=json", "institution": "Tianjin University"}, {"id": 84777, "fullname": "Liqiang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/84777?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "Hierarchical Vision\u2013Language\u2013Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision\u2013Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6\\% average success rate on LIBERO, improves over $\\pi_{0}$ by 13.5\\% on CALVIN, and attains 38\\% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing $\\pi_{0}$ by 42.9\\% and 52.4\\%, respectively, while delivering 2.9\u00d7 inference speedup.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39665", "url": null, "sourceid": 39952, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39664, "uid": "ef3217af91c6be2ed9ff90f60f247620", "name": "I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models", "authors": [{"id": 158205, "fullname": "Juntong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158205?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 154577, "fullname": "Wang Jiarui", "url": "http://cvpr.thecvf.com/api/miniconf/users/154577?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 154576, "fullname": "Huiyu Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154576?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192595, "fullname": "Jiaxiang Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192595?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 89522, "fullname": "Xiongkuo Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/89522?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose I2I-Bench, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences.Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39664", "url": null, "sourceid": 46572, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39669, "uid": "39d03a3562c450fec0091a5391c27e11", "name": "Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis", "authors": [{"id": 183327, "fullname": "Zhilu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183327?format=json", "institution": "Imperial College London"}, {"id": 91010, "fullname": "Mingcheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91010?format=json", "institution": "Fudan University"}], "abstract": "Multimodal Sentiment Analysis (MSA) aims to comprehensively and robustly interpret human emotions by integrating information from verbal, visual and acoustic modalities. However, the performance of existing models is often hampered by two key challenges: insufficient multilayer semantic extraction inherent to modalities and static feature fusion, leading to low performance. Therefore, this paper proposes a Multi-factor Factor-Decoupling and Semantics-enhanced Fusion Framework for accurate multimodal sentiment analysis. First, each modality is decomposed into three orthogonal subspaces based on a multidimensional information separation mechanism, which is regulated by a contrast constraint for subspace separation, an information gain constraint for maximizing the capture of task-relevant features, and a pairwise constraint for ensuring complementary subspaces. Subsequently, a variational purification strategy is introduced to further ensure the semantic integrity of each sentiment representation. Finally, the fusion module computes the adaptive fusion weights in parallel using multiple orthogonal factors such as sample-level modality saliency, global subspace type importance and feature-level internal attention. Extensive experiments on three datasets demonstrate the effectiveness of the proposed method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39669", "url": null, "sourceid": 44790, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39670, "uid": "d572f68116e37b6e798fd9260c43efe3", "name": "Bulk RNA-seq Guided Multi-modal Detection of Anomalous Regions in Human Cancer via Spatial Transcriptomics", "authors": [{"id": 145130, "fullname": "Hang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/145130?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192605, "fullname": "Ruocheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192605?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 176477, "fullname": "Wenjie You", "url": "http://cvpr.thecvf.com/api/miniconf/users/176477?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192606, "fullname": "huangzhilin huangzhilin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192606?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 87100, "fullname": "Daoqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87100?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 128998, "fullname": "WEI SHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/128998?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "Spatial transcriptomics (ST) has emerged as a revolutionary approach in the field of tissue analysis that can offer spatial resolved molecular insights for the identification of anomalous regions (AR) on human cancers. Current ST-based methods for detecting AR focus narrowly on the molecular features of local tissue spots, overlooking the matched bulk RNA-seq data that contains crucial diagnostic information. This oversight limits their effectiveness in identifying subtle or heterogeneous tumors, where accurate detection depends on broader genetic context. Besides the genomic signatures, the pathological images can also provide rich visual information to reflect the morphology of AR. To utilize the patient-level diagnostic knowledge and harness complementary information from both histology images and ST, we develop a Bulk RNA-seq Guided Multi-modal Anomalous Regions Detection method (BRGMAR) for the identification of AR from human tissues. Specifically, to effectively model the  dependencies in ST, we introduce a Dynamic Multi-Relational Graph Learning (DMRGL) module to adaptively capture complex relationships in ST, including both spatial proximity and gene expression similarity. Then, we design an Optimal Transportation-based Gene Module Alignment (OTGMA) approach to align ST data with patient-level bulk RNA-seq data by matching the compositional and functional similarities of their corresponding gene modules. Finally, we combine the learned genomic features with pathological image representations for accurate AR detection. We evaluate our method on three public available ST datasets for the purpose of identifying cancerous regions from normal tissues, and the experimental results demonstrate the advantage of our method in comparison with the existing studies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39670", "url": null, "sourceid": 34658, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39671, "uid": "72824049a9d187c8848e6ba146b02ed3", "name": "EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing", "authors": [{"id": 183834, "fullname": "Yehonathan Litman", "url": "http://cvpr.thecvf.com/api/miniconf/users/183834?format=json", "institution": "Carnegie Mellon University"}, {"id": 76995, "fullname": "Shikun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76995?format=json", "institution": "Meta AI"}, {"id": 192607, "fullname": "Dario Seyb", "url": "http://cvpr.thecvf.com/api/miniconf/users/192607?format=json", "institution": "Meta"}, {"id": 192608, "fullname": "Nicholas Milef", "url": "http://cvpr.thecvf.com/api/miniconf/users/192608?format=json", "institution": "Facebook"}, {"id": 192609, "fullname": "Yang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192609?format=json", "institution": "Meta Reality Labs Research"}, {"id": 126455, "fullname": "Carl Marshall", "url": "http://cvpr.thecvf.com/api/miniconf/users/126455?format=json", "institution": "Meta Reality Labs Research"}, {"id": 76012, "fullname": "Shubham Tulsiani", "url": "http://cvpr.thecvf.com/api/miniconf/users/76012?format=json", "institution": "Carnegie Mellon University"}, {"id": 167321, "fullname": "Caleb Leak", "url": "http://cvpr.thecvf.com/api/miniconf/users/167321?format=json", "institution": "Meta Platforms, Inc."}], "abstract": "High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl $10\\times$ more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and real-time autoregressive content propagation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39671", "url": null, "sourceid": 38377, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39685, "uid": "5960da740b2265cd212a7500a88abb1f", "name": "Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects", "authors": [{"id": 182553, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182553?format=json", "institution": "IEEE International Conferenc 2026"}, {"id": 184564, "fullname": "Gaoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184564?format=json", "institution": "Zhejiang University"}, {"id": 192645, "fullname": "Jun Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192645?format=json", "institution": "Zhejiang University"}, {"id": 184565, "fullname": "Xinguo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184565?format=json", "institution": "Zhejiang University"}], "abstract": "3D reconstruction of transparent objects from multiple views has been a long-standing challenge. In contrast to opaque objects, transparent objects exhibit complex refraction that causes serious image distortion, resulting in a highly ill-posed problem. Existing reconstruction methods commonly depend on special capture devices or controlled environments, which provide more priors and simplify the modeling of refraction. More importantly, these methods lack the capability for reconstruction of mixed transparent and opaque objects, being confined to transparent or opaque materials. To address these challenges, we propose Opti-NeuS, a novel method for reconstructing transparent and opaque objects without controlled environments or additional input. Opti-NeuS incorporates a novel IoRNetwork to obtain spatially-varying IoR for tracing the refractive ray paths, which can finally model refractive visual distortion. To deal with dual-layered transparent and opaque objects, we devise a two-stage hierarchical reconstruction strategy that decouples outer and inner geometry, combined with alpha-blending for transparency-aware surface separation. Experiments show that Opti-NeuS achieves practical utility and effectiveness and outperforms prior works.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39685", "url": null, "sourceid": 42554, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39677, "uid": "1d4fec573363bb608caa3008337ea5cb", "name": "Tri-Modal Fusion Transformers for UAV-based Object Detection", "authors": [{"id": 146247, "fullname": "Craig Iaboni", "url": "http://cvpr.thecvf.com/api/miniconf/users/146247?format=json", "institution": "New Jersey Institute of Technology"}, {"id": 182449, "fullname": "Pramod Abichandani", "url": "http://cvpr.thecvf.com/api/miniconf/users/182449?format=json", "institution": "New Jersey Institute of Technology"}], "abstract": "Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise\u2013pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector.We introduce a 10,489-frame UAV dataset with synchronized and pre-aligned RGB\u2013thermal\u2013event streams and 24,223 annotated vehicles across day and night flights. Through 61 controlled ablations, we evaluate fusion placement, mechanism (baseline MAGE+BiTE, CSSA, GAFF), modality subsets, and backbone capacity. Tri-modal fusion improves over all dual-modal baselines, with fusion depth having a significant effect and a lightweight CSSA variant recovering most of the benefit at minimal cost. This work provides the first systematic benchmark and modular backbone for tri-modal UAV-based object detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39677", "url": null, "sourceid": 38831, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39689, "uid": "626c8e06612900e985ac01abf663cc6f", "name": "SAGE: Scalable Agentic 3D Scene Generation for Embodied AI", "authors": [{"id": 155916, "fullname": "Hongchi Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/155916?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 102910, "fullname": "Xuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/102910?format=json", "institution": "NVIDIA"}, {"id": 135762, "fullname": "Max Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/135762?format=json", "institution": "NVIDIA Research"}, {"id": 150902, "fullname": "Qianli Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/150902?format=json", "institution": "NVIDIA"}, {"id": 192651, "fullname": "Jiashu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192651?format=json", "institution": "NVIDIA"}, {"id": 90941, "fullname": "Ming-Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90941?format=json", "institution": "NVIDIA"}, {"id": 85190, "fullname": "Yin Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/85190?format=json", "institution": "NVIDIA"}, {"id": 90963, "fullname": "Tsung-Yi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/90963?format=json", "institution": "NVIDIA"}, {"id": 127203, "fullname": "Wei-Chiu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/127203?format=json", "institution": "Cornell University"}, {"id": 86810, "fullname": "Shenlong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86810?format=json", "institution": "University of Illinois, Urbana Champaign"}, {"id": 165594, "fullname": "Shuran Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/165594?format=json", "institution": "Stanford University"}, {"id": 150892, "fullname": "Fangyin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/150892?format=json", "institution": "NVIDIA"}], "abstract": "Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., \u201cpick up a bowl and place it on the table\u201d), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for Embodied AI. We will release both 3D scene and action generation code to foster further research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39689", "url": null, "sourceid": 42596, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39691, "uid": "23e18616d10dc96e93805413b83969be", "name": "Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets", "authors": [{"id": 180326, "fullname": "Yeshwanth Kumar Adimoolam", "url": "http://cvpr.thecvf.com/api/miniconf/users/180326?format=json", "institution": "CYENS Center of Excellence"}, {"id": 158695, "fullname": "Charalambos Poullis", "url": "http://cvpr.thecvf.com/api/miniconf/users/158695?format=json", "institution": "Concordia University"}, {"id": 75800, "fullname": "Melinos Averkiou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75800?format=json", "institution": "University of Cyprus"}], "abstract": "In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we found that approximately 56k of the 60k images in the validation split were also present in the training split, amounting to a 93% data leakage.Furthermore, we present a data validation pipeline to address these issues of duplication and data leakage, which hinder the performance of models trained on such datasets. Employing perceptual hashing techniques, this pipeline is designed for efficient de-duplication and leakage identification. It aims to thoroughly evaluate the quality of datasets before their use, thereby ensuring the reliability and robustness of the trained models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39691", "url": null, "sourceid": 31769, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40377?format=json"], "related_events_ids": [40377]}, {"id": 40377, "uid": "23e18616d10dc96e93805413b83969be", "name": "Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets", "authors": [{"id": 180326, "fullname": "Yeshwanth Kumar Adimoolam", "url": "http://cvpr.thecvf.com/api/miniconf/users/180326?format=json", "institution": "CYENS Center of Excellence"}, {"id": 158695, "fullname": "Charalambos Poullis", "url": "http://cvpr.thecvf.com/api/miniconf/users/158695?format=json", "institution": "Concordia University"}, {"id": 75800, "fullname": "Melinos Averkiou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75800?format=json", "institution": "University of Cyprus"}], "abstract": "In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we found that approximately 56k of the 60k images in the validation split were also present in the training split, amounting to a 93% data leakage.Furthermore, we present a data validation pipeline to address these issues of duplication and data leakage, which hinder the performance of models trained on such datasets. Employing perceptual hashing techniques, this pipeline is designed for efficient de-duplication and leakage identification. It aims to thoroughly evaluate the quality of datasets before their use, thereby ensuring the reliability and robustness of the trained models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40377", "url": null, "sourceid": -31769, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39691?format=json"], "related_events_ids": [39691]}, {"id": 39692, "uid": "8cf4ff2dc2db6902d222b0c7dcc98d04", "name": "Adaptive Action Chunking at Inference-time for Vision-Language-Action Models", "authors": [{"id": 182302, "fullname": "Yuanchang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182302?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 192654, "fullname": "Xiaobo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192654?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 73838, "fullname": "Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73838?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 192655, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192655?format=json", "institution": "Mininglamp"}, {"id": 153639, "fullname": "Xiaojiang Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153639?format=json", "institution": "Shenzhen Technology University"}, {"id": 71514, "fullname": "Haoyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71514?format=json", "institution": "City University of Hong Kong"}, {"id": 192656, "fullname": "David Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/192656?format=json", "institution": "National University of Singapore"}, {"id": 192657, "fullname": "Prahlad Vadakkepat", "url": "http://cvpr.thecvf.com/api/miniconf/users/192657?format=json", "institution": "National University of Singapore"}], "abstract": "In Vision-Language-Action (VLA) models, action chunking (i.e., executing a sequence of actions without intermediate replanning) is a key technique to improve robotic manipulation abilities. However, a large chunk size reduces the model\u2019s responsiveness to new information, while a small one increases the likelihood of mode-jumping, jerky behavior resulting from discontinuities between chunks. Therefore, selecting the optimal chunk size is an urgent demand to balance the model's reactivity and consistency. Unfortunately, a dominant trend in current VLA models is an empirical fixed chunk length at inference-time, hindering their superiority and scalability across diverse manipulation tasks. To address this issue, we propose a novel Adaptive Action Chunking (AAC) strategy, which exploits action entropy as the cue to adaptively determine the chunk size based on current predictions. Extensive experiments on a wide range of simulated and real-world robotic manipulation tasks have demonstrated that our approach substantially improves performance over the state-of-the-art alternatives. The videos and source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39692", "url": null, "sourceid": 34843, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39694, "uid": "c5adc9d79ed1c94b61694807c491f877", "name": "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning", "authors": [{"id": 130742, "fullname": "Qi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130742?format=json", "institution": "School of Artificial Intelligence, University of Chinese Academy of Sciences."}, {"id": 192663, "fullname": "Bolin Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/192663?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 89642, "fullname": "Shiming Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89642?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 192664, "fullname": "Houwen Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192664?format=json", "institution": null}], "abstract": "Multimodal Large Language Models (MLLMs) with explicit step-by-step reasoning have achieved strong performance on complex tasks. However, such reasoning is unnecessary for many simple queries and introduces substantial computational overhead. To address this inefficiency, we present R-4B, an auto-thinking MLLM that dynamically determines whether to invoke the reasoning process based on input complexity.Our key idea is to equip a single model with both thinking and non-thinking capabilities and train it to select the appropriate mode. We first introduce bi-mode annealing, a unified training paradigm that constructs a model competent in both reasoning-intensive and direct-answer settings without requiring explicit complexity annotations. Building on this foundation, we propose Bi-mode Policy Optimization (BPO), a lightweight reinforcement learning algorithm that employs a dual-rollout mechanism: for each input, the model generates both thinking and non-thinking responses. This prevents mode collapse and enables robust learning of an adaptive reasoning policy using only simple, rule-based rewards.Extensive experiments across 25 benchmarks show that R-4B achieves state-of-the-art performance among models of similar scale. It consistently surpasses Qwen2.5-VL-7B and matches or exceeds larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive tasks, while reducing computational cost by avoiding redundant reasoning. Our results demonstrate that adaptive auto-thinking offers an effective and scalable pathway toward more efficient multimodal reasoning models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39694", "url": null, "sourceid": 46734, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39697, "uid": "6ba54ed26589b595c540b72dec6346c2", "name": "AstraNav-Memory: Contexts Compression for Long Memory", "authors": [{"id": 189489, "fullname": "Junjun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189489?format=json", "institution": "Alibaba Group"}, {"id": 192673, "fullname": "Xinda Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192673?format=json", "institution": "Peking University"}, {"id": 154243, "fullname": "Botao Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/154243?format=json", "institution": "Tsinghua University"}, {"id": 189488, "fullname": "Minghua Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189488?format=json", "institution": "Alibaba Group"}, {"id": 156765, "fullname": "Jintao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156765?format=json", "institution": "Peking University"}, {"id": 192674, "fullname": "Haochen Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192674?format=json", "institution": "Alibaba Group"}, {"id": 176773, "fullname": "Liangliang You", "url": "http://cvpr.thecvf.com/api/miniconf/users/176773?format=json", "institution": null}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}], "abstract": "Lifelong embodied navigation requires agents to accumulate, retain, and exploit spatial\u2013semantic experience across tasks, enabling efficient exploration in novel environments and rapid goal reaching in familiar ones. While object-centric memory is interpretable, it depends on detection and reconstruction pipelines that limit robustness and scalability. We propose an image-centric memory framework that achieves long-term implicit memory via an efficient visual context compression module end-to-end coupled with a Qwen2.5-VL\u2013based navigation policy. Built atop a ViT backbone with frozen DINOv3 features and lightweight PixelUnshuffle+Conv blocks, our visual tokenizer reduces native vision tokens by roughly 10\u201320\u00d7, representing each image with about 30 tokens and allowing the agent to maintain hundreds of historical frames within a single context. Experimental results on GOAT-Bench and HM3D-OVON show that our method achieves state-of-the-art navigation performance, improving exploration in unfamiliar environments and shortening paths in familiar ones. Ablation studies further reveal that moderate compression provides the best balance between efficiency and accuracy. These findings position compressed image-centric memory as a practical and scalable interface for lifelong embodied agents, enabling them to reason over long visual histories and navigate with human-like efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39697", "url": null, "sourceid": 34091, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39698, "uid": "1685a13066f6f21b02e62ef6cbeb0f5f", "name": "Leveraging Verifier-Based Reinforcement Learning in Image Editing", "authors": [{"id": 192675, "fullname": "Hanzhong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192675?format=json", "institution": "University of Hong Kong"}, {"id": 89921, "fullname": "Jie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89921?format=json", "institution": "ByteDance Inc."}, {"id": 186774, "fullname": "Jie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186774?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 192676, "fullname": "Yu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192676?format=json", "institution": "Meituan Inc"}, {"id": 134180, "fullname": "Zilyu Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/134180?format=json", "institution": "Westlake University"}, {"id": 192677, "fullname": "Linxiao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192677?format=json", "institution": "ByteDance Inc."}, {"id": 192678, "fullname": "Xionghui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192678?format=json", "institution": null}, {"id": 89008, "fullname": "Yizhou Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89008?format=json", "institution": "The University of Hong Kong"}, {"id": 190589, "fullname": "Weilin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190589?format=json", "institution": "Alibaba Group"}], "abstract": "While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we propose the Verifier-Based Reasoning Reward Model (RRM), which breaks instructions into verifiable principles, evaluates the edited images against each principle, and aggregates fine-grained scores to reduce hallucinations and provide more interpretable criteria. To address this, we argue the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework to build a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and leverage it into the downstream editing task. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each, and aggregates these checks to provide an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a \u201ccold-start\u201d to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39698", "url": null, "sourceid": 42060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39700, "uid": "5612d8e312285c1879e33a26b40dd227", "name": "Compressed-Domain-Aware Online Video Super-Resolution", "authors": [{"id": 182909, "fullname": "Yuhang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182909?format=json", "institution": "Beijing Institute of Technology"}, {"id": 145103, "fullname": "Hai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/145103?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192679, "fullname": "Shujuan Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192679?format=json", "institution": "Beijing Institute of Technology"}, {"id": 144584, "fullname": "zhetao dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/144584?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192680, "fullname": "yangxiaoyao yangxiaoyao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192680?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "In bandwidth-limited online video streaming, videos are usually downsampled and compressed. Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. Specifically, we propose a motion-vector-guided deformable alignment module that uses motion vectors for coarse warping and learns only local residual offsets for fine-tuned adjustments, thereby maintaining accuracy while reducing computation. Then, we utilize a residual-map-guided gated fusion module to derive spatial weights from residual maps, suppressing mismatched regions and emphasizing reliable details. Further, we design a frame-type-aware reconstruction module for adaptive compute allocation across frame types, balancing accuracy and efficiency. On the REDS4 dataset, our CDA-VSR surpasses the state-of-the-art method TMP, with a maximum PSNR improvement of 0.13 dB while delivering more than double the inference speed.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39700", "url": null, "sourceid": 46417, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39699, "uid": "05428fe4b2f7f9fad3fc1702c51b4c43", "name": "Geo$^\\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis", "authors": [{"id": 192296, "fullname": "Yancheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192296?format=json", "institution": "University of Central Florida"}, {"id": 180593, "fullname": "Xiaohan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180593?format=json", "institution": "University of Vermont"}, {"id": 135261, "fullname": "Guangyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/135261?format=json", "institution": "University of Central Florida"}, {"id": 129185, "fullname": "Zonglin Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129185?format=json", "institution": "New York University"}, {"id": 189325, "fullname": "Safwan Wshah", "url": "http://cvpr.thecvf.com/api/miniconf/users/189325?format=json", "institution": "University of Vermont"}, {"id": 73542, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73542?format=json", "institution": "University of Central Florida"}], "abstract": "Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39699", "url": null, "sourceid": 36862, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39705, "uid": "8eec7d1b475b38be606935be8e70fccd", "name": "SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World", "authors": [{"id": 183152, "fullname": "Jungho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183152?format=json", "institution": "Seoul National University"}, {"id": 183374, "fullname": "Jiyong Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/183374?format=json", "institution": "Seoul National University"}, {"id": 182716, "fullname": "Seunghoon Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182716?format=json", "institution": "Seoul National University"}, {"id": 192690, "fullname": "Hongjae Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192690?format=json", "institution": "Seoul National University"}, {"id": 192691, "fullname": "Donghyuk Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/192691?format=json", "institution": "Seoul National University"}, {"id": 133243, "fullname": "Jun Won Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133243?format=json", "institution": "Seoul National University"}], "abstract": "The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.6% driving score.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39705", "url": null, "sourceid": 36090, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39707, "uid": "89e017e9e34cd057fb363397b02eabee", "name": "WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments", "authors": [{"id": 151402, "fullname": "Xuweiyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151402?format=json", "institution": "University of Virginia"}, {"id": 145688, "fullname": "Wentao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/145688?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 159178, "fullname": "Zezhou Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/159178?format=json", "institution": "University of Virginia, Charlottesville, VA"}], "abstract": "We present **WildRayZer**, a self-supervised framework for novel view synthesis (NVS) in dynamic environments, where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, causing ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39707", "url": null, "sourceid": 35250, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39722, "uid": "e70adf4ec66959998bca02c1c732c0ee", "name": "TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations", "authors": [{"id": 172476, "fullname": "Yifeng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/172476?format=json", "institution": "University of Science and Technology of China"}, {"id": 159934, "fullname": "Zhirong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159934?format=json", "institution": "Nullmax"}, {"id": 192723, "fullname": "Bo Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/192723?format=json", "institution": "Institute of Intelligent Machines"}, {"id": 131698, "fullname": "Erkang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131698?format=json", "institution": "Nullmax"}, {"id": 92055, "fullname": "Haibin Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/92055?format=json", "institution": "State University of New York, Stony Brook"}], "abstract": "Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, these approaches often neglect the importance of point-to-instance (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in $\\mathrm{DET}\\_{\\text{l}}$, +5.4 in $\\mathrm{TOP}\\_{\\text{ll}}$ on subset_A and +11.0 in $\\mathrm{DET}\\_{\\text{l}}$, +7.9 in $\\mathrm{TOP}\\_{\\text{ll}}$ on subset_B, validating the effectiveness of the proposed components. The code will be shared publicly upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39722", "url": null, "sourceid": 35040, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39725, "uid": "60c6413e69efc9f67452aca5925e6977", "name": "Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs", "authors": [{"id": 76605, "fullname": "Jinqi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76605?format=json", "institution": "University of Pennsylvania / Amazon"}, {"id": 136543, "fullname": "Jinyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136543?format=json", "institution": "Amazon"}, {"id": 156933, "fullname": "Tal Neiman", "url": "http://cvpr.thecvf.com/api/miniconf/users/156933?format=json", "institution": "Amazon"}, {"id": 126582, "fullname": "Lei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126582?format=json", "institution": "Northwestern University"}, {"id": 180688, "fullname": "Bing Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180688?format=json", "institution": "Amazon"}, {"id": 86946, "fullname": "Son Dinh Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/86946?format=json", "institution": "Amazon"}, {"id": 73977, "fullname": "Mubarak Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/73977?format=json", "institution": "Amazon"}, {"id": 131913, "fullname": "Rene Vidal", "url": "http://cvpr.thecvf.com/api/miniconf/users/131913?format=json", "institution": "University of Pennsylvania and Amazon"}], "abstract": "Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent works use prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the intermediate activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that jointly utilizes a curated concept dictionary with a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli (we name the dataset DACO-400K) and summarizing their activations into per-concept directions. Second, we show that the curated dictionary can be directly applied to interventions via sparse coding. Third, we propose a new steering approach that uses our dictionary to initialize the training of an SAE and automatically annotate the semantics of the SAE atoms for safeguarding MLLMs. Experiments on multiple MLLMs (e.g., QwenVL, LLaVA, InternVL) across safety benchmarks (e.g., MM-SafetyBench, JailBreakV) show that DACO significantly improves the MLLM safety while maintaining general-purpose capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39725", "url": null, "sourceid": 38383, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39726, "uid": "aa4b8d815476f638191851b529731a57", "name": "TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery", "authors": [{"id": 176468, "fullname": "XingYu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176468?format=json", "institution": "South East University"}, {"id": 192733, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192733?format=json", "institution": "Southeast University"}, {"id": 176512, "fullname": "Siya Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/176512?format=json", "institution": "Southeast University"}, {"id": 86430, "fullname": "Xiu-Shen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/86430?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "For an unlabelled dataset containing known and unknown categories, Generalized Category Discovery (GCD) aims to classify the known categories exactly while simultaneously discovering the unknown categories. Current GCD methods have achieved significant progress on coarse-grained datasets but still struggle to generalize to fine-grained scenarios. We observe that attention artifacts, a phenomenon where the attention map exhibits abnormally high responses concentrated on a few tokens, significantly interferes with fine-grained GCD. In this paper, we argue that attention artifacts compel the model to overemphasize global semantics, consequently overlooking fine-grained local cues that are crucial for category discrimination. We propose the $\\textbf{T}$oken-$\\textbf{A}$ware $\\textbf{R}$efinement ($\\textbf{TAR}$) framework, which introduces a plug-and-play module to mitigate the impact of attention artifacts and enhances the concentration of local information. TAR departs from the conventional classification paradigm that relies solely on the first token as input to the classifier. Instead, it fully exploits the entire token sequence, thereby significantly enhancing the model's focus on fine-grained local information. Extensive experiments demonstrate the superior performance of TAR across various benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39726", "url": null, "sourceid": 46631, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39727, "uid": "0067035172fc786488ff7c6317ed88c9", "name": "SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection", "authors": [{"id": 192734, "fullname": "Sandro Papais", "url": "http://cvpr.thecvf.com/api/miniconf/users/192734?format=json", "institution": "University of Toronto"}, {"id": 163866, "fullname": "lezhou feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/163866?format=json", "institution": "Zoox"}, {"id": 163868, "fullname": "Charles Cossette", "url": "http://cvpr.thecvf.com/api/miniconf/users/163868?format=json", "institution": "Zoox"}, {"id": 192735, "fullname": "Lingting Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/192735?format=json", "institution": "Zoox"}], "abstract": "Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D\u20133D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39727", "url": null, "sourceid": 41060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39729, "uid": "5d9c5ea2d5950e8ea2d01cdc092ffe29", "name": "AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows", "authors": [{"id": 77052, "fullname": "Zhenglin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/77052?format=json", "institution": "Zhejiang University"}, {"id": 107182, "fullname": "Fan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/107182?format=json", "institution": "Zhejiang University"}, {"id": 192737, "fullname": "Chengzhuo Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/192737?format=json", "institution": "Zhejiang University"}, {"id": 84803, "fullname": "Xiaobo Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/84803?format=json", "institution": "The University of Sydney"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}, {"id": 88927, "fullname": "Tat-seng Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/88927?format=json", "institution": "National University of Singapore"}], "abstract": "Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. The code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39729", "url": null, "sourceid": 37907, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40380?format=json"], "related_events_ids": [40380]}, {"id": 40380, "uid": "5d9c5ea2d5950e8ea2d01cdc092ffe29", "name": "AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows", "authors": [{"id": 77052, "fullname": "Zhenglin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/77052?format=json", "institution": "Zhejiang University"}, {"id": 107182, "fullname": "Fan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/107182?format=json", "institution": "Zhejiang University"}, {"id": 192737, "fullname": "Chengzhuo Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/192737?format=json", "institution": "Zhejiang University"}, {"id": 84803, "fullname": "Xiaobo Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/84803?format=json", "institution": "The University of Sydney"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}, {"id": 88927, "fullname": "Tat-seng Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/88927?format=json", "institution": "National University of Singapore"}], "abstract": "Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. The code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40380", "url": null, "sourceid": -37907, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39729?format=json"], "related_events_ids": [39729]}, {"id": 39731, "uid": "1e60e9a0cdf68820824a93bb3c055ec7", "name": "R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment", "authors": [{"id": 177082, "fullname": "Junhyuk Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/177082?format=json", "institution": "Hansung University Seoul"}, {"id": 183392, "fullname": "SANG HYUK SEO", "url": "http://cvpr.thecvf.com/api/miniconf/users/183392?format=json", "institution": "Hansung University"}, {"id": 192738, "fullname": "Dawoon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192738?format=json", "institution": "Hansung University Seoul"}, {"id": 181990, "fullname": "Heeseok Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/181990?format=json", "institution": "Hansung University"}], "abstract": "Prevailing no-reference 3D point cloud quality assessment methods predominantly treat 2D projections and 3D point clouds as independent modalities and rely on simplistic feature fusion, thereby neglecting fundamental mechanisms underlying human 3D perception. To address this limitation, we introduce R3-PCQA (Ray-Reprojection-Reinforcement 3D Point Cloud Quality Assessor), a novel and principled framework that explicitly encodes perceptual priors into the assessment pipeline: A geometric-aware ray-based reprojection pipeline simulates viewpoint-dependent observation of 3D structure. A reinforcement-learning-based quality-salient subcloud selector adaptively attends to perceptually informative regions. The global view attention module aggregates local quality responses across viewpoints, forming a unified representation that facilitates reliable cross-view inference. Extensive experiments demonstrate that R3-PCQA achieves state-of-the-art performance on SJTU-PCQA, WPC, and WPC2.0.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39731", "url": null, "sourceid": 36787, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39733, "uid": "d817c023bf760228ebf536f68271bb90", "name": "Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy", "authors": [{"id": 152148, "fullname": "Shuwei Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152148?format=json", "institution": "Beihang University"}, {"id": 192741, "fullname": "Kejin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192741?format=json", "institution": "Yanbian University"}, {"id": 146903, "fullname": "Shixing Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/146903?format=json", "institution": "Shandong University"}, {"id": 146323, "fullname": "Xinzhe Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/146323?format=json", "institution": "Shandong University"}, {"id": 87855, "fullname": "Baochang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87855?format=json", "institution": "Beihang University"}, {"id": 192742, "fullname": "Zhe Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/192742?format=json", "institution": "Shandong University"}], "abstract": "Monocular depth estimation serves as a core technique in endoscopic applications such as 3D reconstruction and localization. However, most existing methods focus primarily on in-domain depth estimation, which limits their robustness and prevents them from delivering impressive cross-domain performance, due to variations in depth distributions, illumination conditions, and texture patterns. In this work, we propose Depth Any Endoscopy (DAE), a novel self-supervised framework for generalizable depth estimation in monocular endoscopy. To specify, we develop a dual-level Mixture-of-Experts (MoE) adaptation paradigm that effectively tailors Vision Foundation Models to diverse endoscopic procedures, such as laparoscopy and colonoscopy, accounting for the challenges posed by varying environments. Internally, we integrate LoRA and Adapter modules within the MoE architecture, allowing the model to flexibly adapt to the characteristics of input data. Externally, a mixture of domain-specific experts provides customized guidance to enhance the training stability. In addition, we introduce a learnable gradient harmonization mechanism to dynamically balance the optimization between the depth and pose networks, along with a semantic distribution calibration module that strengthens the semantic consistency of depth predictions. Extensive experiments demonstrate that the proposed DAE achieves state-of-the-art performance in both zero-shot and in-domain depth estimation scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39733", "url": null, "sourceid": 34470, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39735, "uid": "45ff734cddd7485e2082348bed6dbcda", "name": "Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models", "authors": [{"id": 100177, "fullname": "Wongi Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/100177?format=json", "institution": "Seoul National University"}, {"id": 147333, "fullname": "Hoigi Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/147333?format=json", "institution": "Seoul National University"}, {"id": 87674, "fullname": "Se Young Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87674?format=json", "institution": "Seoul National University"}], "abstract": "Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33\\% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3$\\times$ speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39735", "url": null, "sourceid": 37976, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39739, "uid": "96deb4fd26fa5d5a385c22a6a4c5b48f", "name": "ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation", "authors": [{"id": 179638, "fullname": "Mingyang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179638?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 192757, "fullname": "Ashirbad Mishra", "url": "http://cvpr.thecvf.com/api/miniconf/users/192757?format=json", "institution": "eBay Inc."}, {"id": 184014, "fullname": "Soumik Dey", "url": "http://cvpr.thecvf.com/api/miniconf/users/184014?format=json", "institution": "Amazon Inc."}, {"id": 179637, "fullname": "Shuo Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/179637?format=json", "institution": "Google; Texas A&amp;M University - College Station"}, {"id": 192758, "fullname": "Naveen Ravipati", "url": "http://cvpr.thecvf.com/api/miniconf/users/192758?format=json", "institution": "eBay Inc."}, {"id": 192759, "fullname": "Hansi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192759?format=json", "institution": "eBay Inc."}, {"id": 192760, "fullname": "Binbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192760?format=json", "institution": "eBay Inc."}, {"id": 155027, "fullname": "Zhengzhong Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155027?format=json", "institution": "Texas A&amp;M University - College Station"}], "abstract": "Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual\u2013geometric encoder as well as a text\u2013visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39739", "url": null, "sourceid": 43264, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39742, "uid": "71893eee68115a8116f0534826a2e155", "name": "ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals", "authors": [{"id": 192768, "fullname": "Xuelu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192768?format=json", "institution": "Shandong University"}, {"id": 181381, "fullname": "Zhaonan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181381?format=json", "institution": "Shandong University"}, {"id": 105137, "fullname": "Xiaogang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/105137?format=json", "institution": "Southwest University"}, {"id": 149152, "fullname": "Lei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149152?format=json", "institution": "Shandong University"}, {"id": 158522, "fullname": "Manyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158522?format=json", "institution": "Shandong University"}, {"id": 149490, "fullname": "Changhe Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149490?format=json", "institution": "Shandong University"}], "abstract": "Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects.To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39742", "url": null, "sourceid": 41443, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39743, "uid": "66f185be8c09e6746e18ff160e90b380", "name": "SplitFlux: Learning to Decouple Content and Style from a Single Image", "authors": [{"id": 85126, "fullname": "Yitong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85126?format=json", "institution": "Guizhou University"}, {"id": 192769, "fullname": "Yinglin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192769?format=json", "institution": "Shanghai University of Finance and Economics, Shanghai 200433, China"}, {"id": 169665, "fullname": "Changshuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169665?format=json", "institution": "Nanyang Technological University"}, {"id": 76300, "fullname": "Yongjun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76300?format=json", "institution": "Guizhou University"}, {"id": 131208, "fullname": "Ziyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131208?format=json", "institution": "Guizhou University"}, {"id": 166830, "fullname": "Shuting He", "url": "http://cvpr.thecvf.com/api/miniconf/users/166830?format=json", "institution": "Shanghai University of Finance and Economics"}], "abstract": "Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results,  while the recently proposed Flux model fails to achieve effective content\u2013style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39743", "url": null, "sourceid": 38385, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39747, "uid": "724e901b27069d807cbb81cbf93ba374", "name": "SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging", "authors": [{"id": 180505, "fullname": "Yuqiao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/180505?format=json", "institution": "Hunan University"}, {"id": 192774, "fullname": "Xiaoyan LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/192774?format=json", "institution": "Hunan University"}, {"id": 192775, "fullname": "Jianxu Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192775?format=json", "institution": "Hunan University"}, {"id": 128787, "fullname": "Yaonan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128787?format=json", "institution": "Hunan University"}, {"id": 192776, "fullname": "Hui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192776?format=json", "institution": "Hunan University"}, {"id": 192777, "fullname": "Lizhu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192777?format=json", "institution": "Hunan University"}, {"id": 192778, "fullname": "Yurong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192778?format=json", "institution": "Hunan University"}, {"id": 192779, "fullname": "Wenbin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192779?format=json", "institution": "Hunan University"}], "abstract": "Coded Aperture Snapshot Spectral Imaging (CASSI) has emerged as a prominent technique for efficient hyperspectral imaging. However, the strong coupling between physical encoding and computational decoding makes CASSI highly sensitive to minor hardware misalignments, which can significantly degrade reconstruction quality. Existing methods either assume ideal imaging conditions, or rely on offline calibration, making them vulnerable to dynamic perturbations, such as thermal expansion and mechanical vibration that cause mask shifts. To address these limitations, we propose a Self-Supervised Geometry Degradation Estimation (SGDE) framework that explicitly models mask misalignments as an affine transformation and embeds it into the imaging model. SGDE jointly estimates affine parameters and reconstructs the hyperspectral image in a self-supervised manner, eliminating the need for calibration targets or device-specific training data. Furthermore, we introduce a multi-kernel estimation strategy to enhance calibration robustness under large perturbations. Extensive experiments on both simulated and real-world datasets demonstrate that SGDE achieves superior robustness against geometric degradations. Moreover, the estimated affine parameters can be directly integrated into existing reconstruction algorithms, enabling plug-and-play calibration for practical CASSI systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39747", "url": null, "sourceid": 44877, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39749, "uid": "418140029d08ec9365aebdc9542616a0", "name": "PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation", "authors": [{"id": 192781, "fullname": "Hao jiacheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192781?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 173517, "fullname": "Chunying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173517?format=json", "institution": "Northwestern Polytechnical University; Northwest A&amp;F University"}, {"id": 158723, "fullname": "Hao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158723?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 144686, "fullname": "RuohanWang ruohan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144686?format=json", "institution": "Northwestern Polytechinical University"}, {"id": 87165, "fullname": "Hongping Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87165?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 158726, "fullname": "Yilei Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158726?format=json", "institution": "the Northwestern Polytechnical University"}], "abstract": "In industrial settings, classification of 3D CAD models are critical for efficient manufacturing. However, the limited availability of annotated CAD models presents an obstacle to achieving rapid adaptation in few-shot part classification scenarios. In this paper, we propose a hybrid graph representation and a pre-training and graph prompt framework for B-rep few-shot classification. Specifically, hybrid graph representation captures comprehensive and multi-level structural information of B-rep models by constructing local topology graph, global parallel graph and regional association hypergraph. A hierarchical graph network then fuses component-level structures with topological details in the hybrid graph. Reinforcement-augmented contrastive pre-training produces robust universal representations while in-place perturbation reduces training time. Structure-aware graph prompts finally produce node-specific cues, enabling few-shot B-rep part classification without heavy fine-tuning. Experiments on the TraceParts-11and FabWave-31 datasets show that our method outperforms existing general-purpose approaches. This work provides an efficient and state-of-the-art solution for few-shot B-rep part classification.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39749", "url": null, "sourceid": 32190, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39754, "uid": "6ec70e52e105bf01ba41639601076d5b", "name": "Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning", "authors": [{"id": 87385, "fullname": "WonJun Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/87385?format=json", "institution": "Sungkyunkwan University"}, {"id": 87337, "fullname": "Hyun Seok Seong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87337?format=json", "institution": "Sungkyunkwan University"}, {"id": 87383, "fullname": "Jae-Pil Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87383?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Video Object\u2011Centric Learning seeks to decompose raw videos into a small set of object slots, but existing slot\u2011attention models often suffer from severe over\u2011fragmentation.This is because the model is implicitly encouraged to occupy all slots to minimize the reconstruction objective, thereby representing a single object with multiple redundant slots.We tackle this limitation with a reconstruction\u2011guided slot curriculum (SlotCurri). Training starts with only a few coarse slots and progressively allocates new slots where reconstruction error remains high, thus expanding capacity only where it is needed and preventing fragmentation from the outset. Yet, during slot expansion, meaningful sub\u2011parts can emerge only if coarse\u2011level semantics are already well separated; however, with a small initial slot budget and an MSE objective, semantic boundaries remain blurry.Therefore, we augment MSE with a structure\u2011aware loss that preserves local contrast and edge information to encourage each slot to sharpen its semantic boundaries.Lastly, we propose a cyclic inference that rolls slots forward and then backward through the frame sequence, producing temporally consistent object representations even in the earliest frames.All combined, SlotCurri addresses object over-fragmentation by allocating representational capacity where reconstruction fails, further enhanced by structural cues and cyclic inference.Notable FG-ARI gains of +6.8 on YouTube-VIS and +8.3 on MOVi-C validate the effectiveness of SlotCurri.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39754", "url": null, "sourceid": 45116, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39755, "uid": "b884881fa38175c803d9084ac18e39b9", "name": "Learning Straight Flows: Variational Flow Matching for Efficient Generation", "authors": [{"id": 147395, "fullname": "Chenrui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/147395?format=json", "institution": "University of California, Irvine"}, {"id": 169031, "fullname": "Xi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/169031?format=json", "institution": "Oak Ridge National Laboratory; University of Alabama at Birmingham"}, {"id": 190800, "fullname": "Tianyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190800?format=json", "institution": "University of Alabama at Birmingham"}, {"id": 169935, "fullname": "Xiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169935?format=json", "institution": "Oak ridge national lab"}, {"id": 192793, "fullname": "Yanning Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192793?format=json", "institution": "University of California, Irvine"}], "abstract": "Flow Matching has limited ability in achieving one-step generation due to its reliance on learned curved trajectories. Previous studies have attempted to address this limitation by either modifying the coupling distribution to prevent interpolant intersections or introducing consistency and mean-velocity modeling to promote straight trajectory learning. However, these approaches often suffer from discrete approximation errors, training instability, and convergence difficulties. To tackle these issues, in the present work, we propose \\textbf{S}traight \\textbf{V}ariational \\textbf{F}low \\textbf{M}atching (\\textbf{S-VFM}), which integrates a variational latent code representing the ``generation overview'' into the Flow Matching framework. \\textbf{S-VFM} explicitly enforces trajectory straightness, ideally producing linear generation paths. The proposed method achieves competitive performance across three challenge benchmarks and demonstrates advantages in both training and inference efficiency compared with existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39755", "url": null, "sourceid": 31263, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39756, "uid": "9d2596a5514f8021eaf84fc7e42f792b", "name": "ReLaGS: Relational Language Gaussian Splatting", "authors": [{"id": 113852, "fullname": "Yaxu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/113852?format=json", "institution": "German Research Center for Artificial Intelligence"}, {"id": 192794, "fullname": "Abdalla Arafa", "url": "http://cvpr.thecvf.com/api/miniconf/users/192794?format=json", "institution": "German Research Center for AI; Rheinland-Pf\u00e4lzische Technische Universit\u00e4t"}, {"id": 142824, "fullname": "Alireza Javanmardi", "url": "http://cvpr.thecvf.com/api/miniconf/users/142824?format=json", "institution": "DFKI"}, {"id": 130883, "fullname": "Christen Millerdurai", "url": "http://cvpr.thecvf.com/api/miniconf/users/130883?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 192795, "fullname": "Jia Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192795?format=json", "institution": "University of Modena and Reggio Emilia"}, {"id": 130393, "fullname": "Shaoxiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130393?format=json", "institution": "German Research Center for AI"}, {"id": 130375, "fullname": "Alain Pagani", "url": "http://cvpr.thecvf.com/api/miniconf/users/130375?format=json", "institution": "German Research Center for Artificial Intelligence (DFKI)"}, {"id": 89910, "fullname": "Didier Stricker", "url": "http://cvpr.thecvf.com/api/miniconf/users/89910?format=json", "institution": "Universit\u00e4t Kaiserslautern"}], "abstract": "Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39756", "url": null, "sourceid": 45340, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39758, "uid": "b2f24ad52942ea9eadda8ac7b7db4f54", "name": "Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models", "authors": [{"id": 153913, "fullname": "Jiadong Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153913?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 77213, "fullname": "Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77213?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 87023, "fullname": "Yuxin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87023?format=json", "institution": "Peking University"}, {"id": 105067, "fullname": "Yu-Ming Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/105067?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 185013, "fullname": "Shuohuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185013?format=json", "institution": "Baidu"}, {"id": 84960, "fullname": "Yu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/84960?format=json", "institution": "Baidu"}, {"id": 84968, "fullname": "Hua Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84968?format=json", "institution": "Baidu"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 84919, "fullname": "Haifeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84919?format=json", "institution": "Baidu"}], "abstract": "Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities.This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes.While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts.To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality.We propose a token-level intrinsic text-image alignment reward mechanism,GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals\u2014without reliance on external supervision.Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39758", "url": null, "sourceid": 41169, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39761, "uid": "666ea6cdce817ac66f83e17f0229b4d7", "name": "AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors", "authors": [{"id": 180313, "fullname": "Matic Fu\u010dka", "url": "http://cvpr.thecvf.com/api/miniconf/users/180313?format=json", "institution": "University of Ljubljana"}, {"id": 129794, "fullname": "Vitjan Zavrtanik", "url": "http://cvpr.thecvf.com/api/miniconf/users/129794?format=json", "institution": "University of Ljubljana"}, {"id": 180314, "fullname": "Danijel Sko\u010daj", "url": "http://cvpr.thecvf.com/api/miniconf/users/180314?format=json", "institution": "University of Ljubljana, Faculty of Computer and Information Science"}], "abstract": "Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision\u2013language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of $94.1\\%$ across 9 diverse datasets, surpassing previous methods by significant $3.3$ percentage points. Code: \\textcolor{magenta}{Upon Acceptance}", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39761", "url": "https://maticfuc.github.io/anomaly_vfm/", "sourceid": 39109, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39770, "uid": "0d027acefa6f870bd312eed9df6d48ff", "name": "WorldGen: From Text to Traversable and Interactive 3D Worlds", "authors": [{"id": 154362, "fullname": "Dilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154362?format=json", "institution": "Meta"}, {"id": 77465, "fullname": "Hyunyoung Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/77465?format=json", "institution": "Seoul National University"}, {"id": 152119, "fullname": "Tom Monnier", "url": "http://cvpr.thecvf.com/api/miniconf/users/152119?format=json", "institution": "Facebook"}, {"id": 84616, "fullname": "Kihyuk Sohn", "url": "http://cvpr.thecvf.com/api/miniconf/users/84616?format=json", "institution": "Google"}, {"id": 75768, "fullname": "Chuhang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75768?format=json", "institution": "Amazon"}, {"id": 87127, "fullname": "Xiaoyu Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87127?format=json", "institution": "Meta"}, {"id": 156703, "fullname": "Yu-Ying Yeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/156703?format=json", "institution": "Meta"}, {"id": 128416, "fullname": "Di Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128416?format=json", "institution": "Rutgers University, New Brunswick"}, {"id": 77107, "fullname": "Zixuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77107?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 126485, "fullname": "Thu Nguyen-Phuoc", "url": "http://cvpr.thecvf.com/api/miniconf/users/126485?format=json", "institution": "Reality Labs Research, Meta"}, {"id": 89705, "fullname": "Yuchen Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89705?format=json", "institution": "Facebook"}, {"id": 192821, "fullname": "Sergiu Oprea", "url": "http://cvpr.thecvf.com/api/miniconf/users/192821?format=json", "institution": "Meta"}, {"id": 133010, "fullname": "Ziyan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133010?format=json", "institution": "Facebook"}, {"id": 85874, "fullname": "Roman Shapovalov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85874?format=json", "institution": "Meta"}, {"id": 76700, "fullname": "Nikolaos Sarafianos", "url": "http://cvpr.thecvf.com/api/miniconf/users/76700?format=json", "institution": "Meta Reality Labs"}, {"id": 126403, "fullname": "Thibault Groueix", "url": "http://cvpr.thecvf.com/api/miniconf/users/126403?format=json", "institution": "Adobe Systems"}, {"id": 129874, "fullname": "Antoine Toisoul", "url": "http://cvpr.thecvf.com/api/miniconf/users/129874?format=json", "institution": "Meta"}, {"id": 192822, "fullname": "Prithviraj Dhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/192822?format=json", "institution": "Meta Platforms, Inc."}, {"id": 192823, "fullname": "Xiao Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192823?format=json", "institution": "Facebook"}, {"id": 128085, "fullname": "Minghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128085?format=json", "institution": "University of Oxford"}, {"id": 85065, "fullname": "Geon Yeong Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/85065?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 73496, "fullname": "Rakesh Ranjan", "url": "http://cvpr.thecvf.com/api/miniconf/users/73496?format=json", "institution": "Meta"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}], "abstract": "We introduce WorldGen, a method for generating large, fully formed, navigable 3D worlds from a single text prompt. Existing approaches to 3D scene generation often trade off scene diversity, completeness, and correctness in different ways. We push this envelope by producing large scenes explicitly decomposed into individual, high-quality 3D meshes, making them compatible with standard game engines. Our approach first uses a language-driven procedural generator to lay out the scene's basic volumes and navigable regions. An image generator then establishes the scene's theme, style, and details. Next, we obtain a high-quality, compositional 3D reconstruction of the planned scene. This step first uses an image-to-3D model to perform a holistic reconstruction that implicitly determines the shape and location of all scene objects, accounting for context and navigability. The reconstruction is then decomposed into individual entities, which are regenerated at higher resolution, synthesizing additional details with guidance from the image generator. We ablate key design choices and compare qualitatively against existing scene generators, showing that our design addresses many of their common challenges.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39770", "url": null, "sourceid": 30842, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39771, "uid": "572d375abf307ce8a9652a76f02df10e", "name": "VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation", "authors": [{"id": 85340, "fullname": "Thanh Van Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/85340?format=json", "institution": "VinAI Research"}, {"id": 86434, "fullname": "Yun Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86434?format=json", "institution": "Northeastern University"}], "abstract": "While open-sourcing instruction-guided image editing models accelerates research, it surrenders control over their capabilities to anyone who downloads the weights.Existing protection methods are reactive: they verify ownership after generation, but the underlying model remains fully functional for unauthorized users.We introduce \\textbf{Visilock}, where access control is baked into model weights, rendering the model unusable without a visual trigger in the input.The challenge is training a model that retains editing capability for authorized input and remains unusable for unauthorized input, without destabilizing training.Naive multi-task objectives create gradient conflicts that collapse training, while contrastive approaches like FMLock destroy the denoising manifold.We develop \\textbf{Diverged Score Distillation}, a dual-teacher framework where a degraded teacher defines locked behavior and an original teacher guides editing quality, eliminating gradient interference through separate frozen targets.A key risk is that released models could be unlocked through post-hoc fine-tuning. To prevent this, we initialize the student model from the degraded teacher so that it begins in a locked state, and only regains editing ability for authorized inputs via distillation. This impedes adversarial fine-tuning from recovering full editing capability.Evaluation on InstructPix2Pix shows authorized edits maintain baseline quality (CLIP-I: 0.821, DINO: 0.726) while unauthorized attempts degrade substantially (CLIP-I: 0.481, DINO: 0.072) with 41\\% and 90\\% drops in image and semantic similarity.The lock remains robust to key corruptions, spatial perturbations, and adversarial unlock fine-tuning.Code and data will be available for research purposes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39771", "url": null, "sourceid": 38328, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39774, "uid": "352c158de620027ff0452ad48dd2c3b2", "name": "Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments", "authors": [{"id": 174948, "fullname": "jianhui wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174948?format=json", "institution": "ustc &amp; tencent"}, {"id": 176522, "fullname": "Jian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/176522?format=json", "institution": "University of Science and Technology of China"}, {"id": 180471, "fullname": "Zhi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180471?format=json", "institution": "University of Science and Technology of China"}, {"id": 177849, "fullname": "Zhangjin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177849?format=json", "institution": "University of Science and Technology of China"}, {"id": 175337, "fullname": "Chao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/175337?format=json", "institution": "Tencent"}], "abstract": "High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead.To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation.  Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39774", "url": null, "sourceid": 37678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39775, "uid": "1a56a7fdcd92b1520a3508d47cb5adbb", "name": "DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos", "authors": [{"id": 143769, "fullname": "Huang yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/143769?format=json", "institution": "Shanghai Kaiyong Information Technology Co., Ltd"}, {"id": 153759, "fullname": "Sijie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153759?format=json", "institution": null}, {"id": 147682, "fullname": "Jing Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/147682?format=json", "institution": "ByteDance Inc."}, {"id": 192829, "fullname": "HaoXu HaoXu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192829?format=json", "institution": null}, {"id": 129821, "fullname": "Shaohui Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129821?format=json", "institution": "Bytedance"}], "abstract": "Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70\\% of redundant tokens, achieving a 10.7$\\times$ speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD ($768\\times1280$) videos at 25\\,FPS on a single A100 GPU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39775", "url": null, "sourceid": 31665, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39777, "uid": "ccb255da5d69e61b4f86dd59c3f6aca7", "name": "Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos", "authors": [{"id": 97232, "fullname": "Yayuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/97232?format=json", "institution": "University of Michigan, Ann Arbor"}, {"id": 192831, "fullname": "Aadit Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/192831?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 97456, "fullname": "Filippos Bellos", "url": "http://cvpr.thecvf.com/api/miniconf/users/97456?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 95802, "fullname": "Jason Corso", "url": "http://cvpr.thecvf.com/api/miniconf/users/95802?format=json", "institution": "Voxel51; University of Michigan"}], "abstract": "We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video.MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M\u2014two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39777", "url": null, "sourceid": 38918, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39773, "uid": "d996e31032e7c288d7e20e7b82221c20", "name": "InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space", "authors": [{"id": 130080, "fullname": "Jiarui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130080?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 131538, "fullname": "Yujin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131538?format=json", "institution": "Shanghai Artificial Intelligence  Laboratory"}, {"id": 175578, "fullname": "Ruikang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/175578?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 149220, "fullname": "Fan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149220?format=json", "institution": "Tetras"}, {"id": 76562, "fullname": "Mingde Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76562?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process.In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency.To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency.Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39773", "url": null, "sourceid": 32883, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39778, "uid": "7888220d6fa80e0ba9f548a8ea9f1678", "name": "Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation", "authors": [{"id": 88227, "fullname": "Subin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/88227?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 160686, "fullname": "Sangwoo Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/160686?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 129638, "fullname": "Mamshad Nayeem Rizve", "url": "http://cvpr.thecvf.com/api/miniconf/users/129638?format=json", "institution": "Amazon"}, {"id": 180251, "fullname": "Yiran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180251?format=json", "institution": "Adobe Research"}, {"id": 106443, "fullname": "Difan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106443?format=json", "institution": "Adobe Research"}, {"id": 84533, "fullname": "Jinwoo Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84533?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 85887, "fullname": "Tobias Hinz", "url": "http://cvpr.thecvf.com/api/miniconf/users/85887?format=json", "institution": "Adobe Systems"}], "abstract": "Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt.To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures.Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15\\% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39778", "url": null, "sourceid": 35269, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39782, "uid": "83218450304f89053114eaa3b1487815", "name": "From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection", "authors": [{"id": 87316, "fullname": "Yuwen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87316?format=json", "institution": "University of Science and Technology of China"}, {"id": 87373, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87373?format=json", "institution": "University of Science and Technology of China"}, {"id": 192840, "fullname": "Shaohui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192840?format=json", "institution": "Zhejiang University"}, {"id": 192841, "fullname": "Zhi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192841?format=json", "institution": "Tsinghua University"}, {"id": 129105, "fullname": "Yu LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/129105?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 129104, "fullname": "You He", "url": "http://cvpr.thecvf.com/api/miniconf/users/129104?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Zero-shot anomaly detection (ZSAD) aims to detect unseen anomalies without any abnormal supervision, which is crucial for open-world scenarios where anomalies are diverse and unpredictable. By expressing normal and abnormal concepts in natural language, recent vision\u2013language models such as CLIP enable anomaly reasoning through shared visual\u2013textual embeddings. However, existing approaches rely on coarse prompt fusion, resulting in unstable alignment and inaccurate localization under domain shifts. To overcome these challenges, we propose the Semantic Graviton Network (SGNet), a physics-inspired framework that models multimodal alignment as an adaptive potential field. We introduce semantic gravitons, learnable dynamic mediators that bridge visual and textual modalities by establishing localized semantic equilibria through attraction and equilibrium forces. Within this framework, a graviton interaction network alternately performs text-to-graviton and vision-to-graviton coupling, progressively refining multimodal correspondence and promoting structured semantic binding. Furthermore, an energy-based potential regularization, composed of attraction and equilibrium forces, constrains the evolution of these interactions, ensuring stability and interpretability in the learned representations. Extensive experiments on ten industrial and medical benchmarks demonstrate that SGNet achieves state-of-the-art zero-shot anomaly detection performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39782", "url": null, "sourceid": 39335, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39783, "uid": "8ed420bdc5a06207eecefd77318cebf9", "name": "Focal\u2013General Diffusion Model with Semantic Consistent Guidance for Sign Language Production", "authors": [{"id": 181096, "fullname": "Yiheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181096?format=json", "institution": "Zhejiang University of Technology - Pingfeng Campus"}, {"id": 192842, "fullname": "Sheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192842?format=json", "institution": null}, {"id": 192843, "fullname": "Yuan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192843?format=json", "institution": null}, {"id": 192844, "fullname": "Zhelun Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192844?format=json", "institution": "Zhejiang University of Technology"}, {"id": 192845, "fullname": "Yining Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192845?format=json", "institution": "Zhejiang University of Technology"}, {"id": 192846, "fullname": "Min Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192846?format=json", "institution": "Zhejiang University of Technology"}], "abstract": "Sign Language Production (SLP) aims to translate spoken language into sign sequences, where the main challenge lies in generating coherent and natural poses from discrete glosses (G2P). Existing G2P methods typically treat each pose as an indivisible unit, limiting their ability to capture fine-grained joint-level dependencies and thus degrading pose quality. To address this, we propose the Focal\u2013General Diffusion Model (FGDM), characterized by a pioneering two-stage denoising framework that harmonizes local joint-level dependencies and global coherence. Specifically, in the Focal stage, a novel Adaptive Sign GCN (ASGCN) adaptively models each pose based on contextual correlations, skeletal topology, and semantic conditions, ensuring precise generation of local details. In the General stage, a Transformer-based module refines the entire pose sequence to enhance global coherence and naturalness. Moreover, we introduce a Semantic Consistent Guidance (SCG) mechanism that seamlessly integrates semantic supervision into diffusion training, enforcing tighter alignment between generated pose sequences and their intended gloss semantics. Extensive experiments on PHOENIX14T and USTC-CSL demonstrate that FGDM achieves SOTA performance. The source code will be released on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39783", "url": null, "sourceid": 43171, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39784, "uid": "180d9ac0990bb42907fe9cc7aa3eb5a1", "name": "Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning", "authors": [{"id": 176307, "fullname": "Siyu Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/176307?format=json", "institution": "University of Copenhagen"}, {"id": 187131, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187131?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 184143, "fullname": "Zhong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184143?format=json", "institution": "Southern Illinois University Carbondale"}, {"id": 129403, "fullname": "Zhenyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129403?format=json", "institution": "University of Central Florida"}], "abstract": "Non-transferable learning (NTL) aims to enforce usage restrictions by limiting a model\u2019s generalization on target-domain data while maintaining its utility on the source domain. Current approaches face three major challenges: (1) low training efficiency due to retraining of the backbone network, (2) low inference efficiency, and (3) a rigid reliance on a shared, non-adaptive backbone network spanning both source and target domains. This shared setup, which aims to maximize source-domain performance and minimize target-domain performance, often introduces optimization conflicts due to overlapping class categories across source and target domains.In this paper, we propose a novel and efficient NTL approach using a dynamic Early-Exit Network, named ENL-DEE, which leverages Bayesian theory and dynamic neural networks to address these limitations. Our custom loss function guides source-domain data to exit at later stages of the network, maximizing model utility, while target-domain data exits earlier with non-semantic features, ensuring limited transferability. ENL-DEE offers three key advantages: (1) it enhances training efficiency by optimizing only the parameters of dynamic exit classifiers, bypassing the need to retrain the backbone; (2) it improves inference efficiency as data exits at various exit classifiers in the network; and (3) it resolves optimization conflicts by using distinct parameter sets for source and target domains, achieving higher performance on the source domain and lower performance on the target domain, thereby strengthening NTL. Extensive experiments across diverse datasets and model architectures validate the scalability, efficiency, and effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39784", "url": null, "sourceid": 44731, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39785, "uid": "37086104cef13b38ab3f950584b6929c", "name": "WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories", "authors": [{"id": 89329, "fullname": "Yisu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89329?format=json", "institution": "Zhejiang University"}, {"id": 76242, "fullname": "Chenjie Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76242?format=json", "institution": "Fudan University"}, {"id": 85462, "fullname": "Tengfei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85462?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192847, "fullname": "Xuhui Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192847?format=json", "institution": "tencent"}, {"id": 192848, "fullname": "Junta Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192848?format=json", "institution": "Tencent"}, {"id": 89309, "fullname": "Jianke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89309?format=json", "institution": "Zhejiang University"}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}], "abstract": "Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories.In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds.Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.These components enable WorldStereo to generate multi-view-consistent videos under precise camera control, facilitating high-quality 3D reconstruction.Furthermore, the flexible control branch-based WorldStereo shows impressive efficiency, benefiting from the distribution matching distilled VDM backbone without joint training.Extensive experiments across both camera-guided video generation and 3D reconstruction benchmarks demonstrate the effectiveness of our approach. Notably, we show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks (whether starting from perspective or panoramic images) with high-fidelity 3D results. Models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39785", "url": null, "sourceid": 43636, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39786, "uid": "796e2f57fd87b9b44251e692e269f0bf", "name": "InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity", "authors": [{"id": 183724, "fullname": "Haoming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183724?format=json", "institution": "University of Pittsburgh"}, {"id": 153816, "fullname": "Qiyao Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/153816?format=json", "institution": "University of Pittsburgh"}, {"id": 153819, "fullname": "Wei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153819?format=json", "institution": "University of Pittsburgh"}], "abstract": "Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39786", "url": null, "sourceid": 45135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40382?format=json"], "related_events_ids": [40382]}, {"id": 40382, "uid": "796e2f57fd87b9b44251e692e269f0bf", "name": "InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity", "authors": [{"id": 183724, "fullname": "Haoming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183724?format=json", "institution": "University of Pittsburgh"}, {"id": 153816, "fullname": "Qiyao Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/153816?format=json", "institution": "University of Pittsburgh"}, {"id": 153819, "fullname": "Wei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153819?format=json", "institution": "University of Pittsburgh"}], "abstract": "Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40382", "url": null, "sourceid": -45135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39786?format=json"], "related_events_ids": [39786]}, {"id": 39793, "uid": "c7ced588de969dbd96e09067876bab3a", "name": "LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding", "authors": [{"id": 156304, "fullname": "Jihao Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156304?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 89218, "fullname": "Lingxi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/89218?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 192873, "fullname": "Xinyue Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192873?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 87065, "fullname": "Qixiang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/87065?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets.We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search.At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing.During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query.To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation.Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39793", "url": null, "sourceid": 34996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39794, "uid": "0fbb86b06c2fa661433282ee30ab3723", "name": "MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping", "authors": [{"id": 181208, "fullname": "Shiyao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181208?format=json", "institution": "IP Paris &amp; Inria"}, {"id": 77312, "fullname": "Antoine Gu\u00e9don", "url": "http://cvpr.thecvf.com/api/miniconf/users/77312?format=json", "institution": "Ecole des Ponts ParisTech"}, {"id": 89359, "fullname": "Shizhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89359?format=json", "institution": "INRIA"}, {"id": 75898, "fullname": "Vincent Lepetit", "url": "http://cvpr.thecvf.com/api/miniconf/users/75898?format=json", "institution": "Ecole des Ponts ParisTech"}], "abstract": "Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction.To address this limitation, we introduce MAGICIAN a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a predicted scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of overage gain for any novel viewpoint via fast volumetric rendering.The resulting speedup allows the integration of the gain metric into a tree-search algorithm for planning long-horizon paths.We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner.Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39794", "url": null, "sourceid": 36274, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40384?format=json"], "related_events_ids": [40384]}, {"id": 40384, "uid": "0fbb86b06c2fa661433282ee30ab3723", "name": "MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping", "authors": [{"id": 181208, "fullname": "Shiyao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181208?format=json", "institution": "IP Paris &amp; Inria"}, {"id": 77312, "fullname": "Antoine Gu\u00e9don", "url": "http://cvpr.thecvf.com/api/miniconf/users/77312?format=json", "institution": "Ecole des Ponts ParisTech"}, {"id": 89359, "fullname": "Shizhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89359?format=json", "institution": "INRIA"}, {"id": 75898, "fullname": "Vincent Lepetit", "url": "http://cvpr.thecvf.com/api/miniconf/users/75898?format=json", "institution": "Ecole des Ponts ParisTech"}], "abstract": "Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction.To address this limitation, we introduce MAGICIAN a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a predicted scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of overage gain for any novel viewpoint via fast volumetric rendering.The resulting speedup allows the integration of the gain metric into a tree-search algorithm for planning long-horizon paths.We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner.Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40384", "url": null, "sourceid": -36274, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39794?format=json"], "related_events_ids": [39794]}, {"id": 39803, "uid": "2dfbe1c802c78a8b1f75ef13b70c1124", "name": "Uika: Universal Head Avatar from Pose-Free Images", "authors": [{"id": 154162, "fullname": "Zijian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154162?format=json", "institution": "nanjing university"}, {"id": 157046, "fullname": "Boyao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/157046?format=json", "institution": "Tsinghua University"}, {"id": 127484, "fullname": "Liangxiao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127484?format=json", "institution": "Harbin Institute of Technology"}, {"id": 87692, "fullname": "Hongyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87692?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 192890, "fullname": "Yuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192890?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 85470, "fullname": "Xuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85470?format=json", "institution": "Tencent AI Lab"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 153839, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153839?format=json", "institution": "Nanjing University"}], "abstract": "We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation.  First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise UV coordinate estimation. Such UV coordinate estimation allows us to project each valid pixel from screen space to UV space, which is independent of camera pose and character expression. We thus leverage this UV space to represent our Gaussian head avatar. To this end, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV token can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. Such a Gaussian avatar is directly animatable via standard linear blend skinning and supports real-time rendering. To train our large avatar model, we further prepare a large-scale, identity-rich training dataset with controllable views and motions, synthesized with a 3D GAN and a state-of-the-art image animation model. Our proposed method significantly outperforms existing approaches in rendering quality, 3D consistency, and inference efficiency on both single-view and multi-view input data.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39803", "url": null, "sourceid": 44321, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39808, "uid": "042d5795f8a89756d19c4d5aff933e18", "name": "Defect Cue-Preserved Structural Feature Refinement for Few-Shot Anomaly Detection", "authors": [{"id": 126763, "fullname": "Le Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126763?format=json", "institution": "South China University of Technology"}, {"id": 182954, "fullname": "Yan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182954?format=json", "institution": "South China University of Technology"}, {"id": 91310, "fullname": "Zhen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91310?format=json", "institution": "South China University of Technology"}, {"id": 126780, "fullname": "Yong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126780?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 86139, "fullname": "Hau San Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/86139?format=json", "institution": "City University of Hong Kong"}, {"id": 87469, "fullname": "Si Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87469?format=json", "institution": "South China University of Technology"}], "abstract": "Modern industrial quality control heavily relies on automated anomaly detection. While few-shot anomaly detection addresses the challenge of limited labeled data, real-world inspection faces a vast diversity of anomaly types, sizes, and shapes. We identify the primary cause for the anomaly detection difficulty as the progressive loss of detect cues as they pass through deep feature extraction pipelines. To counteract the defect cue fading, we propose a Defect Cue-Preserved Structural Feature Refinement model, referred to as DCP-SFR. Recognizing that early-stage cues are paramount, we design a conditional anomaly cue amplification module to produce an initial anomaly score map, which is then enhanced to increase the contrast between anomalous and normal regions. The amplified cues is subsequently used for reconstruction-based anomaly localization, by anchoring attention on true anomaly regions to preserve spatial integrity and prevent drift. Further, we incorporate a structure-aware segmentation refinement stage to improve anomaly segmentation in terms of edge alignment, thereby significantly improve boundary accuracy. On the MVTec AD and VisA benchmarks, DCP-SFR achieves state-of-the-art performance, with an image-level AUROC of 97.3% and a pixel-level AUROC of 98.2%, demonstrating strong cross-domain generalization performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39808", "url": null, "sourceid": 40400, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39820, "uid": "f8f6c2b02a64736a13529bb5003a562c", "name": "Self-Attention Driven Tensor Representation for High-Order Data Recovery", "authors": [{"id": 182096, "fullname": "Zhi-Wei SHI", "url": "http://cvpr.thecvf.com/api/miniconf/users/182096?format=json", "institution": "Southwest Jiaotong University"}, {"id": 127872, "fullname": "Yu-Bang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127872?format=json", "institution": "Southwest Jiaotong University"}, {"id": 127846, "fullname": "Heng-Chao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127846?format=json", "institution": "Southwest Jiaotong University"}], "abstract": "Low-rank tensor representation (LRTR) is an effective tool for compactly modeling high-order data. While nonlinear LRTR models can better capture real-world nonlinear dependencies, most existing methods rely on fixed mappings of multilayer perceptrons (MLPs) or convolutional neural networks (CNNs), limiting their ability to model complex global dependencies. To overcome this limitation, we construct a novel paradigm called Self-Attention Driven Tensor Representation (SADTR), which is the first framework that models nonlinearity from the perspective of self-attention. Specifically, we design a factor self-representation mechanism to establish dynamic global mapping, thereby adaptively capturing both local and non-local nonlinear dependencies. Moreover, we introduce an implicit sparse representation to impose sparsity constraint while avoiding additional optimization problems. As a result, the proposed SADTR can achieve a more accurate low-rank representation. In theory, we provide a detailed analysis to demonstrate the recoverability of SADTR. To validate the effectiveness of SADTR, we apply it to three representative high-order data recovery tasks. Experimental results demonstrate that SADTR consistently outperforms existing state-of-the-art LRTR methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39820", "url": null, "sourceid": 46408, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39809, "uid": "63b7cfecf5585795a08cde4e46a2af36", "name": "Learning Effective Sign Features without Text for Gloss-free Sign Language Translation", "authors": [{"id": 107387, "fullname": "Shiwei Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/107387?format=json", "institution": "Nanjing University"}, {"id": 181586, "fullname": "Xiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181586?format=json", "institution": "nanjing university"}, {"id": 130214, "fullname": "Yafeng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130214?format=json", "institution": "Nanjing University"}, {"id": 192902, "fullname": "Nan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192902?format=json", "institution": "nanjing university"}, {"id": 192903, "fullname": "Kuizhuang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192903?format=json", "institution": "nanjing university"}, {"id": 192904, "fullname": "Desibieer Tuerdaken", "url": "http://cvpr.thecvf.com/api/miniconf/users/192904?format=json", "institution": "nanjing university"}, {"id": 130219, "fullname": "Zhiwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130219?format=json", "institution": "Nanjing University"}, {"id": 130192, "fullname": "Lei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/130192?format=json", "institution": "Nanjing University"}, {"id": 72976, "fullname": "Sanglu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72976?format=json", "institution": "Nanjing University"}, {"id": 130189, "fullname": "Hongkai Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130189?format=json", "institution": "University of Warwick"}], "abstract": "Self-supervised learning (SSL) has achieved remarkable success across both NLP and CV domains. However, sign language translation (SLT) models still heavily rely on gloss annotations in gloss-based SLT or text annotations in gloss-free SLT (GFSLT) during pretraining, aiming to ensure that the backbone provides effective sign language (SL) features for the translation model. Such reliance restricts the scalability and generalization ability of the SLT model. One natural question arises: \\textbf{Can existing SSL methods be directly applied to the SL domain to train an effective sign feature extractor for downstream GFSLT tasks, eliminating the need for text annotations?}In this paper, we propose a simple yet effective pretraining framework with two goals:(1) decoupling the pretraining process from gloss or text annotations, relying purely on sign frames; and(2) only global frames are required during inference for simplicity. We show that directly applying existing SSL methods yields suboptimal performance, as SL features involve subtle motion patterns and discriminative cues that are often confined to local regions. To achieve this, we introduce SignDINO, a simple yet effective sign-aware DINO training strategy that learns effective and semantically meaningful representations from global frames without any textual supervision. Specifically, a student\u2013teacher architecture is employed, where the teacher model receives the global sign frame, while the student model learns from masked local views that preserve only the hand and facial regions. Such a simple design encourages the  model to infer global semantics from discriminative local cues, allowing the teacher model to extract SL-related feature during inference solely based on global views. Extensive experiments on public SL datasets show that SignDINO achieves highly competitive performance on the GFSLT task without relying on extra cues or additional SL-related pretraining.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39809", "url": null, "sourceid": 31700, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39812, "uid": "f20fd1e8deb34b1bc57a3146f2511eb2", "name": "Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing", "authors": [{"id": 181713, "fullname": "Abdullah Azeem", "url": "http://cvpr.thecvf.com/api/miniconf/users/181713?format=json", "institution": "Shenzhen University"}, {"id": 150979, "fullname": "Ruisheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150979?format=json", "institution": "University of Calgary"}, {"id": 186761, "fullname": "Qingquan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186761?format=json", "institution": "Shenzhen University"}, {"id": 192908, "fullname": "Abubakar Siddique", "url": "http://cvpr.thecvf.com/api/miniconf/users/192908?format=json", "institution": null}], "abstract": "Autonomous object detection in remote sensing requires systems that can discover new categories and assign them usable labels during deployment. Existing Open-World Object Detectors identify unknown objects but leave them unnamed until manual annotation. In contrast, Open-Vocabulary Detectors recognize unseen categories only with provided prompts at test time, lacking autonomous discovery or naming. This work presents HSGDet, a detector that achieves both discovery and semantic assignment at test time without external prompts. This method introduces DHGA that navigates a hierarchical semantic graph to perform scene-conditioned coarse-to-fine classification of detected objects. It leverages spatial co-occurrence patterns from surrounding scene context to produce classification confidence scores. High-scoring regions are identified as known objects, while low-scoring regions are flagged as unknown detections. Unknown regions pass to CR2T, which synthesizes text embeddings by fusing visual features, hierarchical parents, and scene context, enabling prompt-free labeling and vocabulary expansion. This approach enables prompt-free semantic labeling and supports autonomous vocabulary expansion without requiring external models. Results demonstrate that HSGDet outperforms state-of-the-art methods by a large margin of 6.6 points in Known mAP and 9.9 points in Unknown Recall. It also reduces Wilderness Impact by 36\\%, enabling scalable and autonomous aerial monitoring.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39812", "url": null, "sourceid": 35452, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39813, "uid": "f7e38046dc880ae3144b422c84cb83e5", "name": "Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images", "authors": [{"id": 147684, "fullname": "bo zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/147684?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129592, "fullname": "Qiuxia Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129592?format=json", "institution": "Communication University of China"}, {"id": 129610, "fullname": "Zeren Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129610?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 157797, "fullname": "Xiangbo Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157797?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86315, "fullname": "Wenguan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86315?format=json", "institution": "Zhejiang University"}], "abstract": "Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.We introduce $\\textbf{\\textit{UniSplat}}$, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a $\\textit{dual-masking strategy}$ that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural information from incomplete visual cues, yielding geometry-aware representations even under unposed inputs.Second, we develop a $\\textit{coarse-to-fine Gaussian splatting strategy}$ that reduces appearance-semantics inconsistencies by progressively refining the radiance field.Finally, to enforce geometric\u2013semantic consistency, we introduce a $\\textit{pose-conditioned recalibration mechanism}$ that interrelates the outputs of multiple heads by reprojecting predicted 3D point and semantic maps into the image plane using estimated camera parameters, and aligning them with corresponding RGB and semantic predictions to ensure cross-task consistency and resolving geometry\u2013semantic mismatches. Together, these components yield unified 3D representations that are robust to unposed, sparse-view inputs and generalize across diverse tasks, laying a perceptual foundation for spatial intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39813", "url": null, "sourceid": 41699, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39814, "uid": "9a10ba00bcdf929df950b6e1eae1994e", "name": "InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting", "authors": [{"id": 183859, "fullname": "Hong Duc Vu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183859?format=json", "institution": "Qualcomm AI Research"}, {"id": 192909, "fullname": "Kien Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192909?format=json", "institution": null}, {"id": 152491, "fullname": "Trong-Tung Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152491?format=json", "institution": "Johns Hopkins University"}, {"id": 192910, "fullname": "Ngan Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192910?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 76220, "fullname": "Phong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76220?format=json", "institution": "VinAI"}, {"id": 77028, "fullname": "Khoi Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/77028?format=json", "institution": "Qualcomm AI Research"}, {"id": 134650, "fullname": "Cuong Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/134650?format=json", "institution": "Posts &amp; Telecom. Institute of Technology and VinAI Research"}, {"id": 76979, "fullname": "Anh Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/76979?format=json", "institution": "VinAI Research"}], "abstract": "Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the target masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as inputs, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill requires no real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39814", "url": null, "sourceid": 43133, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39818, "uid": "b90510a6ad93ef8af036ae0a8ab5c021", "name": "Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation", "authors": [{"id": 128424, "fullname": "Jihun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/128424?format=json", "institution": "KAIST"}, {"id": 130885, "fullname": "Hoyong Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/130885?format=json", "institution": "KAIST"}, {"id": 156373, "fullname": "Hyeokjun Kweon", "url": "http://cvpr.thecvf.com/api/miniconf/users/156373?format=json", "institution": "Chung-Ang University"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10\\%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA\u2019s effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39818", "url": null, "sourceid": 40072, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39821, "uid": "74061f08793737e9374dd85cd2233d3c", "name": "High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy", "authors": [{"id": 192922, "fullname": "Xianjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192922?format=json", "institution": "Sichuan University"}, {"id": 153198, "fullname": "Keren Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153198?format=json", "institution": "Sichuan University"}, {"id": 153199, "fullname": "Qijun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153199?format=json", "institution": "Sichuan University"}], "abstract": "High-precision dichotomous image segmentation (DIS) is a task of extracting fine-grained objects from high-resolution images.Existing methods trade efficiency for accuracy: non-diffusion methods are fast but suffer from weak semantics and unstable spatial priors, causing false detections; diffusion-based methods offer high accuracy via strong generative priors but are computationally expensive.In depth maps, a complete object appears as a low variance region with a smooth interior and sharp boundaries, whereas the background exhibits a chaotic, high variance pattern due to disconnected surfaces at varying depths. We refer to this as the depth integrity-prior.Inspired by this, and noting that DIS currently lacks depth maps, we leverage pseudo-depth information from monocular depth estimation models to obtain essential semantic understanding, thereby rapidly revealing spatial differences across target objects and the background.To exploit this prior, we propose the Prior-guided Depth Fusion Network (PDFNet), which fuses RGB and pseudo-depth features for depth-aware structure perception. We further introduce a novel depth integrity-prior loss to enforce depth consistency in segmentation and a fine-grained enhancement module with adaptive patch selection to sharpen boundaries.Notably, PDFNet with DAM-v2 achieves SOTA ($F^{max}_\\beta$ 0.915 on DIS-VD and 0.915 on DIS-TE) using less than half the params of diffusion-based methods.Code is provided in the supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39821", "url": null, "sourceid": 32335, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39822, "uid": "2bfd7c7985715037980235d588dc2e9e", "name": "V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs", "authors": [{"id": 192923, "fullname": "Sen Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192923?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 127360, "fullname": "Jie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127360?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 192924, "fullname": "Jianxin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192924?format=json", "institution": "Zhejiang University"}, {"id": 86312, "fullname": "Shiguang Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86312?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 76990, "fullname": "Xilin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76990?format=json", "institution": "Institute of Computing Technology"}], "abstract": "Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose \\textbf{V-Attack}, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V's intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36\\% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39822", "url": null, "sourceid": 35790, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39825, "uid": "f3ebd784518ba600e0ae288653819b5e", "name": "HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks", "authors": [{"id": 145116, "fullname": "Ting Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/145116?format=json", "institution": "Sun Yat-Sen University"}, {"id": 157999, "fullname": "Daoyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157999?format=json", "institution": "Alibaba Group"}, {"id": 157998, "fullname": "Qirui Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157998?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 154621, "fullname": "Bolin Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/154621?format=json", "institution": "Alibaba Group"}, {"id": 158000, "fullname": "Yaliang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158000?format=json", "institution": "Alibaba Group"}, {"id": 158001, "fullname": "Ying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158001?format=json", "institution": "SUN YAT-SEN UNIVERSITY, Tsinghua University"}], "abstract": "Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable ``machine'' for creating nuanced evaluation suites. Our extensive evaluation of 27 leading MLLMs on HumanVBench reveals critical deficiencies, particularly in perceiving subtle emotions and aligning speech with visual cues, with even top proprietary models falling short of human performance. We open-source HumanVBench and our synthesis pipelines to catalyze the development of more socially intelligent and capable video MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39825", "url": null, "sourceid": 45875, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39829, "uid": "5409d94570540a9c8f0e83ddd73e2453", "name": "RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph", "authors": [{"id": 126413, "fullname": "Yifan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126413?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 179596, "fullname": "Fangneng Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179596?format=json", "institution": "Harvard University &amp; MIT"}, {"id": 181404, "fullname": "Wanhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181404?format=json", "institution": "School of Electrical and Electronic Engineering, Nanyang Technological University"}, {"id": 191559, "fullname": "Haowen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191559?format=json", "institution": "Tsinghua University, Tsinghua University; Tsinghua University, Tsinghua University"}, {"id": 93898, "fullname": "Katerina Fragkiadaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/93898?format=json", "institution": "CMU"}, {"id": 89796, "fullname": "Hanspeter Pfister", "url": "http://cvpr.thecvf.com/api/miniconf/users/89796?format=json", "institution": "Harvard University"}], "abstract": "Estimating robot pose from a monocular RGB image is a challenge in robotics and computer vision. Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training, which is often scarce in real-world scenarios, causing a sim-to-real gap.Moreover, these approaches reduce the 3D-based problem to 2D domain, neglecting the 3D priors. To address these, we propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations, alleviating the reliance on labels. Specifically, the RoboTAG consists of a 3D branch and a 2D branch, where nodes represent the states of the camera and robot system, and edges capture the dependencies between these variables or denote alignments between them. Closed loops are then defined in the graph, on which a consistency supervision across branches can be applied. This design allows us to utilize in-the-wild images as training data without annotations. Experimental results demonstrate that our method is effective across robot types, highlighting its potential to alleviate the data bottleneck in robotics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39829", "url": null, "sourceid": 44691, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39838, "uid": "4337f7ca32a4a36b744e31b93244e926", "name": "Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration", "authors": [{"id": 154107, "fullname": "I-Hsiang (Aaron) Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154107?format=json", "institution": "NTU"}, {"id": 167830, "fullname": "Isma Hadji", "url": "http://cvpr.thecvf.com/api/miniconf/users/167830?format=json", "institution": "Samsung"}, {"id": 107646, "fullname": "Enrique Sanchez", "url": "http://cvpr.thecvf.com/api/miniconf/users/107646?format=json", "institution": "Samsung AI Center Cambridge"}, {"id": 77078, "fullname": "Adrian Bulat", "url": "http://cvpr.thecvf.com/api/miniconf/users/77078?format=json", "institution": "Samsung AI Center Cambridge"}, {"id": 126235, "fullname": "Sy-Yen Kuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126235?format=json", "institution": "National Taiwan University"}, {"id": 73509, "fullname": "Radu Timofte", "url": "http://cvpr.thecvf.com/api/miniconf/users/73509?format=json", "institution": "University of W\u00fcrzburg"}, {"id": 86077, "fullname": "Georgios Tzimiropoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/86077?format=json", "institution": "Queen Mary University London"}, {"id": 154330, "fullname": "Brais Martinez", "url": "http://cvpr.thecvf.com/api/miniconf/users/154330?format=json", "institution": "Samsung AI Center - Cambridge"}], "abstract": "Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39838", "url": null, "sourceid": 39236, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39839, "uid": "4208c962bda954ddcbbc9dcd21c0aa0a", "name": "TokenLight: Precise Lighting Control in Images using Attribute Tokens", "authors": [{"id": 156652, "fullname": "Sumit Chaturvedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/156652?format=json", "institution": "Department of Computer Science, Yale University"}, {"id": 85036, "fullname": "Yannick Hold-Geoffroy", "url": "http://cvpr.thecvf.com/api/miniconf/users/85036?format=json", "institution": "Adobe Research"}, {"id": 106348, "fullname": "Mengwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/106348?format=json", "institution": "Adobe"}, {"id": 180955, "fullname": "Jingyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180955?format=json", "institution": "Adobe"}, {"id": 86848, "fullname": "He Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86848?format=json", "institution": "Adobe Systems"}, {"id": 76137, "fullname": "Yiqun Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76137?format=json", "institution": "Johns Hopkins University"}, {"id": 192956, "fullname": "Julie Dorsey", "url": "http://cvpr.thecvf.com/api/miniconf/users/192956?format=json", "institution": "Yale University"}, {"id": 137329, "fullname": "ZHIXIN SHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/137329?format=json", "institution": "Adobe Systems"}], "abstract": "This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39839", "url": null, "sourceid": 42408, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39840, "uid": "96e1da3b09373c4e9ab748abf0d43329", "name": "The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts", "authors": [{"id": 181644, "fullname": "Yuchen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181644?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 154334, "fullname": "Yaxiong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154334?format=json", "institution": "Hefei University of Technology"}, {"id": 191472, "fullname": "Yujiao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191472?format=json", "institution": "CSIRO"}, {"id": 192957, "fullname": "Lianwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192957?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 192958, "fullname": "Li Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192958?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 77195, "fullname": "Zhedong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/77195?format=json", "institution": "National University of Singapore"}], "abstract": "The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to generate high-risk disinformation. Our approach begins with constructing the MLLM-Driven Synthetic Multimodal (MDSM) dataset, where images are first altered using state-of-the-art editing techniques and then paired with MLLM-generated deceptive texts that maintain semantic consistency with the visual manipulations. Building upon this foundation, we present the Artifact-aware Manipulation Diagnosis via MLLM (AMD) framework featuring two key innovations: Artifact Pre-perception Encoding strategy and Manipulation-Oriented Reasoning, to tame MLLMs for the MDSM problem. Comprehensive experiments validate our framework's superior generalization capabilities as a unified architecture for detecting MLLM-powered multimodal deceptions. In cross-domain testing on the MDSM dataset, AMD achieves the best average performance, with ACC, mAP, and mIoU scores of 88.18, 60.25, and 61.02, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39840", "url": null, "sourceid": 43872, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39841, "uid": "f1a0ecdce0260ad2fad9a56d64582084", "name": "AvatarPointillist: AutoRegressive 4D Gaussian Avatarization", "authors": [{"id": 87692, "fullname": "Hongyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87692?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 85470, "fullname": "Xuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85470?format=json", "institution": "Tencent AI Lab"}, {"id": 154162, "fullname": "Zijian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154162?format=json", "institution": "nanjing university"}, {"id": 147923, "fullname": "yating wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147923?format=json", "institution": "shanghai jiaotong university"}, {"id": 91804, "fullname": "Ziyu Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/91804?format=json", "institution": "City University of Hong Kong"}, {"id": 127301, "fullname": "Yue Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/127301?format=json", "institution": "The Hong Kong University of Science and Technology, Hong Kong"}, {"id": 153949, "fullname": "Runtao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153949?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 157046, "fullname": "Boyao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/157046?format=json", "institution": "Tsinghua University"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87711, "fullname": "Qifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87711?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code and data to inspire future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39841", "url": null, "sourceid": 33898, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39842, "uid": "762ce0877a6d271c947878cdb07bb68e", "name": "Does YOLO Really Need to See Every Training Image in Every Epoch?", "authors": [{"id": 174643, "fullname": "Xingxing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/174643?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 174669, "fullname": "Jiahua Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/174669?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 86299, "fullname": "Junwei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86299?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 190382, "fullname": "Gong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190382?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy.This naturally raises an important question: Does YOLO really need to see every training image in every epoch? To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently.Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Medium training images are partially selected, with priority given to recently unused ones and the remaining quota filled randomly to ensure short-term coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones.On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\\times$ training speedup for YOLO-series detectors (e.g., YOLOv8, YOLOv10, YOLO11, YOLO12) while also improving detection accuracy.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39842", "url": null, "sourceid": 41317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40385?format=json"], "related_events_ids": [40385]}, {"id": 40385, "uid": "762ce0877a6d271c947878cdb07bb68e", "name": "Does YOLO Really Need to See Every Training Image in Every Epoch?", "authors": [{"id": 174643, "fullname": "Xingxing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/174643?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 174669, "fullname": "Jiahua Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/174669?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 86299, "fullname": "Junwei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86299?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 190382, "fullname": "Gong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190382?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy.This naturally raises an important question: Does YOLO really need to see every training image in every epoch? To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently.Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Medium training images are partially selected, with priority given to recently unused ones and the remaining quota filled randomly to ensure short-term coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones.On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\\times$ training speedup for YOLO-series detectors (e.g., YOLOv8, YOLOv10, YOLO11, YOLO12) while also improving detection accuracy.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40385", "url": null, "sourceid": -41317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39842?format=json"], "related_events_ids": [39842]}, {"id": 39843, "uid": "485b551f4f68d315b9a0856bd3c92195", "name": "Decoding 3D Perception via BrainSSD: Synergistic Fusion of EEG Representations from Static and Dynamic Visual Streams", "authors": [{"id": 173088, "fullname": "Yincheng Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/173088?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 192959, "fullname": "Enze Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192959?format=json", "institution": "Northwestern Polytechnical University Xi&#x27;an"}, {"id": 185731, "fullname": "Shu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185731?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "Understanding how the brain constructs coherent 3D visual percepts from multifaceted experiences remains a pivotal yet underexplored challenge. To investigate this, we introduce BrainSSD, a novel framework for decoding 3D representations from electroencephalography (EEG) signals. The core of BrainSSD is a neuro-inspired fusion architecture, Hierarchical Phase-Amplitude Coupling guided Fusion (HPACF), which synergistically integrates EEG from two distinct viewing paradigms: brief presentations of a static 3D object view, and sustained observation of the object undergoing full rotation. HPACF embodies two key principles of neural computation, namely hierarchical processing realized through multi-level cross-attention, and neural synchrony actualized by using a differentiable estimator of Phase-Amplitude Coupling (PAC) to dynamically guide the integration. The resulting fused representations are subsequently mapped to the visual domain via a multi-level alignment loss. Our framework establishes a new state-of-the-art across a range of EEG decoding tasks, achieving superior discriminative power and exceptional generative fidelity. Furthermore, our static-dynamic dominance analysis provides the first direct visual evidence for a functional specialization in the brain's 3D perception, revealing that neural responses to static object views primarily underpin the object's holistic structure and form, while responses to rotational observation are indispensable for resolving its fine-grained geometric details. Our work presents an advanced framework for probing EEG-based visual decoding and offers computational insights into the brain's synergistic strategies for 3D perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39843", "url": null, "sourceid": 33992, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39849, "uid": "41d727696cfe79387a8644bba00cd45e", "name": "CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization", "authors": [{"id": 76322, "fullname": "Xinhai Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76322?format=json", "institution": "University of Michigan"}, {"id": 134612, "fullname": "Shaoyuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/134612?format=json", "institution": "Amazon Inc."}, {"id": 139899, "fullname": "Manan Biyani", "url": "http://cvpr.thecvf.com/api/miniconf/users/139899?format=json", "institution": "Amazon"}, {"id": 137733, "fullname": "Moyan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/137733?format=json", "institution": "Amazon"}, {"id": 192977, "fullname": "Jia Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192977?format=json", "institution": "The Ohio State University"}, {"id": 84663, "fullname": "Todd C. Hollon", "url": "http://cvpr.thecvf.com/api/miniconf/users/84663?format=json", "institution": "University of Michigan"}, {"id": 139814, "fullname": "Bryan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/139814?format=json", "institution": "Amazon"}], "abstract": "Agentic vision\u2013language models are increasingly trained to \u201cthink with images\u201d by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39849", "url": null, "sourceid": 45760, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40386?format=json"], "related_events_ids": [40386]}, {"id": 39850, "uid": "52ac909c791fd918500ce7e8fed79cf5", "name": "PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing", "authors": [{"id": 136714, "fullname": "Rohan Mahadev", "url": "http://cvpr.thecvf.com/api/miniconf/users/136714?format=json", "institution": "Pinterest Inc."}, {"id": 182458, "fullname": "Joyce Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182458?format=json", "institution": "Pinterest"}, {"id": 164846, "fullname": "Patrick Poirson", "url": "http://cvpr.thecvf.com/api/miniconf/users/164846?format=json", "institution": "Pinterest, Inc."}, {"id": 182542, "fullname": "David Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/182542?format=json", "institution": "Pinterest"}, {"id": 192978, "fullname": "Hao-Yu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192978?format=json", "institution": null}, {"id": 192979, "fullname": "Dmitry Kislyuk", "url": "http://cvpr.thecvf.com/api/miniconf/users/192979?format=json", "institution": "Pinterest, Inc."}], "abstract": "Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present \\textbf{PinPoint}, a comprehensive real world benchmark with 7,846 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4\\% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5\\%, still retrieves irrelevant results (hard negatives) 9\\% of the time. The best models also exhibit 25.1\\% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70\\% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39850", "url": null, "sourceid": 35983, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40386, "uid": "41d727696cfe79387a8644bba00cd45e", "name": "CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization", "authors": [{"id": 76322, "fullname": "Xinhai Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76322?format=json", "institution": "University of Michigan"}, {"id": 134612, "fullname": "Shaoyuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/134612?format=json", "institution": "Amazon Inc."}, {"id": 139899, "fullname": "Manan Biyani", "url": "http://cvpr.thecvf.com/api/miniconf/users/139899?format=json", "institution": "Amazon"}, {"id": 137733, "fullname": "Moyan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/137733?format=json", "institution": "Amazon"}, {"id": 192977, "fullname": "Jia Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192977?format=json", "institution": "The Ohio State University"}, {"id": 84663, "fullname": "Todd C. Hollon", "url": "http://cvpr.thecvf.com/api/miniconf/users/84663?format=json", "institution": "University of Michigan"}, {"id": 139814, "fullname": "Bryan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/139814?format=json", "institution": "Amazon"}], "abstract": "Agentic vision\u2013language models are increasingly trained to \u201cthink with images\u201d by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40386", "url": null, "sourceid": -45760, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39849?format=json"], "related_events_ids": [39849]}, {"id": 39852, "uid": "b8d1eddbcaebc960137cadf0b70cbbcc", "name": "VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery", "authors": [{"id": 175123, "fullname": "Wenhao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/175123?format=json", "institution": "Nanyang Technological University"}, {"id": 156341, "fullname": "Hao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156341?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 129474, "fullname": "Wanqi Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129474?format=json", "institution": "SenseTime Research "}, {"id": 127844, "fullname": "Fayao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127844?format=json", "institution": "Institute for Infocomm Research, A*STAR"}, {"id": 107391, "fullname": "Xulei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107391?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 127309, "fullname": "Chao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127309?format=json", "institution": "Zhejiang University"}, {"id": 128967, "fullname": "Zhongang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128967?format=json", "institution": "Nanyang Technological University"}, {"id": 86136, "fullname": "Guosheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/86136?format=json", "institution": "Nanyang Technological University"}], "abstract": "Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent probabilistic and diffusion-based methods tackle this ambiguity by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this issue, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Building upon this dataset, we propose a group preference alignment framework for finetuning diffusion-based HMR models. This process injects the rich preference signals into the model, guiding it to generate more physically plausible and image-consistent human meshes. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art probabilistic HMR approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39852", "url": null, "sourceid": 39778, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39853, "uid": "b38700450d76fa2e5c3498b9218b9bc6", "name": "OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning", "authors": [{"id": 182075, "fullname": "Yanqing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182075?format=json", "institution": "UC Santa Cruz"}, {"id": 129795, "fullname": "Xianhang li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129795?format=json", "institution": "University of California, Santa Cruz"}, {"id": 129889, "fullname": "Letian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129889?format=json", "institution": "Tongji University"}, {"id": 89422, "fullname": "Zirui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89422?format=json", "institution": "Google Brain"}, {"id": 192984, "fullname": "Zeyu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192984?format=json", "institution": "University of California Berkeley"}, {"id": 75508, "fullname": "Yuyin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75508?format=json", "institution": "UC Santa Cruz"}, {"id": 75526, "fullname": "Cihang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/75526?format=json", "institution": "University of California, Santa Cruz"}], "abstract": "This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39853", "url": null, "sourceid": 40662, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39856, "uid": "bd9c29a7219e6d849af77f68e697debe", "name": "GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling", "authors": [{"id": 181587, "fullname": "Shivanshu Shekhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/181587?format=json", "institution": "UIUC"}, {"id": 129213, "fullname": "Uttaran Bhattacharya", "url": "http://cvpr.thecvf.com/api/miniconf/users/129213?format=json", "institution": "Adobe Inc."}, {"id": 192992, "fullname": "Raghavendra Addanki", "url": "http://cvpr.thecvf.com/api/miniconf/users/192992?format=json", "institution": "Adobe Research"}, {"id": 106596, "fullname": "Mehrab Tanjim", "url": "http://cvpr.thecvf.com/api/miniconf/users/106596?format=json", "institution": "Adobe Research"}, {"id": 192993, "fullname": "Somdeb Sarkhel", "url": "http://cvpr.thecvf.com/api/miniconf/users/192993?format=json", "institution": "Adobe Research"}, {"id": 130359, "fullname": "Tong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130359?format=json", "institution": "UIUC"}], "abstract": "Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \\modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\\times$ to $65\\times$ fewer than existing VLM-based approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39856", "url": null, "sourceid": 41329, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39863, "uid": "517b17a8b5f138f3e0aef1099044bc7a", "name": "MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation", "authors": [{"id": 182211, "fullname": "Yuta Oshima", "url": "http://cvpr.thecvf.com/api/miniconf/users/182211?format=json", "institution": "The University of Tokyo"}, {"id": 193004, "fullname": "Daiki Miyake", "url": "http://cvpr.thecvf.com/api/miniconf/users/193004?format=json", "institution": "The University of Tokyo, The University of Tokyo"}, {"id": 193005, "fullname": "Kohsei Matsutani", "url": "http://cvpr.thecvf.com/api/miniconf/users/193005?format=json", "institution": "Liquid AI; the University of Tokyo"}, {"id": 152470, "fullname": "Yusuke Iwasawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/152470?format=json", "institution": "The University of Tokyo, The University of Tokyo"}, {"id": 188377, "fullname": "Masahiro Suzuki", "url": "http://cvpr.thecvf.com/api/miniconf/users/188377?format=json", "institution": "The University of Tokyo, Tokyo Institute of Technology"}, {"id": 152471, "fullname": "Yutaka Matsuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152471?format=json", "institution": "The University of Tokyo"}, {"id": 193006, "fullname": "Hiroki Furuta", "url": "http://cvpr.thecvf.com/api/miniconf/users/193006?format=json", "institution": "Google DeepMind"}], "abstract": "Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts.However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions.In addition, their task definitions are vague, typically limited to axes such as \"what to edit\" or \"how many references are given\", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce MultiBanana, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39863", "url": null, "sourceid": 37333, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39868, "uid": "f1f49699d5abb3a62e57c6541770b65d", "name": "Hypergraph-State Collaborative Reasoning for Multi-Object Tracking", "authors": [{"id": 152183, "fullname": "Zikai Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152183?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 152184, "fullname": "Junqing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152184?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 155106, "fullname": "Yi-Ping Phoebe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155106?format=json", "institution": "La Trobe University"}, {"id": 90421, "fullname": "Wei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90421?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear.To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded.To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial\u2013temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation.Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39868", "url": null, "sourceid": 30636, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39869, "uid": "57ecd7316c52ffcbdea03690ea7db2b8", "name": "EgoX: Egocentric Video Generation from a Single Exocentric Video", "authors": [{"id": 156200, "fullname": "Taewoong Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156200?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 152981, "fullname": "Kinam Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/152981?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 193033, "fullname": "Dohyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193033?format=json", "institution": "Seoul National University"}, {"id": 128128, "fullname": "Minho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/128128?format=json", "institution": "KAIST"}, {"id": 87940, "fullname": "Junha Hyung", "url": "http://cvpr.thecvf.com/api/miniconf/users/87940?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input.EgoX leverages the pretrained spatio\u2013temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width- and channel-wise concatenation.Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity.Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39869", "url": null, "sourceid": 41930, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39874, "uid": "0226542688e5dc36cf47b133ef4e7237", "name": "MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images", "authors": [{"id": 193038, "fullname": "Chentao Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/193038?format=json", "institution": null}, {"id": 102159, "fullname": "He Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102159?format=json", "institution": "Tsinghua University"}, {"id": 182500, "fullname": "Yuan Haolei", "url": "http://cvpr.thecvf.com/api/miniconf/users/182500?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 133385, "fullname": "Haozhe Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/133385?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 156643, "fullname": "Jianhua Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156643?format=json", "institution": "Tsinghua University"}, {"id": 133204, "fullname": "Hongwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133204?format=json", "institution": "Beijing Normal University"}, {"id": 86367, "fullname": "Tao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86367?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module.To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position.Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space.Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh  and scene recovery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39874", "url": null, "sourceid": 38012, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39875, "uid": "6ef77bd3e3cfb00cd02bba48e6e9a9e3", "name": "ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization", "authors": [{"id": 89755, "fullname": "Bingchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89755?format=json", "institution": "University of Science and Technology of China"}, {"id": 153256, "fullname": "Zhixin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153256?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 153255, "fullname": "Fan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153255?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 76441, "fullname": "Jiaqi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76441?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 193039, "fullname": "Jiaming Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193039?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 84715, "fullname": "Renjing Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84715?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 158204, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158204?format=json", "institution": "University of Science and Technology of China"}, {"id": 85129, "fullname": "Zhibo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85129?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization.This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39875", "url": null, "sourceid": 40724, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39877, "uid": "1ede38c6bf62e43dd7ecb9f2bd7417a8", "name": "Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object  Detection", "authors": [{"id": 147200, "fullname": "Mianzhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147200?format=json", "institution": "Tianjin University of Technology"}, {"id": 157190, "fullname": "Fan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/157190?format=json", "institution": "Tianjin University of Technology"}, {"id": 157191, "fullname": "Xu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157191?format=json", "institution": "Tianjin University of Technology"}, {"id": 157189, "fullname": "Chen Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/157189?format=json", "institution": "Tianjin University of Technology"}, {"id": 85973, "fullname": "Shengyong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85973?format=json", "institution": "Tianjin University of Technology"}], "abstract": "Unaligned RGB-T salient object detection (SOD) remains challenging due to severe cross-modal spatial discrepancies and unreliable feature fusion. Existing methods often assume perfect alignment or rely on geometric registration, which is computationally demanding and sensitive to cross-modal inconsistencies. To address these limitations, we propose an uncertainty-aware modality fusion network (UMFNet) that reformulates RGB-T SOD as an uncertainty-aware representation learning problem. Specifically, the proposed uncertainty alignment module (UAM) models pixel-wise features as Gaussian latent distributions to estimate local uncertainty and identify cross-modal consistency regions within the feature space, thereby achieving implicit alignment without explicit registration. Furthermore, the confidence-guided global modulation (CGM) mechanism leverages confidence maps derived from uncertainty estimation to adaptively regulate the fusion of RGB and thermal features, enhancing salient cues in reliable regions while suppressing noisy or inconsistent information. Extensive experiments on five unaligned and three aligned RGB-T SOD benchmarks demonstrate that UMFNet achieves state-of-the-art performance across diverse alignment conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39877", "url": null, "sourceid": 46154, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39880, "uid": "db525c87bd3c69102f83450fbf2c684c", "name": "STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting", "authors": [{"id": 71432, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71432?format=json", "institution": "Zhejiang University"}, {"id": 126145, "fullname": "Tao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/126145?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 193043, "fullname": "Jie ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/193043?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 126142, "fullname": "Song Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126142?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 87059, "fullname": "Lei Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87059?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "To gain finer regional forecasts, many works have explored the regional integration from the global atmosphere, e.g., by solving boundary equations in physics-based methods or cropping regions from global forecasts in data-driven methods. However, the effectiveness of these methods is often constrained by static and imprecise regional boundaries, resulting in poor generalization ability. To address this issue, we propose Spatial-Temporal Weather Forecasting (STCast), a novel AI-driven framework for adaptive regional boundary optimization and dynamic monthly forecast allocation. Specifically, our approach employs a Spatial-Aligned Attention (SAA) mechanism, which aligns global and regional spatial distributions to initialize boundaries and adaptively refines them based on attention-derived alignment patterns. Furthermore, we design a Temporal Mixture-of-Experts (TMoE) module, where atmospheric variables from distinct months are dynamically routed to specialized experts using a discrete Gaussian distribution, enhancing the model\u2019s ability to capture temporal patterns. Beyond global and regional forecasting, STCast is evaluated on extreme event prediction and ensemble forecasting. Experimental results demonstrate consistent superiority over state-of-the-art methods across all four tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39880", "url": null, "sourceid": 30938, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39882, "uid": "0b54a9f970d7aceab45b6b4e15349fa2", "name": "The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation", "authors": [{"id": 102935, "fullname": "Weijia Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/102935?format=json", "institution": "NUS"}, {"id": 193046, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193046?format=json", "institution": "ByteDance Inc."}, {"id": 157173, "fullname": "Zhenheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157173?format=json", "institution": "Tiktok"}, {"id": 73557, "fullname": "Mike Zheng Shou", "url": "http://cvpr.thecvf.com/api/miniconf/users/73557?format=json", "institution": "National University of Singapore"}], "abstract": "A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce **Adv-GRPO**, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take **the image itself as a reward**, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39882", "url": null, "sourceid": 46686, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39887, "uid": "561ac63524680368701567891601024a", "name": "Extending Embodied Question Answering from Perception to Decision", "authors": [{"id": 180575, "fullname": "Xicheng Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180575?format=json", "institution": "Peking University"}, {"id": 129979, "fullname": "Qiwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129979?format=json", "institution": "Peking University"}, {"id": 180523, "fullname": "Peiran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180523?format=json", "institution": "Peking University"}, {"id": 89566, "fullname": "Yadong Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89566?format=json", "institution": "Peking University"}], "abstract": "Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question\u2013answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39887", "url": null, "sourceid": 36882, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39891, "uid": "4e0c50348aedc8071dd0073f5aa46adc", "name": "Weight Space Representation Learning with Neural Fields", "authors": [{"id": 180797, "fullname": "Zhuoqian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180797?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 85443, "fullname": "Mathieu Salzmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/85443?format=json", "institution": "EPFL"}, {"id": 77165, "fullname": "Sabine S\u00fcsstrunk", "url": "http://cvpr.thecvf.com/api/miniconf/users/77165?format=json", "institution": "EPFL - EPF Lausanne"}], "abstract": "In this work, we investigate the potential of weights to serve as effective representations, focusing on neural fields. Our key insight is that constraining the optimization space through a pre-trained base model and multiplicative low-rank adaptation (mLoRA) can induce structure in weight space. Across reconstruction, generation, and analysis tasks on 2D and 3D data, we find that mLoRA weights achieve high representation quality while exhibiting distinctiveness and semantic structure. When used with latent diffusion models, mLoRA weights enable higher-quality generation than existing weight-space methods. Source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39891", "url": null, "sourceid": 37308, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39892, "uid": "b156dd7416c8124ee31921d2d6f3111c", "name": "$\\texttt{MonoVLM}$: Monocular 3D Visual Grounding with Vision Language Models", "authors": [{"id": 192537, "fullname": "Huaizhi Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192537?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 188277, "fullname": "Hossein Nourkhiz Mahjoub", "url": "http://cvpr.thecvf.com/api/miniconf/users/188277?format=json", "institution": "Honda Research Institute US"}, {"id": 182479, "fullname": "Vaishnav Tadiparthi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182479?format=json", "institution": "Honda Research Institute USA, Inc."}, {"id": 93484, "fullname": "Kwonjoon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/93484?format=json", "institution": "Honda Research Institute USA"}, {"id": 128595, "fullname": "Tianlong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128595?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Vision-Language Models (VLMs) have demonstrated remarkable capabilities in instruction following and 2D visual understanding. However, state-of-the-art VLMs, including GPT-5 still struggle with 3D perception, particularly in tasks such as monocular 3D visual grounding. While specialized vision-only models excel in this domain, they often lack the rich semantic understanding inherent to VLMs. To bridge this gap, we propose $\\texttt{MonoVLM}$, a novel triple-stage training framework that effectively enables VLMs with accurate monocular 3D grounding. The core of our method is a progressive training process, which utilizes Group Relative Policy Optimization (GRPO) to gradually teach the model to first localize the described object, then understand its 3D structure, and finally, perform accurate estimation. Comprehensive experiments show that $\\texttt{MonoVLM}$ models significantly outperform existing VLMs and even surpass the performance of specialized vision-only models. We validate our design via extensive comparisons and ablation studies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39892", "url": null, "sourceid": 33973, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39894, "uid": "e8660ee8b41231074f2d9f16adb2ea71", "name": "DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment", "authors": [{"id": 152360, "fullname": "Xin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/152360?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 152361, "fullname": "Zhiyuan You", "url": "http://cvpr.thecvf.com/api/miniconf/users/152361?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 129904, "fullname": "Zhoutong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129904?format=json", "institution": "Adobe Systems"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Reducing token number in the latent diffusion is important for both efficient training and inference, especially at high resolution.A common approach is to design high-compression image tokenizers that store more information per token by increasing the number of channels. However, packing more details into each token tends to make the latent space less structured, which in turn makes diffusion training difficult. To solve this, current solutions use semantic alignment or training-time dropout to impose structures in the latent space, which often requires retraining the diffusion model from scratch. Can we increase the compression ratio of the image tokenizer, while not requiring expensive re-training? As we find out, a simple solution is to explicit add channels to the existing latent to capture image details, and align them towards the latent from the pre-trained diffusion model. Our method, \\textbf{D}etail-\\textbf{A}ligned VAE, increases the compression ratio of a pretrained VAE, while only require a light-weight adaptation stage for the corresponding pretrained diffusion backbone. Specifically, DA-VAE imposes an explicit latent structure: the first $C$ channels of the latent space is given by the pre-trained VAE, encoding the input image at half the resolution. We use an extra of $D$ channels to encode details of the image at full-res.To make this new latent diffusion friendly, we introduces a simple detail alignment strategy that constraints the extra $D$ channels to have similar structures of the first $C$ channels. With such a design, we provide a warm-start finetuning recipe which effectively enables $1024\\times 1024$ image generation with Stable Diffusion 3.5, using only $32\\times32$ tokens, $4\\times$ less than the original model.This adaptation only takes 5 H100 days. We also show that we could unlock $2048\\times2048$ image generation with SD3.5, with $6\\times$ speed up and more stable image structure. We further validate the effectiveness of our method and design decisions quantitatively on ImageNet.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39894", "url": null, "sourceid": 32521, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39895, "uid": "b6d647bef97806f924072628784229c2", "name": "Edge-RecViT: Efficient Vision Transformer via Semantic-Refined Dynamic Recursion", "authors": [{"id": 175252, "fullname": "YiZhou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/175252?format=json", "institution": "Xi\u2019an Jiaotong\u2013Liverpool University"}, {"id": 175433, "fullname": "Jinyi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175433?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 193067, "fullname": "Mingyu Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193067?format=json", "institution": null}, {"id": 177619, "fullname": "Xianyi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/177619?format=json", "institution": "Xi\u2019an Jiaotong-Liverpool University"}], "abstract": "Vision Transformers (ViTs) have achieved remarkable progress in visual and multimodal tasks, yet their deployment remains costly.Token-adaptive methods reduce FLOPs through dynamic depth computation, but face two limitations:(1) Global attention overemphasizes highly similar foreground regions, causing token-adaptive modules to assign the deepest computation to semantically weak foreground tokens while prematurely exiting edge tokens rich in structural cues;(2) Although token-adaption lowers FLOPs, it still relies on large parameter sets, and deep-layer weights remain underutilized due to early token exit. Parameter sharing could address redundancy but is difficult to apply in ViTs, where hierarchical abstraction typically requires diverse transformations.To address these issues, we propose Edge-RecViT, an Edge-Adaptive Dynamic Recursive Vision Transformer that integrates an edge-aware token-adaptive ranker with a recursive transformer using fully shared parameters in its hidden layers.Edge-RecViT dynamically allocates computation based on semantic richness: structurally informative edge tokens receive deeper refinement, whereas redundant low-information tokens exit early.Extensive experiments show that Edge-RecViT provides an excellent trade-off among accuracy, FLOPs, and parameter efficiency.On ImageNet-1K, it matches DeiT within 0.3\\% Top-1 accuracy while reducing FLOPs by 30.5\\% (35.1 \u2192 24.39 GFLOPs).At the Base level, parameter drops from 86M to 23.21M with higher accuracy than ViT-Base; compared with ViT-Large, parameters are reduced by 93\\% while maintaining superior accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39895", "url": null, "sourceid": 39818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39896, "uid": "dbfd53b86f75d866c5d574d87738c43a", "name": "LumiMotion: Improving Gaussian Relighting with Scene Dynamics", "authors": [{"id": 184286, "fullname": "Joanna Kaleta", "url": "http://cvpr.thecvf.com/api/miniconf/users/184286?format=json", "institution": "Sano Centre for Computational Medicine"}, {"id": 193068, "fullname": "Piotr W\u00f3jcik", "url": "http://cvpr.thecvf.com/api/miniconf/users/193068?format=json", "institution": "Universit\u00e4t K\u00f6ln"}, {"id": 193069, "fullname": "Kacper Marzol", "url": "http://cvpr.thecvf.com/api/miniconf/users/193069?format=json", "institution": "Jagiellonian University in Krakow"}, {"id": 87369, "fullname": "Tomasz Trzci\u0144ski", "url": "http://cvpr.thecvf.com/api/miniconf/users/87369?format=json", "institution": "Jagiellonian University"}, {"id": 87341, "fullname": "Kacper Kania", "url": "http://cvpr.thecvf.com/api/miniconf/users/87341?format=json", "institution": "Warsaw University of Technology"}, {"id": 87347, "fullname": "Marek Kowalski", "url": "http://cvpr.thecvf.com/api/miniconf/users/87347?format=json", "institution": "Microsoft"}], "abstract": "In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23\\% for albedo estimation and by 15% for scene relighting relative to next-best baseline. To this end, we introduce LumiMotion, the first Gaussian-based approach that leverages dynamics for inverse rendering and operates in arbitrary dynamic scenes. Our method learns a dynamic 2D Gaussian Splatting representation that employs a set of novel constraints which encourage the dynamic regions of the scene to deform, while keeping static regions stable. As we demonstrate, this separation is crucial for correct optimization of the albedo. Finally, we release a new synthetic benchmark comprising five scenes under four lighting conditions, each in both static and dynamic variants, for the first time enabling systematic evaluation of inverse rendering methods in dynamic environments and challenging lighting.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39896", "url": null, "sourceid": 44210, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39900, "uid": "83bf29a1e7fb5f7afa57187371bdd6e7", "name": "CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects", "authors": [{"id": 146136, "fullname": "Gabriel Fiastre", "url": "http://cvpr.thecvf.com/api/miniconf/users/146136?format=json", "institution": "Inria Paris"}, {"id": 90739, "fullname": "Antoine Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90739?format=json", "institution": "Inria"}, {"id": 69185, "fullname": "Cordelia Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/69185?format=json", "institution": "Inria / Google"}], "abstract": "Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language.Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance.To circumvent this issue,  we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM, and extend the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap). Moreover, we introduce an end-to-end model, CaptionFormer, capable of jointly detecting, segmenting, tracking and captioning object trajectories.With pretraining on LVISCap and LV-VISCap, CaptionFormer achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39900", "url": null, "sourceid": 42509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39899, "uid": "821277b37d3d6ac1b891520564c66300", "name": "Reframing Long-Tailed Learning via Loss Landscape Geometry", "authors": [{"id": 193073, "fullname": "shenghan chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193073?format=json", "institution": "Shandong University"}, {"id": 193074, "fullname": "Yiming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193074?format=json", "institution": null}, {"id": 193075, "fullname": "Yanzhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193075?format=json", "institution": null}, {"id": 157796, "fullname": "Yujia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157796?format=json", "institution": "Zhejiang Sci-Tech University"}, {"id": 155291, "fullname": "Xiankai Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155291?format=json", "institution": "Shandong University"}], "abstract": "Balancing performance trade-off on long-tail data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called  \"catastrophic forgetting'' in continual learning (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective.  We observe that  different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent \"catastrophic forgetting''. To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware  module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39899", "url": null, "sourceid": 35185, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39905, "uid": "3ef00cbe8a65af09beddab1c55e103fd", "name": "Thinking with Programming Vision: Towards a Unified View for Thinking with Images", "authors": [{"id": 155582, "fullname": "Zirun Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/155582?format=json", "institution": "Zhejiang University"}, {"id": 193087, "fullname": "Minjie Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193087?format=json", "institution": "Zhejiang University"}, {"id": 193088, "fullname": "Feng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193088?format=json", "institution": "ByteDance Inc."}, {"id": 193089, "fullname": "Kai Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193089?format=json", "institution": "ByteDance Inc."}, {"id": 152602, "fullname": "Tao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/152602?format=json", "institution": "Zhejiang University"}], "abstract": "Multimodal large language models (MLLMs) that ''think with images'' can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes, underscoring the need for more robust tool-based reasoning. To address this, we propose **CodeVision**, a flexible and scalable ``code-as-tool'' framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Codes and datasets will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39905", "url": null, "sourceid": 37653, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39912, "uid": "87b65f9d738ea59c2c51cb5d0e492087", "name": "SpotEdit: Selective Region Editing in Diffusion Transformers", "authors": [{"id": 183247, "fullname": "ZHIBIN QIN", "url": "http://cvpr.thecvf.com/api/miniconf/users/183247?format=json", "institution": "National university of singapore"}, {"id": 127173, "fullname": "Zhenxiong Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127173?format=json", "institution": "National University of Singapore"}, {"id": 193101, "fullname": "Zeqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193101?format=json", "institution": "National University of Singapore"}, {"id": 180264, "fullname": "Songhua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180264?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Diffusion Transformer (DiT)-based models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39912", "url": null, "sourceid": 38696, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39915, "uid": "14919153f8aaf3af893165510756de56", "name": "QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models", "authors": [{"id": 175764, "fullname": "Jingxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175764?format=json", "institution": "Indiana University Indy"}, {"id": 175780, "fullname": "Yun-Ta Hsieh", "url": "http://cvpr.thecvf.com/api/miniconf/users/175780?format=json", "institution": "University of Michigan, Ann Arbor"}, {"id": 193107, "fullname": "Zhongwei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193107?format=json", "institution": "Ohio State University, Columbus"}, {"id": 129022, "fullname": "Haokun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129022?format=json", "institution": "City University of Hong Kong"}, {"id": 193108, "fullname": "Xin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193108?format=json", "institution": "The Ohio StateUniversity"}, {"id": 178522, "fullname": "Ziqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178522?format=json", "institution": "The Ohio State University"}, {"id": 170433, "fullname": "Yingtie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/170433?format=json", "institution": "The Ohio State University"}, {"id": 151046, "fullname": "Mi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151046?format=json", "institution": "The Ohio State University"}], "abstract": "Vision\u2013language\u2013action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models, QuantVLA surpasses full-precision baselines, substantially reduces memory usage within quantized modules, and lowers inference latency, providing a practical pathway toward scalable, low-bit embodied intelligence under strict compute, memory, and power constraints.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39915", "url": null, "sourceid": 40101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39916, "uid": "ff051a2ee798b928b05590a311d0c44b", "name": "D$^3$FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Noise", "authors": [{"id": 180681, "fullname": "Hui Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180681?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 193109, "fullname": "Yifan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193109?format=json", "institution": "Fudan University"}, {"id": 149203, "fullname": "Zhong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/149203?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}], "abstract": "Facial expression recognition (FER) in the wild remains a challenging task due to the coexistence of data noise and label noise. While existing methods often address one type of noise in isolation, they struggle to achieve robust performance under the compound effects of both. To this end, we propose D$^3$FER ($\\textbf{D}$ual Channel and $\\textbf{D}$ual Branch Network for Robust Facial Expression Recognition under $\\textbf{D}$ual Noise), a unified framework that simultaneously tackles data and label noise in a single architecture. D$^3$FER introduces a dual-channel augmentation strategy, pairing weakly and strongly augmented views, to facilitate reliable pseudo-label generation and noise-aware training. Coupled with a dynamic queue mechanism, it adaptively estimates a noise threshold based on historical prediction confidence, enabling automatic identification and correction of label noise. Furthermore, inspired by contrastive learning, we design a momentum-updated Query-Key dual-branch structure that enhances intra-class compactness and inter-class separability, thereby improving robustness to data noise. At inference time, the stable Key branch parameters are leveraged to ensure consistent and generalized predictions. Extensive experiments on major in-the-wild benchmarks demonstrate that D$^3$FER outperforms state-of-the-art methods, setting new records in both accuracy and robustness under realistic, noisy conditions. The source code is available at https://github.com/D3FER/D3FER.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39916", "url": null, "sourceid": 31357, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39918, "uid": "49dee2008d36b4304fcb72e867f40c14", "name": "Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding", "authors": [{"id": 183892, "fullname": "Xuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183892?format=json", "institution": "Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo"}, {"id": 178221, "fullname": "Kangle Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178221?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193111, "fullname": "Haohang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193111?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 193112, "fullname": "Rui Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193112?format=json", "institution": "Google Cloud AI Research"}, {"id": 133538, "fullname": "Wenjun Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133538?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 154481, "fullname": "Xiaoyu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154481?format=json", "institution": "Eastern Institute of Technology, Ningbo"}], "abstract": "Recent advances in multimodal large language models (MLLMs) have substantially expanded the capabilities of multimodal retrieval, enabling systems to align and retrieve information across visual and textual modalities. Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities.To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval)\u2014a large-scale benchmark designed to evaluate fine-grained, multi-condition retrieval under natural-language queries. MCMR spans five product domains\u2014upper and bottom clothing, jewelry, shoes, and furniture\u2014and preserves rich long-form metadata essential for compositional matching. Each query integrates complementary visual and textual attributes, requiring models to jointly satisfy all specified conditions for relevance.We benchmark a diverse suite of MLLM-based multimodal retrievers and vision\u2013language rerankers to assess their condition-aware reasoning abilities. Experimental results reveal: (i) distinct modality asymmetries across models; (ii) visual cues dominate early-rank precision, while textual metadata stabilizes long-tail ordering; and (iii) MLLM-based pointwise rerankers markedly improve fine-grained matching by explicitly verifying query\u2013candidate consistency.Overall, MCMR establishes a challenging and diagnostic benchmark for advancing multimodal retrieval toward compositional, constraint-aware, and interpretable understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39918", "url": null, "sourceid": 31086, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39920, "uid": "218a7d88cba82b6cca4f1ae59063e3ca", "name": "Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors", "authors": [{"id": 193115, "fullname": "Mingqian Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/193115?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 157800, "fullname": "Shanshan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157800?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 85000, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85000?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to ***57\\%*** faster inference than the StreamPETR baseline and ***20\\%*** higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39920", "url": null, "sourceid": 46276, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39921, "uid": "3c12c84af346626dc2f1b77e52bb301e", "name": "Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations", "authors": [{"id": 191341, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191341?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 148200, "fullname": "chengan che", "url": "http://cvpr.thecvf.com/api/miniconf/users/148200?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 191863, "fullname": "Xinyue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191863?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 191343, "fullname": "Sophia Tsoka", "url": "http://cvpr.thecvf.com/api/miniconf/users/191343?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 181202, "fullname": "Luis Carlos Garcia Peraza Herrera", "url": "http://cvpr.thecvf.com/api/miniconf/users/181202?format=json", "institution": "King&#x27;s College London"}], "abstract": "Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39921", "url": null, "sourceid": 31095, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39928, "uid": "d1377d8518f097d94cc5061d1c593a3b", "name": "Compositional Transformation Reasoning for Composed Video Retrieval", "authors": [{"id": 152040, "fullname": "Sihong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152040?format=json", "institution": "South China University of Technology"}, {"id": 158036, "fullname": "Jiaxin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158036?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 128613, "fullname": "Dongmei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128613?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 158038, "fullname": "Yi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158038?format=json", "institution": "South China University of Technology"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}, {"id": 158037, "fullname": "Xiaoyong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/158037?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Composed Video Retrieval (CoVR) aims to retrieve a target video given a reference video and a textual modification describing the desired change. The core challenge lies in modeling compositional multimodal transformations, i.e., how objects, actions, and scenes evolve across video and language modalities in response to fine\u2011grained textual edits. Existing methods address this issue by training on large\u2011scale video\u2013text\u2013video triplets or by generating dense textual descriptions to capture subtle visual differences. However, these supervised approaches often rely on noisy web\u2011scale data and dataset\u2011specific correspondences, leading to overfitting and limited generalization in diverse or fine\u2011grained scenarios, while also failing to effectively model compositional and temporal transformations. To overcome these limitations, we propose a zero\u2011shot, fine\u2011grained transformation reasoning framework based on Multimodal Large Language Models (MLLMs). Our method decomposes the compositional transformation into three complementary reasoning dimensions, i.e., \\emph{entity}, \\emph{action}, and \\emph{scene}, and performs pairwise candidate reasoning to explicitly capture semantic evolution over time. Furthermore, we introduce a recall\u2011oriented multi\u2011objective candidate selection module that identifies high\u2011quality retrieval targets by jointly balancing visual, textual, and multimodal similarities before transformation reasoning. Experiments on EgoCVR and WebVid\u2011CoVR demonstrate the effectiveness of our method over state\u2011of\u2011the\u2011art approaches under the zero\u2011shot setting, with R@1 improvements of +5.8 and +10.8, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39928", "url": null, "sourceid": 31058, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39934, "uid": "3bbc8b8ac2f0e75cafa24ec9b9530352", "name": "TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation", "authors": [{"id": 172811, "fullname": "Peng Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/172811?format=json", "institution": "Capital University of Physical Education And Sports"}, {"id": 193151, "fullname": "Yuting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193151?format=json", "institution": "cupes"}, {"id": 166463, "fullname": "Qiurui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/166463?format=json", "institution": "Institution of Artificial Intelligence in Sport in the capital university of physical education and sport"}], "abstract": "Current football imitation research primarily aims to optimize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicating real-world team tactical behaviors. We introduce TacSIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Premier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. TacSIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and temporal similarities for one team. We run multiple baseline methods in a unified virtual environment to generate full-team behaviors, enabling both quantitative and visual assessment of tactical coordination. By using unified data and metrics from broadcast to simulation, TacSIm establishes a rigorous benchmark for measuring and modeling style-aligned tactical imitation task in football. The benchmark will be soon public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39934", "url": null, "sourceid": 33308, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39935, "uid": "ced5e89cbbebcc24beba11a8148cfaa1", "name": "AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs", "authors": [{"id": 181735, "fullname": "Lidong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181735?format=json", "institution": "Nanjing University"}, {"id": 126529, "fullname": "Guo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126529?format=json", "institution": "Nanjing University"}, {"id": 187988, "fullname": "Zhu Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187988?format=json", "institution": null}, {"id": 132597, "fullname": "Zhiqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132597?format=json", "institution": "NVIDIA"}, {"id": 193152, "fullname": "Yicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193152?format=json", "institution": "Nanjing University"}, {"id": 88504, "fullname": "Tong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88504?format=json", "institution": "Nanjing University"}], "abstract": "Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves SOTA results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments reveal that on out-of-domain benchmarks, reasoning in the language space offers limited performance gains, suggesting the need for more robust cross-domain reasoning mechanisms.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39935", "url": null, "sourceid": 36425, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39939, "uid": "af1042fcbdb33caab1c3ba63169e360c", "name": "MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models", "authors": [{"id": 181484, "fullname": "Ruoxiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181484?format=json", "institution": "Peking University"}, {"id": 182300, "fullname": "Zhen Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182300?format=json", "institution": "Peking. University"}], "abstract": "Vision-Language Models (VLMs) have demonstrated remarkable capabilities in multimodal understanding, yet their positional encoding mechanisms remain fundamentally limited. Current approaches apply uniform positional indices across all tokens, failing to account for dramatic variations in information density between and within modalities. This uniform treatment leads to suboptimal attention allocation and inefficient cross-modal fusion. We introduce MODIX (Multimodal Information-Driven Positional Index Scaling), a training-free framework that dynamically adapts positional granularity based on information-theoretic analysis of modality contributions. By jointly quantifying intrinsic information density within each modality and cross-modal interaction strength, MODIX assigns finer positional strides to information-rich content and coarser strides to redundant regions. Operating purely at inference time, our method requires no architectural modifications or retraining, enabling plug-and-play integration with existing VLMs. Comprehensive experiments across multiple state-of-the-art architectures and six benchmarks demonstrate that MODIX consistently improves multimodal reasoning, achieving up to 8.4% gains on ScienceQA and 6.8% on RealWorldQA, while dynamically adapting positional resolution to task-specific information distributions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39939", "url": null, "sourceid": 41660, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39948, "uid": "5bf496834d830d71d0d517e552b8245f", "name": "OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data", "authors": [{"id": 129927, "fullname": "Jinlu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129927?format=json", "institution": "Peking University"}, {"id": 176658, "fullname": "Zixi Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176658?format=json", "institution": "Peking University"}, {"id": 161896, "fullname": "Libin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/161896?format=json", "institution": "Peking University"}, {"id": 88151, "fullname": "Jianlong Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88151?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 129700, "fullname": "Feng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129700?format=json", "institution": "Peking University"}, {"id": 74208, "fullname": "Yizhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/74208?format=json", "institution": "Peking University"}], "abstract": "Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39948", "url": null, "sourceid": 30766, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40388?format=json"], "related_events_ids": [40388]}, {"id": 39943, "uid": "44374c1f0ebad6dc48951e6c20c25806", "name": "LongStream: Long-Sequence Streaming Autoregressive Visual Geometry", "authors": [{"id": 193164, "fullname": "Chong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193164?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou))"}, {"id": 193165, "fullname": "Xianda Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193165?format=json", "institution": "Central South University"}, {"id": 159903, "fullname": "Tao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/159903?format=json", "institution": "Zhejiang University"}, {"id": 88382, "fullname": "Wei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88382?format=json", "institution": " Shenzhen DJI Sciences and Technologies Ltd."}, {"id": 176618, "fullname": "Weiqiang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/176618?format=json", "institution": "Horizon Robotics"}, {"id": 184347, "fullname": "Qian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184347?format=json", "institution": "Horizon Robotics"}, {"id": 184348, "fullname": "Xiaoyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184348?format=json", "institution": "Horizon Robotics"}, {"id": 156341, "fullname": "Hao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156341?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference.Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39943", "url": null, "sourceid": 32815, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39945, "uid": "c9181c54402583f631b0ec7e6685b0d0", "name": "ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy", "authors": [{"id": 181797, "fullname": "Chenhui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181797?format=json", "institution": "Xidian University"}, {"id": 193171, "fullname": "Guoqing Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193171?format=json", "institution": "Northwestern Polytechinical University"}, {"id": 193172, "fullname": "WeijiePeng WeijiePeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193172?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}], "abstract": "Multi-object tracking (MOT) based on unmanned aerial vehicle (UAV) aims to identify and continuously track the positions of multiple ground targets during UAV flight. Current mainstream methods utilize appearance matching and motion matching to match targets in consecutive frames. However, these methods often fail in the following scenarios: First, scenarios with multi-scale targets, where small targets have weak appearance features and small bounding boxes; second, scenarios with complex backgrounds or occlusions, where the background or occlusions interfere with the appearance features and change the bounding box size of targets; third, scenarios where the UAV lens shakes, rotates, or zooms, leading to misalignment between consecutive frames; and fourth, scenarios with high targets similarity, where the appearance features between targets are difficult to distinguish, such as vehicles on a road. To address these issues, we propose a multi-object tracking algorithm, ProgTrack, based on a multi-stage progressive matching mechanism. This algorithm simulates human eye tracking strategies, employing a progressive process of \"first matching easily matched large targets, then matching difficult-to-match small targets, and finally matching the remaining mixed-scale targets.\" Similarly, ProgTrack employs three strategies for target matching at different scales and appearances: a simple Local Motion Information (LMI) matching strategy for large targets, a complex Context Enhancement Feature (CE-Feature) matching strategy for small targets, and a Global Motion Information (GMI) matching strategy for multi-scale targets matching, thereby achieving target matching. On the VisDrone2019 UAV tracking dataset, ProgTrack achieves MOTA, MOTP, and IDF1 scores of 40.2, 77.5, and 52.8, respectively, demonstrating state-of-the-art performance among ten methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39945", "url": null, "sourceid": 39378, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39947, "uid": "f9cba935dc57ee3ead2f292cd14d3c3c", "name": "VideoMaMa: Mask-Guided Video Matting via Generative Prior", "authors": [{"id": 193176, "fullname": "Sangbeom Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193176?format=json", "institution": "Korea University; Adobe Systems"}, {"id": 87532, "fullname": "Seoung Wug Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/87532?format=json", "institution": "Adobe Systems"}, {"id": 91712, "fullname": "Gabriel Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91712?format=json", "institution": "Adobe Research"}, {"id": 193177, "fullname": "Heeji Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/193177?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 180187, "fullname": "Joon-Young Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/180187?format=json", "institution": "Adobe Research"}], "abstract": "Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model VideoMaMa that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video MA-V dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MAV to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos.These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.All models and the MAV dataset will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39947", "url": null, "sourceid": 38932, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40250, "uid": "bd898b3aa551025f930aa4638ab497c8", "name": "EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding", "authors": [{"id": 193878, "fullname": "GwangWook Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/193878?format=json", "institution": "Chungnam National University"}, {"id": 157437, "fullname": "Hyo-Jun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/157437?format=json", "institution": "Kangwon National University"}, {"id": 193879, "fullname": "Jong-Hyeon Baek", "url": "http://cvpr.thecvf.com/api/miniconf/users/193879?format=json", "institution": "Chungnam National University"}, {"id": 85480, "fullname": "Hanul Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85480?format=json", "institution": "Seoul National University of Science and Technology"}, {"id": 85472, "fullname": "Yeong Jun Koh", "url": "http://cvpr.thecvf.com/api/miniconf/users/85472?format=json", "institution": "Chungnam National University"}], "abstract": "Despite recent progress in 3D visual grounding, existing methods still struggle with three core challenges: 1) cross-modal misalignment that prevents textual cues from being reliably delivered to visual representations, 2) intra-class confusion arising from insufficient understanding of fine-grained expression cues, and 3) geometric reasoning errors caused by inaccurate aggregation of spatially relevant visual features. We propose EG-3DVG, a unified framework that addresses these issues through an expression and geometry aware grounding decoder. The decoder integrates two complementary attention modules\u2014position-guided expression cross-attention (PECA) for reliable text\u2013vision alignment and geometry-aware masked attention (GMA) for selective aggregation of geometry-consistent visual cues. To further distinguish semantically similar instances, we introduce expression-aware contrastive learning (ECL), which strengthens the alignment between the target object token and expression-relevant words. Extensive experiments on ScanRefer and SR3D/NR3D demonstrate that EG-3DVG achieves state-of-the-art performance in both 3D bounding box localization and mask prediction, validating the effectiveness of our geometry- and expression-aware design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40250", "url": null, "sourceid": 34287, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40388, "uid": "5bf496834d830d71d0d517e552b8245f", "name": "OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data", "authors": [{"id": 129927, "fullname": "Jinlu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129927?format=json", "institution": "Peking University"}, {"id": 176658, "fullname": "Zixi Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176658?format=json", "institution": "Peking University"}, {"id": 161896, "fullname": "Libin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/161896?format=json", "institution": "Peking University"}, {"id": 88151, "fullname": "Jianlong Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88151?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 129700, "fullname": "Feng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129700?format=json", "institution": "Peking University"}, {"id": 74208, "fullname": "Yizhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/74208?format=json", "institution": "Peking University"}], "abstract": "Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40388", "url": null, "sourceid": -30766, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39948?format=json"], "related_events_ids": [39948]}, {"id": 39951, "uid": "4ecca9950b8e48cca47014655c2c4789", "name": "FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching", "authors": [{"id": 181169, "fullname": "Andranik Sargsyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181169?format=json", "institution": "Picsart"}, {"id": 90051, "fullname": "Shant Navasardyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90051?format=json", "institution": "Picsart AI Research"}], "abstract": "Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical analysis. In recent years, Dichotomous Image Segmentation (DIS)  has become the standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground.To address these challenges, we present $\\textbf{FlowDIS}$, a novel dichotomous image segmentation method built upon the flow matching framework, which learns a time-dependent vector field to transport the image distribution into the corresponding mask distribution under optional textual guidance.Moreover, with our $\\textbf{Position-Aware Instance Pairing (PAIP)}$ training strategy, FlowDIS offers strong controllability through textual prompts, enabling precise, pixel-level object segmentation.Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared to the second best DIS method, FlowDIS achieves $\\textbf{5.5}$% $\\textbf{higher $F_\\beta^\\omega$}$ measure and $\\textbf{43}$% $\\textbf{better MAE}$ ($\\mathcal{M})$ on DIS-TE test set.The code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39951", "url": null, "sourceid": 31343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39955, "uid": "4a11791c655ba29d3978d655a68516e5", "name": "Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection", "authors": [{"id": 181242, "fullname": "Yongwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181242?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131641, "fullname": "Yixiong Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131641?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131642, "fullname": "Yuhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131642?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86547, "fullname": "Ruixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86547?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region;(2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39955", "url": null, "sourceid": 46580, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39959, "uid": "c68d176afebc29d2b60db6dcfb35a5b4", "name": "TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond", "authors": [{"id": 175944, "fullname": "Yifei Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175944?format=json", "institution": "Nanjing University"}, {"id": 176513, "fullname": "Bao Yajie", "url": "http://cvpr.thecvf.com/api/miniconf/users/176513?format=json", "institution": "Department of Intelligent Science and Technology, nanjing university"}, {"id": 193195, "fullname": "Jiachen Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193195?format=json", "institution": "University of Hong Kong; DreamTech"}, {"id": 193196, "fullname": "Shuang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193196?format=json", "institution": "Nanjing University"}, {"id": 98719, "fullname": "Youtian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/98719?format=json", "institution": "Nanjing university"}, {"id": 153839, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153839?format=json", "institution": "Nanjing University"}, {"id": 193197, "fullname": "Buyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193197?format=json", "institution": "OriginArk"}, {"id": 193198, "fullname": "Feihu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193198?format=json", "institution": "DreamTech"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}, {"id": 128100, "fullname": "Yao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128100?format=json", "institution": "Nanjing University"}], "abstract": "Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39959", "url": null, "sourceid": 38098, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39963, "uid": "28dee3e52394cbf8e862643faac6a735", "name": "OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis", "authors": [{"id": 171903, "fullname": "Junuk Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/171903?format=json", "institution": "KAIST"}, {"id": 193206, "fullname": "Jihyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193206?format=json", "institution": "Korea Telecom Research"}, {"id": 134344, "fullname": "Han-Mu Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/134344?format=json", "institution": "Korea Electronics Technology Institute (KETI)"}], "abstract": "Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. We will release the code and processed data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39963", "url": null, "sourceid": 41483, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39969, "uid": "fc9b0afb85bfffc53285f0478d3e45a8", "name": "AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models", "authors": [{"id": 182017, "fullname": "Yubo Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/182017?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 193216, "fullname": "Xianchao Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193216?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193217, "fullname": "Zijun Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193217?format=json", "institution": "Harbin Institute of Technology"}, {"id": 128732, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128732?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Pre-trained vision-language models (VLMs) have exhibited exceptional generalization capabilities in zero-shot tasks, yet remain vulnerable to adversarial examples. Conventional classification-guided adversarial fine-tuning often compromises the pre-trained cross-modal alignment, undermining the intricate visual-textual correspondence essential for zero-shot performance. To mitigate this, we introduce Alignment-Guided Fine-Tuning (AGFT), a novel framework that preserves semantic integrity while enhancing robustness. AGFT leverages the output distribution of pre-trained VLMs as the fine-tuning objective, thereby maintaining cross-modal semantic correspondence. Recognizing the divergence in feature alignment objectives between pre-trained and robust models, we further calibrate the output distribution by attenuating cross-modal feature similarity of robust models, all while safeguarding correspondence across images and diverse textual descriptions. This calibration ensures compatibility with robust feature representation without sacrificing generalization. Comprehensive experiments across diverse zero-shot datasets and settings demonstrate that AGFT achieves state-of-the-art performance, significantly improving the zero-shot adversarial robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39969", "url": null, "sourceid": 39782, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39964, "uid": "ac246594da24758e8d58c4c36b913616", "name": "Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching", "authors": [{"id": 145858, "fullname": "Deyu Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/145858?format=json", "institution": "National University of Singapore"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Condensing the large-scale, high-resolution ImageNet-1K dataset remains a challenge for dataset distillation (DD). Existing methods typically match batch normalization (BN) statistics, \\ie, mean and variance, between real and synthetic datasets. Although effective with soft labels, their performance degrades substantially under hard labels. In this paper, we theoretically identify that BN matching mainly aligns the scales of real and synthetic gradients but overlooks their directions. However, experimental evidence demonstrates that gradient direction, rather than scale, is pivotal to model training, clarifying the limitations of prior methods. Building on this insight, we introduce \\textbf{O}rthogonal \\textbf{G}radient \\textbf{M}atching (OGM), which explicitly aligns the intrinsic direction of gradients, \\ie, singular vectors. Specifically, OGM first orthogonalizes real and synthetic gradients by setting all singular values to one, eliminating their scales, and then minimizes the distance between these orthogonal gradients so that their singular vectors coincide. To further reduce computation, OGM employs a least-squares loss whose gradients can be obtained in the forward pass, avoiding back-propagation. Extensive experiments on ImageNet-1K validate the effectiveness of OGM. With only ten images per class (IPC = 10), OGM achieves 47.0\\% accuracy with soft labels and 16.7\\% with hard labels, outperforming training-based DD methods and RDED.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39964", "url": null, "sourceid": 32629, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39968, "uid": "d0b5eca03e134d06550912d2a5c1f314", "name": "OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance", "authors": [{"id": 193214, "fullname": "Xiao Zitong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193214?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 193215, "fullname": "Yuda Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193215?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 163578, "fullname": "Zisheng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/163578?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 88683, "fullname": "Xiaoguang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88683?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously.Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains.To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on facial texture benchmarks. Both the dataset and the pretrained model weights will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39968", "url": null, "sourceid": 32711, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39972, "uid": "6798acf8da0458e3299ad2a1f4e7c2bf", "name": "FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle", "authors": [{"id": 180877, "fullname": "Mario Markov", "url": "http://cvpr.thecvf.com/api/miniconf/users/180877?format=json", "institution": "INSAIT, Sofia University"}, {"id": 180878, "fullname": "Stefan Ailuro", "url": "http://cvpr.thecvf.com/api/miniconf/users/180878?format=json", "institution": "INSAIT, Sofia University"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 86863, "fullname": "Konrad Schindler", "url": "http://cvpr.thecvf.com/api/miniconf/users/86863?format=json", "institution": "ETH Zurich"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}], "abstract": "Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce $\\textbf{FireScope-Bench}$, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose $\\textbf{FireScope}$, a VLM-based reasoning-to-generation framework that learns from  both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, $\\textbf{FireScope}$ achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that $\\textbf{FireScope-Bench}$ has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39972", "url": null, "sourceid": 41564, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39974, "uid": "50a039e881bb157121e9ea9afea996c5", "name": "Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models", "authors": [{"id": 182543, "fullname": "Katarzyna Zaleska", "url": "http://cvpr.thecvf.com/api/miniconf/users/182543?format=json", "institution": "Warsaw University of Technology"}, {"id": 193222, "fullname": "\u0141ukasz Popek", "url": "http://cvpr.thecvf.com/api/miniconf/users/193222?format=json", "institution": "Warsaw University of Technology"}, {"id": 193223, "fullname": "Monika Wysocza\u0144ska", "url": "http://cvpr.thecvf.com/api/miniconf/users/193223?format=json", "institution": "Valeo"}, {"id": 193224, "fullname": "Kamil Deja", "url": "http://cvpr.thecvf.com/api/miniconf/users/193224?format=json", "institution": "IDEAS NCBR; Warsaw University of Technology"}], "abstract": "Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model's architecture. While existing localization techniques focused on prompt-related interventions, we notice that such explicit conditioning might differ from the implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers causally responsible for making the decisions. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39974", "url": null, "sourceid": 40849, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39975, "uid": "0b845236da4ce3d6e20524161c1966c0", "name": "Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos", "authors": [{"id": 170053, "fullname": "Xuankai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/170053?format=json", "institution": "Sun Yat-sen University"}, {"id": 102024, "fullname": "Junjin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/102024?format=json", "institution": "School of Computer Science and Engineering, Sun Yat-sen University"}, {"id": 193225, "fullname": "Shangwei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193225?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87176, "fullname": "Qing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87176?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code and trained model will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39975", "url": null, "sourceid": 44005, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39977, "uid": "911d28522117a1c1f47a42df7743b846", "name": "RAYNOVA: Geometry-Free Auto-Regressive 4D World Modeling with Unified Spatio-Temporal Representation", "authors": [{"id": 88188, "fullname": "Yichen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/88188?format=json", "institution": "University of California, Berkeley"}, {"id": 153960, "fullname": "Chensheng Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153960?format=json", "institution": "University of California, Berkeley"}, {"id": 193228, "fullname": "Mazen Abdelfattah", "url": "http://cvpr.thecvf.com/api/miniconf/users/193228?format=json", "institution": null}, {"id": 96453, "fullname": "Yihan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/96453?format=json", "institution": "Applied Intuition"}, {"id": 191294, "fullname": "Jiezhi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191294?format=json", "institution": "Applied Intuition, Inc."}, {"id": 193229, "fullname": "Eric Higgins", "url": "http://cvpr.thecvf.com/api/miniconf/users/193229?format=json", "institution": "Applied Intuition"}, {"id": 193230, "fullname": "Ryan Brigden", "url": "http://cvpr.thecvf.com/api/miniconf/users/193230?format=json", "institution": "Applied Intuition"}, {"id": 86415, "fullname": "Masayoshi Tomizuka", "url": "http://cvpr.thecvf.com/api/miniconf/users/86415?format=json", "institution": "University of California, Berkeley"}, {"id": 88181, "fullname": "Wei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88181?format=json", "institution": "University of California Berkeley"}], "abstract": "World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RayNova, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RayNova constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Pl\u00fccker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RayNova achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39977", "url": null, "sourceid": 45153, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39978, "uid": "e78c666d9e6c8aaf2cb2044b8960c4d2", "name": "PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis", "authors": [{"id": 182751, "fullname": "Mijeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182751?format=json", "institution": "Huawei Technologies Ltd.; Seoul National University"}, {"id": 180347, "fullname": "Gunhee Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/180347?format=json", "institution": "Seoul National University"}, {"id": 135964, "fullname": "Jungyoon Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/135964?format=json", "institution": "Seoul National University"}, {"id": 132965, "fullname": "WonJae Roh", "url": "http://cvpr.thecvf.com/api/miniconf/users/132965?format=json", "institution": "Seoul national university"}, {"id": 75881, "fullname": "Bohyung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/75881?format=json", "institution": "Seoul National University"}], "abstract": "We introduce PhysGaia, a novel physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS),  encompassing both structured objects and unstructured physical phenomena.Unlike existing datasets that primarily focus on photorealistic reconstruction, PhysGaia is created to actively support physics-aware dynamic scene modeling.Our dataset provides complex dynamic scenarios with rich interactions among multiple objects, where they realistically collide with each other and exchange forces.Furthermore, it contains a diverse range of physical materials, such as liquid, gas, textile, and rheological substances, which moves beyond the rigid bodies prevalent in existing datasets.All scenes in PhysGaia are faithfully generated to strictly adhere to physical laws, leveraging carefully selected material-specific physics solvers. To enable quantitative evaluation of physical modeling, our dataset provides essential ground-truth information, including 3D particle trajectories and physics parameters, e.g., viscosity.To facilitate research adoption, we also provide essential integration pipelines for using recent 4D Gaussian Splatting models with our dataset and report their results.By addressing the critical lack of datasets for physics-aware modeling, PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation-ultimately enabling more faithful reconstruction and interpretation of complex dynamic scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39978", "url": null, "sourceid": 32976, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39979, "uid": "fe1d29cf02c128bc7f714c4fab1fe7dd", "name": "Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation", "authors": [{"id": 153758, "fullname": "Aviral Chharia", "url": "http://cvpr.thecvf.com/api/miniconf/users/153758?format=json", "institution": "Carnegie Mellon University"}, {"id": 75815, "fullname": "Fernando De la Torre", "url": "http://cvpr.thecvf.com/api/miniconf/users/75815?format=json", "institution": "Carnegie Mellon"}], "abstract": "Generating large-scale 3D head avatars of non-existent identities with high-fidelity and strong multi-view consistency (MVC) is essential for applications such as synthetic crowds, digital twins, and large asset libraries. For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. Second, we argue that the common strategy of enforcing MVC via intermediate MV image generation is both expensive and fundamentally fragile. Instead, we analyze how MVC can be induced by design, showing that intermediate view synthesis is unnecessary. To this end, we introduce MVCHead \u2014 a fast, single-shot state space model that directly predicts Gaussians, without intermediate generation. At its core, we propose a Hierarchical State Space (HiSS) block that enforces grid-aligned coherence while capturing long-range dependencies. We further modify Mamba's standard unidirectional scanning into a Hierarchical Bi-directional State Scan (HiBiSS), scanning the render grid to better propagate geometric and appearance cues. Finally, we design an SE(3) MV Critic that judges whether a set of self-renders arise from a single underlying 3D configuration, rewarding cross-view pixel alignment without real MV data. In this setting, MVCHead surpasses SOTA in perceptual quality and on all three MVC axes\u2014shape, texture, and geometry. The code has been submitted and will be open-sourced with model weights upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39979", "url": null, "sourceid": 40531, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39980, "uid": "1ba11d583c8cec68cddb9d103387ffbe", "name": "Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning", "authors": [{"id": 152930, "fullname": "Yunlong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152930?format=json", "institution": "University of Rochester"}, {"id": 90333, "fullname": "Chao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90333?format=json", "institution": "Department of Computer Science, University of Rochester"}, {"id": 127813, "fullname": "Susan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127813?format=json", "institution": "University of Rochester"}, {"id": 149423, "fullname": "Jing Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/149423?format=json", "institution": "University of Rochester"}, {"id": 193231, "fullname": "Yicheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193231?format=json", "institution": "University of Rochester"}, {"id": 91797, "fullname": "Daiki Shimada", "url": "http://cvpr.thecvf.com/api/miniconf/users/91797?format=json", "institution": "Sony Group Corporation"}, {"id": 90334, "fullname": "Chenliang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90334?format=json", "institution": "University of Rochester"}], "abstract": "Streaming dense video captioning requires real-time processing of continuous visual input while determining precisely when and what to caption. Current approaches primarily focus on designing complex external memory mechanisms, failing to leverage Large Multimodal Models' (LMMs) inherent long-context capabilities. Moreover, existing methods employing threshold-based caption triggering face a severe Threshold-Gated Discrepancy (TGD) problem, a training-inference mismatch arising from data imbalance, where models predominantly predict silence tokens, requiring thresholds that vary drastically across videos with extremely narrow effective ranges. We introduce Takusen, an asynchronous temporal modeling two-agent framework comprising a Small Multimodal Model (SMM) as an Oracle agent and an LMM as a Listener agent. The Oracle agent processes sparse video inputs at an accelerated rate to detect event boundaries, while the Listener agent processes dense inputs to generate accurate captions when prompted by the Oracle's signals. This architecture eliminates threshold dependencies by fundamentally changing how silence/generation decisions are made, resolving the TGD problem. To enhance robustness against boundary prediction instabilities, we integrate uniformly distributed fixed decoding points with Oracle-predicted boundaries. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that Takusen achieves state-of-the-art performance with a simpler and more efficient design that balances temporal sensitivity with descriptive accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39980", "url": null, "sourceid": 30875, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39985, "uid": "ba08e773374e74b0609b9668df3cd163", "name": "Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields", "authors": [{"id": 127765, "fullname": "Guangyang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127765?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193239, "fullname": "Youran Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/193239?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193240, "fullname": "Xinyu Che", "url": "http://cvpr.thecvf.com/api/miniconf/users/193240?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 77401, "fullname": "BENYUAN SUN", "url": "http://cvpr.thecvf.com/api/miniconf/users/77401?format=json", "institution": "Central Media Technology Institute, Huawei"}, {"id": 185397, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185397?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 71043, "fullname": "Xiaohong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71043?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Tracking-any-point (TAP) answers query-conditioned correspondence but leaves the dense, all-pairs structure of a video implicit. We formulate All-Pairs Tracking (APT): given a video, predict dense displacement and visibility for every source-target frame pair, from which per-pixel trajectories can be read out.To this end, we propose PairFormer, a feed-forward transformer that addresses APT in a single pass. A spatio-temporal patch encoder computes temporally conditioned features for all frames. CorrBank constructs a learnable correlation memory for each frame pair to obtain pairwise motion tokens. A broadcast motion mixer aggregates information across space and time and refines these tokens with global context. A trajectory head then predicts full-resolution displacement for each pair, yielding a coherent all-pairs trajectory field.To support APT at scale, we develop PAIRender, a data platform that synthesizes photo-realistic dynamic scenes with dense annotations. From PAIRender we derive a training set ($\\pi$-R10K) and a benchmark (APT-Bench) with an all-to-all evaluation protocol. Experiments show that PairFormer achieves strong performance on APT-Bench and competitive results on standard TAP benchmarks. Code and dataset will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39985", "url": null, "sourceid": 40408, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39986, "uid": "d04853731ce0790af8fbd9de5c37c8c1", "name": "TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising", "authors": [{"id": 142855, "fullname": "Junyoung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/142855?format=json", "institution": "Seoul National University"}, {"id": 177631, "fullname": "Youngjin Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/177631?format=json", "institution": "Seoul National University"}, {"id": 77105, "fullname": "Nam Ik Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/77105?format=json", "institution": "Seoul National University"}], "abstract": "Blind-spot networks (BSNs) enable self-supervised image denoising by preventing access to the target pixel during training, allowing the network to estimate clean signals without ground-truth supervision.However, this approach assumes pixel-wise noise independence, which is violated in real-world sRGB images due to spatially correlated noise introduced by the camera's image signal processing (ISP) pipeline.While several methods employ downsampling strategies to decorrelate noise, these approaches alter noise statistics and limit the network's ability to utilize full contextual information.In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise.This correlation originates from the demosaicing process, where each pixel is reconstructed from neighboring samples with weights that decay spatially, resulting in a characteristic diamond-shaped pattern.To align the receptive field with this geometry, we introduce a triangular-masked convolution that restricts the kernel to its upper-triangular region, creating a diamond-shaped blind spot at the original image resolution.This design effectively excludes correlated pixels while fully leveraging uncorrelated contextual information, eliminating the need for downsampling or post-processing.Furthermore, we use knowledge distillation to transfer complementary knowledge from multiple blind-spot predictions into a lightweight U-Net, improving both accuracy and efficiency.Extensive experiments on real-world denoising benchmarks demonstrate that our method achieves state-of-the-art performance, significantly outperforming existing self-supervised approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39986", "url": null, "sourceid": 31483, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39988, "uid": "5f3df1f487d87e9623e3da17e9136918", "name": "Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling", "authors": [{"id": 128665, "fullname": "Minh-Tuan Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/128665?format=json", "institution": "Monash University"}, {"id": 128678, "fullname": "Xuan-May Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/128678?format=json", "institution": "University of Melbourne"}, {"id": 193244, "fullname": "Quan Hung Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/193244?format=json", "institution": "Facebook"}, {"id": 87144, "fullname": "Mehrtash Harandi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87144?format=json", "institution": "Monash University"}, {"id": 128648, "fullname": "Dinh Phung", "url": "http://cvpr.thecvf.com/api/miniconf/users/128648?format=json", "institution": "Monash University"}, {"id": 128675, "fullname": "Trung Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/128675?format=json", "institution": "Monash University"}], "abstract": "Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model\u2019s weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization. The code will be available at \\url{https://anonymous.4open.science/r/Composer-IPC}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39988", "url": null, "sourceid": 35812, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39990, "uid": "4fc6dab4a7563bdae6315d673ff09f3b", "name": "Self-Critical Distillation Network for Video-based Commonsense Captioning", "authors": [{"id": 193254, "fullname": "Mengqi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193254?format=json", "institution": null}, {"id": 152753, "fullname": "Gengyun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/152753?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 76806, "fullname": "Bing-Kun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76806?format=json", "institution": "Hefei University of Technology"}], "abstract": "Video-based commonsense captioning aims to generate captions for the video content while providing multiple commonsense about the underlying events. Existing approaches rely on constructing a \"video $\\rightarrow$ content caption $\\rightarrow$ commonsense\" reasoning chain, which generates visually ungrounded commonsense and neglects inter-category commonsense correlations. Firstly, the existing reasoning chain induces the model's excessive reliance on content caption when generating commonsense, resulting in generic outputs with limited visual relevance. Secondly, the reasoning chain adopts multiple isolated decoders for commonsense generation, which fails to leverage the correlations between different categories of commonsense. To address these limitations, we introduce a novel self-critical distillation network (SCD-Net), which optimizes the reasoning chain by enhancing visual reasoning and establishing inter-category commonsense correlations. Specifically, on the one hand, we introduce self-critical learning and design a reward function to allow the model to refine its output. This mechanism incentivizes the model to maximize the utilization of visual information, thus improving the model's capacity for visual comprehension. On the other hand, we propose a joint reasoning distillation framework that fosters mutual inference among diverse commonsense categories. In this framework, we incorporate the cascaded decoder and knowledge distillation strategy to facilitate inter-category commonsense knowledge transfer while maintaining the fairness of the testing. Our experiments on the large-scale Video-to-Commonsense dataset demonstrate that our approach performs favorably against state-of-the-art methods. The code will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39990", "url": null, "sourceid": 46302, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39992, "uid": "94f0a35a8ae73c670fec214e5c595227", "name": "3D Instance Models are Implicit Generalizable Spatial Learners", "authors": [{"id": 126312, "fullname": "Lu Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/126312?format=json", "institution": "Purdue University"}, {"id": 132877, "fullname": "Yunhao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/132877?format=json", "institution": "NVIDIA"}, {"id": 88915, "fullname": "Yichen Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88915?format=json", "institution": "NVIDIA"}, {"id": 128106, "fullname": "Aniket Bera", "url": "http://cvpr.thecvf.com/api/miniconf/users/128106?format=json", "institution": "Purdue University"}], "abstract": "Generalization remains the central challenge for interactive 3D scene generation. Existing learning\u2011based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts.We instead reprogram a pre\u2011trained 3D instance generator to act as a scene\u2011level learner via, replacing dataset-bounded supervision with model-centric spatial supervision.This reprogramming unlocks the generator's transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions.Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator\u2019s transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues.Replacing widely used canonical space, we instantiate this insight with a view\u2011centric formulation of the scene space, yielding a fully feed\u2011forward, generalizable scene generator that learns spatial relations directly from the instance model.Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39992", "url": null, "sourceid": 34846, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39993, "uid": "eaac44eefff218557692720f8a56af4e", "name": "Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences", "authors": [{"id": 193256, "fullname": "Ziyi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193256?format=json", "institution": "Fudan University"}, {"id": 77399, "fullname": "Zhipeng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/77399?format=json", "institution": "Fudan University"}, {"id": 89124, "fullname": "Jingjing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89124?format=json", "institution": "Fudan University"}, {"id": 127467, "fullname": "Stewart Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127467?format=json", "institution": "Alibaba DAMO Academy"}, {"id": 99574, "fullname": "Hao li", "url": "http://cvpr.thecvf.com/api/miniconf/users/99574?format=json", "institution": "Fudan University"}, {"id": 155106, "fullname": "Yi-Ping Phoebe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155106?format=json", "institution": "La Trobe University"}], "abstract": "Narrative image generation aims to create images featuring multiple distinct characters while capturing their interrelationships, posing significant challenges for current text-to-image diffusion models. As a result, general personalized methods often suffer from poor semantic alignment, identity blending, and aesthetic implausibility.These issues are inadequately captured by existing evaluation metrics such as CLIP, ArcFace, and conventional reward models, which fundamentally fail to align with human perceptual preferences. To align with human preferences, we first construct a fine-grained human preference dataset, NI-RLHF, by collecting both detailed human critiques and preference judgments across three core dimensions: prompt following, identity consistency, and visual quality.This comprehensive dataset facilitates the training of NIReward, a critique-based reward model capable of generating interpretable image evaluations.Building upon the interpretable reward signal from NIReward, we propose Adaptive Dominance-based Preference Optimization (ADPO) to balance learning across diverse preference dimensions while dynamically adapting to reward margins.Experimental results indicate that NIReward significantly outperforms existing evaluation models and reward models, and ADPO yields a significant improvement across the three key preference dimensions. By introducing NIReward and ADPO, our work paves the way for generating narrative images aligned with actual human preferences.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39993", "url": null, "sourceid": 41969, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39996, "uid": "29c4f178686891b4988c32f722acec62", "name": "Structural Action Transformer for 3D Dexterous Manipulation", "authors": [{"id": 107066, "fullname": "Xiaohan Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/107066?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u6280\u672f\u5927\u5b66"}, {"id": 90344, "fullname": "Min Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90344?format=json", "institution": "Institute of Artificial Intelligence, Hefei Comprehensive National Science Center"}, {"id": 193266, "fullname": "Bohong Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193266?format=json", "institution": "University of Science and Technology of China"}, {"id": 85751, "fullname": "Wengang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85751?format=json", "institution": "University of Science and Technology of China"}, {"id": 89074, "fullname": "Houqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89074?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity.This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories.This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length.To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties.Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective.We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks.Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer.This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39996", "url": null, "sourceid": 30923, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40389?format=json"], "related_events_ids": [40389]}, {"id": 39994, "uid": "a6807243689c76d0e34230e8e6ce5ca9", "name": "Interpretable  Cross-Domain Few-Shot Learning with Rectified  Target-Domain Local Alignment", "authors": [{"id": 182758, "fullname": "Yaze Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182758?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131641, "fullname": "Yixiong Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131641?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131642, "fullname": "Yuhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131642?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86547, "fullname": "Ruixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86547?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that **the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns**, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features.To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance. Our codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39994", "url": null, "sourceid": 46301, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40389, "uid": "29c4f178686891b4988c32f722acec62", "name": "Structural Action Transformer for 3D Dexterous Manipulation", "authors": [{"id": 107066, "fullname": "Xiaohan Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/107066?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u6280\u672f\u5927\u5b66"}, {"id": 90344, "fullname": "Min Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90344?format=json", "institution": "Institute of Artificial Intelligence, Hefei Comprehensive National Science Center"}, {"id": 193266, "fullname": "Bohong Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193266?format=json", "institution": "University of Science and Technology of China"}, {"id": 85751, "fullname": "Wengang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85751?format=json", "institution": "University of Science and Technology of China"}, {"id": 89074, "fullname": "Houqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89074?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity.This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories.This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length.To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties.Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective.We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks.Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer.This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40389", "url": null, "sourceid": -30923, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39996?format=json"], "related_events_ids": [39996]}, {"id": 39997, "uid": "ffbc391008d6481ccf89703e9619a98c", "name": "ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting", "authors": [{"id": 193267, "fullname": "Yeonkyung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193267?format=json", "institution": "Yonsei University"}, {"id": 128742, "fullname": "Dayun Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/128742?format=json", "institution": "Yonsei University"}, {"id": 193268, "fullname": "Youngmin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193268?format=json", "institution": "Yonsei University"}, {"id": 101322, "fullname": "seil kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101322?format=json", "institution": "yonsei university"}, {"id": 107168, "fullname": "Seong Jae Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107168?format=json", "institution": "Yonsei University"}], "abstract": "Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword\u2013Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and preserves dense-frame baseline performance with as few as 20% of frames.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39997", "url": null, "sourceid": 35838, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39999, "uid": "f6f658b6c7f13e833d7f81797e9a0869", "name": "Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning", "authors": [{"id": 183790, "fullname": "Denis Huseljic", "url": "http://cvpr.thecvf.com/api/miniconf/users/183790?format=json", "institution": "University of Kassel"}, {"id": 193269, "fullname": "Marek Herde", "url": "http://cvpr.thecvf.com/api/miniconf/users/193269?format=json", "institution": "Universit\u00e4t Kassel"}, {"id": 193270, "fullname": "Lukas Rauch", "url": "http://cvpr.thecvf.com/api/miniconf/users/193270?format=json", "institution": "University of Kassel"}, {"id": 176802, "fullname": "Paul Hahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/176802?format=json", "institution": "University of Kassel"}, {"id": 193271, "fullname": "Bernhard Sick", "url": "http://cvpr.thecvf.com/api/miniconf/users/193271?format=json", "institution": "Universit\u00e4t Kassel"}], "abstract": "Existing active learning (AL) strategies capture fundamentally different notions of data value, e.g., uncertainty or representativeness. Consequently, the effectiveness of strategies can vary substantially across datasets, models, and even AL cycles. Committing to a single strategy risks suboptimal performance, as no single strategy dominates throughout the entire AL process. We introduce REFINE, an ensemble AL method that combines multiple strategies without knowing in advance which will perform best. In each AL cycle, REFINE operates in two stages: (1) Progressive filtering iteratively refines the unlabeled pool by considering an ensemble of AL strategies, retaining promising candidates capturing different notions of value. (2) Coverage-based selection then chooses a final batch from this refined pool, ensuring all previously identified notions of value are accounted for. Extensive experiments across 6 classification datasets and 3 foundation models show that REFINE consistently outperforms individual strategies and existing ensemble methods.  Notably, progressive filtering serves as a powerful preprocessing step that improves the performance of any individual AL strategy applied to the refined pool, which we demonstrate on an audio spectrogram classification use case. Finally, the ensemble of REFINE can be easily extended with upcoming state-of-the-art AL strategies.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39999", "url": null, "sourceid": 35757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40390, "uid": "7991414f1fb147d65ca8fb117934f94b", "name": "Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views", "authors": [{"id": 174241, "fullname": "Kunwar Maheep Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/174241?format=json", "institution": "MPI for Informatics"}, {"id": 136775, "fullname": "Jianchun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/136775?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 75786, "fullname": "Vladislav Golyanik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75786?format=json", "institution": "MPI for Informatics"}, {"id": 87361, "fullname": "Stephan J. Garbin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87361?format=json", "institution": "Microsoft"}, {"id": 75876, "fullname": "Thabo Beeler", "url": "http://cvpr.thecvf.com/api/miniconf/users/75876?format=json", "institution": "Google"}, {"id": 89511, "fullname": "Rishabh Dabral", "url": "http://cvpr.thecvf.com/api/miniconf/users/89511?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 127234, "fullname": "Marc Habermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/127234?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}], "abstract": "We present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method\u2019s superior visual fidelity and lighting reproduction compared to state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40390", "url": null, "sourceid": -34670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39998?format=json"], "related_events_ids": [39998]}, {"id": 40000, "uid": "3d640c33a0135a046e055254038872c6", "name": "INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion", "authors": [{"id": 181145, "fullname": "Seonho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/181145?format=json", "institution": "Hanyang University"}, {"id": 176278, "fullname": "JUNHYEONG HONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/176278?format=json", "institution": "Hanyang University"}, {"id": 193272, "fullname": "Kyungjae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193272?format=json", "institution": "Korea University"}, {"id": 90672, "fullname": "Yoonseon Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/90672?format=json", "institution": "Hanyang University"}], "abstract": "Humans intuitively rely on text and symbols inscribed on objects (e.g. \"PULL\", \"Squeeze and Turn\") to perform tasks safely and correctly. In contrast, vision-language-action models excel at following external language commands, but remain largely unaware of this object-centric information. This capability is essential for reliable robotic operation, yet progress remains unmeasured due to the absence of standardized benchmarks.  To address this gap, we introduce INSIGHT Bench, a benchmark that formalizes the task of ``in-situ guide grounding\". INSIGHT Bench provides a comprehensive taxonomy that evaluates how agents utilize diverse guide information, including action-direction cues and procedural instructions. It also includes a scalable simulation framework that procedurally generates tasks and programmatically links each visual guide to its corresponding physical constraint. We release both the benchmark and the resulting trajectory dataset to support future research. Our evaluation of state-of-the-art VLA models reveals a critical limitation: their ability to ground in-situ guides is inconsistent and strongly dependent on the type of information. While models succeed on some guide categories, they frequently fail on others. However, performance improves substantially when the same information is provided as language instructions, indicating that in-situ guides could contribute to manipulation performance if VLAs were capable of interpreting them. These findings underscore the need for further research on understanding and grounding in-situ guides.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40000", "url": null, "sourceid": 38621, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40008, "uid": "0dacbbb65aed5d4ee2c067a456e7c0f2", "name": "CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model", "authors": [{"id": 181557, "fullname": "Houji Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181557?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 193287, "fullname": "Jiangyong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193287?format=json", "institution": "Houmo"}, {"id": 90681, "fullname": "Dawei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90681?format=json", "institution": "University of York"}, {"id": 127403, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127403?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder.This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization;and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights.Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence.Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6\\% and 6.6\\% mAP on SAM-B and SAM-L respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40008", "url": null, "sourceid": 45463, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40010, "uid": "7bc5f2f1017ea56b1bb2d971a6190dbc", "name": "AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples", "authors": [{"id": 183433, "fullname": "Runze Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183433?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193289, "fullname": "Zeyue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193289?format=json", "institution": "Shanghai Institute of Microsystem and Information Technology"}, {"id": 193290, "fullname": "Fanghui Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193290?format=json", "institution": null}, {"id": 193291, "fullname": "Rui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193291?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193292, "fullname": "Yihan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193292?format=json", "institution": null}, {"id": 193293, "fullname": "Shen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193293?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193294, "fullname": "Zhaoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193294?format=json", "institution": null}], "abstract": "Unrestricted adversarial attacks based on generative models typically operate either directly in image space or through diffusion-style denoising and re-noising, which limits transferability and robustness against defenses. We revisit this problem through the lens of flow matching and continuous-time velocity fields, and propose AdvFM, a velocity-field attack that injects adversarial signals into the flow-matching dynamics instead of the pixel space. Given a noisy state $x_t$, AdvFM perturbs the reconstruction at $t{=}1$ and converts this perturbation into a change of the velocity field, yielding a state update that amplifies the inner PGD step in the noisy space. We further introduce a lookahead variant that optimizes a two-point objective over the current and rolled-out reconstructions, reducing temporal mismatch along the ODE trajectory. From a theoretical perspective, we show that compared to diffusion-based attacks, AdvFM enjoys: (i) larger single-step increases in the black-box loss via step amplification, (ii) reduced gradient variance and stronger surrogate-target alignment due to Gaussian smoothing, enhancing its transferability, and (iii) perturbations that concentrate in robust-tangent directions, thereby aligning with robust gradients of adversarially trained models and surviving purification more effectively; the lookahead variant further lowers gradient noise for a two-point robust objective. Extensive experiments demonstrate that AdvFM achieves promising performance in both black-box transferability and a suite of adversarial training and purification defenses.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40010", "url": null, "sourceid": 42233, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40020, "uid": "e77cae9fcdb3fd588d797921ac663823", "name": "LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting", "authors": [{"id": 180087, "fullname": "xulun ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180087?format=json", "institution": "Ningbo University"}, {"id": 193318, "fullname": "Qin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193318?format=json", "institution": "Ningbo University"}, {"id": 186403, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186403?format=json", "institution": "Shenzhen University"}], "abstract": "Language-Guided 3D segmentation is crucial for linking 3D perception with semantic understanding, yet it remains vulnerable to the incomplete and occluded views common in real-world RGB-D data. To overcome this, we present a real-time framework that leverages 3D Gaussian Splatting (3DGS) to build a semantically continuous and differentiable embedding field from partial observations. Our approach integrates two key components: a Dirichlet Process (DP) for the adaptive discovery of novel object categories, and a gradient low-rank  mechanism that enhances class separability by reducing feature redundancy. This combination enables robust open-vocabulary segmentation guided directly by text prompts. Extensive experiments on challenging benchmarks demonstrate that our method achieves strong performance, exhibiting superior accuracy, robustness to incomplete inputs, and a powerful capacity for novel class discovery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40020", "url": null, "sourceid": 46678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40026, "uid": "d47da82feac4fce587b7b03ab2ee7f03", "name": "SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution", "authors": [{"id": 182278, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182278?format=json", "institution": "University of Science and Technology of China"}, {"id": 113808, "fullname": "Zijin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/113808?format=json", "institution": "University of Science and Technology of China"}, {"id": 193333, "fullname": "Yaofei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193333?format=json", "institution": "Hefei University of Technology"}, {"id": 193334, "fullname": "Yuang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193334?format=json", "institution": "University of Science and Technology of China"}, {"id": 90592, "fullname": "Weiming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90592?format=json", "institution": "University of Science and Technology of China"}, {"id": 90580, "fullname": "Nenghai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90580?format=json", "institution": "University of Science and Technology of China"}, {"id": 107137, "fullname": "Kejiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107137?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the \"few-shot training-free generated video attribution\" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the \"Pixel Frames(many)$\\leftrightarrow$Latent Frame(one)\" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90\\% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40026", "url": null, "sourceid": 42994, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40028, "uid": "0bc62afce4c0004839d37c08c5f0528a", "name": "\u03bcVLM: A Vision Language Model for \u03bcNPUs", "authors": [{"id": 181669, "fullname": "Zijie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181669?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 193335, "fullname": "Guiyun Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193335?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193336, "fullname": "Zhaoxing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193336?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 193337, "fullname": "Rong Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/193337?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 182036, "fullname": "Haiming Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182036?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The proliferation of low-power intelligent processors with integrated Neural Processing Units (NPUs), called $\\mu$NPUs, has created new opportunities for on-device generative AI, benefitting end devices like smart wearables and small robots. However, deploying Vision-Language Models (VLMs) on $\\mu$NPUs is severely hindered by stringent memory constraints and limited operator support. To bridge this critical gap, we propose $\\mu$VLM, the first lightweight-oriented VLM architecture designed for $\\mu$NPUs. It is comprised of our proposed OverMod encoder and AttSSM decoder. OverMod is a lightweight dynamic convolutional network inspired by biomimetic vision, incorporating our novel Global Spatial Modulation mechanism to enable adaptive, high-fidelity feature extraction using only NPU-friendly operators. AttSSM leverages a highly efficient State Space Model (SSM) core, augmented with multi-scale feature fusion and Global Context Dynamic Modulation mechanism, to perform robust sequential modeling. Furthermore, we introduce a coordinated full-parameter quantization strategy that preserves precision across the encoder-decoder boundary, alongside hand-optimized operators for unsupported modules like SSMs. $\\mu$VLM achieves a competitive CIDEr score of 117.8 on the COCO Karpathy test split and, for the first time, demonstrates the feasibility of millisecond-level VLM inference on a $\\mu$NPU platform.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40028", "url": null, "sourceid": 44006, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40034, "uid": "6ea6f37c28354e994795384f82e200d2", "name": "AutoMoMa: Scalable Coordinated Mobile Manipulation Trajectory Generation", "authors": [{"id": 183429, "fullname": "Yida Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183429?format=json", "institution": "Peking University"}, {"id": 193350, "fullname": "Xinhai Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193350?format=json", "institution": "Peking University"}, {"id": 193351, "fullname": "Xin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193351?format=json", "institution": "Beijing Institute for General Artificial Intelligence"}, {"id": 193352, "fullname": "Ziyuan Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193352?format=json", "institution": "BIGAI"}, {"id": 88123, "fullname": "Yixin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88123?format=json", "institution": "Peking University"}], "abstract": "Mobile robots need coordinated whole-body motion to perform household tasks effectively. Current mobile manipulation datasets rely on expensive teleoperation or slow planning methods, limiting available data to hundreds of demonstrations. This data scarcity severely constrains the development of generalizable learning-based policies. Here, we demonstrate that GPU-accelerated planning generates up to 5,000 episodes per GPU hour, over 80 $\\times$ faster than existing methods. Our AutoMoMa pipeline produces 500K diverse physically valid whole-body motions across 300 household scenes and multiple robot embodiments, compared to previous datasets limited to narrow robot-scene pairs with a few hundred demonstrations. Downstream validation demonstrates consistent policy improvements with large-scale training data. This work provides the first scalable solution to the mobile manipulation data bottleneck. By enabling massive dataset generation, AutoMoMa accelerates progress toward general-purpose household robots capable of complex coordination tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40034", "url": null, "sourceid": 39503, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40032, "uid": "a75c57ea23c3e5a95e9f00bf745bd8b1", "name": "Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models", "authors": [{"id": 126773, "fullname": "Tianci Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126773?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 88233, "fullname": "Xiaoyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88233?format=json", "institution": "Research, Microsoft"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}, {"id": 87581, "fullname": "Nanning Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87581?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE). To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training, enabling synergistic benefits of dual-side alignment on both tokenizers and diffusion models. Our effort in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a **gFID (w/o CFG) of 2.22 in merely 80 epochs** (a $10\\times$ speedup over prior tokenizers). With continued training to **640 epochs**, it further attains a **gFID (w/o CFG) of 1.62**. These results offer solid evidence for the substantial potential of VFMs to serve as visual tokenizers to accelerate the LDM training progress.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40032", "url": null, "sourceid": 38678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40036, "uid": "533b878d8231617eeaad0b9543a206d1", "name": "Talking Together: Synthesizing Co-Located 3D Conversations from Audio", "authors": [{"id": 131521, "fullname": "Mengyi Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131521?format=json", "institution": "University of Washington"}, {"id": 193354, "fullname": "Shouchieh Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193354?format=json", "institution": "Google"}, {"id": 158442, "fullname": "Ziqian Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158442?format=json", "institution": "Google"}, {"id": 128093, "fullname": "Shichen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128093?format=json", "institution": "Google"}, {"id": 88352, "fullname": "Yinda Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88352?format=json", "institution": "Google"}, {"id": 96787, "fullname": "Luchuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/96787?format=json", "institution": "University of Rochester"}, {"id": 85742, "fullname": "Rohit Pandey", "url": "http://cvpr.thecvf.com/api/miniconf/users/85742?format=json", "institution": "Google"}, {"id": 85744, "fullname": "Sean Fanello", "url": "http://cvpr.thecvf.com/api/miniconf/users/85744?format=json", "institution": "Google"}, {"id": 88347, "fullname": "Zeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88347?format=json", "institution": "University of Southern California"}], "abstract": "We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied ``talking heads'' akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship\u2014including relative position, orientation, and mutual gaze\u2014that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms are designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40036", "url": null, "sourceid": 36513, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40038, "uid": "17d6278aaf26be24980654e102a111fc", "name": "EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models", "authors": [{"id": 181181, "fullname": "Jiali Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181181?format=json", "institution": "South China University Of Technology, The Hong Kong Polytechnic University"}, {"id": 193356, "fullname": "Yuqi Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193356?format=json", "institution": null}, {"id": 193357, "fullname": "Xusen Hei", "url": "http://cvpr.thecvf.com/api/miniconf/users/193357?format=json", "institution": "South China University of Technology"}, {"id": 193358, "fullname": "DingBa Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193358?format=json", "institution": "South China University of Technology"}, {"id": 132980, "fullname": "wei yuancheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/132980?format=json", "institution": "South China University Of Technology"}, {"id": 193359, "fullname": "Jiayuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/193359?format=json", "institution": null}, {"id": 158038, "fullname": "Yi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158038?format=json", "institution": "South China University of Technology"}], "abstract": "Large multimodal models (LMMs) have achieved impressive performance on multimodal reasoning, becoming crucial technology for the advancement of intelligent question-answering systems. In real-world educational scenarios, effective teaching extends far beyond providing answers. Experienced teachers analyze students' incorrect answers to trace underlying errors and provide corrective feedback, termed educational diagnostic reasoning, a capability that remains under-explored in existing LMMs. To bridge this research gap, we introduce Edudiag benchmark, requiring LMMs to reconstruct erroneous reasoning chains from incorrect answers and generate corrective feedback. Through an AI-assisted annotation pipeline with rigorous human verification, we create 8K erroneous reasoning chains and corresponding feedback, spanning three representative educational domains: commonsense, science, and mathematics.Extensive evaluation across 28 leading LMMs highlights \\textit{Edudiag} as a challenging testbed, where even leading proprietary LMMs struggle on it and supervised fine-tuning (SFT) on open-source LMMs achieves marginal performance gains. Moreover, we conduct analysis experiments and identify three critical insights for educational diagnostic reasoning: (i) Effective error tracing remains the primary bottleneck, while SFT models still fail to reversely identify errors that commonly occur. (ii) Group relative policy optimization (GRPO) mitigates this bottleneck and boosts performance. (iii) LMMs optimized with GRPO can generate plausible yet challenging distractors for multiple-choice questions based on their self-constructed erroneous reasoning chains. We believe Edudiag provides a new direction for evaluating the advanced LMMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40038", "url": null, "sourceid": 46260, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40040, "uid": "a2a561ffc61849e2ba1ad94624ad4e0b", "name": "PositionIC: Unified Position and Identity Consistency for Image Customization", "authors": [{"id": 182219, "fullname": "Junjie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182219?format=json", "institution": "Fudan University"}, {"id": 193362, "fullname": "Tianyang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/193362?format=json", "institution": "Meituan"}, {"id": 193363, "fullname": "Kai Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193363?format=json", "institution": "Meituan"}, {"id": 159498, "fullname": "Jialin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/159498?format=json", "institution": "Meituan"}, {"id": 193364, "fullname": "Yang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/193364?format=json", "institution": null}, {"id": 193365, "fullname": "Xianhua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193365?format=json", "institution": null}, {"id": 154365, "fullname": "Junfeng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154365?format=json", "institution": "Meituan"}, {"id": 84905, "fullname": "Xiaoming Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84905?format=json", "institution": "Meituan"}, {"id": 76787, "fullname": "Wenqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76787?format=json", "institution": "Fudan University"}], "abstract": "Recent subject-driven image customization excels in fidelity, yet fine-grained instance-level spatial control remains an elusive challenge, hindering real-world applications. This limitation stems from two factors: a scarcity of scalable, position-annotated datasets, and the entanglement of identity and layout by global attention mechanisms.To this end, we introduce PositionIC, a unified framework for high-fidelity, spatially controllable multi-subject customization. First, we present BMPDS, the first automatic data-synthesis pipeline for position-annotated multi-subject datasets, effectively providing crucial spatial supervision. Second, we design a lightweight, layout-aware diffusion framework that integrates a novel visibility-aware attention mechanism. This mechanism explicitly models spatial relationships via an NeRF-inspired volumetric weight regulation to effectively decouple instance-level spatial embeddings from semantic identity features, enabling precise, occlusion-aware placement of multiple subjects.Extensive experiments demonstrate PositionIC achieves state-of-the-art performance on public benchmarks, setting new records for spatial precision and identity consistency. Our work represents a significant step towards truly controllable, high-fidelity image customization in multi-entity scenarios. Code and data will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40040", "url": null, "sourceid": 37299, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40042, "uid": "1768fdbdf94f1a65de7f0885a4b67400", "name": "Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation", "authors": [{"id": 180947, "fullname": "YINGKAI YANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180947?format=json", "institution": "Shenzhen University"}, {"id": 158540, "fullname": "Chaoqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158540?format=json", "institution": "Shenzhen University"}, {"id": 85724, "fullname": "Hui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85724?format=json", "institution": "Shenzhen University"}], "abstract": "Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes --- a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40042", "url": null, "sourceid": 37805, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40044, "uid": "e3a76242ae14e12c01d1554751bb7a90", "name": "StreamDiT: Real-Time Streaming Text-to-Video Generation", "authors": [{"id": 193370, "fullname": "Akio Kodaira", "url": "http://cvpr.thecvf.com/api/miniconf/users/193370?format=json", "institution": "Shizuku AI"}, {"id": 127701, "fullname": "Tingbo Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127701?format=json", "institution": "Google DeepMind"}, {"id": 89885, "fullname": "Ji Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89885?format=json", "institution": "Facebook"}, {"id": 75979, "fullname": "Markos Georgopoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/75979?format=json", "institution": "Synthesia"}, {"id": 69156, "fullname": "Felix Juefei-Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/69156?format=json", "institution": "GenAI, Meta"}, {"id": 86415, "fullname": "Masayoshi Tomizuka", "url": "http://cvpr.thecvf.com/api/miniconf/users/86415?format=json", "institution": "University of California, Berkeley"}, {"id": 167239, "fullname": "Yue Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/167239?format=json", "institution": "Meta"}], "abstract": "Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40044", "url": null, "sourceid": 38703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40046, "uid": "83a17b1aec90395c67bba676be661288", "name": "STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative", "authors": [{"id": 89448, "fullname": "Peixuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89448?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 193373, "fullname": "Zijian Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193373?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 193374, "fullname": "Kaiqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193374?format=json", "institution": "Nanjing University"}, {"id": 89455, "fullname": "Shuchen Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89455?format=json", "institution": "Peking University"}, {"id": 89137, "fullname": "Si Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89137?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}], "abstract": "While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP^2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot coherence, and the two-stage training scheme to learn cinematic inter-shot transition. We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained annotations for story progression, cinematic attributes, and human preferences. Extensive experiments demonstrate that STAGE achieves superior performance in structured narrative control and cross-shot coherence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40046", "url": null, "sourceid": 35405, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40049, "uid": "a39878ae282a18ea051ad89e7875a272", "name": "Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning", "authors": [{"id": 180532, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180532?format=json", "institution": "Wuhan University"}, {"id": 166719, "fullname": "Haibo Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166719?format=json", "institution": "Meituan"}, {"id": 193378, "fullname": "Qiming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193378?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 89331, "fullname": "Yufei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89331?format=json", "institution": "The University of Sydney, University of Sydney"}, {"id": 186186, "fullname": "Zhixiong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186186?format=json", "institution": "Meituan"}, {"id": 186192, "fullname": "Siqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186192?format=json", "institution": null}, {"id": 186191, "fullname": "Peng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186191?format=json", "institution": "Meituan"}, {"id": 88103, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88103?format=json", "institution": "Meituan"}, {"id": 91732, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91732?format=json", "institution": "The University of Sydney"}], "abstract": "Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist---a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7\\% improvement over the baseline and +6.6\\% over GRPO on MathVerse.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40049", "url": null, "sourceid": 31779, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40051, "uid": "099546f43eb9438109dd51bfbdfd2eda", "name": "MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality", "authors": [{"id": 183500, "fullname": "Kyungwon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183500?format=json", "institution": "Yonsei University"}, {"id": 91230, "fullname": "Dosik Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91230?format=json", "institution": "Yonsei University"}], "abstract": "Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose a novel framework that explicitly decomposes each modality's representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables us to identify precisely what information is missing when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that the proposed method achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40051", "url": null, "sourceid": 32845, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40053, "uid": "20913c01b73eb72bf3bbd8b570e4dfa4", "name": "Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation", "authors": [{"id": 193385, "fullname": "Shuang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193385?format=json", "institution": "Tencent"}, {"id": 193386, "fullname": "Chao Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193386?format=json", "institution": "Tsinghua University"}, {"id": 193387, "fullname": "Hang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193387?format=json", "institution": "Tsinghua University"}, {"id": 193388, "fullname": "Liqun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193388?format=json", "institution": "Tencent"}, {"id": 193389, "fullname": "zhenyu hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193389?format=json", "institution": null}, {"id": 193390, "fullname": "Cao Te", "url": "http://cvpr.thecvf.com/api/miniconf/users/193390?format=json", "institution": null}, {"id": 193391, "fullname": "Mengge Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193391?format=json", "institution": null}, {"id": 193392, "fullname": "Yuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193392?format=json", "institution": "Tencent"}, {"id": 193393, "fullname": "Peng Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193393?format=json", "institution": "Tencent"}, {"id": 193394, "fullname": "Huan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193394?format=json", "institution": "Tencent"}, {"id": 86191, "fullname": "Jie Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86191?format=json", "institution": "Tencent AI Lab"}], "abstract": "Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject's identity while editing its context based on a text prompt. A core challenge in this task is the ``similarity-controllability paradox'', where enhancing textual control often degrades the subject's fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose \\textbf{DisCo}, a novel framework that first \\textbf{Dis}entangles and then re-\\textbf{Co}uples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40053", "url": null, "sourceid": 38673, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40057, "uid": "b5943cfb2b78a2694cc84c22b9381970", "name": "Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs", "authors": [{"id": 183508, "fullname": "Shiang-Feng Tsai", "url": "http://cvpr.thecvf.com/api/miniconf/users/183508?format=json", "institution": "National Tsing Hua University"}, {"id": 152562, "fullname": "Yuan-Hong Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152562?format=json", "institution": "University of Toronto"}, {"id": 130488, "fullname": "Jin-Cheng Jhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130488?format=json", "institution": "National Tsing Hua University"}, {"id": 97203, "fullname": "Nan Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/97203?format=json", "institution": "Amazon"}, {"id": 93149, "fullname": "Min Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/93149?format=json", "institution": "Amazon/NTHU"}], "abstract": "Part-level pointing is important for fine-grained interaction and reasoning, yet existing Multimodal Large Language Models (MLLMs) remain limited to instance-level pointing. Part-level pointing presents unique challenges: annotation is costly, parts are long-tail distributed, and many are difficult to specify precisely in language. We introduce POinting at Parts (POP), a training-free, plug-and-play approach that addresses these challenges under a few-shot setup. POP fuses textual and visual attention maps with self-supervised visual correspondences from query image and few-shot examples. On average across the three evaluated datasets, POP achieves accuracy gains of up to 8.9 points in the one-shot setting and 16.4 points in the three-shot setting for the pointing-capable MLLMs\u2014Qwen2.5-VL, Ovis2.5, and Molmo. Notably, even MLLMs without pointing capability benefit significantly from the proposed approach. These results establish a simple yet effective path toward fine-grained spatial grounding in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40057", "url": null, "sourceid": 31259, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40060, "uid": "1f79c37c38ef31e2174277b34e5aa64b", "name": "Distilling Quasi-Conformal Mapping: A Generalizable and Efficient Solution for Wide-Angle Correction", "authors": [{"id": 144365, "fullname": "Chengyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144365?format=json", "institution": "The University of Hong Kong"}, {"id": 180133, "fullname": "Zixuan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180133?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 180131, "fullname": "Miaolin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/180131?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 193410, "fullname": "Michael Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193410?format=json", "institution": "Hong Kong Baptist University"}, {"id": 146860, "fullname": "huibin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/146860?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "This paper introduces a novel framework for wide-angle correction by distilling the principles of quasi-conformal mapping into an efficient and generalizable deep neural network. Our methodology can be divided into two primary stages. In the first stage, the wide-angle distortion correction problem is treated as a quasi-conformal mapping from the distorted image to the target image. In particular, we minimize the Beltrami smoothness energy with constraints of both line structures and human body regions. The Beltrami coefficient is subsequently estimated using the Proximal Gradient Descent algorithm. This alternating optimization yields the final quasi-conformal mapping and the corresponding corrected image. In the second stage, the Quasi-conformal-mapping Distilled Wide-angle Correction Network (QDWC-Net) is proposed, which is trained on these corrected images to predict the correction flow directly from a distorted input and built upon an encoder-decoder followed by a soft-argmin regression output head and loss functions. Extensive quantitative and qualitative experiments demonstrate the superior effectiveness and efficiency of our distilled approach, which achieves state-of-the-art correction results, especially in mitigating distortion in both portrait and human body regions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40060", "url": null, "sourceid": 31493, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40061, "uid": "cb0dfa84f66dc1d66825f5aa726cbf55", "name": "CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification", "authors": [{"id": 180631, "fullname": "Jinheng Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/180631?format=json", "institution": "Xidian University"}, {"id": 193411, "fullname": "Jiahui Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193411?format=json", "institution": "Xidian University"}, {"id": 193412, "fullname": "Wenqian Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193412?format=json", "institution": "Xidian University"}, {"id": 73148, "fullname": "Yunsong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73148?format=json", "institution": "Xidian University"}], "abstract": "Fine-tuning Vision-Language Models (VLMs) trained on large-scale datasets of natural image-text pairs has demonstrated impressive performance for various downstream tasks. However, their fine-tuning for remote sensing (RS) tasks faces dual barriers: (1) Data-level barrier caused by the fundamental modality gap between natural imagery and RS data, and (2) Task-level barrier stemming from the requirement for multi-source interaction modeling capabilities. This paper proposes a Cross-modal Fusion Interactive Prompt Tuning (CF-IPT) method to fine-tune CLIP for multi-source RS image classification tasks. It aims to leverage the prompt learning framework to transfer the alignment target of the text branch shifts from natural images to multi-source RS images. Specifically, we design a Multi-source Interactive Fusion\u2013guided Spectral-Spatial Prompt Generation (MFPG) module, which enables cross-modal feature interaction to generate a prompt matrix that preserves the original spectral and spatial information while performing adaptive multi-scale fusion to address the multi-source image adaptation problem. Subsequently, a Spectral\u2013Spatial Prompt\u2013guided Visual\u2013Text Prompt Interaction (V-TPI)  Strategy is proposed, which leverages spectral\u2013spatial prompt matrices to guide visual\u2013textual prompt interaction and inject RS\u2013specific information into both branches of CLIP, ultimately enabling multi-source RS image\u2013text representation alignment. The proposed approach performs the downstream task of multi-source RS image classification with merely 0.76\\% of CLIP\u2019s parameters. It is evaluated on several widely used datasets, demonstrating the effectiveness of the proposed approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40061", "url": null, "sourceid": 43787, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40056, "uid": "7f57bce28faaaea3f907428563a4437c", "name": "Generalizable Sparse-View 3D Reconstruction from Unconstrained Images", "authors": [{"id": 132592, "fullname": "Vinayak Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/132592?format=json", "institution": "University of Maryland, College Park"}, {"id": 73881, "fullname": "Chih-Hao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/73881?format=json", "institution": "Department of Computer Science"}, {"id": 86810, "fullname": "Shenlong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86810?format=json", "institution": "University of Illinois, Urbana Champaign"}, {"id": 193404, "fullname": "Anand Bhattad", "url": "http://cvpr.thecvf.com/api/miniconf/users/193404?format=json", "institution": "Johns Hopkins University"}, {"id": 88945, "fullname": "Jia-Bin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88945?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization with appearance embeddings or dynamic masks, requiring extensive per-scene training and failing under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and a new 20-scene MegaScenes benchmark demonstrate state-of-the-art feed-forward reconstruction quality, achieving real-time inference without test-time optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40056", "url": null, "sourceid": 32300, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40064, "uid": "dacf1de88e7c80b43b3b23f39c1e8d0b", "name": "Latent Action Pretraining Meets Pose Estimation", "authors": [{"id": 183219, "fullname": "Zhengqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183219?format=json", "institution": "Simon Fraser University"}, {"id": 93292, "fullname": "Saurabh Nair", "url": "http://cvpr.thecvf.com/api/miniconf/users/93292?format=json", "institution": "Wayve AI"}, {"id": 93632, "fullname": "Prajwal Chidananda", "url": "http://cvpr.thecvf.com/api/miniconf/users/93632?format=json", "institution": "Wayve AI"}, {"id": 193420, "fullname": "Pujith Kachana", "url": "http://cvpr.thecvf.com/api/miniconf/users/193420?format=json", "institution": "Wayve"}, {"id": 193421, "fullname": "Samuel Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193421?format=json", "institution": "Wayve"}, {"id": 193422, "fullname": "Matthew Brown", "url": "http://cvpr.thecvf.com/api/miniconf/users/193422?format=json", "institution": "Wayve"}, {"id": 75671, "fullname": "Yasutaka Furukawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/75671?format=json", "institution": "Simon Fraser University"}], "abstract": "This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos.Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks.Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations.This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods.To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40064", "url": null, "sourceid": 41548, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40065, "uid": "7a45114aa2fd2789a144aac7d8ee89dd", "name": "AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal", "authors": [{"id": 180787, "fullname": "Idan Yankelev", "url": "http://cvpr.thecvf.com/api/miniconf/users/180787?format=json", "institution": "Ben-Gurion University of the Negev"}, {"id": 193423, "fullname": "Edita Grolman", "url": "http://cvpr.thecvf.com/api/miniconf/users/193423?format=json", "institution": "Ben-Gurion University of the Negev"}, {"id": 193424, "fullname": "Yarin Levi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193424?format=json", "institution": null}, {"id": 193425, "fullname": "Amit Giloni", "url": "http://cvpr.thecvf.com/api/miniconf/users/193425?format=json", "institution": "Fujitsu Research and Development Center Co. Ltm."}, {"id": 193426, "fullname": "Omer Hofman", "url": "http://cvpr.thecvf.com/api/miniconf/users/193426?format=json", "institution": "Fujitsu Research and Development Center Co. Ltm."}, {"id": 193427, "fullname": "Toshiya Shimizu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193427?format=json", "institution": "Fujitsu"}, {"id": 128368, "fullname": "Yuval Elovici", "url": "http://cvpr.thecvf.com/api/miniconf/users/128368?format=json", "institution": "Ben Gurion University of the Negev"}, {"id": 128334, "fullname": "Asaf Shabtai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128334?format=json", "institution": "Ben-Gurion University of the Negev"}], "abstract": "Adversarial patch attacks pose a significant threat to the reliability of object detection (OD) models, particularly in real-time security applications. Although several defenses have been proposed, they often suffer from two limitations: 1) reduced performance on benign images, and 2) impractical processing time for real-time OD applications. In this paper, we present AntiStyler, a novel and rapid defense against adversarial patches. Given an input image, AntiStyler identifies and masks pixels that exhibit a ``random'' style associated with adversarial attacks and uses a series of spatial filters to enhance the mask and remove unwanted noise, efficiently masking adversarial patches. AntiStyler features model-, patch-, and attack-agnostic capabilities and does not require any training, making it a fully agnostic zero-shot defense against adversarial patch attacks. Our evaluation on the COCO, INRIA, Superstore, and APRICOT datasets, with both digital and physical attacks, demonstrates AntiStyler's state-of-the-art robustness (improving adversarial performance by 8-15 mAP%) without compromising the original performance on benign images. Additionally, unlike most existing defenses, AntiStyler can process 10-12 frames per second (FPS), making it efficient and relevant for real-time OD applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40065", "url": null, "sourceid": 36689, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40068, "uid": "65b15d7b1dbbaaecb9e2b33d681c5497", "name": "Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning", "authors": [{"id": 145127, "fullname": "Shihao Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/145127?format=json", "institution": "Xiamen University"}, {"id": 180937, "fullname": "Chikai Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180937?format=json", "institution": "Xiamen University"}, {"id": 193436, "fullname": "Zhiheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193436?format=json", "institution": "Xiamen University"}, {"id": 180950, "fullname": "jiacheng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180950?format=json", "institution": "Xiamen University"}, {"id": 153841, "fullname": "Xinyi Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153841?format=json", "institution": "University College London, University of London"}, {"id": 184649, "fullname": "Junlong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184649?format=json", "institution": "Xiamen University"}, {"id": 153842, "fullname": "Yiqun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153842?format=json", "institution": "Hong Kong Baptist University"}, {"id": 90817, "fullname": "Yang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90817?format=json", "institution": "Xiamen University"}], "abstract": "Personalized federated learning (PFL) with foundation models has emerged as a promising paradigm enabling clients to adapt to heterogeneous data distributions. However, real-world scenarios often face the co-occurrence of non-IID data and long-tailed class distributions, presenting unique challenges that remain underexplored in PFL. In this paper, we investigate this long-tailed personalized federated learning and observe that current methods suffer from two limitations: (i) Fine-tuning degrades performance below zero-shot baselines due to the erosion of inherent class balance in foundation models; (ii) Conventional personalization techniques further transfer this bias to local models through parameter or feature-level fusion. To address these challenges, we propose Federated Learning via Gradient Purification and Residual Learning (FedPuReL), which preserves balanced knowledge in the global model while enabling unbiased personalization. Specifically, we purify local gradients using zero-shot predictions to maintain a class-balanced global model, and model personalization as residual corrections atop the frozen global model. Extensive experiments demonstrate that FedPuReL consistently outperforms state-of-the-art methods, achieving superior performance on both global and personalized models across diverse long-tailed scenarios. The code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40068", "url": null, "sourceid": 35205, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40071, "uid": "06886417c92cf23925a22755585b1899", "name": "Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation", "authors": [{"id": 107262, "fullname": "Jiahao Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/107262?format=json", "institution": "Nanyang Technological University"}, {"id": 193441, "fullname": "Guanqiao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193441?format=json", "institution": "Nanyang Technological University"}, {"id": 155418, "fullname": "Wenbin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/155418?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 190414, "fullname": "Yap-Peng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190414?format=json", "institution": "VinUniversity; Nanyang Technological University"}, {"id": 87266, "fullname": "Alex C. Kot", "url": "http://cvpr.thecvf.com/api/miniconf/users/87266?format=json", "institution": "Nanyang Technological University"}, {"id": 87301, "fullname": "Shijian Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87301?format=json", "institution": "Nanyang Technological University"}], "abstract": "Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0\\%).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40071", "url": null, "sourceid": 42740, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40075, "uid": "3fd28bdb83c87432e82dff65aedef74d", "name": "Pixel2Phys: Distilling Governing Laws from Visual Dynamics", "authors": [{"id": 193446, "fullname": "Ruikun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193446?format=json", "institution": "Tsinghua University"}, {"id": 193447, "fullname": "Jun Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193447?format=json", "institution": "University of Science and Technology of China"}, {"id": 193448, "fullname": "Yingfan Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/193448?format=json", "institution": "University of Science and Technology of China; Shanghai Artificial Intelligence Laboratory"}, {"id": 129389, "fullname": "SHIXIANG TANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129389?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 129269, "fullname": "Biqing Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/129269?format=json", "institution": "Harbin Institute of Technology &amp; Tsinghua University &amp; Frontis.AI"}, {"id": 107620, "fullname": "Bin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107620?format=json", "institution": "University of Science and Technology of China"}, {"id": 151075, "fullname": "Wanli Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151075?format=json", "institution": "Shanghai AI Lab"}, {"id": 87596, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87596?format=json", "institution": "Shanghai AI lab"}], "abstract": "Discovering physical laws directly from high-dimensional visual data is a long-standing human pursuit but remains a formidable challenge for machines, representing a fundamental goal of scientific intelligence. This task is inherently difficult because physical knowledge is low-dimensional and structured, whereas raw video observations are high-dimensional and redundant, with most pixels carrying little or no physical meaning. Extracting concise, physically relevant variables from such noisy data remains a key obstacle. To address this, we propose Pixel2Phys, a collaborative multi-agent framework adaptable to any Multimodal Large Language Model (MLLM). It emulates human scientific reasoning by employing a structured workflow to extract formalized physical knowledge through iterative hypothesis generation, validation, and refinement.By repeatedly formulating, and refining candidate equations on high-dimensional data, it identifies the most concise representations that best capture the underlying physical evolution. This automated exploration mimics the iterative workflow of human scientists, enabling AI to reveal interpretable governing equations directly from raw observations. Across diverse simulated and real-world physics videos, Pixel2Phys discovers accurate, interpretable governing equations and maintaining stable long-term extrapolation where baselines rapidly diverge.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40075", "url": null, "sourceid": 38797, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40073, "uid": "78775654c69f1cb73d97d96c1c65354e", "name": "Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification", "authors": [{"id": 131652, "fullname": "Huimin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131652?format=json", "institution": "Beihang University"}, {"id": 193443, "fullname": "Boxuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193443?format=json", "institution": "Beihang University"}, {"id": 176854, "fullname": "Yulin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176854?format=json", "institution": "Beihang University"}, {"id": 129760, "fullname": "Xiuzhuang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129760?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131631, "fullname": "Junlin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131631?format=json", "institution": "Beihang University"}], "abstract": "Defect synthesis, as a core technology for addressing the problem of few-shot defect classification, has been widely adopted in industrial scenarios. It helps alleviate the problem of insufficient model generalization capability owing to data scarcity by establishing a data augmentation pipeline. Recently, remarkable progress has been achieved in both explicit defect image generation and implicit defect feature synthesis approaches. However, the existing methods are always conducted in Euclidean space. Constrained by the flatness of Euclidean space, it is difficult to synthesize defect data containing complex structures. In this paper, we attempt to explore the defect generation in hyperbolic space and propose a hyperbolic defect feature synthesis (HypDFS) method. By modeling the potential defect distribution via a small number of hyperbolic defect prototypes and further optimizing the synthetic defect features with the hierarchical defect contrastive loss in hyperbolic space, our HypDFS method can obtain a better generalized defect representation that is more conducive to downstream few-shot defect classification task. Extensive experiments conducted on the MVTec-FS benchmark and standard MTD dataset under the few-shot settings demonstrate that the proposed HypDFS surpasses the Euclidean baseline by a large margin, showing the promising prospects for defect synthesis in hyperbolic space.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40073", "url": null, "sourceid": 45694, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40079, "uid": "7fc34eee4c21d2e8aacb9bb7774a27ea", "name": "ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP", "authors": [{"id": 182123, "fullname": "Xin Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182123?format=json", "institution": "Renmin University of China"}, {"id": 193452, "fullname": "Manqi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193452?format=json", "institution": null}, {"id": 131145, "fullname": "Dongsheng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131145?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 193453, "fullname": "Yingying Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193453?format=json", "institution": "BeiJing China-Power Information Technology Co., Ltd,"}, {"id": 76389, "fullname": "Bing Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/76389?format=json", "institution": "Renmin University of China"}], "abstract": "Remote sensing image segmentation is critical for a range of applications, including natural disaster monitoring and precision agriculture. Open-vocabulary segmentation enhances flexibility by removing fixed category constraints, enabling more fine-grained and adaptive scene understanding. Unlike CLIP\u2019s original pretraining objective, which emphasizes global image-text alignment, segmentation tasks require accurate and discriminative patch-level representations to support precise pixel-wise predictions. As a result, the quality of attention maps\u2014particularly those generated in the final transformer layers\u2014plays a pivotal role in guiding inter-region interactions. However, current methods generate suboptimal representations when capturing the complex spatial hierarchies in remote sensing. We address this gap by optimizing CLIP's 197\u00d7197 attention matrix through three key modifications: (1) substituting the 196\u00d7196 patch-to-patch submatrix with intermediate-layer feature similarities to preserve spatial structures; (2) prioritizing intermediate-layer attention for global-to-local (class-to-patch) token alignment to reduce classification interference; (3) disabling the \\texttt{[CLS]} token's self-attention to mitigate bias. Extensive experiments on eight remote sensing benchmarks and two building/road extraction datasets demonstrate that our method achieves state-of-the-art performance among existing training-free approaches.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40079", "url": null, "sourceid": 35752, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40089, "uid": "878c57ac87b8629049a172596fb9a67d", "name": "PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction", "authors": [{"id": 128523, "fullname": "Xiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128523?format=json", "institution": "University of California, San Diego"}, {"id": 193487, "fullname": "Sohyun Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193487?format=json", "institution": "University of California, San Diego"}, {"id": 193488, "fullname": "Hongrui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193488?format=json", "institution": "Tongji University"}, {"id": 174066, "fullname": "Chuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/174066?format=json", "institution": "Lambda, Inc"}, {"id": 193489, "fullname": "Jianwen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/193489?format=json", "institution": null}, {"id": 84901, "fullname": "Zhuowen Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84901?format=json", "institution": "University of California, San Diego"}], "abstract": "We introduce PixARMesh, the first method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative modeling, we enrich a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream of context, pose, and mesh tokens, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40089", "url": null, "sourceid": 46180, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40097, "uid": "db5cfc2cb39ac654a6a533f2f2611fb9", "name": "Incremental Object Detection via Future-Aware Decoupled Cross-Head Distillation", "authors": [{"id": 174326, "fullname": "Chenfeng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/174326?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 88793, "fullname": "De Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88793?format=json", "institution": "Xidian University"}, {"id": 144821, "fullname": "Wenlong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/144821?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 193518, "fullname": "Mingyue Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193518?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 156650, "fullname": "Shizhou Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156650?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 86416, "fullname": "Nannan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86416?format=json", "institution": "Xidian University"}, {"id": 88813, "fullname": "Xinbo Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88813?format=json", "institution": "Xidian University"}], "abstract": "Incremental Object Detection (IOD) enables AI systems to continuously acquire new object classes while preserving knowledge of previously learned ones, an ability essential for deployment in dynamic, real-world environments. Existing IOD methods typically rely on knowledge distillation to mitigate catastrophic forgetting. However, the tight coupling between the student model\u2019s detection head and backbone causes distillation gradients to conflict with new-class supervision at the head, injecting head-specific bias into the backbone and ultimately weakening distillation effectiveness. To address this issue, we propose a decoupled training mechanism for the model\u2019s backbone and classification head. Specifically, we introduce the Future-aware Cross-head Distillation (FaCHD) method, which utilizes two frozen complementary teachers (historical and intermediate teachers) to decode the student\u2019s ROI features for cross-head distillation. This strategy implicitly alleviates prediction conflicts caused by detection-head bias and provides richer task-relevant guidance, thereby improving distillation efficiency. To further address the detection head bias and model recency problem, we propose a Prototype Semantic Drift Compensation module, which recalibrates multi-granularity prototypes of old classes, effectively correcting semantic drift and enhancing the stability of the detection head. Extensive experiments on two standard IOD benchmarks demonstrate the effectiveness and superiority of the proposed method. Code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40097", "url": null, "sourceid": 35809, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40100, "uid": "6cb7f43fec13e471a347be105e7cbd08", "name": "D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network", "authors": [{"id": 180753, "fullname": "Qiang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180753?format=json", "institution": "Qingdao University of Science and Technology"}, {"id": 193528, "fullname": "Wenqi Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193528?format=json", "institution": "QUST"}, {"id": 193529, "fullname": "Meifang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193529?format=json", "institution": "Qingdao University of Science and Technology"}, {"id": 185146, "fullname": "Xiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185146?format=json", "institution": "Qingdao University of Science and Technology"}], "abstract": "Accurately capturing and aggregating spatiotemporal information has become crucial for video object detection. Previous methods mainly perform feature aggregation in the spatiotemporal domain, treating all regions indiscriminately and overlooking both their relative importance and the frequency characteristics that capture periodic motion patterns. This limits the capability of these methods to capture dynamic interactions and adapt to complex scene variations. In this paper, we propose a novel Dual-Domain Feature Aggregation Network (D2FANet) for video object detection, which, to the best of our knowledge, is the first work to introduce the frequency-domain feature aggregation into the video object detection task. By collaboratively modeling spatiotemporal and frequency information, our D2FANet enhances motion awareness and temporal consistency, thereby improving detection accuracy. First, we develop a frequency-domain feature aggregation module that decomposes frame features into high- and low-frequency distributions and reinforces object query representations through aggregating multi-scale frequency features. Second, we design a spatiotemporal-domain feature aggregation module that leverages an importance guidance mechanism to dynamically emphasize regions of different importance and reinforces object query representations via guiding the aggregation of spatiotemporal features. Experiments on the ImageNet VID and EPIC-KITCHENS datasets demonstrate that D2FANet achieves state-of-the-art performance. The code will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40100", "url": null, "sourceid": 38467, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40105, "uid": "382eb7cfc227924eadb64ffc62bb0e58", "name": "EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation", "authors": [{"id": 154316, "fullname": "Abhishek Saroha", "url": "http://cvpr.thecvf.com/api/miniconf/users/154316?format=json", "institution": "Department of Informatics, Technische Universit\u00e4t M\u00fcnchen"}, {"id": 193545, "fullname": "Huajian Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193545?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 176743, "fullname": "Xingxing Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/176743?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 73527, "fullname": "Xi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73527?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba\u2013Transformer\u2013Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79\\%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40105", "url": null, "sourceid": 45066, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40110, "uid": "f0fe8624ed77b0f2c7c5a6e6826021cd", "name": "ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data", "authors": [{"id": 181174, "fullname": "Yaoqin Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/181174?format=json", "institution": "ShanghaiTech University"}, {"id": 107237, "fullname": "Yiteng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107237?format=json", "institution": "ShanghaiTech University"}, {"id": 193559, "fullname": "Qin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193559?format=json", "institution": null}, {"id": 89298, "fullname": "Xinge Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89298?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 131459, "fullname": "YUJING SUN", "url": "http://cvpr.thecvf.com/api/miniconf/users/131459?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 132350, "fullname": "Yuexin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/132350?format=json", "institution": "ShanghaiTech University"}], "abstract": "Human behaviors in real-world environments are inherently interactive, with an individual\u2019s motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human\u2013robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego\u2019s future motion from dynamic multi-source cues, including others\u2019 actions, scene geometry, and semantic inputs.This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human\u2013human, and human\u2013scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. During online rollout, ReMoGen performs generation in short temporal segments and employs a lightweight Frame-wise Segment Refinement module that incorporates freshly observed interaction cues, achieving responsive and temporally coherent motion without heavy full-sequence inference. Extensive experiments across human\u2013human, human\u2013scene, and composite interaction settings demonstrate that ReMoGen delivers superior motion fidelity, responsiveness, and cross-domain generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40110", "url": null, "sourceid": 35712, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40131, "uid": "6751bce20296374f82f386e054e0b9f7", "name": "VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding", "authors": [{"id": 183913, "fullname": "Xueqing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183913?format=json", "institution": "Peking University"}, {"id": 181706, "fullname": "Bohan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181706?format=json", "institution": "ByteDance Inc."}, {"id": 193601, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193601?format=json", "institution": "TikTok"}, {"id": 157173, "fullname": "Zhenheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157173?format=json", "institution": "Tiktok"}], "abstract": "Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model\u2019s input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly.To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70\\% in the best models to nearly 0\\% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40131", "url": null, "sourceid": 46304, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40115, "uid": "1f06d3acecd32c5394bf8ba9911d66ea", "name": "VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding", "authors": [{"id": 180426, "fullname": "Yufei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180426?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 180390, "fullname": "Qianke Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180390?format=json", "institution": "Hangzhou Dianzi University, China"}, {"id": 88165, "fullname": "Minghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88165?format=json", "institution": "Zhejiang University"}, {"id": 104052, "fullname": "Jiajun Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/104052?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 77176, "fullname": "Zhenwei Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/77176?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 76382, "fullname": "Zhou Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76382?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an agentic reasoning-over-hierarchical-memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a Controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the Controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40115", "url": null, "sourceid": 32550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40116, "uid": "51ac520c783d88964a793e455dae3506", "name": "Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers", "authors": [{"id": 189726, "fullname": "Xinyu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189726?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 129580, "fullname": "Han Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129580?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 156763, "fullname": "Yuyang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156763?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 155993, "fullname": "Ziyang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155993?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 76656, "fullname": "Yaoming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76656?format=json", "institution": "Meituan"}, {"id": 189727, "fullname": "Xin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189727?format=json", "institution": "Meituan"}, {"id": 89859, "fullname": "Wenrui Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/89859?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 89835, "fullname": "Chenglin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89835?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 90132, "fullname": "Junni Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90132?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 76584, "fullname": "Hongkai Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76584?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \\textbf{L}ocal \\textbf{D}iffusion \\textbf{F}orcing for \\textbf{V}ideo \\textbf{F}rame \\textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40116", "url": null, "sourceid": 40021, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40122, "uid": "fc7b1dd1da46208a27e47035cbdab149", "name": "ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos", "authors": [{"id": 183776, "fullname": "Peijun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183776?format=json", "institution": "Nanyang Technological University"}, {"id": 193579, "fullname": "Luo Anwei", "url": "http://cvpr.thecvf.com/api/miniconf/users/193579?format=json", "institution": "Nanyang Technological University"}, {"id": 87277, "fullname": "Gang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87277?format=json", "institution": "Zhejiang University"}, {"id": 87266, "fullname": "Alex C. Kot", "url": "http://cvpr.thecvf.com/api/miniconf/users/87266?format=json", "institution": "Nanyang Technological University"}, {"id": 87664, "fullname": "Xudong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87664?format=json", "institution": "Nanyang Technological University"}], "abstract": "Temporal forgery localization aims to temporally identify manipulated segments in untrimmed videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To address this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in untrimmed videos. It contains over 6K forgery video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that enhances artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset, code, and pretrained models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40122", "url": null, "sourceid": 35876, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40123, "uid": "7fbb006d37d666ab411008bb1f454f05", "name": "OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting", "authors": [{"id": 152874, "fullname": "Hongjia Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/152874?format=json", "institution": "Zhejiang University"}, {"id": 85018, "fullname": "Qi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85018?format=json", "institution": "Tencent AI Lab"}, {"id": 152876, "fullname": "Xiaokun Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152876?format=json", "institution": "Zhejiang University"}, {"id": 104283, "fullname": "Xiyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104283?format=json", "institution": "Zhejiang University"}, {"id": 187938, "fullname": "Yitong Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187938?format=json", "institution": "Zhejiang University"}, {"id": 89659, "fullname": "Huaqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89659?format=json", "institution": "Hangzhou VIVO Information Technology Co., Ltd"}, {"id": 88296, "fullname": "Dan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88296?format=json", "institution": "CSE, HKUST"}, {"id": 84995, "fullname": "Guofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84995?format=json", "institution": "Zhejiang University"}], "abstract": "Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit spatial attribute grids for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40123", "url": null, "sourceid": 39068, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40125, "uid": "f16805d9fc37f021a682d86ab3013bb7", "name": "MotionMaster: Generalizable Text-Driven Motion Generation and Editing", "authors": [{"id": 128553, "fullname": "Nan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128553?format=json", "institution": "Peking University"}, {"id": 193587, "fullname": "yunhao li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193587?format=json", "institution": "Peking University"}, {"id": 193588, "fullname": "Lexi Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193588?format=json", "institution": "Peking University"}, {"id": 193589, "fullname": "Zimo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193589?format=json", "institution": "Peking University"}, {"id": 75767, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75767?format=json", "institution": "Beijing Institute of General Artificial Intelligence"}, {"id": 88123, "fullname": "Yixin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88123?format=json", "institution": "Peking University"}], "abstract": "Text-driven human motion generation struggles with complex multi-action sequences and precise editing tasks due to limited training data diversity, inadequate motion representations, and fragmented generation pipelines. We present MotionMaster, a framework that addresses these challenges. First, we introduce MotionGB, a 10,000-hour motion dataset created from 400 hours of manually verified motion capture data, enriched with multi-level descriptions, then expanded through spatial-temporal editing while maintaining precise motion-text correspondence. Second, we develop a motion representation method that encodes local frame-wise features into discrete tokens while employing sequence-level reconstruction to preserve global trajectory coherence. Third, we finetune the pre-trained multimodal LLM with motion and language tokens in a shared embedding space, enabling end-to-end understanding of motion semantics. We propose a technique to address unbalanced motion semantics in the dataset. Evaluated using a Gemini-based scorer validated against human judgments, MotionMaster demonstrates strong generalization: it achieves state-of-the-art zero-shot motion generation ability, demonstrating a 41.6% relative improvement over baselines in semantic consistency for long multi-action sequences and a 20.8% relative improvement in coordinating complex body part specifications for spatial composition tasks. These results represent a strong generalization across language and motion modalities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40125", "url": null, "sourceid": 31109, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40130, "uid": "3505514e9f9ba3724fc51cb3278e0e67", "name": "Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress", "authors": [{"id": 180299, "fullname": "Yuelin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180299?format=json", "institution": "Renmin University of China"}, {"id": 103271, "fullname": "Sijie Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/103271?format=json", "institution": "Tsinghua University"}, {"id": 193599, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193599?format=json", "institution": "Renmin University of China"}, {"id": 185687, "fullname": "Zongzhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185687?format=json", "institution": "Renmin University of China"}, {"id": 193600, "fullname": "Yuxin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193600?format=json", "institution": "Renmin University of China"}, {"id": 128402, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128402?format=json", "institution": "Tsinghua University"}, {"id": 88061, "fullname": "Wenbing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88061?format=json", "institution": "Renmin University of China"}], "abstract": "Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\\text{R}^2$VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train $\\text{R}^2$VLM on large-scale, automatically generated datasets from ALFRED and Ego4D, enhanced with advanced post-training techniques. Extensive experiments on progress estimation and downstream applications\u2014including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance\u2014demonstrate that $\\text{R}^2$VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40130", "url": null, "sourceid": 36979, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40133, "uid": "c9da48779f4865f0d44f00393c293413", "name": "IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment", "authors": [{"id": 183518, "fullname": "Simone Magistri", "url": "http://cvpr.thecvf.com/api/miniconf/users/183518?format=json", "institution": "University of Florence"}, {"id": 107405, "fullname": "Dipam Goswami", "url": "http://cvpr.thecvf.com/api/miniconf/users/107405?format=json", "institution": "Computer Vision Center"}, {"id": 106277, "fullname": "Marco Mistretta", "url": "http://cvpr.thecvf.com/api/miniconf/users/106277?format=json", "institution": "Universit\u00e0 degli Studi di Firenze"}, {"id": 127055, "fullname": "Bart\u0142omiej Twardowski", "url": "http://cvpr.thecvf.com/api/miniconf/users/127055?format=json", "institution": "Computer Vision Center / IDEAS NCBR"}, {"id": 90564, "fullname": "Joost van de Weijer", "url": "http://cvpr.thecvf.com/api/miniconf/users/90564?format=json", "institution": "Computer Vision Center Barcelona"}, {"id": 126613, "fullname": "Andrew Bagdanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/126613?format=json", "institution": "Universit\u00e0 degli Studi di Firenze"}], "abstract": "Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40133", "url": "https://github.com/simomagi/IsoCLIP", "sourceid": 46145, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40138, "uid": "e16b7c6134db5e15a521ec10fd405ddf", "name": "MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer", "authors": [{"id": 90191, "fullname": "Zenghao Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/90191?format=json", "institution": "National University of Singapore"}, {"id": 151909, "fullname": "Chen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151909?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 180752, "fullname": "Yongkang Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180752?format=json", "institution": "National University of Singapore"}, {"id": 107391, "fullname": "Xulei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107391?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 193609, "fullname": "Mohan Kankanhalli", "url": "http://cvpr.thecvf.com/api/miniconf/users/193609?format=json", "institution": "National University of Singapore"}], "abstract": "3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables flexible many-to-many matching across characters. The pose transfer is then formulated as a conditional generation process, in which the source transformations are first projected onto the target through soft correspondence matching and subsequently refined using shape-conditioned representations. Extensive qualitative and quantitative experiments demonstrate that MimiCAT transfers plausible poses across different characters, significantly outperforming prior methods that are limited to narrow category transfer (e.g., humanoid-to-humanoid).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40138", "url": null, "sourceid": 43237, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40139, "uid": "941cd9fc8182892c91ed6ea4c33909d0", "name": "Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction", "authors": [{"id": 185765, "fullname": "Chengxin Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185765?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 180730, "fullname": "Yihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180730?format=json", "institution": "Beihang University"}, {"id": 76464, "fullname": "Hongyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76464?format=json", "institution": "Beihang University"}, {"id": 89528, "fullname": "Yunhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89528?format=json", "institution": "Beihang University"}], "abstract": "3D semantic occupancy prediction is crucial for autonomous driving, yet vision-only approaches suffer from weak geometric cues, and existing multi-modal frameworks often depend on dense voxel or BEV tensors that impose heavy computational cost. We present  **Gau-Occ**, a multi-modal framework that models the scene as a compact collection of semantic 3D Gaussians, enabling geometry-guided fusion without dense volumetric processing.To enhance geometric completeness, a learned **LiDAR Completion Diffuser (LCD)** trained on real-world priors recovers missing structures from sparse LiDAR, and the completed points are encoded as semantic Gaussian anchors.To further integrate multi-view image semantics, we introduce  **Gaussian Anchor Fusion (GAF)**, a geometry-aligned aggregation module that performs anchor-guided 2D sampling, local neighborhood encoding, and cross-modal alignment. By constructing locally aggregated Gaussian descriptors that capture spatial consistency and semantic discriminability, GAF facilitates accurate feature association across modalities.Through anchor-driven refinement of Gaussian attributes, Occ-GS supports detailed 3D occupancy prediction. Extensive experiments across challenging benchmarks demonstrate that Occ-GS achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40139", "url": null, "sourceid": 46628, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40144, "uid": "67a9b6e59ac1105da3d7785693e2028d", "name": "Illuminating Visual Identity in Universal Multimodal Embeddings", "authors": [{"id": 159379, "fullname": "Jiawei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/159379?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193617, "fullname": "Junyi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193617?format=json", "institution": null}, {"id": 193618, "fullname": "Jiashen Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/193618?format=json", "institution": null}, {"id": 193619, "fullname": "Ziheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193619?format=json", "institution": "Alibaba Group"}, {"id": 154152, "fullname": "Bing Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154152?format=json", "institution": "Alibaba Group"}, {"id": 149095, "fullname": "Kaijie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149095?format=json", "institution": null}, {"id": 91414, "fullname": "Chaochen Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91414?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 88210, "fullname": "Jieping Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/88210?format=json", "institution": "Alibaba Group"}], "abstract": "Universal Multimodal Embeddings (UMEs) aim to unify various modalities and tasks into a shared representation space. In recent years, this field has witnessed substantial progress driven by the development of Multimodal Large Language Models (MLLMs). However, a crucial capability, visual identity discrimination, remains underexplored in existing UME methods, despite its critical role in a wide range of tasks, including instance retrieval, re-identification, and identity preservation in AI-generated content (AIGC).To bridge this gap, we propose a unified formulation for visual identity discrimination and introduce $\\textbf{MIEB}$ ($\\textbf{M}$ultimodal Visual $\\textbf{I}$dentity $\\textbf{E}$mbedding $\\textbf{B}$enchmark), a large-scale benchmark curated from both real-world and synthetic datasets to support evaluation and training.Furthermore, we present a simple yet effective learning framework that jointly optimizes general multimodal and visual identity representations through a carefully designed identity-aware sampling mechanism.Extensive experiments demonstrate that our approach successfully endows UMEs with strong identity discrimination capability and maintains competitive general multimodal performance.We believe this work not only illuminates a critical yet neglected capability, but also takes a step toward more holistic universal multimodal embeddings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40144", "url": null, "sourceid": 37468, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40145, "uid": "b0c7f60cbb73155ae6bf42fce5422dfc", "name": "SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation", "authors": [{"id": 179998, "fullname": "Xuancheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179998?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 175882, "fullname": "Li Yaning", "url": "http://cvpr.thecvf.com/api/miniconf/users/175882?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 76798, "fullname": "Sisi You", "url": "http://cvpr.thecvf.com/api/miniconf/users/76798?format=json", "institution": "Hefei University of Technology"}, {"id": 76806, "fullname": "Bing-Kun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76806?format=json", "institution": "Hefei University of Technology"}], "abstract": "Customized video generation aims to produce videos that faithfully preserve the subject's appearance from reference images while maintaining temporally consistent motion from reference videos. Existing methods struggle to ensure both subject appearance similarity and motion pattern consistency due to the lack of object-level guidance for subject and motion. To address this, we propose SMRABooth, which leverages the self-supervised encoder and optical flow encoder to provide object-level subject appearance and motion representations. These representations are aligned with the model during the LoRA fine-tuning process. Our approach is structured in three core stages: (1) We exploit subject representations via a self-supervised encoder to guide subject alignment, enabling the model to capture overall structure of subject and enhance high-level semantic consistency. (2) We utilize motion representations from an optical flow encoder to capture structurally coherent and object-level motion trajectories independent of appearance. (3) We propose a subject-motion association decoupling strategy that applies sparse LoRAs injection across both locations and timing, effectively reducing interference between subject and motion LoRAs. Extensive experiments show that SMRABooth excels in subject and motion customization, maintaining consistent subject appearance and motion patterns, proving its effectiveness in controllable text-to-video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40145", "url": null, "sourceid": 38398, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40151, "uid": "af93c9db0e2bfa76554dfbdde8426e81", "name": "DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation", "authors": [{"id": 180043, "fullname": "zihao xin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180043?format=json", "institution": "NUAA"}, {"id": 90550, "fullname": "Wentong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90550?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 193646, "fullname": "Yixuan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193646?format=json", "institution": null}, {"id": 193647, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193647?format=json", "institution": "Shandong University; Shandong University of Technology"}, {"id": 88498, "fullname": "Runmin Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88498?format=json", "institution": "Shandong University"}, {"id": 85792, "fullname": "Jie Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85792?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 156156, "fullname": "Sheng-Jun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156156?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "Vision-and-Language Navigation (VLN)  requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem.  To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation.  First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective  finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent selectively collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of our DecoVLN, and we have depolyed it in real-world environments.. Codes and models will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40151", "url": null, "sourceid": 36332, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40154, "uid": "1a6203d931bf37e0476a02bea3effe97", "name": "One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation", "authors": [{"id": 173963, "fullname": "Linghui Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173963?format=json", "institution": "Beijing University of Technology"}, {"id": 191852, "fullname": "Yuhan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191852?format=json", "institution": "Xiamen University"}, {"id": 131016, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131016?format=json", "institution": "Southeast University"}, {"id": 131032, "fullname": "Zhen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131032?format=json", "institution": "Beijing University of Technology"}, {"id": 87377, "fullname": "Yongjian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87377?format=json", "institution": "Beijing University of Technology"}], "abstract": "Video Frame Interpolation (VFI) is a crucial task in video processing. Flow-based methods, despite their success, are constrained by a fundamental dilemma: forward warping is efficient but prone to artifacts, while backward warping yields higher quality at a significant computational cost, especially for multi-frame interpolation. This trade-off is a major bottleneck. To overcome this, we introduce ``One-Shot Flow, Any-Time Frame,\" a novel framework for Event-based VFI (E-VFI) that achieves both high efficiency and superior quality for arbitrary-time interpolation. Our framework uniquely computes a comprehensive motion trajectory representation in a single pass using a Bidirectional Flow Estimation Block (BiFEB), leveraging the high temporal resolution of event data. Subsequently, our Flow Query (FQ) module can instantly retrieve the bidirectional optical flow for any timestamp, enabling the generation of any number of frames without repeated computation. Finally, a novel Bidirectional Warping (BiW) mechanism intelligently fuses the strengths of both warping directions, effectively mitigating artifacts and producing high-fidelity results. Extensive experiments show that our approach consistently surpasses state-of-the-art E-VFI methods in both reconstruction quality and inference efficiency, representing a substantial advance in efficient and high-quality event-based video interpolation. *The code will be released after acceptance.*", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40154", "url": null, "sourceid": 46099, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40155, "uid": "73ea78729d717e6a435948b8912a67cf", "name": "Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding", "authors": [{"id": 148617, "fullname": "Tianchen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/148617?format=json", "institution": "The University of Queensland"}, {"id": 87497, "fullname": "Chen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87497?format=json", "institution": "The University of Queensland"}, {"id": 183016, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183016?format=json", "institution": "Adelaide University"}], "abstract": "Human perception of social environments is inherently a multi-view synthesis problem, requiring the integration of complementary and often occluded information across space and time. However, existing benchmarks for Multimodal Large Language Models (MLLMs) are overwhelmingly predicated on a \"sufficient-view\" assumption, rewarding single-view pattern recognition while failing to evaluate cross-view fusion. To address this critical gap, we introduce \\textbf{CVBench}, a large-scale, multi-task benchmark for cross-view human understanding. CVBench comprises 3,000 challenging questions across 12 spatial and temporal tasks, where every item is designed with \\textit{verifiable single-view insufficiency}, mandating that models synthesize disparate evidence to resolve ambiguities. Our comprehensive evaluation of state-of-the-art open and closed-source MLLMs (from InternVL to Gemini 2.5 Pro) reveals a substantial performance gap, with the best models (e.g., Gemini 2.5 Pro, $\\sim$42\\% spatial accuracy) falling nearly 50 points behind human performance ($\\sim$94\\%). We identify a systemic failure mechanism across all models: a dominant \"Single-View Bias,\" whereby models ignore conflicting evidence and default to the most confident but incorrect single-view prediction. This demonstrates that current MLLMs lack the fundamental mechanisms for geometric grounding, identity persistence, and true spatio-temporal fusion. CVBench provides a rigorous diagnostic framework to catalyze the development of next-generation, cross-view\u2013aware architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40155", "url": null, "sourceid": 33575, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40180, "uid": "356ce221cc12dab65cbc2bd0723bb798", "name": "Flow Matching for Multimodal Distributions", "authors": [{"id": 180493, "fullname": "Gaoxiang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180493?format=json", "institution": "University of Minnesota Twin Cities"}, {"id": 180332, "fullname": "Frank Cole", "url": "http://cvpr.thecvf.com/api/miniconf/users/180332?format=json", "institution": "University of Minnesota Twin Cities"}, {"id": 177133, "fullname": "Sihang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177133?format=json", "institution": "University of Minnesota"}, {"id": 193732, "fullname": "Yuxiang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193732?format=json", "institution": "University of Minnesota - Twin Cities"}, {"id": 193733, "fullname": "Yulong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193733?format=json", "institution": "University of Minnesota - Twin Cities"}, {"id": 85282, "fullname": "Ju Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/85282?format=json", "institution": "University of Minnesota, Twin Cities"}], "abstract": "Visual foundation models play an increasingly important role in the training efficiency of flow-based models by inducing structured latent space through alignments, distillations, adapters, and even replacements of visual encoders. When structured latent space improves training efficiency by lowering the complexity of the target (latent) distribution, the efficiency can be further boosted by a data-adaptive multimodal source (noise) distribution that globally shortens the distance to the target (latent) distribution, and a mode-dependent coupling between source and target samples to move probability mass locally. To this end, we propose an efficient source and coupling co-design algorithm termed Mixture-Modeling Flow Matching (MM-FM). Under linear conditional flow objective and multimodal target assumption, our theoretical results reveal straighter and shorter sampling trajectories and smaller Lipschitz constant for learning complexity relative to isotropic Gaussian with independent coupling. In our ImageNet256x256 experiments with multimodal DINOv2-B latents, we observe superior convergence and state-of-the-art unconditional generation FID=2.74 with autoguidance in only 80 epochs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40180", "url": null, "sourceid": 39251, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40159, "uid": "99fbae8818b0e4199eddab733d8e1b15", "name": "RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations", "authors": [{"id": 90381, "fullname": "Mochu Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90381?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 88785, "fullname": "Zhelun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88785?format=json", "institution": "Peking University"}, {"id": 102621, "fullname": "Xuesong li", "url": "http://cvpr.thecvf.com/api/miniconf/users/102621?format=json", "institution": "Australian National University"}, {"id": 193667, "fullname": "Jiahui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193667?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 130439, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130439?format=json", "institution": "Australian National University"}, {"id": 190854, "fullname": "Chen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190854?format=json", "institution": "Baidu"}, {"id": 190855, "fullname": "Shanshan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190855?format=json", "institution": null}, {"id": 90074, "fullname": "Haocheng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90074?format=json", "institution": "Baidu"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}, {"id": 87079, "fullname": "Yuchao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87079?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40159", "url": null, "sourceid": 42229, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40161, "uid": "6623f60e12e3d40244a0a59f4f765695", "name": "Learning to Act Robustly with View-Invariant Latent Actions", "authors": [{"id": 182173, "fullname": "Youngjoon Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182173?format=json", "institution": "Seoul National University"}, {"id": 193672, "fullname": "Junha Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193672?format=json", "institution": "Seoul National University"}, {"id": 85394, "fullname": "Taesup Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85394?format=json", "institution": "Seoul National University"}], "abstract": "Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance.Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization.We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences.Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40161", "url": null, "sourceid": 32195, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40162, "uid": "f1031e5f06c70abc713119143e888b33", "name": "OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis", "authors": [{"id": 189140, "fullname": "Yuxuan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189140?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 182651, "fullname": "JING HAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/182651?format=json", "institution": "University of Hong Kong"}, {"id": 193673, "fullname": "Hong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193673?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 193674, "fullname": "Jiahao Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193674?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 146663, "fullname": "Yihua Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/146663?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 189138, "fullname": "Yuci Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189138?format=json", "institution": "Shenzhen University"}, {"id": 189145, "fullname": "Kuo Hung", "url": "http://cvpr.thecvf.com/api/miniconf/users/189145?format=json", "institution": "University of Hong Kong"}, {"id": 77263, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77263?format=json", "institution": "ETH Zurich and CMU"}], "abstract": "Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision\u2013language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision\u2013language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand\u2013image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis. Code, benchmark, and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40162", "url": null, "sourceid": 43071, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40198, "uid": "65023e4f3b9d5d2703342990e5d3682a", "name": "PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models", "authors": [{"id": 183965, "fullname": "Nhat Hoang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183965?format=json", "institution": "Los Alamos National Laboratory"}, {"id": 193759, "fullname": "Minh Vu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193759?format=json", "institution": "Los Alamos National Laboratory"}, {"id": 193760, "fullname": "My T. Thai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193760?format=json", "institution": "University of Florida"}, {"id": 193761, "fullname": "Manish Bhattarai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193761?format=json", "institution": "Los Alamos National Laboratory"}], "abstract": "Large vision\u2013language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (\u201cprelim\u201d) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40198", "url": null, "sourceid": 39183, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40175, "uid": "79304cca1ad8a247a9bafffd5f4db436", "name": "HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork", "authors": [{"id": 153077, "fullname": "Jindi Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/153077?format=json", "institution": "Sichuan University"}, {"id": 153076, "fullname": "Yuhao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153076?format=json", "institution": "Sichuan University"}, {"id": 153025, "fullname": "Yuxin Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153025?format=json", "institution": "Sichuan University"}, {"id": 153080, "fullname": "Qing Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/153080?format=json", "institution": "Sichuan University"}, {"id": 86196, "fullname": "Wentao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86196?format=json", "institution": "Sichuan University"}, {"id": 86144, "fullname": "Jiancheng Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/86144?format=json", "institution": "Sichuan University"}], "abstract": "Time-intensive performance evaluations significantly impede progress in Neural Architecture Search (NAS). To address this, neural predictors leverage surrogate models trained on proxy datasets,  allowing for direct performance predictions for new architectures.However, these predictors often exhibit poor generalization due to their limited ability to capture intricate relationships among various architectures. In this paper, we propose HyperNAS, a novel neural predictor paradigm for enhancing architecture representation learning. HyperNAS consists of two primary components: a global encoding scheme and a shared hypernetwork. The global encoding scheme is devised to capture the comprehensive macro-structure information, while the shared hypernetwork serves as an auxiliary task to enhance the investigation of inter-architecture patterns. To ensure training stability, we further develop a dynamic adaptive multi-task loss to facilitate personalized exploration on the Pareto front. Extensive experiments across five representative search spaces, including ViTs, demonstrate the advantages of HyperNAS, particularly in few-shot scenarios. For instance, HyperNAS strikes new state-of-the-art results, with 97.60\\% top-1 accuracy on CIFAR-10 and 82.4\\% top-1 accuracy on ImageNet, using at least 5.0$\\times$ fewer samples.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40175", "url": null, "sourceid": 41432, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40181, "uid": "b857ee1aae0229369354fbeb79044373", "name": "SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization", "authors": [{"id": 172586, "fullname": "CHEN Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172586?format=json", "institution": "National University of Defense Technology, Changsha, China."}, {"id": 132213, "fullname": "Xieyuanli Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132213?format=json", "institution": "National University of Defense Technology"}, {"id": 193734, "fullname": "Junxiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193734?format=json", "institution": null}, {"id": 101498, "fullname": "Jie Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101498?format=json", "institution": "National University of Defense Technology"}, {"id": 183632, "fullname": "Tao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183632?format=json", "institution": "\u56fd\u9632\u79d1\u6280\u5927\u5b66\uff08National University of Defense Technology\uff09"}], "abstract": "Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions---implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40181", "url": null, "sourceid": 38265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40187, "uid": "303807c0c0db7db426d7b6d2081a276f", "name": "No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models", "authors": [{"id": 158583, "fullname": "Hai X. Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/158583?format=json", "institution": "Samsung AI Center"}, {"id": 152367, "fullname": "David T. Hoffmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/152367?format=json", "institution": "University of Freiburg"}, {"id": 193742, "fullname": "Ricardo Guerrero", "url": "http://cvpr.thecvf.com/api/miniconf/users/193742?format=json", "institution": "Samsung AI Center"}, {"id": 154330, "fullname": "Brais Martinez", "url": "http://cvpr.thecvf.com/api/miniconf/users/154330?format=json", "institution": "Samsung AI Center - Cambridge"}], "abstract": "Contrastive vision-language (V\\&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V\\&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard-negative samples. Hard-negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V\\&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V\\&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders leads to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40187", "url": null, "sourceid": 37189, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40188, "uid": "20352cbe288211abf5161bc6fcbc1a3c", "name": "DreamStyle: A Unified Framework for Video Stylization", "authors": [{"id": 154491, "fullname": "Mengtian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154491?format=json", "institution": "ByteDance"}, {"id": 138539, "fullname": "Jinshu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/138539?format=json", "institution": "ByteDance"}, {"id": 154495, "fullname": "Songtao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154495?format=json", "institution": "ByteDance"}, {"id": 154492, "fullname": "Wanquan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154492?format=json", "institution": "ByteDance"}, {"id": 193743, "fullname": "Pengqi Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193743?format=json", "institution": "ByteDance"}, {"id": 127533, "fullname": "Qian HE", "url": "http://cvpr.thecvf.com/api/miniconf/users/127533?format=json", "institution": "Institute of Remote Sensing Application, Chinese Academic of Sciences"}], "abstract": "Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40188", "url": null, "sourceid": 37649, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40189, "uid": "fbfdf08a1210970f7c2f199f4eb10718", "name": "PointCNN++: Performant Convolution on Native Points", "authors": [{"id": 146470, "fullname": "Lihan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/146470?format=json", "institution": "Peking University"}, {"id": 130519, "fullname": "Haofeng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/130519?format=json", "institution": "Peking University"}, {"id": 193744, "fullname": "Rui Bu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193744?format=json", "institution": null}, {"id": 189976, "fullname": "Mingchao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189976?format=json", "institution": null}, {"id": 137984, "fullname": "Wenzheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/137984?format=json", "institution": "Peking University"}, {"id": 88739, "fullname": "Baoquan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88739?format=json", "institution": "Peking University"}, {"id": 183245, "fullname": "Yangyan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183245?format=json", "institution": "Ant Group"}], "abstract": "Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It generalizes sparse convolution from voxels to points, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates natively on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ uses an order of magnitude less memory and is several times faster than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it significantly improves point cloud registration accuracies while proving both more memory-efficient and faster. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40189", "url": null, "sourceid": 36687, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40190, "uid": "85ac6feb584b665e85664974c546cfec", "name": "SineProject: Machine Unlearning for Stable Vision\u2013Language Alignment", "authors": [{"id": 101325, "fullname": "Arpit Garg", "url": "http://cvpr.thecvf.com/api/miniconf/users/101325?format=json", "institution": "University of Adelaide"}, {"id": 127561, "fullname": "Hemanth Saratchandran", "url": "http://cvpr.thecvf.com/api/miniconf/users/127561?format=json", "institution": "University of Adelaide/Australian Institute of Machine Learning"}, {"id": 86945, "fullname": "Simon Lucey", "url": "http://cvpr.thecvf.com/api/miniconf/users/86945?format=json", "institution": "University of Adelaide"}], "abstract": "Multimodal Large Language Models (MLLMs) increasingly need to forget specific knowledge, such as unsafe or private information, without full retraining. However, existing unlearning methods often disrupt vision\u2013language alignment, causing models to reject both harmful and benign queries simultaneously. We trace this failure to the projector network: during unlearning, its Jacobian becomes severely ill-conditioned, leading to unstable optimization and drift in cross-modal embeddings. We introduce SineProject, a simple approach that augments the frozen projector with sinusoidally modulated trainable parameters that improve the Jacobian\u2019s spectral conditioning and stabilize alignment throughout unlearning. Evaluated across standard safety and privacy unlearning benchmarks using LLaVA-v1.5-7B and 13B, SineProject reduces benign-query refusals while achieving complete forgetting of targeted information, delivering state-of-the-art forget\u2013retain trade-offs with negligible computational overhead", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40190", "url": null, "sourceid": 39828, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40191, "uid": "be531378fcd22f55e661e629d5dab9c8", "name": "Flow3r: Factored Flow Prediction for Visual Geometry Learning", "authors": [{"id": 182357, "fullname": "Zhongxiao Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182357?format=json", "institution": "Carnegie Mellon University"}, {"id": 86486, "fullname": "Qitao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86486?format=json", "institution": "CMU"}, {"id": 193745, "fullname": "Minsik Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/193745?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 76012, "fullname": "Shubham Tulsiani", "url": "http://cvpr.thecvf.com/api/miniconf/users/76012?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We propose Flow3r, a scalable framework for visual geometry learning that leverages flow prediction to guide learning using unlabeled monocular videos. Current 3D/4D reconstruction systems primarily rely on dense geometry and pose supervision, and cannot easily generalize to diverse dynamic real-world scenes. In this work, we propose a mechanism to augment training directly from unlabeled videos, leveraging dense 2D correspondences (or \u2018flow\u2019) between arbitrary image pairs as supervision. Our key insight is that a factored flow prediction module that computes from two images using \u2018geometry latents\u2019 from one image and the \u2018pose latent\u2019 from the othercan guide visual geometry learning. We first highlight the benefits and scalability of flow supervision in controlled settings and then leverage large-scale unlabeled data to improve off-the-shelf visual geometry models. We evaluate Flow3r across diverse 3D benchmarks and demonstrate competitive or state-of-the-art performance, even surpassing supervised models trained with more labeled data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40191", "url": null, "sourceid": 39230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40193, "uid": "501f06dacc74edd54022151f71c8960b", "name": "QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models", "authors": [{"id": 107631, "fullname": "Tianxiao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107631?format=json", "institution": "Alipay (Hangzhou) Digital Service Technology Co., Ltd."}, {"id": 193747, "fullname": "Shanwei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193747?format=json", "institution": "Ant Group"}, {"id": 191419, "fullname": "Shuo Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191419?format=json", "institution": null}, {"id": 191420, "fullname": "Shiai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191420?format=json", "institution": null}, {"id": 156101, "fullname": "Chenguang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/156101?format=json", "institution": "Ant Group"}], "abstract": "Vision-language models (VLMs) demonstrate powerful capabilities in multimodal tasks. However, the large number of visual tokens imposes a significant computational cost. In this paper, we propose QuietPrune, a QUery-guIded Early Token Pruning method to remove redundant visual tokens within VLMs, thereby enhancing computational efficiency. Unlike previous late pruning methods, we recognize that implementing early pruning within the vision transformer (ViT) can achieve benefits in both latency reduction and accuracy maintenance. To address the semantic loss problem in early pruning, we design a lightweight adapter by performing a inverse transformation of the projector in VLMs. The proposed adapter converts the contextual query into a visual domain [Q-CLS] (Query [CLS]) token, providing textual guidance for ViT pruning. During pruning, we further introduce a semi-structured pruning scheme based on visual-textual  relevance. Specifically, we group spatially adjacent $2 \\times 2$ tokens to accommodate the visual token merging operation prevalent in mainstream VLMs. We use the mean attention scores between the [Q-CLS] token and the visual tokens as the relevance metric for each group, avoiding additional computation. Pruning is then applied at the group level based on the relevance score, preserving positional continuity. After pruning, we aggregate the redundant tokens into a single token to maintain context cues. Our method achieves up to 19.0\\% reduction in prefill latency while outperforming 4.2\\% in accuracy on the recent Qwen3-VL and InternVL3 series compared to existing late pruning methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40193", "url": null, "sourceid": 44160, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40205, "uid": "67eed99b828e1b94f5b53bce9e400a17", "name": "iSplat: Iterative Learning for Fine-Grained Gaussian Splatting", "authors": [{"id": 154038, "fullname": "Haifeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154038?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 103351, "fullname": "Wei Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/103351?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 86553, "fullname": "Shuhang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86553?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 88453, "fullname": "Lixin Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88453?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 88470, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88470?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Recent advances in feed-forward 3D Gaussian splatting have demonstrated remarkable efficiency by reconstructing scenes in a single pass. However, the reconstruction fidelity of these methods lags behind that of traditional optimization-based approaches, which gradually correct reconstruction flaws through a lengthy iterative process. In this paper, we leverage the strengths of both paradigms and introduce iSplat, a novel framework that reformulates reconstruction as an iterative feed-forward process involving multiple (typically three) passes.Central to iSplat is a recurrent GRU-based optimizer that refines both geometry and appearance in a synergistic loop. To address geometric inaccuracies, we propose an uncertainty-driven depth refinement strategy that progressively narrows the search space for each Gaussian based on its estimated uncertainty from the previous step. To further improve appearance details, we design a region-aware enhancement mechanism that applies targeted multi-view and monocular feature aggregation to resolve ambiguities in both overlapping and non-overlapping areas.We validate iSplat's robustness and generalization on in-domain (RealEstate10K, ACID) and cross-dataset (DTU, ACID) benchmarks. With only 42.6M parameters, iSplat surpasses DepthSplat (354M) on RealEstate10K (PSNR: 27.67 vs. 27.47 dB). Crucially, on the cross-dataset DTU benchmark, it further boosts the PSNR by 2.88 dB (18.26 vs. 15.38 dB), showcasing exceptional generalization. These results highlight the significant potential of iterative refinement to overcome the inherent limitations of one-shot approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40205", "url": null, "sourceid": 46087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40194, "uid": "f613b01d52b86da04ae810f173d5aaef", "name": "Pixel Motion Diffusion is What We Need for Robot Control", "authors": [{"id": 193748, "fullname": "E-Ro Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193748?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 193749, "fullname": "Yichi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193749?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 97168, "fullname": "Kanchana Ranasinghe", "url": "http://cvpr.thecvf.com/api/miniconf/users/97168?format=json", "institution": "Stony Brook University"}, {"id": 193750, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193750?format=json", "institution": "State University of New York, Stony Brook"}, {"id": 127277, "fullname": "Michael Ryoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/127277?format=json", "institution": "Stony Brook University"}], "abstract": "We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Visualization page:  https://anonymous.4open.science/w/DAWN", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40194", "url": null, "sourceid": 45358, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40199, "uid": "c1d2812e7562e2eb5de5a162cbbe1eb7", "name": "GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning", "authors": [{"id": 181563, "fullname": "Zehao Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181563?format=json", "institution": "Soochow University"}, {"id": 193762, "fullname": "An Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193762?format=json", "institution": "Soochow University"}, {"id": 130713, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130713?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Zero-shot 3D Anomaly Detection (ZS3DAD) is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In the stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In the stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection and segmentation. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40199", "url": null, "sourceid": 45974, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40203, "uid": "cf7065ddd8146156c055e0f61d01dcec", "name": "Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation", "authors": [{"id": 180501, "fullname": "jiang shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180501?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 193765, "fullname": "Xinbo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193765?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 193766, "fullname": "Wenyin Tuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193766?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 193767, "fullname": "XiaoChun Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193767?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Egocentric Action Anticipation aims to infer future actions from videos, which is crucial for embodied AI systems. However, its advancement is hindered by the inherent stochasticity of the future, which introduces significant prediction uncertainty. Prevailing methods typically adopt an end-to-end approach to model holistic spatiotemporal contexts, yet they often lack explicit semantic reasoning capabilities, making it difficult to handle open-ended future uncertainties.To address these challenges, we propose a Prototypical Action Reasoning  Framework Facilitated by Vision-Language Alignment (PAR-VLA), which leverages the semantic alignment capability of vision-language models to learn disentangled visual prototype for verbs and nouns. These prototypes serve as robust semantic anchors, transforming the unconstrained temporal prediction problem into a conditional forecasting task guided by well-defined semantic concepts. Our multi-stage framework first extracts visually-grounded and text-aligned prototype groups from a VLM, learning multiple prototypes per category to capture intra-class diversity. Subsequently, a novel Prototypical Action Reasoning-guided Verb-Noun Encoding branch dynamically retrieves the most relevant verb and noun concepts based on visual observations and explicitly models their interactions to guide temporal anticipation. Furthermore, we introduce Dual-Stream Symbiotic Predictive Decoders to more finely capture the interdependencies between verbs and nouns during the prediction process. Experiments Results demonstrate that PAR achieves state-of-the-art performance and exhibits a strong capability in dealing with future uncertainty.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40203", "url": null, "sourceid": 38775, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40209, "uid": "0338bf13624f52beaca91ec4a23c860c", "name": "MotionV2V: Editing Motion in a Video", "authors": [{"id": 97282, "fullname": "Ryan Burgert", "url": "http://cvpr.thecvf.com/api/miniconf/users/97282?format=json", "institution": "Stony Brook University"}, {"id": 87557, "fullname": "Charles Herrmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/87557?format=json", "institution": "Google"}, {"id": 85425, "fullname": "Forrester Cole", "url": "http://cvpr.thecvf.com/api/miniconf/users/85425?format=json", "institution": "Google"}, {"id": 127277, "fullname": "Michael Ryoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/127277?format=json", "institution": "Stony Brook University"}, {"id": 130388, "fullname": "Neal Wadhwa", "url": "http://cvpr.thecvf.com/api/miniconf/users/130388?format=json", "institution": "Google"}, {"id": 129547, "fullname": "Andrey Voynov", "url": "http://cvpr.thecvf.com/api/miniconf/users/129547?format=json", "institution": "Google Research"}, {"id": 133261, "fullname": "Nataniel Ruiz", "url": "http://cvpr.thecvf.com/api/miniconf/users/133261?format=json", "institution": "Google"}], "abstract": "While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has extensively explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising, yet under-explored, paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a 'motion edit' and demonstrate that this representation, when coupled with a generative backbone, enables many powerful video editing capabilities. To achieve this, we introduce a novel pipeline for generating `motion counterfactuals' \u2014 video pairs that share identical content but distinct motion \u2014 and fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a 4-way head-to-head user study, our model achieves over 65% preference against prior work.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40209", "url": null, "sourceid": 45242, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40223, "uid": "7b76aabbcca9301ca0f8604376a1908b", "name": "GGPT: Geometry-Grounded Point Transformer", "authors": [{"id": 193819, "fullname": "Yutong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193819?format=json", "institution": "Department of Computer Science, ETHZ - ETH Zurich"}, {"id": 188322, "fullname": "Yiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188322?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 73724, "fullname": "Xucong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73724?format=json", "institution": "Delft University of Technology"}, {"id": 90760, "fullname": "Sergey Prokudin", "url": "http://cvpr.thecvf.com/api/miniconf/users/90760?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 69176, "fullname": "Siyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69176?format=json", "institution": "ETH Zurich"}], "abstract": "Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views.Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit sparse-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas.Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings. All code and pre-trained models will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40223", "url": null, "sourceid": 40744, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40212, "uid": "0ed04c5f61b5459b009b5b663c43bf94", "name": "FedSST: Rethinking Fair Federated Graph Learning under Structural Shift", "authors": [{"id": 179903, "fullname": "Dingyi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/179903?format=json", "institution": "Wuhan University"}], "abstract": "Federated Graph Learning (FGL) offers a privacy-preserving paradigm for collaborative training on graph data, yet significant topological heterogeneity poses a critical threat to generalization fairness, often yielding a global model dominated by a subset of clients. This introduces two critical issues: at the global level, aggregation bias disproportionately amplifies the influence of dominant clients, while at the local level, blind optimization results in inefficient and inequitable training processes. To address these challenges, we propose FedSST, an adaptive fairness framework. FedSST introduces a fair, structure-based signal to quantify client contributions, which in turn guides fair aggregation and adaptive local training. Extensive experiments across diverse cross-domain and cross-dataset settings demonstrate that FedSST enhances generalization fairness and overall model performance, outperforming various state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40212", "url": null, "sourceid": 33641, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40215, "uid": "656b60bbc8395b4e80dc1cc6a3cf1ef2", "name": "Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks", "authors": [{"id": 181544, "fullname": "Yongqi Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/181544?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193801, "fullname": "Kunshan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193801?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193802, "fullname": "Linze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193802?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193803, "fullname": "Yiyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193803?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193804, "fullname": "Mengmeng Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/193804?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193805, "fullname": "Lin Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193805?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Although the temporal spike dynamics of spiking neural networks (SNNs) enable low-power temporal capture capabilities, they also incur inherent inconsistencies that severely compromise representation. In this paper, we perform dual consistency optimization via Stable Spike to mitigate this problem, thereby improving the recognition performance of SNNs. With the hardware-friendly ``AND\"  bit operation, we efficiently decouple the stable spike skeleton from the multi-timestep spike maps, which captures critical semantics and reduces the inconsistency from variable noise spikes. Enforcing the unstable spike maps to converge to the stable spike skeleton significantly improves the inherent consistency across timesteps. Furthermore, we inject amplitude-aware spike noise into the stable spike skeleton to diversify the representations while preserving consistent semantics. The SNN is encouraged to produce perturbation-consistent predictions, thereby contributing to generalization. Extensive experiments across multiple architectures and datasets validate the effectiveness and versatility of our method. In particular, our method significantly advances neuromorphic object recognition under ultra-low latency, improving accuracy by up to 8.33\\%. This will help unlock the full power consumption and speed potential of SNNs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40215", "url": null, "sourceid": 42842, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40225, "uid": "d5cbe173e3496a9a8cf33ff403326f36", "name": "Dual-Granularity Memory for Efficient Video Generation", "authors": [{"id": 84689, "fullname": "Hongjun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84689?format=json", "institution": "University of Hong Kong"}, {"id": 193821, "fullname": "Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193821?format=json", "institution": "Alibaba Group"}, {"id": 193822, "fullname": "Jianguo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193822?format=json", "institution": "Ant Group"}, {"id": 158625, "fullname": "Tao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/158625?format=json", "institution": "Westlake University"}], "abstract": "Video generation using recurrent architectures offers compelling efficiency advantages over attention-based transformers, particularly for long-sequence generation. However, chunked processing in recurrent models creates temporal discontinuities that harm long-range consistency. We introduce two complementary memory mechanisms to address this challenge at different granularities: \\textbf{(1) Context Memory} maintains persistent global context within attention chunks through learnable \\textit{sink columns} and \\textit{boundary buffers}, adding only 150K parameters (\\textless 0.1\\% overhead); \\textbf{(2) Latent Context-as-Memory (LCaM)} extends memory across video segments by storing and retrieving historical latent embeddings, enabling cross-segment consistency without requiring camera annotations or frame reconstruction. Applied to Generalized Spatial-temporal Propagation Networks (GSTPN), our dual-memory approach achieves \\textbf{1.54$\\times$ faster} inference than attention-based transformers, while excelling in visual quality metrics. Our approach is particularly effective for knowledge distillation scenarios where only pre-extracted latent embeddings are available. This work demonstrates compelling efficiency-quality trade-offs for practical long video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40225", "url": null, "sourceid": 36125, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40227, "uid": "34164328c04dabc126f4fe82bbb2d5e6", "name": "Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner", "authors": [{"id": 180431, "fullname": "Haotian Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180431?format=json", "institution": "Tianjin University"}, {"id": 130020, "fullname": "Wenjing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130020?format=json", "institution": "Peking University"}, {"id": 86641, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86641?format=json", "institution": "WeChat, Tencent"}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 154517, "fullname": "Di Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154517?format=json", "institution": "Tianjin University"}], "abstract": "Generating RGB-A videos, which include alpha channels for transparency, has wide applications. However, current methods often suffer from low quality due to confusion between RGB and alpha. In this paper, we address this problem by learning shiftable RGB\u2011A distributions. We adjust both the latent space and noise space, shifting the alpha distribution outward while preserving the RGB distribution, thereby enabling stable transparency generation without compromising RGB quality. Specifically, for the latent space, we propose a transparency\u2011aware bidirectional diffusion loss during VAE training, which shifts the RGB\u2011A distribution according to likelihood. For the noise space, we propose shifting the mean of diffusion noise sampling and applying a Gaussian ellipse mask to provide transparency guidance and controllability. Additionally, we construct a high\u2011quality RGB\u2011A video dataset. Compared to state\u2011of\u2011the\u2011art methods, our model excels in visual quality, naturalness, transparency rendering, inference convenience, and controllability.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40227", "url": null, "sourceid": 45495, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40228, "uid": "05f747f9753a0b4172a8faf1128a78e1", "name": "Are Image-to-Video Models Good Zero-Shot Image Editors?", "authors": [{"id": 184292, "fullname": "Zechuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184292?format=json", "institution": "Zhejiang University"}, {"id": 143968, "fullname": "Zhenyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/143968?format=json", "institution": "Zhejiang University"}, {"id": 75838, "fullname": "Zongxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75838?format=json", "institution": "Zhejiang University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}], "abstract": "Large-scale video diffusion models exhibit strong world-simulation and temporal reasoning capabilities, yet their potential as zero-shot image editors remains underexplored. We present \\ifedit{IF-Edit} (\\textbf{I}mage Edit by Generating \\textbf{F}rames), a tuning-free framework that repurposes pre-trained image-to-video diffusion models for instruction-driven image editing. \\ifedit{IF-Edit} addresses three core obstacles\u2014prompt misalignment, redundant temporal latents, and blurry late-stage frames\u2014via: (1) a Chain-of-Thought Prompt Enhancement module that reformulates static editing instructions into temporally grounded reasoning prompts; (2) a Temporal Latent Dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving global semantics and temporal coherence; and (3) a Self-Consistent Post-Refinement step that refines the sharpest late-stage frame through a brief still-video trajectory, leveraging the video prior for sharper and more faithful results. Extensive experiments across four public benchmarks\u2014covering non-rigid deformations, physical and temporal reasoning, and general instruction editing\u2014show that \\ifedit{IF-Edit} achieves strong performance on non-rigid and reasoning-centric tasks while remaining competitive on general-purpose edits. Our study offers a systematic view of video diffusion models as image editors, revealing their unique strengths, limitations, and a simple recipe for unified video\u2013image generative reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40228", "url": null, "sourceid": 43683, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40229, "uid": "aba05e9ba7fdfe0164049eb9bfd495cf", "name": "InfinityHuman: Towards Long-Term Audio-Driven Human Animation", "authors": [{"id": 180417, "fullname": "Xiaodi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180417?format=json", "institution": "Zhejiang University"}, {"id": 193826, "fullname": "Pan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/193826?format=json", "institution": null}, {"id": 193827, "fullname": "Yi Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193827?format=json", "institution": "ByteDance"}, {"id": 193828, "fullname": "Qijun Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193828?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 193829, "fullname": "Chen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193829?format=json", "institution": "Facebook"}, {"id": 193830, "fullname": "Fangyuan Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193830?format=json", "institution": "ByteDance Inc."}, {"id": 180939, "fullname": "Xiang Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180939?format=json", "institution": "Miyou Internet Technology (Shanghai) Co., Ltd."}, {"id": 89569, "fullname": "Zehuan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89569?format=json", "institution": "Nanjing University"}, {"id": 154867, "fullname": "BINGYUE PENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/154867?format=json", "institution": "ByteDance Inc."}], "abstract": "Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40229", "url": null, "sourceid": 33917, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40233, "uid": "97b7e121d7fbde23d1f19991e52793a3", "name": "Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals", "authors": [{"id": 181601, "fullname": "Jiachen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181601?format=json", "institution": "Swiss Federal Technology Institute of Lausanne (EPFL)"}, {"id": 181598, "fullname": "Hailan Shanbhag", "url": "http://cvpr.thecvf.com/api/miniconf/users/181598?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 126362, "fullname": "Haitham Al Hassanieh", "url": "http://cvpr.thecvf.com/api/miniconf/users/126362?format=json", "institution": "University of Illinois at Urbana-Champaign"}], "abstract": "Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework that leverages the outside LoS geometry to model and guide RF propagation from the LoS region into the NLoS region. By integrating visual LoS priors into the neural field formulation, our system achieves stable training and physically consistent reconstruction of both visible and hidden geometry, setting a new state-of-the-art in RF-based geometry reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40233", "url": null, "sourceid": 43147, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40236, "uid": "f14155e842ad13c13a8025b93e6803e3", "name": "Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning", "authors": [{"id": 180420, "fullname": "Jingbo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180420?format=json", "institution": "Institute of Automation\uff0cChinese Academy of Sciences"}, {"id": 190198, "fullname": "Qichao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190198?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 193844, "fullname": "Songjun Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193844?format=json", "institution": "Institute of Automation Chinese Academy of Sciences"}, {"id": 193845, "fullname": "Xing Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193845?format=json", "institution": null}, {"id": 72214, "fullname": "Yupeng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/72214?format=json", "institution": "Institute of Automation\uff0cChinese Academy of Sciences"}, {"id": 193846, "fullname": "Haoran Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193846?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 154602, "fullname": "Ke Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154602?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 193847, "fullname": "Dongbin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193847?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40236", "url": null, "sourceid": 42151, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40243, "uid": "4ead4a1873d5c890f6fd1f296422a4cb", "name": "InternVideo-Next: Towards World-Understanding Video Models", "authors": [{"id": 182217, "fullname": "Chenting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182217?format=json", "institution": "Shanghai jiaotong university"}, {"id": 71093, "fullname": "Yuhan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71093?format=json", "institution": "Nanjing University"}, {"id": 183335, "fullname": "Yicheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183335?format=json", "institution": "Institute of Science Tokyo"}, {"id": 156839, "fullname": "Jiange Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156839?format=json", "institution": "Nanjing University"}, {"id": 151685, "fullname": "ziang yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151685?format=json", "institution": "Zhejiang University"}, {"id": 86624, "fullname": "Yali Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86624?format=json", "institution": "SIAT, Chinese Academy of Sciences"}, {"id": 86626, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86626?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "Large-scale video\u2013text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks.We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning.To address these, we disentangle the traditional encoder\u2013decoder design into an Encoder\u2013Predictor\u2013Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model.First, conventional linear decoder in pixel MVM enforces the predictor\u2019s output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction.Our Stage 1 proposes a conditional diffusion decoder and injects clean image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction.Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning.Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40243", "url": null, "sourceid": 41589, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40003, "uid": "88f9344f9167663504c5cc3a143a9b89", "name": "Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex", "authors": [{"id": 193276, "fullname": "Alexandru Brateanu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193276?format=json", "institution": "University of Manchester, University of Manchester"}, {"id": 89361, "fullname": "Tingting Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89361?format=json", "institution": "University of Manchester"}, {"id": 75719, "fullname": "Codruta Ancuti", "url": "http://cvpr.thecvf.com/api/miniconf/users/75719?format=json", "institution": "University Politehnica Timisoara"}, {"id": 193277, "fullname": "Cosmin Ancuti", "url": "http://cvpr.thecvf.com/api/miniconf/users/193277?format=json", "institution": "UVT"}], "abstract": "Low-light image enhancement (LLIE) aims to restore natural visibility, color fidelity, and structural detail under severe illumination degradation. State-of-the-art (SOTA)  LLIE techniques often depend on large model size  and multi-stage training, limiting  practicality for edge deployment. Moreover, they often rely on a single color space, which introduces instability and visible exposure or color artifacts. To achieve low-cost, effective LLIE, we present Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex formulation. It decomposes an image into illumination and color prior stacks derived from distinct analytic representations, and learns to fuse these representations into luminance and reflectance adjustments required to correct exposure. We emphasize enhancement over reconstruction to  enable drastic  reduction of computational overhead, supported by lightweight neural operations. Accordingly, we develop a lightweight Multinex (45K parameters) and a micro version (2.6K parameters). Examined by intensive benchmark comparison,  they outperform  significantly lightweight and micro SOTA models, while reach  close performance to large SOTA models. Code will be released upon publication. Across extensive benchmarks, both variants significantly outperform existing lightweight and micro SOTA models, and reach performance comparable to more complex approaches. Code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40003", "url": null, "sourceid": 31127, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40254, "uid": "fd759db147415e8a49d1ed0dfdd22f81", "name": "CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale", "authors": [{"id": 183751, "fullname": "Shahar Sarfaty", "url": "http://cvpr.thecvf.com/api/miniconf/users/183751?format=json", "institution": "Tel Aviv University"}, {"id": 193892, "fullname": "Adi Haviv", "url": "http://cvpr.thecvf.com/api/miniconf/users/193892?format=json", "institution": "School of Computer Science, Tel Aviv University"}, {"id": 193893, "fullname": "Uri Y. Hacohen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193893?format=json", "institution": "Tel Aviv University, Tel Aviv University"}, {"id": 149759, "fullname": "Niva Elkin-Koren", "url": "http://cvpr.thecvf.com/api/miniconf/users/149759?format=json", "institution": "Tel-Aviv University Faculty of Law"}, {"id": 193894, "fullname": "Roi Livni", "url": "http://cvpr.thecvf.com/api/miniconf/users/193894?format=json", "institution": "Tel Aviv University"}, {"id": 86202, "fullname": "Amit H. Bermano", "url": "http://cvpr.thecvf.com/api/miniconf/users/86202?format=json", "institution": "Tel Aviv University, Technion"}], "abstract": "The rapid proliferation of generative components, such as LoRAs, has created a vast but unstructured ecosystem. Existing discovery methods depend on unreliable user descriptions or biased popularity metrics, hindering usability. We present CARLoS, a large-scale framework for characterizing LoRAs without requiring additional metadata. Analyzing over 650 LoRAs, we employ them in image generation over a variety of prompts and seeds, as a credible way to assess their behavior. Using CLIP embeddings and their difference to a base-model generation, we concisely define a three part representation: \\textit{Directions}, defining semantic shift; \\textit{Strength}, quantifying the significance of the effect; and \\textit{Consistency}, quantifying how stable the effect is. Using these representations, we develop an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in automated and human evaluations. While retrieval is our primary focus, the same representation also supports analyses linking Strength and Consistency to legal notions of substantiality and volition, key considerations in copyright, positioning CARLoS as a practical system with broader relevance for LoRA analysis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40254", "url": null, "sourceid": 46775, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39738, "uid": "19e733acd3ee9f9362a484e5915c132b", "name": "Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning", "authors": [{"id": 180267, "fullname": "SeungHee Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180267?format=json", "institution": "Hanyang University"}, {"id": 132367, "fullname": "minju Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/132367?format=json", "institution": "Hanyang University"}, {"id": 147999, "fullname": "Hyunwoo Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/147999?format=json", "institution": "Hanyang University"}, {"id": 192756, "fullname": "Jihwan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192756?format=json", "institution": "Graduate School of Hanyang University"}, {"id": 69900, "fullname": "Dong-Jin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/69900?format=json", "institution": "Hanyang University"}], "abstract": "Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries.The proposed framework, STaRC, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation.We also propose to utilize the saliency scoresas a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder.By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39738", "url": null, "sourceid": 43150, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37496, "uid": "3696cd0b907299b239100ea1d759d913", "name": "HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance", "authors": [{"id": 187580, "fullname": "Green Rosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187580?format=json", "institution": "Samsung Research Institute India Bangalore"}, {"id": 187581, "fullname": "Prateek Kukreja", "url": "http://cvpr.thecvf.com/api/miniconf/users/187581?format=json", "institution": "Samsung"}, {"id": 187582, "fullname": "Vishakha SR", "url": "http://cvpr.thecvf.com/api/miniconf/users/187582?format=json", "institution": "Samsung"}, {"id": 187583, "fullname": "Pawan Prasad B H", "url": "http://cvpr.thecvf.com/api/miniconf/users/187583?format=json", "institution": "Samsung Research"}], "abstract": "The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in similar views converging to different modes of the distribution. This is particularly aggravated for hands due to the large variations in articulations and poses. To alleviate this, we propose to use MANO hand model based initialization and a hand skeleton guided diffusion process to provide a strong prior for the hand structure and to ensure view and pose consistency.  Further, we propose a novel corrective hand shape guidance loss to ensure that all the views of the 3D hand model converges to view-consistent modes, without leading to geometric distortions. Extensive evaluations demonstrate the superiority of our method over the state-of-the-art methods, paving a new way forward in 3D hand model generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37496", "url": null, "sourceid": 35675, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40259, "uid": "a9bdc9589c2df82ca15eaa0205447770", "name": "Learning Convex Decomposition via Feature Fields", "authors": [{"id": 135048, "fullname": "Yuezhi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135048?format=json", "institution": "University of Texas at Austin"}, {"id": 91087, "fullname": "Qixing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91087?format=json", "institution": "University of Texas at Austin"}, {"id": 184979, "fullname": "Mikaela Angelina Uy", "url": "http://cvpr.thecvf.com/api/miniconf/users/184979?format=json", "institution": "NVIDIA"}, {"id": 184980, "fullname": "Nicholas Sharp", "url": "http://cvpr.thecvf.com/api/miniconf/users/184980?format=json", "institution": "NVIDIA"}], "abstract": "This work proposes  a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decomposition.Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40259", "url": null, "sourceid": -43285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36413?format=json"], "related_events_ids": [36413]}, {"id": 38507, "uid": "5b9a8d01164ae511a4d6f8c83b041556", "name": "CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention", "authors": [{"id": 190014, "fullname": "Jiacheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190014?format=json", "institution": "Fudan University"}, {"id": 190015, "fullname": "Zhiyuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190015?format=json", "institution": "Fudan University"}, {"id": 142795, "fullname": "Zhuolin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/142795?format=json", "institution": "Fudan University"}, {"id": 190016, "fullname": "Jia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190016?format=json", "institution": null}, {"id": 190017, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190017?format=json", "institution": "East China Normal University"}, {"id": 76452, "fullname": "Jian Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76452?format=json", "institution": "Fudan University"}], "abstract": "Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. As its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS first constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model's sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby purifying the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method also demonstrates superior robustness against both data bias and noisy scenarios specifically configured to induce causal confusion. We will release our code upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38507", "url": null, "sourceid": 31510, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40018, "uid": "877f8395efda54ec44a890080c4e4fc0", "name": "Optical Diffraction-based Convolution for Semiconductor Lithography", "authors": [{"id": 153651, "fullname": "Young-Han Son", "url": "http://cvpr.thecvf.com/api/miniconf/users/153651?format=json", "institution": "Korea University"}, {"id": 153650, "fullname": "Dong-Hee Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/153650?format=json", "institution": "Korea University"}, {"id": 193314, "fullname": "Deok-Joong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193314?format=json", "institution": "Korea University"}, {"id": 193315, "fullname": "Hyun Jung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193315?format=json", "institution": "Korea University"}, {"id": 153653, "fullname": "Tae-Eui Kam", "url": "http://cvpr.thecvf.com/api/miniconf/users/153653?format=json", "institution": "Korea University"}], "abstract": "In recent years, the increasing demand for smaller and more powerful semiconductors highlighted the critical role of lithography\u2014a key stage in semiconductor manufacturing responsible for precise mask design and wafer patterning. To meet these demands, the semiconductor industry has increasingly adopted computational lithography, employing machine learning and deep learning techniques to accelerate advancements in lithographic technology. Despite the various research efforts and successes in computational lithography, there remains a lack of explicit incorporation of physical principles. This gap limits the ability of existing methods to fully capture the complex physical phenomena inherent in lithography behaviors. To bridge this gap, we propose OptiCo, a novel convolutional neural network that seamlessly integrates optical diffraction principles into its architecture. At its core, OptiCo employs an optical phase kernel to model phase variations resulting from light propagation, effectively capturing the physical interactions among light, masks, and wafers. We evaluate OptiCo on semiconductor lithography benchmarks, demonstrating its superior performance in mask optimization tasks, with its remarkable generalization capabilities in OOD datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40018", "url": null, "sourceid": 33228, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40355, "uid": "0ccdcb33f85df88037bd0701b351d5df", "name": "Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment", "authors": [{"id": 189114, "fullname": "Youming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189114?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 135529, "fullname": "Songyou Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/135529?format=json", "institution": "Google DeepMind"}, {"id": 107485, "fullname": "Junyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107485?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191094, "fullname": "Kathryn Heal", "url": "http://cvpr.thecvf.com/api/miniconf/users/191094?format=json", "institution": "Google"}, {"id": 191095, "fullname": "Tiancheng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191095?format=json", "institution": "Google"}, {"id": 191096, "fullname": "John Flynn", "url": "http://cvpr.thecvf.com/api/miniconf/users/191096?format=json", "institution": null}, {"id": 155571, "fullname": "Steve Marschner", "url": "http://cvpr.thecvf.com/api/miniconf/users/155571?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 135661, "fullname": "Lucy Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/135661?format=json", "institution": "Google"}], "abstract": "Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own  outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40355", "url": null, "sourceid": -31886, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38972?format=json"], "related_events_ids": [38972]}, {"id": 38972, "uid": "0ccdcb33f85df88037bd0701b351d5df", "name": "Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment", "authors": [{"id": 189114, "fullname": "Youming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189114?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 135529, "fullname": "Songyou Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/135529?format=json", "institution": "Google DeepMind"}, {"id": 107485, "fullname": "Junyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107485?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191094, "fullname": "Kathryn Heal", "url": "http://cvpr.thecvf.com/api/miniconf/users/191094?format=json", "institution": "Google"}, {"id": 191095, "fullname": "Tiancheng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191095?format=json", "institution": "Google"}, {"id": 191096, "fullname": "John Flynn", "url": "http://cvpr.thecvf.com/api/miniconf/users/191096?format=json", "institution": null}, {"id": 155571, "fullname": "Steve Marschner", "url": "http://cvpr.thecvf.com/api/miniconf/users/155571?format=json", "institution": "Department of Computer Science, Cornell University"}, {"id": 135661, "fullname": "Lucy Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/135661?format=json", "institution": "Google"}], "abstract": "Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own  outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38972", "url": null, "sourceid": 31886, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40355?format=json"], "related_events_ids": [40355]}, {"id": 39283, "uid": "9c70cb2f32394f12a8527ccb82da9942", "name": "Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis", "authors": [{"id": 181105, "fullname": "Chen Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181105?format=json", "institution": "Queen&#x27;s University Belfast"}, {"id": 170631, "fullname": "Zhuo ZHI", "url": "http://cvpr.thecvf.com/api/miniconf/users/170631?format=json", "institution": "UCL"}, {"id": 184718, "fullname": "Zhao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184718?format=json", "institution": "Northumbria University"}, {"id": 191763, "fullname": "Jiawei Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/191763?format=json", "institution": "Queen Mary, University of London; Southeast University"}, {"id": 183232, "fullname": "Ling Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183232?format=json", "institution": "Hokkaido University"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 86077, "fullname": "Georgios Tzimiropoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/86077?format=json", "institution": "Queen Mary University London"}, {"id": 75872, "fullname": "Ioannis Patras", "url": "http://cvpr.thecvf.com/api/miniconf/users/75872?format=json", "institution": "Queen Mary University of London"}], "abstract": "Statistically consistent methods based on the noise transition matrix ($T$) offer a theoretically grounded solution to Learning with Noisy Labels (LNL), with guarantees of convergence to the optimal clean-data classifier. In practice, however, these methods are often outperformed by empirical approaches such as sample selection, and this gap is usually attributed to the difficulty of accurately estimating $T$. The common assumption is that, given a perfect $T$, noise-correction methods would recover their theoretical advantage.In this work, we put this longstanding hypothesis to a decisive test. We conduct experiments under idealized conditions, providing correction methods with a \\emph{perfect, oracle transition matrix}. Even under these ideal conditions, we observe that these methods still suffer from performance collapse during training. This compellingly demonstrates that the failure is not fundamentally a $T$-estimation problem, but stems from a more deeply rooted flaw.To explain this behaviour, we provide a unified analysis that links three levels: macroscopic convergence states, microscopic optimisation dynamics, and information-theoretic limits on what can be learned from noisy labels. Together, these results give a formal account of why ideal noise correction fails and offer concrete guidance for designing more reliable methods for learning with noisy labels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39283", "url": null, "sourceid": 43542, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36728, "uid": "7789a12ee7ca2b5d0728970ede4b0777", "name": "SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting", "authors": [{"id": 180698, "fullname": "Jun-Jee Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180698?format=json", "institution": "University of Minnesota, Minneapolis"}, {"id": 185738, "fullname": "Volkan Isler", "url": "http://cvpr.thecvf.com/api/miniconf/users/185738?format=json", "institution": "University of Texas at Austin"}], "abstract": "Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step.However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object\u2019s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34\\% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36728", "url": null, "sourceid": 31152, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40183, "uid": "5cbcd79fd63dc4d72595fa43f6df0dd4", "name": "Unblur-SLAM: Dense Neural SLAM for Blurry Inputs", "authors": [{"id": 146739, "fullname": "Qi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146739?format=json", "institution": "University Of Amsterdam"}, {"id": 164716, "fullname": "Denis Rozumny", "url": "http://cvpr.thecvf.com/api/miniconf/users/164716?format=json", "institution": "Meta"}, {"id": 193738, "fullname": "Francesco Girlanda", "url": "http://cvpr.thecvf.com/api/miniconf/users/193738?format=json", "institution": ""}, {"id": 153393, "fullname": "Sezer Karaoglu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153393?format=json", "institution": "University of Amsterdam"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}, {"id": 153394, "fullname": "Theo Gevers", "url": "http://cvpr.thecvf.com/api/miniconf/users/153394?format=json", "institution": "University of Amsterdam, University of Amsterdam"}, {"id": 88372, "fullname": "Martin R. Oswald", "url": "http://cvpr.thecvf.com/api/miniconf/users/88372?format=json", "institution": "University of Amsterdam"}], "abstract": "We propose Unblur-SLAM, an RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image.As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules.Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur formation process in 3D space, thereby learning sharp details and refined sub-frame poses.Experiments on several real-world datasets demonstrate consistent improvements in both pose estimation and sharp reconstruction results of geometry and texture.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40183", "url": null, "sourceid": 35125, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38452, "uid": "d4a7d6def9138bb65e8a419c473a4d16", "name": "3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding", "authors": [{"id": 189885, "fullname": "Makanjuola Adekunmi Ogunleye", "url": "http://cvpr.thecvf.com/api/miniconf/users/189885?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 189886, "fullname": "Eman Abdelrahman", "url": "http://cvpr.thecvf.com/api/miniconf/users/189886?format=json", "institution": "Virginia Tech"}, {"id": 152651, "fullname": "Ismini Lourentzou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152651?format=json", "institution": "University of Illinois Urbana - Champaign"}], "abstract": "Large Language Models are increasingly integrated as the cognitive core of 3D embodied agents to enable complex environmental reasoning. However, these agents tend to inherit the critical flaw of hallucination, often failing to ground their responses to their 3D view. While Visual Contrastive Decoding (VCD) is a powerful training-free method for mitigating hallucinations in 2D image-based models, it has not been adapted to the complex 3D embodied environment. In this paper, we embarked on the ambitious goal of being  the very first to bridge this gap by introducing a VCD framework for 3D embodied agents. Our method operates at inference time by generating a \"negative\" 3D context, not by blurring an image, but by applying novel distortions directly to a 3D scene graph, such as swapping object category labels or noising positional coordinates. We evaluate our approach on standard evaluation benchmarks and find that it consistently outperforms existing models. For example, in the random category of 3D-POPE, our 3D-VCD method reduced Yes-rate from 99.9\\% to 75.1\\% while simultaneously increasing precision from 50.0\\% to 62.2\\%. These results demonstrate that our training-free approach effectively curbs hallucination, yielding 3D agents that are significantly more reliable and grounded.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38452", "url": null, "sourceid": 38346, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38831, "uid": "b11712a557efbc1dda47d9024b28fc78", "name": "Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency", "authors": [{"id": 190788, "fullname": "Hakan Emre Gedik", "url": "http://cvpr.thecvf.com/api/miniconf/users/190788?format=json", "institution": "University of Texas at Austin"}, {"id": 136220, "fullname": "Shashank Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/136220?format=json", "institution": "University of Texas at Austin"}, {"id": 183579, "fullname": "Alan Bovik", "url": "http://cvpr.thecvf.com/api/miniconf/users/183579?format=json", "institution": "University if Colorado Boulder"}], "abstract": "No-reference image quality assessment (NR IQA) has recently benefited from deep and multimodal models, yet many SOTA systems still violate at least one basic requirement: they either discard critical quality cues via aggressive resizing, fail to generalize across resolutions, cannot be jointly trained on heterogeneous IQA datasets with mismatched MOS scales, or require prohibitive computation. We present $\\textbf{ReLIQS}$, a model for $\\textbf{Re}$solution-agnostic $\\textbf{L}$earning for $\\textbf{I}$mage $\\textbf{Q}$uality with $\\textbf{S}$aliency, which is resolution-agnostic, preserves original-resolution quality cues, learns from multiple subjective studies, and remains computationally efficient and budget-adaptive. ReLIQS is a CLIP-based multiscale patch-driven architecture that learns both \\emph{where to look} and \\emph{how to judge} quality. Fixed-size patches are sampled across multiple resolutions, including the original resolution, and encoded with a CLIP vision backbone. A lightweight Perceptual Importance Estimator then predicts IQA-specific importance maps to select a small set of informative patches, and a Quality Aspect Module aggregates their embeddings into a single image-level score. Across authentic, synthetic, and AIGC benchmarks spanning diverse resolutions and distortions, ReLIQS generalizes better than strong CNN-, CLIP-, and MLLM-based baselines with matching or reduced computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38831", "url": null, "sourceid": 33904, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38052, "uid": "1a0765221ce5d8a5c55dfaeb5321e8df", "name": "IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models", "authors": [{"id": 131553, "fullname": "Guohao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131553?format=json", "institution": "Rochester Institute of Technology"}, {"id": 87315, "fullname": "Yufei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87315?format=json", "institution": "Nanyang Technological University"}, {"id": 94938, "fullname": "Sizhuo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/94938?format=json", "institution": "Snap Inc."}, {"id": 188932, "fullname": "Yuege Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188932?format=json", "institution": "OpenAI"}, {"id": 188933, "fullname": "Yuting Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188933?format=json", "institution": null}, {"id": 131620, "fullname": "ZHIQIANG TAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/131620?format=json", "institution": "Rochester Institute of Technology"}, {"id": 85791, "fullname": "Jian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85791?format=json", "institution": "Snap Inc."}], "abstract": "Vision-language models (VLMs) with dynamic resolution vision encoders achieve strong performance, but face significant efficiency challenges due to long input sequences. A common approach is to assess the importance of tokens and prune those that are less informative. Recent methods utilizing a small VLM to provide the importance map of visual tokens have outperformed existing rule-based and similarity-driven pruning approaches, particularly under high pruning ratios. However, directly using the small VLM remains unreliable, as it utilizes the aggregated visual attention weights as importance score, which can lead to noisy guidance if the generated tokens are incorrect.To address this, we invert the approach by having it detect non-informative visual tokens according to the user's input query. By adding a variational information bottleneck in the small VLM, we can approximate the entropy of each visual token as pruning guidance. Such a posteriori-guided pruning method allows the large VLM to retain its reasoning capacity with improved efficiency.Extensive experiments on eight benchmarks demonstrate the effectiveness of our approach. With only 5\\% of visual tokens retained, the large VLM preserves 95\\% of its original performance, outperforming the state of the art by 8\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38052", "url": null, "sourceid": 44388, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39769, "uid": "ed164c2412284ebffb9b5fe457b207b9", "name": "Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis", "authors": [{"id": 182189, "fullname": "Jaein Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182189?format=json", "institution": "Seoul National University"}, {"id": 182222, "fullname": "Hee Bin Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182222?format=json", "institution": "Seoul National University"}, {"id": 192820, "fullname": "Dong-Sig Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192820?format=json", "institution": "Imperial College London"}, {"id": 86445, "fullname": "Byoung-Tak Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86445?format=json", "institution": "Seoul National University"}], "abstract": "A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39769", "url": null, "sourceid": 41315, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40058, "uid": "26dd1a5da58ce465bd88b083747df5c6", "name": "Chain-of-Models Pre-training: Rethinking Training Acceleration of CLIP Models", "authors": [{"id": 183164, "fullname": "Jiawei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183164?format=json", "institution": "Intel"}, {"id": 193405, "fullname": "Shigeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193405?format=json", "institution": "Intel Labs China"}, {"id": 193406, "fullname": "Chao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193406?format=json", "institution": "Intel"}, {"id": 193407, "fullname": "Xiaolong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193407?format=json", "institution": null}, {"id": 88026, "fullname": "Anbang Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88026?format=json", "institution": "Intel"}], "abstract": "In this paper, we present Chain-of-Models Pre-training (CoM-PT), a novel performance-lossless training acceleration method for vision transformer models. This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, which scales efficiently with the family size.  Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of parameter size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance comparable to standard individual training while significantly reducing the training cost. Thanks to the property of model chain, we empirically find two compelling phenomena: i) adding smaller models can even decrease the total training cost, and ii) adding medium-sized models incurs only marginal additional training cost. In light of this, our CoM-PT first unlocks the pre-training efficiency that scales favorably with family size, providing large deployment flexibility across various devices. We plan to open-source the code and encourage the community to extend it to more pre-training paradigms.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40058", "url": null, "sourceid": 39895, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39244, "uid": "810cd6adda70722fd9d2c292867b86d7", "name": "D-Prism: Differentiable Primitives for Structured Dynamic Modeling", "authors": [{"id": 90554, "fullname": "Xingyuan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90554?format=json", "institution": "Zhejiang University"}, {"id": 155974, "fullname": "Yijin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155974?format=json", "institution": "Avolution AI"}, {"id": 191694, "fullname": "Chong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191694?format=json", "institution": "Stanford University"}, {"id": 158823, "fullname": "Yuhang Ming", "url": "http://cvpr.thecvf.com/api/miniconf/users/158823?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 86219, "fullname": "Hujun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86219?format=json", "institution": "Zhejiang University"}, {"id": 84995, "fullname": "Guofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84995?format=json", "institution": "Zhejiang University"}], "abstract": "Capturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain.Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive counts, better matching objects' true spatial footprint.Experiments confirm that our method excels at structured dynamic modeling, providing both structured geometry and precise motion tracking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39244", "url": null, "sourceid": 40281, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39835, "uid": "bb7272eafe90d1ea3a8dd6db76688ad1", "name": "HandX+: Scaling Up Text-Conditioned Bimanual Motion Generation", "authors": [{"id": 180781, "fullname": "Zimu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180781?format=json", "institution": "Peking University"}, {"id": 180343, "fullname": "Yucheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180343?format=json", "institution": "University of Illinois Urbana Champaign"}, {"id": 148650, "fullname": "Xiyan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/148650?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 164280, "fullname": "Ziyin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/164280?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 140805, "fullname": "Sirui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/140805?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 192946, "fullname": "Kai Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192946?format=json", "institution": "Snap Inc."}, {"id": 191255, "fullname": "Bing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191255?format=json", "institution": "Snap Inc."}, {"id": 96456, "fullname": "Chuan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96456?format=json", "institution": "Meta Reality Lab"}, {"id": 85791, "fullname": "Jian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85791?format=json", "institution": "Snap Inc."}, {"id": 73909, "fullname": "Yu-Xiong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73909?format=json", "institution": "School of Computer Science, Carnegie Mellon University"}, {"id": 91538, "fullname": "Liangyan Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/91538?format=json", "institution": "UIUC"}], "abstract": "Text-conditioned human motion and video generation have progressed rapidly, yet realistic hand motion and bimanual interaction remain significantly underexplored. Existing whole-body models often overlook the fine-grained details required for natural dexterous behavior, such as finger articulation, contact timing, and inter-hand coordination. We aim to close this gap by introducing a hand-centric animation framework. As a foundation, we consolidate large-scale motion data from diverse sources into a unified corpus with rigorous animation quality control. Through this process, we identify a limitation in most of the existing resources: the absence of high-fidelity bimanual motion data that capture nuanced finger dynamics and inter-hand collaboration. To remedy this, we collect a new dataset designed to enrich these underrepresented aspects. To scale motion-language alignment automatically, rather than relying on large language models to directly reason over raw motion sequences, we propose a decoupled paradigm. It extracts representative motion features, such as contact events and finger flexion, and then leverages LLM's reasoning to generate fine-grained, semantically rich descriptions aligned with these features.Building on our corpus and annotations, we develop benchmark models using diffusion and FSQ-based architectures and enable versatile conditioning modes, including standard text-conditioned generation, hand-reaction synthesis, motion inbetweening, keyframe-guided generation, and long-horizon temporal composition. Experiments show that our approach achieves strong text alignment, high-quality dexterous motion, and accurate contact prediction, supported by newly designed metrics tailored for hand animation. We additionally observe clear scaling behavior: larger models trained on larger, higher-quality datasets produce markedly more semantically coherent bimanual motions. All data will be released to support future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39835", "url": null, "sourceid": 46644, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37261, "uid": "47648d4793a97f09d9440bd7caf3869e", "name": "Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding", "authors": [{"id": 187030, "fullname": "Shrinidhi Kumbhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/187030?format=json", "institution": "Arizona State University"}, {"id": 187031, "fullname": "Haofu Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187031?format=json", "institution": "Amazon"}, {"id": 187032, "fullname": "srikar appalaraju", "url": "http://cvpr.thecvf.com/api/miniconf/users/187032?format=json", "institution": "Amazon"}, {"id": 187033, "fullname": "Kunwar Yashraj Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187033?format=json", "institution": "Amazon"}], "abstract": "Autoregressive (AR) vision\u2013language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision\u2013language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37261", "url": null, "sourceid": 40749, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40179, "uid": "779259368ccf87da1146127db1360be8", "name": "Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting", "authors": [{"id": 182611, "fullname": "Alabi Mehzabin Anisha", "url": "http://cvpr.thecvf.com/api/miniconf/users/182611?format=json", "institution": "University of South Florida"}, {"id": 193730, "fullname": "Guangjing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193730?format=json", "institution": "University of South Florida"}, {"id": 193731, "fullname": "Sriram Chellappan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193731?format=json", "institution": null}], "abstract": "State-of-the-art crowd counting and localization are primarily modeled using two paradigms: density maps and point regression. Given the field's security ramifications, there is active interest in model robustness against adversarial attacks. Recent studies have demonstrated transferability across density-map-based approaches via adversarial patches, but cross-paradigm attacks (e.g., from density map-based models to point regression-based models) remain unexplored. We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. For point-regression models, we employ scene-density-specific high-confidence logit suppression; for density-map approaches, we use peak-targeted density map suppression. Both are combined with model-agnostic perceptual constraints to ensure that perturbations are effective and imperceptible to the human eye. Extensive experiments demonstrate the effectiveness of our attack, achieving on average a $7\\times$ increase in Mean Absolute Error compared to clean images while maintaining competitive visual quality, and successfully transferring across seven state-of-the-art crowd models with transfer ratios ranging from $0.55$ to $1.69$. Our approach strikes a balance between attack effectiveness and imperceptibility compared to state-of-the-art transferable attack strategies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40179", "url": null, "sourceid": 43389, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38164, "uid": "274a743f5ed8555e1183b6b752cbeba7", "name": "CGHair: Compact Gaussian Hair Reconstruction with Card Clustering", "authors": [{"id": 86407, "fullname": "Haimin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86407?format=json", "institution": "Shanghaitech University"}, {"id": 158877, "fullname": "Srinjay Sarkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/158877?format=json", "institution": "Indian Institute of Science"}, {"id": 189192, "fullname": "Albert Mosella-Montoro", "url": "http://cvpr.thecvf.com/api/miniconf/users/189192?format=json", "institution": "Carnegie Mellon University"}, {"id": 189193, "fullname": "Francisco Vicente Carrasco", "url": "http://cvpr.thecvf.com/api/miniconf/users/189193?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 75815, "fullname": "Fernando De la Torre", "url": "http://cvpr.thecvf.com/api/miniconf/users/75815?format=json", "institution": "Carnegie Mellon"}], "abstract": "We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200\u00d7 lower memory footprint.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38164", "url": null, "sourceid": 30593, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40211, "uid": "d75f5329450826822ae35da4c1e7d103", "name": "VideoAutoThink: Video Auto Reasoning via Thinking Once, Answering Twice", "authors": [{"id": 76480, "fullname": "Shuming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76480?format=json", "institution": "KAUST"}, {"id": 90942, "fullname": "Mingchen Zhuge", "url": "http://cvpr.thecvf.com/api/miniconf/users/90942?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 193790, "fullname": "Changsheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193790?format=json", "institution": "Meta Inc."}, {"id": 193791, "fullname": "Jun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193791?format=json", "institution": "Facebook"}, {"id": 87109, "fullname": "Lemeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87109?format=json", "institution": "University of Texas, Austin"}, {"id": 193792, "fullname": "Zechun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193792?format=json", "institution": "Meta Inc."}, {"id": 139142, "fullname": "Chenchen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/139142?format=json", "institution": "Meta AI"}, {"id": 129012, "fullname": "zhipeng cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129012?format=json", "institution": "Intel Labs"}, {"id": 156799, "fullname": "Chong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156799?format=json", "institution": "Facebook"}, {"id": 86851, "fullname": "Haozhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86851?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 176926, "fullname": "Ernie Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176926?format=json", "institution": "Meta "}, {"id": 156800, "fullname": "Saksham Suri", "url": "http://cvpr.thecvf.com/api/miniconf/users/156800?format=json", "institution": "Facebook"}, {"id": 156979, "fullname": "Hongyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156979?format=json", "institution": "Meta                  Reality Labs"}, {"id": 127452, "fullname": "Qi Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/127452?format=json", "institution": "Alibaba Group"}, {"id": 193793, "fullname": "Wei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193793?format=json", "institution": "AI at Meta"}, {"id": 128406, "fullname": "Balakrishnan Varadarajan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128406?format=json", "institution": "Meta"}, {"id": 153070, "fullname": "Zhuang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153070?format=json", "institution": "Princeton University"}, {"id": 130912, "fullname": "Hu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130912?format=json", "institution": "FAIR, Multimodal Foundation"}, {"id": 193794, "fullname": "Florian Bordes", "url": "http://cvpr.thecvf.com/api/miniconf/users/193794?format=json", "institution": "Meta"}, {"id": 87353, "fullname": "Raghuraman Krishnamoorthi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87353?format=json", "institution": "Facebook"}, {"id": 75441, "fullname": "Bernard Ghanem", "url": "http://cvpr.thecvf.com/api/miniconf/users/75441?format=json", "institution": "KAUST"}, {"id": 87371, "fullname": "Vikas Chandra", "url": "http://cvpr.thecvf.com/api/miniconf/users/87371?format=json", "institution": "Facebook"}, {"id": 85197, "fullname": "Yunyang Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85197?format=json", "institution": "Facebook"}], "abstract": "Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models in video understanding. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher compute cost. Motivated by this, we propose VideoAutoThink, a video understanding framework that adopts a ``reason-when-necessary'' strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised with verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAutoThink achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 144 to just 44 tokens. Moreover, we observe low activation of thinking on perception-oriented tasks, but higher activation on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40211", "url": null, "sourceid": 36716, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40246, "uid": "0e1290a0f5dd521ac317b2fa20bf511a", "name": "EgoAVU: Egocentric Audio-Visual Understanding", "authors": [{"id": 91138, "fullname": "Ashish Seth", "url": "http://cvpr.thecvf.com/api/miniconf/users/91138?format=json", "institution": "Indian Institute of Technology, Madras"}, {"id": 193869, "fullname": "Xinhao Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/193869?format=json", "institution": "Facebook"}, {"id": 193790, "fullname": "Changsheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193790?format=json", "institution": "Meta Inc."}, {"id": 133859, "fullname": "Varun Nagaraja", "url": "http://cvpr.thecvf.com/api/miniconf/users/133859?format=json", "institution": "Meta"}, {"id": 176926, "fullname": "Ernie Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176926?format=json", "institution": "Meta "}, {"id": 127062, "fullname": "Gregory P. Meyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/127062?format=json", "institution": "Cruise"}, {"id": 193870, "fullname": "Gael Le Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193870?format=json", "institution": "Meta"}, {"id": 85197, "fullname": "Yunyang Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85197?format=json", "institution": "Facebook"}, {"id": 87371, "fullname": "Vikas Chandra", "url": "http://cvpr.thecvf.com/api/miniconf/users/87371?format=json", "institution": "Facebook"}, {"id": 193871, "fullname": "Yangyang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193871?format=json", "institution": "Meta"}, {"id": 85839, "fullname": "Dinesh Manocha", "url": "http://cvpr.thecvf.com/api/miniconf/users/85839?format=json", "institution": "University of Maryland, College Park"}, {"id": 129012, "fullname": "zhipeng cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129012?format=json", "institution": "Intel Labs"}], "abstract": "Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph based curation ensure both the data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct \u2014 a large scale training dataset of 3M samples, and EgoAVU-Bench \u2014 a manually verified evaluation split of 3K samples. EgoAVU-Bench clearly reveals the limitation of existing MLLMs: they bias heavily towards visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively solves this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefit can also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40246", "url": null, "sourceid": 43428, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39595, "uid": "ab661b15e494eb0afffd940606881fc2", "name": "UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition", "authors": [{"id": 192435, "fullname": "Zhuangcheng Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192435?format=json", "institution": "Carnegie Mellon University"}, {"id": 183759, "fullname": "Guang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183759?format=json", "institution": "Nanjing University"}, {"id": 155031, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155031?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 157145, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157145?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 172570, "fullname": "Qintong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172570?format=json", "institution": "Peking University"}, {"id": 107541, "fullname": "Weijia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107541?format=json", "institution": "Sun Yat-sen University"}, {"id": 157152, "fullname": "Chao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157152?format=json", "institution": "Shanghai AI Lab"}, {"id": 87494, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87494?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 87495, "fullname": "Botian Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87495?format=json", "institution": "Shanghai AI Lab"}, {"id": 188923, "fullname": "Jiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188923?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 156267, "fullname": "Wentao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156267?format=json", "institution": "Peking University"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}], "abstract": "This paper introduces UniMERNet, a high-accuracy, computation-efficient algorithm for Mathematical Expression Recognition (MER) across diverse real-world scenarios. To facilitate UniMERNet's training, we constructed UniMER-1M, a million-scale dataset whose unprecedented diversity endows the model with robust generalization ability. Through in-depth analysis, we discover a distinctive raster-scan pattern (left-to-right, top-to-bottom) in the attention distribution of Transformer models for MER tasks, which closely aligns with human reading habits. Based on this key finding, we design an innovative  **Raster-Scan Attention** mechanism that employs a ``horizontal-first, vertical-second\" sequential attention computation strategy. This approach not only successfully reduces computational complexity from $ \\mathcal{O}(NH^2W^2D) $ to $ \\mathcal{O}(NHWD(H + W)) $, but also enables the model to capture long-range dependencies more efficiently, achieving recognition performance comparable to global attention. Leveraging both UniMER-1M and our innovative attention mechanism, UniMERNet achieves state-of-the-art performance across four real-world scenarios while significantly reducing computational resources compared to global attention: over **1.2$\\times$** memory savings during training, approximately **10$\\times$** memory reduction during inference, and **5$\\times$** speed with slightly improved accuracy. All resources will be publicly released to advance MER research further.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39595", "url": null, "sourceid": 38902, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36219, "uid": "325f53e3727938dc321cafd8b8ce8dcb", "name": "3D-IDE: 3D Implicit Depth Emergent", "authors": [{"id": 184476, "fullname": "Chushan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184476?format=json", "institution": "Australian National University"}, {"id": 134965, "fullname": "Ruihan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/134965?format=json", "institution": "The University of Queensland"}, {"id": 69384, "fullname": "Jinguang Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/69384?format=json", "institution": "Australian National University"}, {"id": 184477, "fullname": "Yikai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184477?format=json", "institution": "Tsinghua University"}, {"id": 92749, "fullname": "Hongdong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/92749?format=json", "institution": "Australian National University"}], "abstract": "Leveraging 3D information within Multimodal Large Language Models (MLLMs) has recently shown significant advantages for indoor scene understanding. However, existing methods, including those using explicit ground-truth 3D positional encoding and those grafting external 3D foundation models for implicit geometry, struggle with the trade-off in 2D-3D representation fusion, leading to suboptimal deployment. To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. Our core insight is the Implicit Geometric Emergence Principle: by strategically leveraging privileged geometric supervision through mechanisms like a fine-grained geometry validator and global representation constraints, we construct an information bottleneck. This bottleneck forces the model to maximize the mutual information between visual features and 3D structures, allowing 3D awareness to emerge naturally within a unified visual representation. Unlike existing approaches, our method enables 3D perception to emerge implicitly, disentangling features in dense regions and, crucially, eliminating depth and pose dependencies during inference with zero latency overhead. This paradigm shift from external grafting to implicit emergence represents a fundamental rethinking of 3D knowledge integration in visual-language models. Extensive experiments demonstrate that our method surpasses SOTA on multiple 3D scene understanding benchmarks. Our approach achieves a 55\\% reduction in inference latency while maintaining strong performance across diverse downstream tasks, underscoring the effectiveness of meticulously designed auxiliary objectives for dependency-free 3D understanding. Source code will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36219", "url": null, "sourceid": 39170, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38624, "uid": "8becabfd3781cac86c0988f11d76e690", "name": "MMFace-DiT: A Dual-Stream Diffusion Transformer for Multimodal Face Generation", "authors": [{"id": 190329, "fullname": "Bharath Krishnamurthy", "url": "http://cvpr.thecvf.com/api/miniconf/users/190329?format=json", "institution": "University of North Texas"}, {"id": 148762, "fullname": "Ajita Rattani", "url": "http://cvpr.thecvf.com/api/miniconf/users/148762?format=json", "institution": "University of North Texas"}], "abstract": "Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial\u2013semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over five state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38624", "url": null, "sourceid": 31013, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39902, "uid": "0a175e9ffb2cc47dffead9c1cf4eb9fe", "name": "Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals", "authors": [{"id": 183803, "fullname": "Kunzhe Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/183803?format=json", "institution": "Michigan State University"}, {"id": 193079, "fullname": "Geo Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193079?format=json", "institution": "Michigan State University"}, {"id": 73926, "fullname": "Xiaoming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73926?format=json", "institution": "Michigan State University"}, {"id": 193080, "fullname": "Huacheng Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193080?format=json", "institution": "Michigan State University"}], "abstract": "Robust 3D environmental perception is critical for applications like autonomous navigation and robotics, yet existing optical sensors like cameras and LiDAR fail in adverse conditions such as smoke, fog, and non-ideal lighting. While specialized radar systems can operate in these conditions, their reliance on bespoke, ultra-wideband hardware and licensed spectrum limits their scalability and cost-effectiveness. This paper introduces Rascene, a novel framework that enables high-fidelity 3D imaging by repurposing ubiquitous mmWave OFDM communication signals. Recognizing that a single-frame RF signal is inherently sparse, noisy, and highly ambiguous, the key innovation of Rascene is a multi-frame 3D imaging framework designed to fuse information from signals captured across multiple, arbitrary poses. This framework leverages a spatially adaptive fusion mechanism to find geometric consensus across the multiple views, effectively suppressing multipath artifacts while preserving sparse geometric details. Experiments demonstrate that our method reconstructs 3D scenes with high precision, providing a new pathway for low-cost, scalable, and robust 3D perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39902", "url": null, "sourceid": 30824, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36784, "uid": "41ab32a8ee30c68af11c6e320db4f5d0", "name": "MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization", "authors": [{"id": 181365, "fullname": "Ashutosh Chaubey", "url": "http://cvpr.thecvf.com/api/miniconf/users/181365?format=json", "institution": "PhD Student@USC, Intern@Meta"}, {"id": 185864, "fullname": "Jiacheng Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185864?format=json", "institution": "University of Southern California"}, {"id": 158581, "fullname": "Mohammad Soleymani", "url": "http://cvpr.thecvf.com/api/miniconf/users/158581?format=json", "institution": "University of Southern California"}], "abstract": "Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audio-visual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36784", "url": "https://mod-dpo.github.io/", "sourceid": 45095, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38913, "uid": "ab65e1a4d850fa15d38469a1ad02ce90", "name": "Lynx: Towards High-Fidelity Personalized Video Generation", "authors": [{"id": 136597, "fullname": "Shen Sang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136597?format=json", "institution": "TikTok"}, {"id": 158240, "fullname": "Tiancheng Zhi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158240?format=json", "institution": "ByteDance"}, {"id": 190952, "fullname": "Tianpei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190952?format=json", "institution": "Morpheus AI"}, {"id": 158241, "fullname": "Jing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158241?format=json", "institution": "ByteDance Inc"}, {"id": 85733, "fullname": "Linjie Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85733?format=json", "institution": "ByteDance Inc."}], "abstract": "We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation. Code and models will be released publicly upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38913", "url": null, "sourceid": 41141, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38599, "uid": "30627d0d71028dbab2dfc4335ba3ad32", "name": "When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm", "authors": [{"id": 176161, "fullname": "Ye Leng", "url": "http://cvpr.thecvf.com/api/miniconf/users/176161?format=json", "institution": "CISPA Helmholtz Center for Information Security"}, {"id": 190261, "fullname": "Junjie Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190261?format=json", "institution": "CISPA Helmholtz Center for Information Security"}, {"id": 190262, "fullname": "Mingjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190262?format=json", "institution": "CISPA Helmholtz Center for Information Security"}, {"id": 129957, "fullname": "Chenhao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129957?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 129955, "fullname": "Chao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129955?format=json", "institution": "Xi\u2019an Jiaotong University"}, {"id": 85953, "fullname": "Michael Backes", "url": "http://cvpr.thecvf.com/api/miniconf/users/85953?format=json", "institution": "CISPA Helmholtz Center for Information Security"}, {"id": 190263, "fullname": "Yun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190263?format=json", "institution": "Flexera"}, {"id": 85984, "fullname": "Yang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85984?format=json", "institution": "CISPA Helmholtz Center for Information Security"}], "abstract": "Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Our work shows that MLLMs pair usability with higher risks, highlighting the need for adaptive safeguards to mitigate real-world harms.Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability may also introduce new and potentially greater safety risks.Taking diffusion models as a reference point, we systematically analyze and compare the safety risks of emerging MLLMs along two dimensions: unsafe content generation and fake image synthesis.Across multiple unsafe generation benchmark datasets, we observe that MLLMs tend to generate more unsafe images than diffusion models. This difference partly arises because diffusion models often fail to interpret abstract prompts, producing corrupted outputs, whereas MLLMs can comprehend these prompts and generate unsafe content.For current advanced fake image detectors, MLLM-generated images are also notably harder to identify. Even when detectors are retrained with MLLMs-specific data, they can still be bypassed by simply providing MLLMs with longer and more descriptive inputs.Our measurements indicate that the emerging safety risks of the cutting-edge generative paradigm, MLLMs, have not been sufficiently recognized, posing new challenges to real-world safety.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38599", "url": null, "sourceid": 40797, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38076, "uid": "f0b314f185b80cf35d986e298db53fe3", "name": "Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs", "authors": [{"id": 180489, "fullname": "Jianing Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/180489?format=json", "institution": "University of Pennsylvania"}, {"id": 188998, "fullname": "Qinhe Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188998?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 188999, "fullname": "Emmanuel Panov", "url": "http://cvpr.thecvf.com/api/miniconf/users/188999?format=json", "institution": "Robotics and AI Institute"}, {"id": 155357, "fullname": "Leonor Fermoselle", "url": "http://cvpr.thecvf.com/api/miniconf/users/155357?format=json", "institution": "Boston Dynamics AI Institute"}, {"id": 168798, "fullname": "Dinesh Jayaraman", "url": "http://cvpr.thecvf.com/api/miniconf/users/168798?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 155359, "fullname": "Bernadette Bucher", "url": "http://cvpr.thecvf.com/api/miniconf/users/155359?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 189000, "fullname": "Tarik Kelestemur", "url": "http://cvpr.thecvf.com/api/miniconf/users/189000?format=json", "institution": "Boston Dynamics AI Institute"}], "abstract": "Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38076", "url": null, "sourceid": 46289, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38916, "uid": "e77d89a5cfa17ff55d0b928bf21b2d0f", "name": "LAMP: Language-Assisted Motion Planning for Controllable Video Generation", "authors": [{"id": 182141, "fullname": "Muhammed Burak K\u0131z\u0131l", "url": "http://cvpr.thecvf.com/api/miniconf/users/182141?format=json", "institution": "Ko\u00e7 University"}, {"id": 176672, "fullname": "Enes \u015eanl\u0131", "url": "http://cvpr.thecvf.com/api/miniconf/users/176672?format=json", "institution": "Ko\u00e7 University"}, {"id": 85661, "fullname": "Niloy J. Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/85661?format=json", "institution": "University College London"}, {"id": 182163, "fullname": "Erkut Erdem", "url": "http://cvpr.thecvf.com/api/miniconf/users/182163?format=json", "institution": "Hacettepe University"}, {"id": 190958, "fullname": "Aykut Erdem", "url": "http://cvpr.thecvf.com/api/miniconf/users/190958?format=json", "institution": "Ko\u00e7 University"}, {"id": 86074, "fullname": "Duygu Ceylan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86074?format=json", "institution": "Adobe Systems"}], "abstract": "Recent advances in video generation have achieved remarkable progress in visual fidelity and controllability, enabling conditioning not only on text but also on structural layout and motion signals. Among these, motion control (i.e., specifying both object dynamics and camera trajectories) is particularly critical for directing complex, cinematic scenes, yet existing interfaces remain limited. To address this gap, we introduce LAMP that leverages large language models~(LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for both dynamic objects and (relatively defined) cameras. Specifically, we fine-tune an LLM to generate frame-wise 3D bounding-box trajectories for objects and, conditioned on these, produce corresponding 3D camera paths, which are then converted into generator-compatible 2D control signals. We enable this by constructing a large-scale paired datasets through a combination of procedurally generated text\u2013trajectory pairs and augmented real video datasets with 3D annotations. Experiments demonstrate improved controllability and alignment with user intent compared to state-of-the-art alternatives, establishing the first framework for joint object\u2013camera trajectory generation directly from natural language.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38916", "url": null, "sourceid": 32493, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36201, "uid": "b44d4a198e1a6f8a50e48ba713684725", "name": "EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents", "authors": [{"id": 131682, "fullname": "Wenjia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131682?format=json", "institution": "University of Hong Kong"}, {"id": 139136, "fullname": "Liang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/139136?format=json", "institution": "The University of Hong Kong"}, {"id": 184432, "fullname": "Huaijin Pi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184432?format=json", "institution": "The University of Hong Kong"}, {"id": 184433, "fullname": "Yuke Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184433?format=json", "institution": null}, {"id": 147045, "fullname": "Xuqian Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/147045?format=json", "institution": "Tampere University"}, {"id": 184434, "fullname": "Yifan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184434?format=json", "institution": "Tencent Hunyuan"}, {"id": 127593, "fullname": "Zhouyingcheng Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127593?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 89773, "fullname": "Lei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89773?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 89511, "fullname": "Rishabh Dabral", "url": "http://cvpr.thecvf.com/api/miniconf/users/89511?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}, {"id": 89090, "fullname": "Taku Komura", "url": "http://cvpr.thecvf.com/api/miniconf/users/89090?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}], "abstract": "Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting.However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild.To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame.The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly.Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36201", "url": null, "sourceid": 45732, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39690, "uid": "ee04d7461f4c27af47fb49708e132073", "name": "Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement", "authors": [{"id": 179328, "fullname": "Kartik Patwari", "url": "http://cvpr.thecvf.com/api/miniconf/users/179328?format=json", "institution": "University of California, Davis"}, {"id": 184708, "fullname": "Noranart Vesdapunt", "url": "http://cvpr.thecvf.com/api/miniconf/users/184708?format=json", "institution": "Amazon"}, {"id": 129838, "fullname": "Chien-Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129838?format=json", "institution": "NVIDIA"}, {"id": 192652, "fullname": "Dawei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192652?format=json", "institution": "Amazon"}, {"id": 88646, "fullname": "Cong Phuoc Huynh", "url": "http://cvpr.thecvf.com/api/miniconf/users/88646?format=json", "institution": "Amazon"}, {"id": 140562, "fullname": "Ning Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/140562?format=json", "institution": "Amazon"}, {"id": 192653, "fullname": "Chen-Nee Chuah", "url": "http://cvpr.thecvf.com/api/miniconf/users/192653?format=json", "institution": "University of California, Davis"}, {"id": 138144, "fullname": "Kah Fu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/138144?format=json", "institution": "Amazon"}], "abstract": "Recent advancement in vision-language models have enabled multi-modal person re-identification (Re-ID), where the system takes both an image and a text query to identify matching individuals. While previous state-of-the-art methods perform well with detailed, sentence-level descriptions, we found that their Recall@1 drops by half when using short, keyword-based queries due to ambiguity, training biases, and under-represented attributes. Despite this challenge, short queries provide a more natural and efficient user experience, requiring less effort and allowing for iterative refinement. To address this limitation, we introduce a new problem setting, Composite-Attributes Person Re-ID (CA-ReID), along with a fine-grained composite attribute dataset with queries belonging to varying levels of ambiguity. We further propose two methods: Dense Disentangling Loss to promote attribute-specific embeddings, and Part-Aware Representations that use pose estimation to align textual attributes with relevant body regions. Our method sets a new state of the art on the new CA-ReID benchmark (up to +17% Recall@1) and performs on par with prior methods on existing CC-ReID benchmarks. We will release our dataset to support this emerging direction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39690", "url": null, "sourceid": 35706, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38962, "uid": "6bacc516b88b78d62e578cd9ad50c08b", "name": "Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects", "authors": [{"id": 182501, "fullname": "Denys Iliash", "url": "http://cvpr.thecvf.com/api/miniconf/users/182501?format=json", "institution": "Simon Fraser University"}, {"id": 92222, "fullname": "Jiayi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92222?format=json", "institution": "Simon Fraser University"}, {"id": 191070, "fullname": "Egor Fokin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191070?format=json", "institution": "Simon Fraser University"}, {"id": 72738, "fullname": "Qirui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72738?format=json", "institution": "Simon Fraser University"}, {"id": 98634, "fullname": "Ali Mahdavi Amiri", "url": "http://cvpr.thecvf.com/api/miniconf/users/98634?format=json", "institution": "Simon Fraser University"}, {"id": 75894, "fullname": "Manolis Savva", "url": "http://cvpr.thecvf.com/api/miniconf/users/75894?format=json", "institution": "Simon Fraser University"}, {"id": 88970, "fullname": "Angel Xuan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88970?format=json", "institution": "Simon Fraser University"}], "abstract": "We present Artiverse, a diverse and physically grounded dataset of high-quality articulated 3D objects designed for realistic functional modeling and simulation. Artiverse contains 5.4K human-authored objects across a broad range of 88 categories, aggregated from multiple 3D static repositories. Objects are annotated with functional parts, interior structures, realistic kinematic relationships including multi-DoF joints, and physical attributes such as metric scale, material, and mass. We develop a semi-automated annotation pipeline that combines few-shot segmentation, geometric reasoning, vision-language model inference, and multi-stage human verification to achieve high-quality and efficient annotation, reducing manual annotation time by over 30\\%. We demonstrate the value of Artiverse on tasks of part mobility analysis, articulated object generation, and physics-based interaction. Artiverse provides a data resource to advance functional understanding for articulated objects.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38962", "url": null, "sourceid": 44186, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39752, "uid": "f8548a8d98a27fe73f2558a90f989c5c", "name": "Live Interactive Training for Video Segmentation", "authors": [{"id": 181818, "fullname": "Xinyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181818?format=json", "institution": "Cornell University"}, {"id": 87931, "fullname": "Haozheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87931?format=json", "institution": "Tencent Media Lab"}, {"id": 192792, "fullname": "Yihong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192792?format=json", "institution": "Cornell University"}, {"id": 89201, "fullname": "Bharath Hariharan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89201?format=json", "institution": "Cornell University"}, {"id": 156412, "fullname": "Jennifer J. Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/156412?format=json", "institution": "Cornell University"}], "abstract": "Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.).Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without *learning* from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video.Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34\\%  reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of $\\sim$0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39752", "url": null, "sourceid": 34835, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39566, "uid": "5fc17e04f87f6161a2f1a3be17d4ba2b", "name": "Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity", "authors": [{"id": 182496, "fullname": "Boyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182496?format=json", "institution": "Lancaster University"}, {"id": 192365, "fullname": "Richard Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192365?format=json", "institution": "Lancaster University; Shanghai Jiaotong University"}], "abstract": "We begin by developing a mathematical framework of neural differentiation, formulated at the level of individual neurons. This framework formalizes the principle that each neuron should acquire a distinct representational role within the network, thereby avoiding redundancy and maximizing collective expressivity. Differentiation is quantified through the Neural Differentiation Index (NDI), a loss-aware measure that characterizes neuron significance from geometric, informational, and curvature-based perspectives within a unified framework. The NDI enables a rigorous characterization of how strongly a neuron diverges from its peers in both function and importance, and supports theoretical guarantees: we establish formal bounds on the error increase under NDI-guided elimination, thereby providing provable safety margins for network compression. Building on this foundation, we introduce Neural Differentiation Pruning (NDP) as a practical instantiation. NDP leverages NDI to perform adaptive, training-time neuron sparsification, followed by targeted fine-tuning, guiding networks toward compact yet highly differentiated backbones. Although the terminology draws loose intuition from biological differentiation, the framework is fully mathematical and architecture-agnostic. Experiments on modern vision benchmarks and architectures show that NDP achieves substantial structured sparsity while maintaining\u2014or even improving\u2014accuracy and robustness, underscoring the practical impact of the differentiation framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39566", "url": null, "sourceid": 36865, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38762, "uid": "b20ec3227487235bbe4117ced1337719", "name": "Spherical Leech Quantization for Visual Tokenization and Generation", "authors": [{"id": 183381, "fullname": "Yue Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183381?format=json", "institution": "Stanford University"}, {"id": 135363, "fullname": "Hanwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135363?format=json", "institution": "Adobe Systems"}, {"id": 190611, "fullname": "Zhenlin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190611?format=json", "institution": "Mistral AI"}, {"id": 190612, "fullname": "Chutong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190612?format=json", "institution": "University of Texas at Austin"}, {"id": 75810, "fullname": "Ehsan Adeli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75810?format=json", "institution": "Stanford University"}, {"id": 89183, "fullname": "Philipp Kr\u00e4henb\u00fchl", "url": "http://cvpr.thecvf.com/api/miniconf/users/89183?format=json", "institution": "University of Texas at Austin"}], "abstract": "Lookup-free quantization has received much attention due to its efficiency on parameters and scalability to a large codebook. In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. The geometry of lattice codes explains the necessity of auxiliary loss terms when training auto-encoders with certain existing lookup-free quantization variants such as BSQ. As a step forward, we explore a few possible candidates, including random lattices, generalized Fibonacci lattices, and densest sphere packing lattices. Among all, we find the Leech lattice-based quantization method, which is dubbed as Spherical Leech Quantization ($\\Lambda_{24}$-SQ), leads to both a simplified training recipe and an improved reconstruction-compression tradeoff thanks to its high symmetry and even distribution on the hypersphere. In image tokenization and compression tasks, this quantization approach achieves better reconstruction quality across all metrics than BSQ, the best prior art, while consuming slightly fewer bits. The improvement also extends to state-of-the-art auto-regressive image generation frameworks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38762", "url": null, "sourceid": 43994, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38154, "uid": "2b4a70dc918fea2c26d42edc6dbadc46", "name": "Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field", "authors": [{"id": 145825, "fullname": "Sheyang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145825?format=json", "institution": "University of Waterloo"}, {"id": 154069, "fullname": "Armin Shafiee Sarvestani", "url": "http://cvpr.thecvf.com/api/miniconf/users/154069?format=json", "institution": "University of Waterloo"}, {"id": 189168, "fullname": "Jialu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189168?format=json", "institution": "University of Waterloo"}, {"id": 175300, "fullname": "Xiaoyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175300?format=json", "institution": "City University of Hong Kong"}, {"id": 86639, "fullname": "Zhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86639?format=json", "institution": "University of Waterloo"}], "abstract": "The aesthetic quality of a scene depends strongly on camera viewpoint.Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches.We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealling viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38154", "url": "https://sheyangtang.com/aesthetic-field/", "sourceid": 46673, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39319, "uid": "24584682a19b117289970188b6d68ad4", "name": "CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling", "authors": [{"id": 180009, "fullname": "TingJia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180009?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191839, "fullname": "Bo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191839?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 191840, "fullname": "Shengzhong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191840?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 149695, "fullname": "Fan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149695?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191841, "fullname": "Guihai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191841?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Recent breakthroughs in 3D Gaussian Splatting (3DGS) have advanced neural rendering with high fidelity and speed. However, its performance degrades significantly in large-scale scenes due to the computational burden of tile-based rasterization. Existing optimization efforts either require costly scene re-training or focus on narrow aspects of the pipeline, overlooking critical inefficiencies in real-world deployments. Through a comprehensive analysis, we identify three primary sources of redundancy and low GPU utilization: redundant inter-frame pre-processing, viewpoint-based occlusion redundancy, and severe tile-level load imbalance. To address these issues, we propose CaT-GS, a novel and efficient 3DGS rendering pipeline. CaT-GS introduces a speculative multi-frame preprocessing method to eliminate redundant computations across consecutive frames, and an inter-frame caching mechanism to eliminate viewpoint redundant rendering stages. Furthermore, it refactors rasterization tasks with a dedicated kernel to mitigate tile load imbalance, significantly boosting GPU utilization. Extensive experiments demonstrate that CaT-GS achieves a speedup of up to 10\\times over the original 3DGS and up to 70\\% over previous state-of-the-art methods, establishing a new benchmark for high-fidelity, real-time rendering of large-scale scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39319", "url": null, "sourceid": 45416, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37752, "uid": "073e91c1793a1400af92c0d2ba639daf", "name": "Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Langugae Model Blindness", "authors": [{"id": 178950, "fullname": "Xin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178950?format=json", "institution": "Tulane University"}, {"id": 163067, "fullname": "Haomiao Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/163067?format=json", "institution": "University of Memphis"}, {"id": 136638, "fullname": "Yunbei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136638?format=json", "institution": "Tulane University"}, {"id": 187449, "fullname": "Jihun Hamm", "url": "http://cvpr.thecvf.com/api/miniconf/users/187449?format=json", "institution": "Tulane University"}, {"id": 188168, "fullname": "Zechen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188168?format=json", "institution": "Tulane University"}, {"id": 75472, "fullname": "Zhengming Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/75472?format=json", "institution": "Tulane University"}], "abstract": "Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM\u2019s attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37752", "url": null, "sourceid": 43265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38930, "uid": "81c5e25ca82fea62f9752fd589ccf200", "name": "ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images", "authors": [{"id": 190984, "fullname": "Muhammad Naseer Subhani", "url": "http://cvpr.thecvf.com/api/miniconf/users/190984?format=json", "institution": "Independent Researcher"}], "abstract": "Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting point-supervised framework that adapts SAM to RSIs using only sparse point annotation. Our method employs a Refine\u2013Requery\u2013Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation. We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38930", "url": null, "sourceid": 43861, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39579, "uid": "e30faf2040dbb22e0c56d1ff3c95ed0c", "name": "Ego-STAR: Spatiotemporal Hints for Egocentric Video Understanding", "authors": [{"id": 75518, "fullname": "Arsha Nagrani", "url": "http://cvpr.thecvf.com/api/miniconf/users/75518?format=json", "institution": "Google "}, {"id": 192396, "fullname": "Jasper Uijlings", "url": "http://cvpr.thecvf.com/api/miniconf/users/192396?format=json", "institution": "Google"}, {"id": 192397, "fullname": "Shyamal Buch", "url": "http://cvpr.thecvf.com/api/miniconf/users/192397?format=json", "institution": "Google DeepMind"}, {"id": 187640, "fullname": "Tobias Weyand", "url": "http://cvpr.thecvf.com/api/miniconf/users/187640?format=json", "institution": "Google DeepMind"}, {"id": 192398, "fullname": "Sudheendra Vijayanarasimhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192398?format=json", "institution": "Research, Google"}, {"id": 192399, "fullname": "Bo Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192399?format=json", "institution": null}, {"id": 192400, "fullname": "Ramin Mehran", "url": "http://cvpr.thecvf.com/api/miniconf/users/192400?format=json", "institution": "Google Deepmind"}, {"id": 89423, "fullname": "David A. Ross", "url": "http://cvpr.thecvf.com/api/miniconf/users/89423?format=json", "institution": "Google Research"}, {"id": 69185, "fullname": "Cordelia Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/69185?format=json", "institution": "Inria / Google"}], "abstract": "Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g., the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce EgoSTAR, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, for which we also have spatio-temporal mask annotations. Through extensive evaluations, we identify that if frontier models are prompted with hints of `where' and `when' to look, we can get substantial improvements in performance. EgoSTAR will be released publicly to foster progress in egocentric reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39579", "url": null, "sourceid": 41128, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39188, "uid": "8862fafcae4e60912acd6e74d41b5858", "name": "Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering", "authors": [{"id": 181194, "fullname": "Hugo Blanc", "url": "http://cvpr.thecvf.com/api/miniconf/users/181194?format=json", "institution": "Mines Paris"}, {"id": 191540, "fullname": "Jean-Emmanuel Deschaud", "url": "http://cvpr.thecvf.com/api/miniconf/users/191540?format=json", "institution": "Mines Paris - PSL University"}, {"id": 191541, "fullname": "Alexis Paljic", "url": "http://cvpr.thecvf.com/api/miniconf/users/191541?format=json", "institution": "Mines Paris PSL"}], "abstract": "Recent advances in novel view synthesis have enabled differentiable rendering methods to reconstruct 3D scenes directly from images. Algorithms such as 3D Gaussian Splatting and RayGauss use local basis functions to represent radiance fields, enabling fast, high-quality rendering of real-world scenes. However, these methods lack an exact geometric representation of the scene. In this work, inspired by Hermite Radial Basis Function (HRBF) implicits, we introduce a global implicit function constructed from local RBFs and their derivatives to represent surfaces. The proposed formulation enables learning scene geometry through differentiable rendering of an implicit function. By leveraging local basis functions, it achieves both an efficient geometric representation and fast rendering, using a bounding volume hierarchy (BVH) to accelerate intersections with the local basis functions. The implementation of our approach will be made publicly available upon the paper\u2019s publication.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39188", "url": null, "sourceid": 36336, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38523, "uid": "8b58bfa9e198667418d251769277200c", "name": "X-band Radar Non-Line-of-Sight Imaging", "authors": [{"id": 190056, "fullname": "Dongyu Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/190056?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 190057, "fullname": "Mingkun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190057?format=json", "institution": "Princeton University"}, {"id": 190058, "fullname": "Yutong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190058?format=json", "institution": "Mercedes-Benz AG; Universit\u00e4t Stuttgart"}, {"id": 87803, "fullname": "Dominik Scheuble", "url": "http://cvpr.thecvf.com/api/miniconf/users/87803?format=json", "institution": "Universit\u00e4t des Saarlandes"}, {"id": 190059, "fullname": "Xiaolong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190059?format=json", "institution": null}, {"id": 190060, "fullname": "Zijian Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190060?format=json", "institution": "Princeton University"}, {"id": 87821, "fullname": "Mario Bijelic", "url": "http://cvpr.thecvf.com/api/miniconf/users/87821?format=json", "institution": "Princeton University"}, {"id": 190061, "fullname": "Kaushik Sengupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/190061?format=json", "institution": "Princeton University"}, {"id": 87808, "fullname": "Felix Heide", "url": "http://cvpr.thecvf.com/api/miniconf/users/87808?format=json", "institution": "Department of Computer Science, Princeton University"}], "abstract": "Conventional imaging systems capture objects visible in the direct line-of-sight (LOS). A decade of research on non-line-of-sight (NLOS) imaging approaches has made it possible to reconstruct hidden geometry outside the line of sight by analyzing indirect light transport. However, most existing methods operate in the optical visible or IR range. Relying on diffuse inter-reflections, every bounce incurs a quadratic intensity falloff. As such, with illumination power limited by eye-safety limitations, existing methods are fundamentally restricted to short ranges on the order of a few meters. We propose an X-band radar-based NLOS imaging method that leverages the long wavelength to convert diffuse reflections into predominantly specular ones, allowing for large-scale hidden-scene perception. We develop a neural reconstruction method that combines a learned dense prediction module and a geometry-aware NLOS reconstruction module, tackling the inherently low spatial resolution of long-wavelength radar. We assess our method using a prototype system and in simulation. Synthetic validation shows that, under the same transmit power, X-band radar achieves 10$\\times$ longer NLOS reconstruction range than optical systems, while experimental results further demonstrate accurate hidden-object reconstructions up to 40 m, establishing a practical pathway toward real-world long-range NLOS sensing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38523", "url": null, "sourceid": 32599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40240, "uid": "7f3998170c5afdabcf87fa6e94c293ae", "name": "Inter-Photon-Limited Videography", "authors": [{"id": 153384, "fullname": "Andrew Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/153384?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 190056, "fullname": "Dongyu Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/190056?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 156815, "fullname": "Sotiris Nousias", "url": "http://cvpr.thecvf.com/api/miniconf/users/156815?format=json", "institution": "University of Toronto"}, {"id": 77223, "fullname": "David B. Lindell", "url": "http://cvpr.thecvf.com/api/miniconf/users/77223?format=json", "institution": "University of Toronto"}, {"id": 93592, "fullname": "Kiriakos Kutulakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/93592?format=json", "institution": "University of Toronto"}], "abstract": "We consider the problem of imaging a dynamic scene when scene appearance variations can outpace photon arrivals. Under such conditions, a pixel is effectively ``blind'' to changes in appearance that occur within the timespan separating the photons it detects, and so the inter-photon interval presents a significant speed barrier to video acquisition systems. To analyze and advance imaging capabilities at the inter-photon limit, we introduce a novel reparameterization of time-varying flux that reveals the intrinsic difficulty of signal reconstruction by relating the Fourier decomposition of a flux function to the number of photons arriving within each oscillation period. We find that inter-photon-limited videography of general scenes is underexplored and beyond the reach of existing reconstruction techniques. To this end, we introduce Neural Flux Fields---a technique that combines statistical modeling of photon arrival with intrinsic priors of a neural network to achieve robust videography at the inter-photon limit. Using this approach, we demonstrate never-before-scene capabilties in video reconstruction across a range of captured single-photon video datasets spanning the inter-photon-limited regime.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40240", "url": null, "sourceid": 42312, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37717, "uid": "e2d6f6988ac20b859d35d047d778c522", "name": "MuM: Multi-View Masked Image Modeling for 3D Vision", "authors": [{"id": 182932, "fullname": "David Nordstr\u00f6m", "url": "http://cvpr.thecvf.com/api/miniconf/users/182932?format=json", "institution": "Chalmers University of Technology"}, {"id": 188081, "fullname": "Johan Edstedt", "url": "http://cvpr.thecvf.com/api/miniconf/users/188081?format=json", "institution": "Link\u00f6ping University"}, {"id": 87688, "fullname": "Fredrik Kahl", "url": "http://cvpr.thecvf.com/api/miniconf/users/87688?format=json", "institution": "Chalmers University"}, {"id": 188082, "fullname": "Georg B\u00f6kman", "url": "http://cvpr.thecvf.com/api/miniconf/users/188082?format=json", "institution": "University of Amsterdam"}], "abstract": "Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37717", "url": null, "sourceid": 37966, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36336, "uid": "51bb254d1af7e6d9c5c43be0f7aeabca", "name": "Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation", "authors": [{"id": 153479, "fullname": "Johannes Schusterbauer", "url": "http://cvpr.thecvf.com/api/miniconf/users/153479?format=json", "institution": "CompVis       LMU Munich"}, {"id": 153480, "fullname": "Ming Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/153480?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 184807, "fullname": "Yusong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184807?format=json", "institution": null}, {"id": 184808, "fullname": "Pingchuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184808?format=json", "institution": "University of Munich"}, {"id": 153588, "fullname": "Felix Krause", "url": "http://cvpr.thecvf.com/api/miniconf/users/153588?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 85132, "fullname": "Bj\u00f6rn Ommer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85132?format=json", "institution": "University of Munich"}], "abstract": "Diffusion and flow-based models typically allocate compute uniformly across space, updating every patch with the same noise level and number of steps. However, images are highly heterogeneous and not all regions are equally difficult to denoise. We introduce Patch Forcing (PF), a framework that dynamically allocates compute to regions that require more refinement than others. Using an additional head that predicts per-patch difficulty, we can formulate adaptive samplers that dynamically allocate compute where it is most needed. With noise scales that can vary over space and diffusion time, combined with our adaptive solvers, we can advance easier regions earlier to provide context for harder ones. We show that our framework achieves competitive results on class-conditional ImageNet, while remaining orthogonal to guidance methods. We further show that our method also scales to text-to-image synthesis. With Patch Forcing we hope to open a path towards a new family of samplers that allocate compute adaptively, focusing effort on the hardest parts of an image.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36336", "url": null, "sourceid": 43678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40019, "uid": "bc8dc2e92112275da59ba352aa5070c3", "name": "StreamVLO: Streaming Visual\u2013LiDAR Odometry with Cumulative Drift Compensation", "authors": [{"id": 193316, "fullname": "Mengmeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193316?format=json", "institution": "University of Twente"}, {"id": 192000, "fullname": "Jiuming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192000?format=json", "institution": "University of Cambridge"}, {"id": 150265, "fullname": "Michael Ying Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150265?format=json", "institution": "University of Bath"}, {"id": 103437, "fullname": "Chaokang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/103437?format=json", "institution": "PhiGent Robotics"}, {"id": 193317, "fullname": "Jiangtao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193317?format=json", "institution": "Phigent Robotics"}, {"id": 89102, "fullname": "Yunpeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89102?format=json", "institution": "PhiGent Robotics"}, {"id": 129735, "fullname": "Hesheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129735?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86720, "fullname": "Francesco Nex", "url": "http://cvpr.thecvf.com/api/miniconf/users/86720?format=json", "institution": "University of Twente"}, {"id": 133794, "fullname": "Hao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133794?format=json", "institution": "University of Twente"}], "abstract": "We propose StreamVLO, a streaming visual\u2013LiDAR odometry framework that performs unified spatio-temporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly used autonomous driving datasets, reducing errors by 19\\% ($t_{\\text{rel}}$) and 22\\% ($r_{\\text{rel}}$) on KITTI, and by 18\\% ATE and 16\\% RPE on Argoverse, while remaining suitable for real-time deployment.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40019", "url": null, "sourceid": 40993, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38449, "uid": "9a48bb6f434ea2cdb7c907ec71313405", "name": "ORBIT: Benchmarking SfM in the Wild with 360\u00b0 Video", "authors": [{"id": 85332, "fullname": "Sara Sabour", "url": "http://cvpr.thecvf.com/api/miniconf/users/85332?format=json", "institution": "Google"}, {"id": 86360, "fullname": "Richard Tucker", "url": "http://cvpr.thecvf.com/api/miniconf/users/86360?format=json", "institution": "Google"}, {"id": 86677, "fullname": "Marcus A. Brubaker", "url": "http://cvpr.thecvf.com/api/miniconf/users/86677?format=json", "institution": "Samsung AI"}, {"id": 189881, "fullname": "Saurabh Saxena", "url": "http://cvpr.thecvf.com/api/miniconf/users/189881?format=json", "institution": "Google DeepMind"}, {"id": 87576, "fullname": "Junhwa Hur", "url": "http://cvpr.thecvf.com/api/miniconf/users/87576?format=json", "institution": "Google"}, {"id": 126249, "fullname": "Andrea Tagliasacchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126249?format=json", "institution": "Simon Fraser University, Google Brain"}, {"id": 87560, "fullname": "Deqing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87560?format=json", "institution": "Google"}, {"id": 84604, "fullname": "David J. Fleet", "url": "http://cvpr.thecvf.com/api/miniconf/users/84604?format=json", "institution": "University of Toronto &amp; Google DeepMind"}, {"id": 88991, "fullname": "Richard Szeliski", "url": "http://cvpr.thecvf.com/api/miniconf/users/88991?format=json", "institution": "Google"}, {"id": 85450, "fullname": "Noah Snavely", "url": "http://cvpr.thecvf.com/api/miniconf/users/85450?format=json", "institution": "Google / Cornell"}], "abstract": "Structure-from-Motion (SfM) is a cornerstone of 3D perception, yet current methods often fail when applied to complex videos involving challenging camera motions or dynamic scenes.Compounding the problem, the field lacks reliable ground-truth benchmarks for such difficult scenarios, making it hard to gauge real-world progress, or pinpoint where improvements are most needed.To address this gap, we introduce a new benchmark for evaluating camera pose estimation.Our key insight is to leverage online panoramic 360\u00b0 as a source of data from which to construct challenging clips, while still enabling robust ground-truth trajectory recovery.The panoramic nature of these videos provides richer visual context for tracking camera motion, even when parts of the view are affected by blur, motion, or dynamic objects.By tracking camera motion across full 360\u00b0 videos, we crop and reproject selected portions to generate perspective-view clips that serve as our benchmark---ORBIT---a diverse collection of 100 video clips.Experiments show that COLMAP and other state-of-the-art SfM methods struggle to accurately estimate camera positions on our benchmark, indicating that it remains a challenging and open problem space for future research.As a result, ORBIT provides a valuable testbed where researchers can meaningfully compete and measure progress on truly challenging, real-world SfM problems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38449", "url": null, "sourceid": 42473, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39458, "uid": "882f51857b62621eafb2988625a16b2e", "name": "MoLingo: Motion\u2013Language Alignment for Text-to-Human Motion Generation", "authors": [{"id": 98629, "fullname": "Yannan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/98629?format=json", "institution": "University of T\u00fcbingen"}, {"id": 131729, "fullname": "Garvita Tiwari", "url": "http://cvpr.thecvf.com/api/miniconf/users/131729?format=json", "institution": "University of Tuebingen and MPI-Saarbrucken"}, {"id": 192111, "fullname": "Xiaohan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192111?format=json", "institution": "University of Tuebingen and MPI Informatics"}, {"id": 192112, "fullname": "Pankaj Bora", "url": "http://cvpr.thecvf.com/api/miniconf/users/192112?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 73518, "fullname": "Tolga Birdal", "url": "http://cvpr.thecvf.com/api/miniconf/users/73518?format=json", "institution": "Imperial College London"}, {"id": 126719, "fullname": "Jan Lenssen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126719?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 75975, "fullname": "Gerard Pons-Moll", "url": "http://cvpr.thecvf.com/api/miniconf/users/75975?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text\u2013motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39458", "url": null, "sourceid": 35635, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37167, "uid": "75151da8b881b77d075fbe45c47cef39", "name": "MIM Representations Encode Non-Semantic Noise: Post-Hoc Suppression Boosts Zero-Shot Performance", "authors": [{"id": 183013, "fullname": "Martine Hjelkrem-Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183013?format=json", "institution": "University of Oslo"}, {"id": 186822, "fullname": "Marius Aasan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186822?format=json", "institution": null}, {"id": 76816, "fullname": "Rwiddhi Chakraborty", "url": "http://cvpr.thecvf.com/api/miniconf/users/76816?format=json", "institution": "University of Troms\u00f8"}, {"id": 186823, "fullname": "Gabriel Arteaga", "url": "http://cvpr.thecvf.com/api/miniconf/users/186823?format=json", "institution": "University of Oslo"}, {"id": 186824, "fullname": "Changkyu Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186824?format=json", "institution": "University of Oslo"}, {"id": 186825, "fullname": "Ad\u00edn Ram\u00edrez Rivera", "url": "http://cvpr.thecvf.com/api/miniconf/users/186825?format=json", "institution": "University of Oslo"}], "abstract": "Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantic Orthogonal Projection (SOaP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOaP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37167", "url": null, "sourceid": 39687, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39441, "uid": "36686212b9b05b73736f0d77f98377bb", "name": "Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs", "authors": [{"id": 90408, "fullname": "Kai Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90408?format=json", "institution": "Carnegie Mellon University"}, {"id": 192079, "fullname": "Weichen Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192079?format=json", "institution": "Carnegie Mellon University"}, {"id": 192080, "fullname": "Li Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192080?format=json", "institution": null}, {"id": 192081, "fullname": "Alexander Robey", "url": "http://cvpr.thecvf.com/api/miniconf/users/192081?format=json", "institution": "Thinking Machines"}, {"id": 192082, "fullname": "Andy Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192082?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 192083, "fullname": "Haoqi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192083?format=json", "institution": "Amazon"}, {"id": 156408, "fullname": "Chengming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156408?format=json", "institution": "Tencent"}, {"id": 192084, "fullname": "Matt Fredrikson", "url": "http://cvpr.thecvf.com/api/miniconf/users/192084?format=json", "institution": "Gray Swan AI; Carnegie Mellon University"}], "abstract": "Multimodal large language models (MLLMs) have achieved remarkable success across diverse applications, from autonomous driving to document understanding. As these models are deployed in safety-critical contexts, understanding their adversarial robustness becomes crucial. However, current evaluations focus primarily on simple tasks like coarse-grained classification, and employ inconsistent evaluation protocols, hindering rigorous comparison of attack methods. We introduce AdvRobustBench, a comprehensive adversarial robustness benchmark for MLLMs comprising 1,000 examples across visual question answering (VQA) and optical character recognition (OCR) tasks, drawn from widely-used MLLM benchmarks (MMBench, MMStar, OCRBench-v2). We further propose Omni-Attack, a novel transfer-based black-box attack method that addresses key challenges in attacking open-ended question-answering systems. Our approach introduces (i) a target-construction pipeline that generates question-conditioned textual and visual targets to provide stronger optimization signals, and (ii) a location-aware attack strategy for OCR that enables spatially-precise perturbations. Extensive experiments demonstrate that Omni-Attack achieves strong targeted attack success rates (up to 71.8\\% on GPT-4.1 at $\\varepsilon=8/255$) across both proprietary models (GPT-4.1, Claude 3.7, Gemini 2.0) and open-source MLLMs, revealing significant vulnerabilities in current multimodal systems. Our benchmark and findings establish a foundation for developing more robust MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39441", "url": null, "sourceid": 31911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39479, "uid": "7ac09a39f051ff62eefbafd363b1bbb3", "name": "Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model", "authors": [{"id": 155568, "fullname": "Mohaiminul Al Nahian", "url": "http://cvpr.thecvf.com/api/miniconf/users/155568?format=json", "institution": "State University of New York at Binghamton"}, {"id": 192160, "fullname": "Abeer Almalky", "url": "http://cvpr.thecvf.com/api/miniconf/users/192160?format=json", "institution": "State University of New York at Binghamton"}, {"id": 130201, "fullname": "Sabbir Ahmed", "url": "http://cvpr.thecvf.com/api/miniconf/users/130201?format=json", "institution": "State University of New York at Binghamton"}, {"id": 183974, "fullname": "Abdullah Al Arafat", "url": "http://cvpr.thecvf.com/api/miniconf/users/183974?format=json", "institution": "Florida International University"}, {"id": 129638, "fullname": "Mamshad Nayeem Rizve", "url": "http://cvpr.thecvf.com/api/miniconf/users/129638?format=json", "institution": "Amazon"}, {"id": 107234, "fullname": "Adnan Rakin Rakin", "url": "http://cvpr.thecvf.com/api/miniconf/users/107234?format=json", "institution": "State University of New York at Binghamton"}], "abstract": "The remarkable success of modern Deep Neural Networks (DNNs) can be primarily attributed to having access to compute resources and high-quality labeled data, which is often costly and challenging to acquire. Recently, text-to-image Diffusion Models (DMs) have emerged as powerful data generators to augment training datasets. Machine learning practitioners often utilize off-the-shelf third-party DMs for generating synthetic data without domain-specific expertise or adaptation. Such a practice leads to a novel and insidious threat: diffusion-model infected with a backdoor can effectively spread into a large number of downstream models, causing a backdoor pandemic. To achieve this for the first time, we propose Eidolon, designed and optimized to stealthily transfer the backdoor injected into a single diffusion model into virtually an infinite number of downstream models without any active attacker role in the downstream training tasks. Proposed Eidolon not only makes the attack stealthier and effective, it also enforces a strict threat model for injecting backdoor into the downstream model compared to conventional backdoor attacks. We propose four necessary tests that a successful backdoor attack on the diffusion model should pass to cause a backdoor pandemic. Our evaluation across a wide range of benchmark datasets and model architectures exhibits that only our attack successfully passes these tests, causing widespread pandemic across many downstream classifiers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39479", "url": null, "sourceid": 33960, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37796, "uid": "f7b80214c9564ac1bd11fd91a037d17b", "name": "OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives", "authors": [{"id": 182840, "fullname": "Zedong Dan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182840?format=json", "institution": "Sun Yat-sen University"}, {"id": 188289, "fullname": "Zijie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188289?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 90626, "fullname": "Wei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90626?format=json", "institution": "Baidu"}, {"id": 128015, "fullname": "Xiangru Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128015?format=json", "institution": "University of Hong Kong; Sun Yat-Sen University"}, {"id": 90627, "fullname": "Weiming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90627?format=json", "institution": "Baidu"}, {"id": 87785, "fullname": "Xiao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87785?format=json", "institution": "Baidu"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}, {"id": 75470, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75470?format=json", "institution": "Sun Yat-sen University"}, {"id": 74074, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74074?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Offline vectorized maps constitute critical infrastructure for high-precision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts.We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset of helpers that maximally reduce ego-centric uncertainty in occluded regions, addressing computation and redundancy challenges. Cross-Vehicle Attention (CVA) and Semantic-aware Noise Filter (SNF) then perform pose-tolerant alignment and artifact suppression before BEV-level fusion, addressing the noise challenge. This targeted pipeline yields more complete and topologically faithful maps with substantially fewer views than indiscriminate aggregation.On nuScenes and Argoverse2, OptiMVMap improves MapTRv2 by +10.5 mAP and +9.3 mAP, respectively, and surpasses memory-augmented baselines MVMap and HRMapNet by +6.2 mAP and +3.8 mAP on nuScenes. These results demonstrate that uncertainty-guided selection of helper vehicles is essential for efficient and accurate multi-vehicle vectorized mapping.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37796", "url": null, "sourceid": 45597, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36701, "uid": "3b155e7975caedac55e57c64b23d2843", "name": "Detecting Unknown Objects via Energy-based Separation for Open World Object Detection", "authors": [{"id": 180987, "fullname": "JunWoo Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180987?format=json", "institution": "Korea University"}, {"id": 185677, "fullname": "Keonhee Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/185677?format=json", "institution": "Seoul National University"}, {"id": 153134, "fullname": "Gyeong-Moon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/153134?format=json", "institution": "Korea University"}], "abstract": "In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify given known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector's known class prediction information for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (\\textbf{De}tecting \\textbf{U}nknowns via energy-based \\textbf{S}eparation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of ETF-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations and leverages energies from both spaces to better capture distinct patterns of unknown objects, in contrast to prior energy-based approaches that consider only the energy within the known space. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36701", "url": null, "sourceid": 31711, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37964, "uid": "b3999f2198b3a6ddc547b84352c90e9c", "name": "Coverage Optimization for Camera View Selection", "authors": [{"id": 188696, "fullname": "Timothy Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188696?format=json", "institution": "Stanford University"}, {"id": 188697, "fullname": "Adam Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188697?format=json", "institution": "Stanford University"}, {"id": 188698, "fullname": "Maximilian Adang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188698?format=json", "institution": "Stanford University"}, {"id": 188699, "fullname": "Grace Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188699?format=json", "institution": "Stanford University"}, {"id": 151069, "fullname": "Mac Schwager", "url": "http://cvpr.thecvf.com/api/miniconf/users/151069?format=json", "institution": "Stanford University"}], "abstract": "What makes a good viewpoint? The quality of the data used to learn 3D reconstructions is crucial for enabling efficient and accurate scene modeling. We study the active view selection problem and develop a principled analysis that yields a simple and interpretable criterion for selecting informative camera poses. Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. This leads to a lightweight coverage-based view selection metric that avoids expensive transmittance estimation and is robust to noise and training dynamics. We integrate our method into the Nerfstudio framework and evaluate it on synthetic and real scenes. Across multiple datasets and radiance-field baselines, our method achieves consistently improved reconstruction quality compared to state-of-the-art active view selection methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37964", "url": null, "sourceid": 37477, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39347, "uid": "8282b546dda9ec05515966ba0fe3e5d1", "name": "PhysVid: Physics Aware Local Conditioning for Generative Video Models", "authors": [{"id": 191897, "fullname": "Saurabh Pathak", "url": "http://cvpr.thecvf.com/api/miniconf/users/191897?format=json", "institution": "Eindhoven University of Technology"}, {"id": 158847, "fullname": "Elahe Arani", "url": "http://cvpr.thecvf.com/api/miniconf/users/158847?format=json", "institution": "Wayve ; Eindhoven University of Technology"}, {"id": 191898, "fullname": "Mykola Pechenizkiy", "url": "http://cvpr.thecvf.com/api/miniconf/users/191898?format=json", "institution": "Eindhoven University of Technology"}, {"id": 191899, "fullname": "Bahram Zonooz", "url": "http://cvpr.thecvf.com/api/miniconf/users/191899?format=json", "institution": "Eindhoven University of Technology"}], "abstract": "Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real\u2011world settings. Prior attempts to inject physics rely on conditioning: frame\u2011level signals are domain\u2011specific and short\u2011horizon, while global text prompts are coarse and noisy, missing fine\u2011grained dynamics. We present PhysVid, a physics\u2011aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics\u2011grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk\u2011aware cross\u2011attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by $\\approx 33$% over baseline video generators, and by up to $\\approx 8$% on VideoPhy2. These results show that local, physics\u2011aware guidance substantially increases physical plausibility in generative video and marks a step toward physics\u2011grounded video models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39347", "url": null, "sourceid": 44998, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37331, "uid": "40d8bca82978d40267c607afedab4b78", "name": "From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing", "authors": [{"id": 187179, "fullname": "Ling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187179?format=json", "institution": "HKUST(GZ)"}, {"id": 89612, "fullname": "Yunfan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89612?format=json", "institution": "Hong Kong University of Science and Technology(GuangZhou)"}, {"id": 187180, "fullname": "Wenzong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187180?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 187181, "fullname": "Huizai Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187181?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187182, "fullname": "Pengteng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187182?format=json", "institution": "HKUST(GZ)"}, {"id": 187183, "fullname": "Hui Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187183?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Clear imaging under hazy conditions is a critical task. Prior-based and neural methods have improved results.However, they operate on RGB frames, which suffer from limited dynamic range.Therefore, dehazing remains ill-posed and can erase structure and illumination details.To address this, we use event cameras for dehazing for the \\textbf{first time}.Event cameras offer much higher HDR ($120 dB~vs.~60 dB$) and microsecond latency, therefore they suit hazy scenes.In practice, transferring HDR cues from events to frames is hard because real paired data are scarce.To tackle this, we propose an event-guided diffusion model that utilizes the strong generative priors of diffusion models to reconstruct clear images from hazy inputs by effectively transferring HDR information from events.Specifically, we design an event-guided module that maps sparse HDR event features, \\textit{e.g.,} edges, corners, into the diffusion latent space.This clear conditioning provides precise structural guidance during generation, improves visual realism, and reduces semantic drift.For real-world evaluation, we collect a drone dataset in heavy haze (AQI = 341) with synchronized RGB and event sensors. Experiments on two benchmarks and our dataset achieve state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37331", "url": null, "sourceid": 35966, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39422, "uid": "662a2fde878a9bef6370fb07c5a28705", "name": "Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration", "authors": [{"id": 183533, "fullname": "Chen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183533?format=json", "institution": "College of Electronic Engineering, National University of Defense Technology, Changsha"}, {"id": 187179, "fullname": "Ling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187179?format=json", "institution": "HKUST(GZ)"}, {"id": 185773, "fullname": "Zhuoran Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185773?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184897, "fullname": "Yuning Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184897?format=json", "institution": "Technical University of Munich"}, {"id": 131426, "fullname": "Zhixiong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131426?format=json", "institution": "National University of Defense Technology"}, {"id": 91016, "fullname": "Xiangyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91016?format=json", "institution": "University of Macau"}, {"id": 192046, "fullname": "Yue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192046?format=json", "institution": "Beihang University"}, {"id": 192047, "fullname": "Weidong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192047?format=json", "institution": "National University of Defense Technology"}, {"id": 131407, "fullname": "Jingyuan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/131407?format=json", "institution": "National University of Defense Technology"}], "abstract": "Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces  C$^2$SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model.  C$^2$SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution,  C$^2$SSM charts a new course for efficient large-scale vision: scan clusters, not pixels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39422", "url": null, "sourceid": 45888, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37129, "uid": "d0a446d119cee636ddcafab757831a9f", "name": "From Where Things Are to What They Are For: Benchmarking Spatial\u2013Functional Intelligence in Multimodal LLMs", "authors": [{"id": 127449, "fullname": "Le Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127449?format=json", "institution": "Mila-Quebec AI Institute"}, {"id": 157979, "fullname": "Jihan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157979?format=json", "institution": "New York University"}, {"id": 186726, "fullname": "Soundarya Krishnan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186726?format=json", "institution": "Apple"}, {"id": 186727, "fullname": "Jimit Majmudar", "url": "http://cvpr.thecvf.com/api/miniconf/users/186727?format=json", "institution": "Apple"}, {"id": 186728, "fullname": "Xiou Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/186728?format=json", "institution": "Apple"}, {"id": 186729, "fullname": "Prasoon Puri", "url": "http://cvpr.thecvf.com/api/miniconf/users/186729?format=json", "institution": "Apple"}, {"id": 186730, "fullname": "Prathamesh Saraf", "url": "http://cvpr.thecvf.com/api/miniconf/users/186730?format=json", "institution": "Apple"}, {"id": 186731, "fullname": "Shruti Bhargava", "url": "http://cvpr.thecvf.com/api/miniconf/users/186731?format=json", "institution": "WisdomAI"}, {"id": 186732, "fullname": "Dhivya Piraviperumal", "url": "http://cvpr.thecvf.com/api/miniconf/users/186732?format=json", "institution": "Apple"}, {"id": 186733, "fullname": "Yinan Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/186733?format=json", "institution": "Apple"}, {"id": 186734, "fullname": "Cindy Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186734?format=json", "institution": "Apple"}, {"id": 186735, "fullname": "Hong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186735?format=json", "institution": "Apple"}, {"id": 92323, "fullname": "Aishwarya Agrawal", "url": "http://cvpr.thecvf.com/api/miniconf/users/92323?format=json", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"id": 186736, "fullname": "Bo-Hsiang Tseng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186736?format=json", "institution": "Apple"}], "abstract": "Human level agentic intelligence transcends low-level geometric perception, evolving from knowing where things are to understanding what they are for. While existing benchmarks effectively evaluate this foundational geometric perception capabilites of multimodal LLMs, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1500 expert-annotated questions derived from diverse, egocentric indoor video scans. SFI-Bench is designed to systematically evaluate two complementary dimensions of advanced reasoning: 1) Structured Spatial Reasoning, understanding complex layouts and forming coherent spatial representations, and 2) Functional Reasoning, inferring object affordances and context-dependent utility. Its tasks, including conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenge a model's ability to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to integrate spatial memory with functional and external knowledge, highlighting a critical bottleneck. SFI-Bench thus provides an essential tool for measuring and driving progress towards more cognitively capable and truly grounded multimodal agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37129", "url": null, "sourceid": 30765, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40208, "uid": "6c01156a337cb1e4748f3567bdeff63c", "name": "PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting", "authors": [{"id": 172629, "fullname": "Stephen Price", "url": "http://cvpr.thecvf.com/api/miniconf/users/172629?format=json", "institution": "Worcester Polytechnic Institute"}, {"id": 193783, "fullname": "Danielle Cote", "url": "http://cvpr.thecvf.com/api/miniconf/users/193783?format=json", "institution": "Worcester Polytechnic Institute"}, {"id": 193784, "fullname": "Elke Rundensteiner", "url": "http://cvpr.thecvf.com/api/miniconf/users/193784?format=json", "institution": "Worcester Polytechnic Institute"}], "abstract": "High-quality segmentations are critical in vision tasks where boundary accuracy is important (e.g., medical diagnostics, quality control, etc.). Recently, promptable vision models have emerged as effective backbones for segmentation refinement frameworks. However, their performance not only hinges on prompt quality, they also must overcome noisy input masks and semantically ambiguous outputs from promptable models. Existing prompt-based refiners rely on fixed prompt rules, making them brittle to changing failure modes and new tasks or domains. We propose \\MOE{}, a model-agnostic MoE-driven prompting refiner effective in segmentation refinement across tasks and domains. \\MOE{} features three collaborative modules to refine an initial mask: our MoE-based Image-Informed Prompting framework (IIP) takes an image and coarse mask and produces a set of expert score maps to guide prompt generation, the Dynamic Expert Selector (DES)  activates only the most relevant experts and fuses their maps to avoid dense evaluation and signal dilution, and the Prompt-Placement Explorer (PPE) explores the fused guidance map to place high-confidence spatially diverse point prompts. Across five benchmark datasets (BIG, VOC, DAVIS585, ECSSD, MSRA-B), \\MOE{} achieves statistically significant gains over SOTA methods CascadePSP, SegRefiner, and SAMRefiner on semantic, instance, and salient tasks, with mean improvements of +6.24 IoU / +8.99 BIoU.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40208", "url": null, "sourceid": 31052, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39686, "uid": "6ce8e09d88ff747283ee161a90cb1cd4", "name": "Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds", "authors": [{"id": 107626, "fullname": "Bin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107626?format=json", "institution": "Robert Bosch GmbH"}, {"id": 154467, "fullname": "Mohamed Abdelsamad", "url": "http://cvpr.thecvf.com/api/miniconf/users/154467?format=json", "institution": "Bosch Center for AI"}, {"id": 192646, "fullname": "Miao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192646?format=json", "institution": "Robert Bosch GmbH, Bosch"}, {"id": 105703, "fullname": "Alexandru Paul Condurache", "url": "http://cvpr.thecvf.com/api/miniconf/users/105703?format=json", "institution": "Bosch"}], "abstract": "Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to localization, and often require full finetuning for strong performance. Accurate localization is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that supports all downstream tasks on 3D data. In this work, we introduce PointINS, a localization-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal localization branch to jointly learn high-level semantic understanding and geometric reasoning, yielding localization awareness. We identify two consistent properties essential for robust localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39686", "url": null, "sourceid": 40853, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39485, "uid": "66d58ab619387ebb945e0e3abe5e0a1c", "name": "In Pursuit of Pixel Supervision for Visual Pre-training", "authors": [{"id": 77204, "fullname": "Lihe Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77204?format=json", "institution": "The University of Hong Kong"}, {"id": 130924, "fullname": "Shang-Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130924?format=json", "institution": "Facebook"}, {"id": 192170, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192170?format=json", "institution": "Meta FAIR"}, {"id": 192171, "fullname": "Xinjie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192171?format=json", "institution": "Facebook"}, {"id": 192172, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192172?format=json", "institution": "Facebook"}, {"id": 190172, "fullname": "Abdelrahman Mohamed", "url": "http://cvpr.thecvf.com/api/miniconf/users/190172?format=json", "institution": "Facebook"}, {"id": 90241, "fullname": "Saining Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90241?format=json", "institution": "Facebook"}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}, {"id": 150920, "fullname": "Kaiming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/150920?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 130912, "fullname": "Hu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130912?format=json", "institution": "FAIR, Multimodal Foundation"}], "abstract": "Pixels provide a lightweight, scalable way to encode the physical world, preserving rich visual information with minimal human inductive bias. We demonstrate that visual pre-training using pixel supervision alone can learn desirable visual properties and produce strong representations, while remaining simple, stable, and efficient. We present Pixo, a capable self-supervised model trained by purely predicting pixels. It is instantiated on the masked autoencoding (MAE) framework, but enhances MAE with a deeper decoder, larger-block masking, and additional class tokens. It is trained on 2B web-crawled images with a self-curated strategy. Pixo performs well on many downstream tasks, covering monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), object segmentation (e.g., SAM 2), and embodied AI. We will release the training code and pre-trained models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39485", "url": null, "sourceid": 44270, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40107, "uid": "fd9c99c4dd5bfb124407e7c6a84b5ef0", "name": "CamDirector: Towards Long-Term Coherent Video Trajectory Editing", "authors": [{"id": 183466, "fullname": "Kejia Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/183466?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 157322, "fullname": "Zhihao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/157322?format=json", "institution": "Noah&#x27;s Ark Lab"}, {"id": 193551, "fullname": "Weilin Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193551?format=json", "institution": "Snap Inc."}, {"id": 157324, "fullname": "Yuhongze Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/157324?format=json", "institution": "McGill University"}, {"id": 193552, "fullname": "YUANHAO YU", "url": "http://cvpr.thecvf.com/api/miniconf/users/193552?format=json", "institution": ""}, {"id": 152200, "fullname": "Xinxin Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152200?format=json", "institution": "Concordia University"}, {"id": 193553, "fullname": "Qiang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193553?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence; University of Toronto"}, {"id": 157326, "fullname": "Juwei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157326?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement.2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40107", "url": null, "sourceid": 33069, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38140, "uid": "3a56cc8f30a06a396fe53361b0f3553f", "name": "MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues", "authors": [{"id": 151683, "fullname": "Zichen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151683?format=json", "institution": "HKUST"}, {"id": 156838, "fullname": "Yue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156838?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 89797, "fullname": "Hao Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89797?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 93859, "fullname": "Qiuyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93859?format=json", "institution": "Ant Group"}, {"id": 69405, "fullname": "Shuailei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69405?format=json", "institution": "Northeastern University"}, {"id": 91740, "fullname": "Ka Leong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/91740?format=json", "institution": "HKUST"}, {"id": 127962, "fullname": "Wen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127962?format=json", "institution": "Zhejiang University"}, {"id": 70941, "fullname": "Qingyan Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/70941?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 189146, "fullname": "Yuxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189146?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 94115, "fullname": "Yanhong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/94115?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 71750, "fullname": "Yixuan LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/71750?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 186273, "fullname": "Xing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186273?format=json", "institution": "Ant Group"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87711, "fullname": "Qifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87711?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "We propose MagicQuill V2, a novel framework that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of modern diffusion models and the granular control of traditional graphics software. While state-of-the-art diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and style. To overcome this limitation, our method deconstructs creative intent into a stack of independently controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive, and powerful control over the generative process.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38140", "url": null, "sourceid": 32534, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37246, "uid": "ee577c86d39b672b84ced795d14380dc", "name": "ELVIS: Enhance Low-light for Video Instance Segmentation in the Dark", "authors": [{"id": 183455, "fullname": "Joanne Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/183455?format=json", "institution": "University of Bristol"}, {"id": 187003, "fullname": "Ruirui Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187003?format=json", "institution": "University of Bristol"}, {"id": 187004, "fullname": "Yini Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187004?format=json", "institution": "University of Bristol"}, {"id": 155305, "fullname": "David Bull", "url": "http://cvpr.thecvf.com/api/miniconf/users/155305?format=json", "institution": "University of Bristol"}, {"id": 187005, "fullname": "Nantheera Anantrasirichai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187005?format=json", "institution": "University of Bristol"}], "abstract": "Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce ELVIS (Enhance Low-light for Video Instance Segmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to +3.7AP on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37246", "url": null, "sourceid": 39714, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39962, "uid": "0cb5bbd38e65f8df5c422232fe758c5d", "name": "WalkGPT: Grounded Vision\u2013Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation", "authors": [{"id": 180602, "fullname": "Rafi Ibn Sultan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180602?format=json", "institution": "Wayne State University"}, {"id": 193201, "fullname": "Hui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193201?format=json", "institution": "Wayne State University"}, {"id": 193202, "fullname": "Xiangyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193202?format=json", "institution": "Wayne State University"}, {"id": 182705, "fullname": "Chengyin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182705?format=json", "institution": "Henry Ford Health"}, {"id": 193203, "fullname": "Prashant Khanduri", "url": "http://cvpr.thecvf.com/api/miniconf/users/193203?format=json", "institution": "Wayne State University"}, {"id": 193204, "fullname": "Marco Brocanelli", "url": "http://cvpr.thecvf.com/api/miniconf/users/193204?format=json", "institution": "Ohio State University, Columbus"}, {"id": 193205, "fullname": "Dongxiao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193205?format=json", "institution": "Wayne State University"}], "abstract": "Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision\u2013Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are included in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39962", "url": null, "sourceid": 32807, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38609, "uid": "66f09de47d7af86fdb784e3e7f68640b", "name": "Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds", "authors": [{"id": 190287, "fullname": "Alexis Jensen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190287?format=json", "institution": "INRIA"}, {"id": 190288, "fullname": "Pei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190288?format=json", "institution": "Stanford University"}, {"id": 190289, "fullname": "Ioannis Karamouzas", "url": "http://cvpr.thecvf.com/api/miniconf/users/190289?format=json", "institution": "University of California, Riverside"}, {"id": 190290, "fullname": "Charles Pontonnier", "url": "http://cvpr.thecvf.com/api/miniconf/users/190290?format=json", "institution": "Ecole Normale Sup\u00e9rieure de Rennes - Inria ComBO"}, {"id": 128425, "fullname": "Julien Pettr\u00e9", "url": "http://cvpr.thecvf.com/api/miniconf/users/128425?format=json", "institution": "INRIA"}], "abstract": "We present a physics-based method for simulating full-body agents that recover balance by stepping or applying contact forces after being perturbed in dense crowds. While traditional 2D crowd simulations focus on navigation and social interactions in moderately dense settings, interactions in highly dense environments are predominantly physical, leading to push propagation, falls, and potential hazards. Existing models cannot capture how forces are transmitted through the body at the limb level. To address this, we use physics-based anthropomorphic simulations combined with a two-stage deep reinforcement learning framework. In the first stage, a policy is pre-trained using reference motion data and general balance rewards, enabling agents to handle a wide range of perturbations. In the second stage, an adaptive phase refines the policy to allow socially aware interactions, using hand contacts for stabilization guided by an online heuristic targeting neighbors\u2019 shoulders based on mechanical efficiency and collision risk. Ablation studies validate the training framework and reward components, and simulations reproduce trends observed in empirical studies of push propagation. Our method scales to large populations, offering new opportunities to study safety and collective behavior in dense crowds.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38609", "url": null, "sourceid": 31036, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39060, "uid": "cde3137a12d0a821fe6657c5a5292bfd", "name": "Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post\u2011hoc Debiasing in Vision-Language Models", "authors": [{"id": 191273, "fullname": "Dachuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191273?format=json", "institution": "Harvard University, Harvard University"}, {"id": 191274, "fullname": "Weiyue Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191274?format=json", "institution": "Harvard University"}, {"id": 191275, "fullname": "Zhenda Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191275?format=json", "institution": "Harvard University"}, {"id": 191276, "fullname": "Yushu Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191276?format=json", "institution": "Harvard University"}, {"id": 191277, "fullname": "Bowen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191277?format=json", "institution": "Harvard University"}, {"id": 182465, "fullname": "Haoyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182465?format=json", "institution": "Harvard University"}, {"id": 135132, "fullname": "Yongchao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/135132?format=json", "institution": "MIT"}], "abstract": "Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical failures of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose  $\\textbf{S}$ubspace $\\textbf{P}$rojection $\\textbf{D}$ebiasing ($\\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of \\sys{}: our method achieves more robust debiasing with an average improvement of 18.5\\% across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39060", "url": null, "sourceid": 46310, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37510, "uid": "c6429c03537e18407cf1e2d16f4af3c8", "name": "Same or Not? Enhancing Visual Perception in Vision-Language Models", "authors": [{"id": 153252, "fullname": "Damiano Marsili", "url": "http://cvpr.thecvf.com/api/miniconf/users/153252?format=json", "institution": "California Institute of Technology"}, {"id": 187608, "fullname": "Aditya Mehta", "url": "http://cvpr.thecvf.com/api/miniconf/users/187608?format=json", "institution": "California Institute of Technology"}, {"id": 183309, "fullname": "Ryan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/183309?format=json", "institution": "California Institute of Technology"}, {"id": 86982, "fullname": "Georgia Gkioxari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86982?format=json", "institution": "California Institute of Technology"}], "abstract": "Vision\u2013language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition (\u201cIs it a cat or a dog?\u201d) over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37510", "url": null, "sourceid": 36869, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38069, "uid": "28f36aa85fbadc0663c2df15a5af35db", "name": "SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation", "authors": [{"id": 145390, "fullname": "Yu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/145390?format=json", "institution": "Purdue University"}, {"id": 183202, "fullname": "Tharindu Wickremasinghe", "url": "http://cvpr.thecvf.com/api/miniconf/users/183202?format=json", "institution": "Purdue University"}, {"id": 188977, "fullname": "Zeeshan Nadir", "url": "http://cvpr.thecvf.com/api/miniconf/users/188977?format=json", "institution": "Samsung Research America"}, {"id": 154013, "fullname": "Xijun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154013?format=json", "institution": "Purdue University"}, {"id": 188978, "fullname": "Yiheng Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188978?format=json", "institution": "DeepLux Technology Inc"}, {"id": 75635, "fullname": "Stanley H. Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75635?format=json", "institution": "Purdue University, USA"}], "abstract": "Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D$\\to$4D$\\to$2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D$\\to$4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D$\\to$continuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D$\\to$2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generation, demonstrating strong potentials in multiple tasks including unseen temporal generation, unseen spatial generation, and video editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38069", "url": "https://yuyuanspace.com/SeeU/", "sourceid": 46225, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36583, "uid": "92d97e5c9d7f16e0f7f89464108ea62e", "name": "ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning", "authors": [{"id": 100023, "fullname": "Ruiqing Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/100023?format=json", "institution": "University of Alberta"}, {"id": 177164, "fullname": "Mohan Sai Singamsetti", "url": "http://cvpr.thecvf.com/api/miniconf/users/177164?format=json", "institution": "University of Alberta"}, {"id": 88141, "fullname": "Di Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88141?format=json", "institution": "University of Alberta"}, {"id": 185411, "fullname": "Bahador Rashidi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185411?format=json", "institution": "Huawei Canada"}], "abstract": "Large Vision Language Models (LVLMs) exhibit strong perceptual and linguistic capabilities yet struggle with complex visual reasoning tasks that require structured, compositional, and adaptive inference. Existing approaches either rely on costly inference-time exploration\u2014such as multi-path or tree-based Chain-of-Thought (CoT) search\u2014or on expensive post-training with large curated CoT datasets. We propose ReaGEN, a lightweight framework for the adaptive generation of structured reasoning chains that enhances reasoning without modifying the underlying vision\u2013language model (VLM). ReaGEN first employs a teacher-guided evolutionary search to collect sample specific CoT structure, leveraging attention-derived stage importance to capture how information flows across reasoning stages. These adaptive CoT structures are then used to train a compact generator (GEN) that learns to refine and improve CoT structures by reflecting on attention feedback from the reasoning process. At inference, the GEN dynamically produces question-adaptive structured CoTs, and can be iteratively invoked to refine them based on the VLM\u2019s internal state\u2014achieving the flexibility of deep search with single-path efficiency. Across diverse multimodal reasoning benchmarks, ReaGEN achieves up to +26 accuracy points over test-time scaling methods while reducing the average inference-time token usage by 79\\%, establishing a scalable and model-agnostic approach for structured reasoning generation in VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36583", "url": null, "sourceid": 42264, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39640, "uid": "adcf964dc675106763e656dcd371299f", "name": "Efficient Decentralized Diffusion with Heterogeneous Training Objectives", "authors": [{"id": 177956, "fullname": "Zhiying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177956?format=json", "institution": "Bagel Labs"}, {"id": 192546, "fullname": "Raihan Seraj", "url": "http://cvpr.thecvf.com/api/miniconf/users/192546?format=json", "institution": "Bagel Labs"}, {"id": 192547, "fullname": "Marcos Villagra", "url": "http://cvpr.thecvf.com/api/miniconf/users/192547?format=json", "institution": "Bagel Labs"}, {"id": 192548, "fullname": "Bidhan Roy", "url": "http://cvpr.thecvf.com/api/miniconf/users/192548?format=json", "institution": "Bagel Labs"}], "abstract": "Training state-of-the-art diffusion models requires massive computational resources concentrated in tightly-coupled clusters, fundamentally limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that dramatically reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three key contributions: (1) PixArt-$\\alpha$'s efficient AdaLN-Single architecture, reducing parameters while maintaining quality; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) a training-free inference conversion framework that unifies heterogeneous expert predictions (DDPM and Flow Matching) into a common velocity space without any retraining. Experiments on LAION-Aesthetics demonstrate that our decentralized approach achieves comparative results with 16$\\times$ compute reduction (72 vs 1176 GPU-days) and 14$\\times$ data reduction (11M vs 158M images). Our heterogeneous variant mixing DDPM and Flow Matching experts exhibits complementary specialization patterns, improving generation diversity and texture quality despite modest FID increases. By eliminating synchronization requirements and enabling arbitrary objective combinations, our framework democratizes large-scale generative model training, allowing contributors with diverse resources to participate using consumer GPUs requiring only 20-48GB VRAM.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39640", "url": null, "sourceid": 33757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39965, "uid": "4a9afaeb2472f426769ee7fe737f82ff", "name": "Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning", "authors": [{"id": 193207, "fullname": "Ido Sobol", "url": "http://cvpr.thecvf.com/api/miniconf/users/193207?format=json", "institution": "Technion - Israel Institute of Technology"}, {"id": 84616, "fullname": "Kihyuk Sohn", "url": "http://cvpr.thecvf.com/api/miniconf/users/84616?format=json", "institution": "Google"}, {"id": 193208, "fullname": "Yoav Blum", "url": "http://cvpr.thecvf.com/api/miniconf/users/193208?format=json", "institution": null}, {"id": 87620, "fullname": "Egor Zakharov", "url": "http://cvpr.thecvf.com/api/miniconf/users/87620?format=json", "institution": "Meta Reality Labs"}, {"id": 193209, "fullname": "Max Bluvstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/193209?format=json", "institution": "Meta"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}, {"id": 87005, "fullname": "Or Litany", "url": "http://cvpr.thecvf.com/api/miniconf/users/87005?format=json", "institution": "NVIDIA / Technion"}], "abstract": "We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls.Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available.While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images.To address this, we introduce Realiz3D, a lightweight framework that decouples controls and visual domain.The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain.In this way, the model can be guided to produce realistic images even when controls are applied.We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap.We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39965", "url": null, "sourceid": 30885, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37433, "uid": "bcbc28b41f550d272eb2f1d343011760", "name": "SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control", "authors": [{"id": 181438, "fullname": "Arman Zarei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181438?format=json", "institution": "Department of Computer Science, University of Maryland, College Park; Netflix"}, {"id": 187435, "fullname": "Samyadeep Basu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187435?format=json", "institution": "University of Maryland, College Park"}, {"id": 187436, "fullname": "Mobina Pournemat", "url": "http://cvpr.thecvf.com/api/miniconf/users/187436?format=json", "institution": "University of Maryland, College Park"}, {"id": 187437, "fullname": "Sayan Nag", "url": "http://cvpr.thecvf.com/api/miniconf/users/187437?format=json", "institution": "Adobe Research"}, {"id": 187438, "fullname": "Ryan A. Rossi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187438?format=json", "institution": "Adobe Research"}, {"id": 84636, "fullname": "Soheil Feizi", "url": "http://cvpr.thecvf.com/api/miniconf/users/84636?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user\u2019s ability to precisely and continuously control the intensity of individual edits.We introduce *SliderEdit*, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a *single* set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. We are the first to explore and propose a framework for continuous, fine-grained instruction control in image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37433", "url": null, "sourceid": 35955, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40301?format=json"], "related_events_ids": [40301]}, {"id": 40301, "uid": "bcbc28b41f550d272eb2f1d343011760", "name": "SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control", "authors": [{"id": 181438, "fullname": "Arman Zarei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181438?format=json", "institution": "Department of Computer Science, University of Maryland, College Park; Netflix"}, {"id": 187435, "fullname": "Samyadeep Basu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187435?format=json", "institution": "University of Maryland, College Park"}, {"id": 187436, "fullname": "Mobina Pournemat", "url": "http://cvpr.thecvf.com/api/miniconf/users/187436?format=json", "institution": "University of Maryland, College Park"}, {"id": 187437, "fullname": "Sayan Nag", "url": "http://cvpr.thecvf.com/api/miniconf/users/187437?format=json", "institution": "Adobe Research"}, {"id": 187438, "fullname": "Ryan A. Rossi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187438?format=json", "institution": "Adobe Research"}, {"id": 84636, "fullname": "Soheil Feizi", "url": "http://cvpr.thecvf.com/api/miniconf/users/84636?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user\u2019s ability to precisely and continuously control the intensity of individual edits.We introduce *SliderEdit*, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a *single* set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. We are the first to explore and propose a framework for continuous, fine-grained instruction control in image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40301", "url": null, "sourceid": -35955, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37433?format=json"], "related_events_ids": [37433]}, {"id": 36211, "uid": "a162d5eaf59d4935d3f6196f03f7b994", "name": "Lite Any Stereo: Efficient Zero-Shot Stereo Matching", "authors": [{"id": 182350, "fullname": "Junpeng Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/182350?format=json", "institution": "Imperial College London"}, {"id": 150460, "fullname": "Weixun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/150460?format=json", "institution": "Imperial College London"}, {"id": 184461, "fullname": "Ye Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184461?format=json", "institution": "Imperial College London"}, {"id": 184462, "fullname": "Krystian Mikolajczyk", "url": "http://cvpr.thecvf.com/api/miniconf/users/184462?format=json", "institution": "Imperial College London"}], "abstract": "Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong zero-shot generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% of their computational cost, setting a new standard for efficient stereo matching. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36211", "url": null, "sourceid": 33374, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36461, "uid": "7221cf069a295e443767735660697a24", "name": "From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis", "authors": [{"id": 185108, "fullname": "Ranran Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185108?format=json", "institution": "Imperial College London"}, {"id": 150460, "fullname": "Weixun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/150460?format=json", "institution": "Imperial College London"}, {"id": 184461, "fullname": "Ye Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184461?format=json", "institution": "Imperial College London"}, {"id": 184462, "fullname": "Krystian Mikolajczyk", "url": "http://cvpr.thecvf.com/api/miniconf/users/184462?format=json", "institution": "Imperial College London"}], "abstract": "In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors.Given uncalibrated and unposed multi-view images, NAS3R reconstructs 3D Gaussian primitives from context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision.To ensure stable convergence, NAS3R integrates scene reconstruction and camera estimation within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization.The framework is compatible with state-of-the-art architectures and can  incorporate pretrained priors or intrinsic information when available.Extensive experiments show that NAS3R achieves  superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D learning from unconstrained data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36461", "url": null, "sourceid": 36489, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39862, "uid": "5d2a47cdbbcc3d3af68f7d4003f796b5", "name": "HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration", "authors": [{"id": 181882, "fullname": "Seunghoi Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/181882?format=json", "institution": "University College London"}, {"id": 193001, "fullname": "Henry Tregidgo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193001?format=json", "institution": "University College London, University of London"}, {"id": 193002, "fullname": "Chen Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193002?format=json", "institution": "Astrazeneca"}, {"id": 193003, "fullname": "Matteo Figini", "url": "http://cvpr.thecvf.com/api/miniconf/users/193003?format=json", "institution": "University College London"}, {"id": 185469, "fullname": "Daniel C. Alexander", "url": "http://cvpr.thecvf.com/api/miniconf/users/185469?format=json", "institution": "University College London"}], "abstract": "Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors.Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective.We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36).Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation.We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures.Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39862", "url": null, "sourceid": 32266, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37488, "uid": "2920bee1c45be7884374eabc7f50f78d", "name": "SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation", "authors": [{"id": 152414, "fullname": "Vaibhav Agrawal", "url": "http://cvpr.thecvf.com/api/miniconf/users/152414?format=json", "institution": "IIIT-Hyderabad"}, {"id": 126903, "fullname": "Rishubh Parihar", "url": "http://cvpr.thecvf.com/api/miniconf/users/126903?format=json", "institution": "Indian Institute of Science, Bangalore"}, {"id": 187572, "fullname": "Pradhaan S Bhat", "url": "http://cvpr.thecvf.com/api/miniconf/users/187572?format=json", "institution": "Indian Institute of Science, Bangalore"}, {"id": 151819, "fullname": "Ravi Kiran Sarvadevabhatla", "url": "http://cvpr.thecvf.com/api/miniconf/users/151819?format=json", "institution": "International Institute of Information Technology Hyderabad, Dhirubhai Ambani Institute Of Information and Communication Technology"}, {"id": 76920, "fullname": "R. Venkatesh Babu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76920?format=json", "institution": "Indian Institute of Science"}], "abstract": "We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout\u2013conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose \\textbf{SeeThrough3D}, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37488", "url": null, "sourceid": 41166, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37602, "uid": "0dcee28928c857ea32a8eb174c74c663", "name": "OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective", "authors": [{"id": 187804, "fullname": "Markus Gross", "url": "http://cvpr.thecvf.com/api/miniconf/users/187804?format=json", "institution": "TU Munich"}, {"id": 187805, "fullname": "Sai Bharadhwaj Matha", "url": "http://cvpr.thecvf.com/api/miniconf/users/187805?format=json", "institution": ""}, {"id": 187806, "fullname": "Aya Fahmy", "url": "http://cvpr.thecvf.com/api/miniconf/users/187806?format=json", "institution": "Fraunhofer"}, {"id": 179571, "fullname": "Rui Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/179571?format=json", "institution": "University of California, Los Angeles; University of Cambridge"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 187807, "fullname": "Henri Mee\u00df", "url": "http://cvpr.thecvf.com/api/miniconf/users/187807?format=json", "institution": "SWARM Biotactics GmbH"}], "abstract": "Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework that is based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by projecting annotated 2D masks into the reconstructed 3D point cloud, thereby minimizing manual 3D annotation effort. Finally, we benchmark several state-of-the-art SSC methods on OccuFly using standard metrics, and highlight challenges specific to aerial viewpoints, yielding a comprehensive aerial vision benchmark that fosters holistic aerial 3D scene understanding.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37602", "url": null, "sourceid": 42703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40308?format=json"], "related_events_ids": [40308]}, {"id": 40308, "uid": "0dcee28928c857ea32a8eb174c74c663", "name": "OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective", "authors": [{"id": 187804, "fullname": "Markus Gross", "url": "http://cvpr.thecvf.com/api/miniconf/users/187804?format=json", "institution": "TU Munich"}, {"id": 187805, "fullname": "Sai Bharadhwaj Matha", "url": "http://cvpr.thecvf.com/api/miniconf/users/187805?format=json", "institution": ""}, {"id": 187806, "fullname": "Aya Fahmy", "url": "http://cvpr.thecvf.com/api/miniconf/users/187806?format=json", "institution": "Fraunhofer"}, {"id": 179571, "fullname": "Rui Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/179571?format=json", "institution": "University of California, Los Angeles; University of Cambridge"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 187807, "fullname": "Henri Mee\u00df", "url": "http://cvpr.thecvf.com/api/miniconf/users/187807?format=json", "institution": "SWARM Biotactics GmbH"}], "abstract": "Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework that is based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by projecting annotated 2D masks into the reconstructed 3D point cloud, thereby minimizing manual 3D annotation effort. Finally, we benchmark several state-of-the-art SSC methods on OccuFly using standard metrics, and highlight challenges specific to aerial viewpoints, yielding a comprehensive aerial vision benchmark that fosters holistic aerial 3D scene understanding.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40308", "url": null, "sourceid": -42703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37602?format=json"], "related_events_ids": [37602]}, {"id": 40197, "uid": "18074a65b9a1273408c6d5a00e1da59e", "name": "Unsupervised Multi-agent and Single-agent Perception from Cooperative Views", "authors": [{"id": 183988, "fullname": "Haochen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183988?format=json", "institution": "CLEVELAND STATE UNIVERSITY"}, {"id": 126979, "fullname": "Baolu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126979?format=json", "institution": "Cleveland State University"}, {"id": 193755, "fullname": "Lei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193755?format=json", "institution": "Cleveland State University"}, {"id": 193756, "fullname": "Delin Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193756?format=json", "institution": "Cleveland State University"}, {"id": 193757, "fullname": "Jiacheng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193757?format=json", "institution": "Cleveland State University"}, {"id": 84704, "fullname": "Minghai Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84704?format=json", "institution": "Western Digital Corporation"}, {"id": 193758, "fullname": "Tianyun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193758?format=json", "institution": "Cleveland State University"}, {"id": 89963, "fullname": "Hongkai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89963?format=json", "institution": "Cleveland State University"}], "abstract": "The LiDAR sensor based multi-agent and single-agent perception has shown promising performance in the environmental understanding for robots and automated vehicles.  However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance to the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perception. UMS combines a learning-based Proposal Purifying Filter to better classify the candidate proposals after multi-agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy-to-hard curriculum learning. Furthermore, we design a Cross-View Consensus Learning to use multi-agent cooperative view to guide detection in single-agent view. Experimental results on two  public datasets  V2V4Real and OPV2V  show that our UMS method achieved significantly higher 3D detection  performance than the state-of-the-art methods on both multi-agent and single-agent perception tasks in an unsupervised way.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40197", "url": null, "sourceid": 45354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36926, "uid": "fb3176bfa58b2e438f42411a2fb3443d", "name": "The Missing GAP: From Solving Square Jigsaw Puzzles To Handling Real World Archaeological Fragments", "authors": [{"id": 180485, "fullname": "Ofir Shahar", "url": "http://cvpr.thecvf.com/api/miniconf/users/180485?format=json", "institution": "Ben-Gurion University of the Negev"}, {"id": 186234, "fullname": "Gur Elkin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186234?format=json", "institution": "Ben Gurion University of the Negev"}, {"id": 186235, "fullname": "Ohad Ben-Shahar", "url": "http://cvpr.thecvf.com/api/miniconf/users/186235?format=json", "institution": "Ben Gurion University of the Negev"}], "abstract": "Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces, demonstrating superior performance on GAP, comparing to both classic and recent prominent works in this domain.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36926", "url": null, "sourceid": 38607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38005, "uid": "001d3439223b7bb23ed89b9c8890d096", "name": "Thinking in 360\u00b0: Humanoid Visual Search in the Wild", "authors": [{"id": 188806, "fullname": "Heyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188806?format=json", "institution": "New York University; Tsinghua University"}, {"id": 188807, "fullname": "Yinan Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188807?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 169906, "fullname": "Xiangyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169906?format=json", "institution": "Department of Computer Science and Technology, Tsinghua University"}, {"id": 188808, "fullname": "Baiqiao Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188808?format=json", "institution": "New York University"}, {"id": 188809, "fullname": "Bowen Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188809?format=json", "institution": "New York University"}, {"id": 188810, "fullname": "Xiangyu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188810?format=json", "institution": "May Mobility"}, {"id": 74032, "fullname": "Xinhao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74032?format=json", "institution": "New York University"}, {"id": 104589, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104589?format=json", "institution": "New York University"}, {"id": 88162, "fullname": "Marco Pavone", "url": "http://cvpr.thecvf.com/api/miniconf/users/88162?format=json", "institution": "NVIDIA"}, {"id": 86335, "fullname": "Chen Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86335?format=json", "institution": "New York University"}, {"id": 90241, "fullname": "Saining Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90241?format=json", "institution": "Facebook"}, {"id": 77530, "fullname": "Yiming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77530?format=json", "institution": "New York University"}], "abstract": "Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360\u00b0. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360\u00b0 panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% \u2192 47.38%) and path search (6.44% \u2192 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38005", "url": null, "sourceid": 45930, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39730, "uid": "a6101db109c4529f60805430ce177318", "name": "VGGT-$\\Omega$", "authors": [{"id": 160682, "fullname": "Jianyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160682?format=json", "institution": "Oxford VGG"}, {"id": 128085, "fullname": "Minghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128085?format=json", "institution": "University of Oxford"}, {"id": 190744, "fullname": "Shangzhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190744?format=json", "institution": "University of Oxford"}, {"id": 86804, "fullname": "Nikita Karaev", "url": "http://cvpr.thecvf.com/api/miniconf/users/86804?format=json", "institution": "University of Oxford"}, {"id": 153646, "fullname": "Johannes Sch\u00f6nberger", "url": "http://cvpr.thecvf.com/api/miniconf/users/153646?format=json", "institution": "Meta"}, {"id": 95950, "fullname": "Patrick Labatut", "url": "http://cvpr.thecvf.com/api/miniconf/users/95950?format=json", "institution": "Meta AI"}, {"id": 84556, "fullname": "Piotr Bojanowski", "url": "http://cvpr.thecvf.com/api/miniconf/users/84556?format=json", "institution": "Facebook"}, {"id": 152120, "fullname": "David Novotny", "url": "http://cvpr.thecvf.com/api/miniconf/users/152120?format=json", "institution": "Meta"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}, {"id": 129663, "fullname": "Christian Rupprecht", "url": "http://cvpr.thecvf.com/api/miniconf/users/129663?format=json", "institution": "University of Oxford"}], "abstract": "We present VGGT-\u03a9, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT\u2019s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more efficient global attention via scene tokens. These changes allow us to efficiently train VGGT-\u03a9 with 20$\\times$ more supervised data and 100$\\times$ more unsupervised data than prior work, while requiring only 30% of VGGT\u2019s memory and running 1.6$\\times$ faster at inference. As a result, VGGT-\u03a9 establishes a new state of the art for 3D reconstruction on both static and dynamic scenes across a wide range of benchmarks, e.g., improving the camera estimation accuracy by 77% on the Sintel dataset. Models and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39730", "url": null, "sourceid": 31221, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40381?format=json"], "related_events_ids": [40381]}, {"id": 40381, "uid": "a6101db109c4529f60805430ce177318", "name": "VGGT-$\\Omega$", "authors": [{"id": 160682, "fullname": "Jianyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160682?format=json", "institution": "Oxford VGG"}, {"id": 128085, "fullname": "Minghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128085?format=json", "institution": "University of Oxford"}, {"id": 190744, "fullname": "Shangzhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190744?format=json", "institution": "University of Oxford"}, {"id": 86804, "fullname": "Nikita Karaev", "url": "http://cvpr.thecvf.com/api/miniconf/users/86804?format=json", "institution": "University of Oxford"}, {"id": 153646, "fullname": "Johannes Sch\u00f6nberger", "url": "http://cvpr.thecvf.com/api/miniconf/users/153646?format=json", "institution": "Meta"}, {"id": 95950, "fullname": "Patrick Labatut", "url": "http://cvpr.thecvf.com/api/miniconf/users/95950?format=json", "institution": "Meta AI"}, {"id": 84556, "fullname": "Piotr Bojanowski", "url": "http://cvpr.thecvf.com/api/miniconf/users/84556?format=json", "institution": "Facebook"}, {"id": 152120, "fullname": "David Novotny", "url": "http://cvpr.thecvf.com/api/miniconf/users/152120?format=json", "institution": "Meta"}, {"id": 85624, "fullname": "Andrea Vedaldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85624?format=json", "institution": "University of Oxford"}, {"id": 129663, "fullname": "Christian Rupprecht", "url": "http://cvpr.thecvf.com/api/miniconf/users/129663?format=json", "institution": "University of Oxford"}], "abstract": "We present VGGT-\u03a9, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT\u2019s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more efficient global attention via scene tokens. These changes allow us to efficiently train VGGT-\u03a9 with 20$\\times$ more supervised data and 100$\\times$ more unsupervised data than prior work, while requiring only 30% of VGGT\u2019s memory and running 1.6$\\times$ faster at inference. As a result, VGGT-\u03a9 establishes a new state of the art for 3D reconstruction on both static and dynamic scenes across a wide range of benchmarks, e.g., improving the camera estimation accuracy by 77% on the Sintel dataset. Models and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40381", "url": null, "sourceid": -31221, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39730?format=json"], "related_events_ids": [39730]}, {"id": 39558, "uid": "cf9ad2ca65265e31795644dfe32958db", "name": "EI-Part\uff1aExplode for Completion and Implode for Refinement", "authors": [{"id": 181192, "fullname": "wanhu sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181192?format=json", "institution": "\u9999\u6e2f\u4e2d\u6587\u5927\u5b66(\u6df1\u5733)"}, {"id": 77211, "fullname": "Zhongjin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/77211?format=json", "institution": "the chinese university of hong kong, shenzhen"}, {"id": 192341, "fullname": "Heliang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192341?format=json", "institution": "USTC"}, {"id": 184898, "fullname": "Jiahao Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184898?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 90336, "fullname": "Chongjie Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/90336?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 192342, "fullname": "Huiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192342?format=json", "institution": "South China University of Technology"}, {"id": 192343, "fullname": "Shengchu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192343?format=json", "institution": "Mathmagic"}, {"id": 147012, "fullname": "Rongfei Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/147012?format=json", "institution": "Mathmagical.com"}, {"id": 88683, "fullname": "Xiaoguang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88683?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Part-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting either poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components distinguished by structural coherence, geometric plausibility, accuracy, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy allows us to fully leverage spatial resolution, enabling flexible part completion and fine geometric details generation. To maintain structural coherence between parts, a self-attention mechanism is incorporated in both the exploded and imploded states, facilitating effective information perception and feature fusion among components during generation. Extensive experiments conducted on various benchmarks demonstrate that EI-Part efficiently yields semantically meaningful and structurally coherent parts with fine-grained geometric details, achieving state-of-the-art performance in part-level generation compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39558", "url": null, "sourceid": 41543, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38674, "uid": "9926b3ed9e0bc20b2ed7f32711b8b45f", "name": "Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs", "authors": [{"id": 181246, "fullname": "Angela van Sprang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181246?format=json", "institution": "University of Amsterdam"}, {"id": 190439, "fullname": "Laurens Samson", "url": "http://cvpr.thecvf.com/api/miniconf/users/190439?format=json", "institution": "University of Amsterdam"}, {"id": 190440, "fullname": "Ana Lucic", "url": "http://cvpr.thecvf.com/api/miniconf/users/190440?format=json", "institution": "University of Amsterdam"}, {"id": 190441, "fullname": "Erman Acar", "url": "http://cvpr.thecvf.com/api/miniconf/users/190441?format=json", "institution": "University of Amsterdam"}, {"id": 190442, "fullname": "Sennay Ghebreab", "url": "http://cvpr.thecvf.com/api/miniconf/users/190442?format=json", "institution": "University of Amsterdam"}, {"id": 189862, "fullname": "Yuki M Asano", "url": "http://cvpr.thecvf.com/api/miniconf/users/189862?format=json", "institution": "University of Technology Nuremberg"}], "abstract": "We introduce two new benchmarks \\textbf{REST} and \\textbf{REST+} (Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the \\textbf{same semantic information} in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38674", "url": null, "sourceid": 43090, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36933, "uid": "56355c10afa230a92f133313b3e75536", "name": "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding", "authors": [{"id": 98107, "fullname": "Christopher Clark", "url": "http://cvpr.thecvf.com/api/miniconf/users/98107?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 90298, "fullname": "Jieyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90298?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 136978, "fullname": "Zixian Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/136978?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 85306, "fullname": "Jae Sung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/85306?format=json", "institution": "University of Washington"}, {"id": 133433, "fullname": "Rohun Tripathi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133433?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 98421, "fullname": "Sangho Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/98421?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 136938, "fullname": "Reza Salehi", "url": "http://cvpr.thecvf.com/api/miniconf/users/136938?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 179554, "fullname": "Jason Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/179554?format=json", "institution": "University of North Carolina at Chapel Hill; Allen Institute for Artificial Intelligence; University of Washington"}, {"id": 186255, "fullname": "Chris Dongjoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186255?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 186256, "fullname": "Yinuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186256?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 186257, "fullname": "Vincent Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186257?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 186258, "fullname": "Yue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186258?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 161151, "fullname": "Weikai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161151?format=json", "institution": "University of Washington"}, {"id": 184879, "fullname": "Ziqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184879?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 153873, "fullname": "Taira Anderson", "url": "http://cvpr.thecvf.com/api/miniconf/users/153873?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 186259, "fullname": "Jianrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186259?format=json", "institution": "Department of Computer Science, University of Wisconsin - Madison"}, {"id": 94898, "fullname": "Jitesh Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/94898?format=json", "institution": "Georgia Tech"}, {"id": 186260, "fullname": "George Stoica", "url": "http://cvpr.thecvf.com/api/miniconf/users/186260?format=json", "institution": "Georgia Institute of Technology"}, {"id": 88931, "fullname": "Ali Farhadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88931?format=json", "institution": "University of Washington"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}], "abstract": "Today\u2019s strongest video-language models (VLMs) remain proprietary.The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe.As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models.Crucially, many downstream applications require more than just high-level video understanding; they require grounding\u2014either by pointing or by tracking in pixels. Even proprietary models lack this capability.We present Molmo2, a new family of VLMs that are state-of-the-art amongst open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks.Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs.We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme and show bi-directional attention on vision tokens and a novel token-weight strategy improve performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 outperforms larger proprietary models, including 32.9% (Molmo2) vs 17% (Gemini 2.5 Pro) on video pointing.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36933", "url": null, "sourceid": 39754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40278?format=json"], "related_events_ids": [40278]}, {"id": 40278, "uid": "56355c10afa230a92f133313b3e75536", "name": "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding", "authors": [{"id": 98107, "fullname": "Christopher Clark", "url": "http://cvpr.thecvf.com/api/miniconf/users/98107?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 90298, "fullname": "Jieyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90298?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 136978, "fullname": "Zixian Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/136978?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 85306, "fullname": "Jae Sung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/85306?format=json", "institution": "University of Washington"}, {"id": 133433, "fullname": "Rohun Tripathi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133433?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 98421, "fullname": "Sangho Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/98421?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 136938, "fullname": "Reza Salehi", "url": "http://cvpr.thecvf.com/api/miniconf/users/136938?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 179554, "fullname": "Jason Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/179554?format=json", "institution": "University of North Carolina at Chapel Hill; Allen Institute for Artificial Intelligence; University of Washington"}, {"id": 186255, "fullname": "Chris Dongjoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186255?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 186256, "fullname": "Yinuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186256?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 186257, "fullname": "Vincent Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186257?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 186258, "fullname": "Yue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186258?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 161151, "fullname": "Weikai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161151?format=json", "institution": "University of Washington"}, {"id": 184879, "fullname": "Ziqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184879?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 153873, "fullname": "Taira Anderson", "url": "http://cvpr.thecvf.com/api/miniconf/users/153873?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 186259, "fullname": "Jianrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186259?format=json", "institution": "Department of Computer Science, University of Wisconsin - Madison"}, {"id": 94898, "fullname": "Jitesh Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/94898?format=json", "institution": "Georgia Tech"}, {"id": 186260, "fullname": "George Stoica", "url": "http://cvpr.thecvf.com/api/miniconf/users/186260?format=json", "institution": "Georgia Institute of Technology"}, {"id": 88931, "fullname": "Ali Farhadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88931?format=json", "institution": "University of Washington"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}], "abstract": "Today\u2019s strongest video-language models (VLMs) remain proprietary.The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe.As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models.Crucially, many downstream applications require more than just high-level video understanding; they require grounding\u2014either by pointing or by tracking in pixels. Even proprietary models lack this capability.We present Molmo2, a new family of VLMs that are state-of-the-art amongst open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks.Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs.We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme and show bi-directional attention on vision tokens and a novel token-weight strategy improve performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 outperforms larger proprietary models, including 32.9% (Molmo2) vs 17% (Gemini 2.5 Pro) on video pointing.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40278", "url": null, "sourceid": -39754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36933?format=json"], "related_events_ids": [36933]}, {"id": 39095, "uid": "d1c790abf37bba7049eb9e8c1e520334", "name": "TextFM: Robust Semi-dense Feature Matching with Language Guidance", "authors": [{"id": 183371, "fullname": "Zhihao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183371?format=json", "institution": "Lehigh University"}, {"id": 191348, "fullname": "Jinglun Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191348?format=json", "institution": "Apple"}, {"id": 191349, "fullname": "Nirav Savaliya", "url": "http://cvpr.thecvf.com/api/miniconf/users/191349?format=json", "institution": "Honda Research Institution US"}, {"id": 136550, "fullname": "Zheng-Hang Yeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/136550?format=json", "institution": "Nuro.ai"}, {"id": 161147, "fullname": "Bo Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161147?format=json", "institution": "Lehigh University"}, {"id": 191350, "fullname": "Mooi Chuah", "url": "http://cvpr.thecvf.com/api/miniconf/users/191350?format=json", "institution": "Lehigh University"}], "abstract": "Feature matching is a critical task in geometric perception, yet existing methods often struggle under domain shifts and illumination changes due to reliance on visual-only learning and expensive 3D supervision.  In this paper, we present TextFM, the first language-guided feature matching framework that incorporates domain-invariant semantic information from vision-language models (VLMs). Built upon a detector-free architecture, TextFM leverages textual embeddings as instance-level queries to provide global semantic context during coarse-level matching, enhancing robustness in challenging scenarios such as textureless surfaces and cross-domain shifts. Additionally,  we integrate illumination-invariant physical priors and apply Low-Rank Adaptation (LoRA) to efficiently fine-tune Vision Foundation Models (VFMs) for more robust visual feature extraction.  Extensive experiments on outdoor and indoor datasets show that our method outperforms other state-of-the-art methods. In addition, we contribute a synthetic day-night matching benchmark for rigorous evaluation under extreme lighting conditions. Together, our method and dataset establish a strong foundation for robust and generalizable feature matching under real-world constraints.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39095", "url": null, "sourceid": 31215, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38901, "uid": "ec8cb5c7c2b8b15598b6b6d4f41af482", "name": "Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation", "authors": [{"id": 190938, "fullname": "Yara Bahram", "url": "http://cvpr.thecvf.com/api/miniconf/users/190938?format=json", "institution": "\u00c9cole de technologie sup\u00e9rieure (ETS Montreal)"}, {"id": 183369, "fullname": "M\u00e9lodie Desbos", "url": "http://cvpr.thecvf.com/api/miniconf/users/183369?format=json", "institution": "LIVIA, ETS"}, {"id": 190939, "fullname": "Mohammadhadi Shateri", "url": "http://cvpr.thecvf.com/api/miniconf/users/190939?format=json", "institution": "\u00c9cole de technologie sup\u00e9rieure, Universit\u00e9 du Qu\u00e9bec"}, {"id": 77403, "fullname": "Eric Granger", "url": "http://cvpr.thecvf.com/api/miniconf/users/77403?format=json", "institution": "ETS Montreal"}], "abstract": "Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher\u2019s domain. Thus, fast and high-quality generation for novel domains relies on two-stage training pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and suffer from degraded quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. It couples two signals during training: (i) a dual-domain distribution-matching distillation objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We perform evaluations on a variety of datasets for few-shot image generation (FSIG). Uni-DAD delivers higher quality than state-of-the-art (SoTA) adaptation methods even with $\\leq4$ sampling steps, and outperforms two-stage training pipelines in both quality and diversity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38901", "url": null, "sourceid": 43177, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38019, "uid": "8ed67538022400bdba14f3e8b4bf6ab3", "name": "PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation", "authors": [{"id": 182723, "fullname": "Minjae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182723?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 188839, "fullname": "Sungwoo Hur", "url": "http://cvpr.thecvf.com/api/miniconf/users/188839?format=json", "institution": "POSTECH"}, {"id": 188840, "fullname": "Soojin Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188840?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 89062, "fullname": "Won Hwa Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/89062?format=json", "institution": "POSTECH"}], "abstract": "Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM\u2019s mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples.Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38019", "url": null, "sourceid": 45817, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40321?format=json"], "related_events_ids": [40321]}, {"id": 40149, "uid": "8845795a6b7bb32de1b64d7521ff976b", "name": "Globally Optimal Pose from Silhouettes", "authors": [{"id": 193638, "fullname": "Agniva Sengupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/193638?format=json", "institution": "Zuse Institute Berlin"}, {"id": 193639, "fullname": "Dilara Kus", "url": "http://cvpr.thecvf.com/api/miniconf/users/193639?format=json", "institution": "Zuse Institute Berlin; Freie Universit\u00e4t Berlin"}, {"id": 193640, "fullname": "Jianning Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193640?format=json", "institution": "Zuse Institute Berlin"}, {"id": 193641, "fullname": "Stefan Zachow", "url": "http://cvpr.thecvf.com/api/miniconf/users/193641?format=json", "institution": "Zuse Institute Berlin (ZIB)"}], "abstract": "We solve the problem of determining the pose of known shapes in $\\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by correspondences, for any shape, irrespective of its convexity and genus. We validate our method on synthetic and real examples, demonstrating significantly improved accuracy against comparable approaches.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40149", "url": null, "sourceid": 43079, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36273, "uid": "dd2a4a45d18fb70e74e7cbd38b6ac1bf", "name": "GP-4DGS: Probabilistic Analysis of 4D Gaussian Splattings for Monocular Video Reconstruction via Variational Gaussian Processes", "authors": [{"id": 182751, "fullname": "Mijeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182751?format=json", "institution": "Huawei Technologies Ltd.; Seoul National University"}, {"id": 184647, "fullname": "Jungtaek Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/184647?format=json", "institution": "University of Wisconsin\u2013Madison"}, {"id": 75881, "fullname": "Bohyung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/75881?format=json", "institution": "Seoul National University"}], "abstract": "We present GP-4DGS, a probabilistic framework for monocular video reconstruction that models the motion of 4D Gaussian Splatting (GS) primitives using variational Gaussian Processes (GPs). In contrast to prior approaches that depend on manually designed motion priors, our kernel-based probabilistic formulation enables flexible, data-adaptive motion modeling while implicitly providing appropriate priors for unobserved regions. GP-4DGS employs variational GPs with spatial kernels to capture geometric correlations and periodic kernels to characterize temporal dynamics, achieving efficient scalability to large sets of primitives compared to standard GPs. To train GP-4DGS, we introduce an optimization strategy that jointly optimizes GS primitive parameters as well as GP hyperparameters, establishing a complementary relationship between probabilistic and geometric modeling. Beyond improved reconstruction quality, our variational GP formulation naturally supports uncertainty quantification and temporal extrapolation beyond the input sequence. Experiments on challenging dynamic scenes demonstrate that GP-4DGS delivers high-quality reconstructions, robustly handles severe occlusions and extreme viewpoints, and provides principled uncertainty estimation and extrapolation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36273", "url": null, "sourceid": 44502, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37390, "uid": "2121301dde1bec34b962291dad9093ef", "name": "Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game\u2013Decision Lens for Interpretable, Discriminative Visual Representations", "authors": [{"id": 187325, "fullname": "Sudong Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187325?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 187326, "fullname": "Shuai Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187326?format=json", "institution": "Beijing Institute of Technology"}, {"id": 157969, "fullname": "Bingzhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157969?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187327, "fullname": "Rui Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187327?format=json", "institution": "Shenzhen Institute of Computing Sciences; Shenzhen University"}, {"id": 77364, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77364?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Self-attention with separate pre- and post-projections can be a universal approximator (on compact domains) under mild conditions.Yet we observe a striking gap: an attention-only Transformer (w/o FFN layers) exhibits a marked accuracy drop relative to its standard interleaved attention--FFN baseline.We term this the **weak-independence** challenge of attention.We study this through a new conceptual lens, **Selection-as-Nonlinearity (SaN)**, which interprets effective nonlinearity as directed, cost-constrained selection, offering a coherent account of attention as context-gated activation.In this joint game\u2013decision view, attention performs a resource-constrained cooperative allocation over values: each query distributes a unit-mass weight budget over shared values to optimize representational utility, under a normalizer (e.g., $\\mathrm{softmax}$), and guided by context-derived scores (e.g., q-k similarities).SaN interprets *weak-independence* as a structural tension: the value weights almost cannot simultaneously attain the decoupled per-query (row-wise) and the per-value (column-wise) optimums under shared budgets, thereby limiting attention's stand-alone capacity.Guided by SaN, we introduce **CSaN**, an interpretable, efficient attention compensation paradigm with two key insights: **1) hierarchical budget calibration,** *re-allocate* row budgets via inter-query correction signals; and **2) public-private cooperation,** enhancing the *public* attention pathway with a per-token *private* value pathway to decouple conflicting demands.CSaN is evaluated on various vision benchmarks and demonstrates *level-jump gains* across popular Transformer families (Swin, ViT, Hiera), enabling models to rival much heavier same-family counterparts $\\sim2\\times$ as large.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37390", "url": null, "sourceid": 43918, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36803, "uid": "8aecd51d6d87cc8f1ceed1134cc06934", "name": "Radiance Meshes for Volumetric Reconstruction", "authors": [{"id": 176286, "fullname": "Alexander Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/176286?format=json", "institution": "University of California, San Diego"}, {"id": 185903, "fullname": "Trevor Hedstrom", "url": "http://cvpr.thecvf.com/api/miniconf/users/185903?format=json", "institution": "University of California, San Diego"}, {"id": 141062, "fullname": "George Kopanas", "url": "http://cvpr.thecvf.com/api/miniconf/users/141062?format=json", "institution": "Research, Google"}, {"id": 127184, "fullname": "Janne Kontkanen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127184?format=json", "institution": "Research, Google"}, {"id": 185904, "fullname": "Falko Kuester", "url": "http://cvpr.thecvf.com/api/miniconf/users/185904?format=json", "institution": "University of California, San Diego"}, {"id": 69179, "fullname": "Jonathan T. Barron", "url": "http://cvpr.thecvf.com/api/miniconf/users/69179?format=json", "institution": "Google"}], "abstract": "We introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization.Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieve faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms.Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes.Our rendering method exactly evaluates the volume rendering equation and enables high quality, real-time view synthesis on standard consumer hardware. Our tetrahedral meshes also lend themselves to a variety of exciting applications including fisheye lens distortion, physics-based simulation, editing, and mesh extraction.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36803", "url": null, "sourceid": 38855, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38580, "uid": "ef36bd288598753e5d732ce574984a2c", "name": "Learning 3D Reconstruction with Priors in Test Time", "authors": [{"id": 184012, "fullname": "Lei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184012?format=json", "institution": "Stony Brook University"}, {"id": 190199, "fullname": "Haoyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190199?format=json", "institution": "Stony Brook University"}, {"id": 189077, "fullname": "Akshat Dave", "url": "http://cvpr.thecvf.com/api/miniconf/users/189077?format=json", "institution": ", State University of New York at Stony Brook"}, {"id": 85146, "fullname": "Dimitris Samaras", "url": "http://cvpr.thecvf.com/api/miniconf/users/85146?format=json", "institution": "Stony Brook University"}], "abstract": "We introduce a test-time framework for multiview Transformers (MVTs) that    incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks,     without retraining or modifying the pre-trained image-only networks.    Rather than feeding priors into the architecture,    we cast them as constraints on the predictions and optimize the network at inference.    The optimization loss is composed of a self-supervised objective and prior penalty terms.    The self-supervised objective is defined as the compatibility among multi-view predictions,     implemented by the photometric or geometric loss between the renderings from other views and each view itself.    Any available priors are turned into the penalty terms on the corresponding output modalities.    Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation,     our method consistently improves performance over base MVTs by a large margin.    On ETH3D, 7-Scenes, and NRGBD datasets, our method cuts the point map distance error by more than half compared to the base image-only models.    Our method also outperforms those re-trained prior-aware feed-forward methods,    demonstrating the effectiveness of our test-time constrained optimization (TCO) framework, in incorporating priors for 3D vision tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38580", "url": null, "sourceid": 36822, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38522, "uid": "38e70244a463f6aed90c243813938d4f", "name": "RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion", "authors": [{"id": 180488, "fullname": "Shangran Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180488?format=json", "institution": "Alibaba Cloud"}, {"id": 181072, "fullname": "Lu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181072?format=json", "institution": "Hangzhou AliCloud Apsara Information Technology Co., Ltd."}, {"id": 188661, "fullname": "Jian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188661?format=json", "institution": "Alibaba Group"}, {"id": 188662, "fullname": "Qiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188662?format=json", "institution": "Alibaba Group"}], "abstract": "The prohibitive cost of 3D attention hinders high-quality video generation with diffusion models. Existing sparse attention methods either lack content adaptivity (static) or incur excessive overhead from per-step recalculation (dynamic). Our work challenges the necessity of this trade-off, based on a twofold empirical discovery: (1) attention patterns in video diffusion exhibit strong temporal stability, and (2) the requisite computational density progressively decays. This insight motivates RAPID, a framework that performs a one-shot attention block importance estimation early in the generation process. The resulting scores and high-fidelity sparse mask are then cached for efficient reuse, eliminating recalculation overhead. The cached scores also enable an optional, multi-stage adaptive pruning (Turbo mode) for maximum acceleration. On leading models like Wan2.1-14B and HunyuanVideo, our high-fidelity configuration surpasses all baselines across key quality metrics (PSNR, SSIM, LPIPS) under a controlled compute budget. Concurrently, its Turbo mode achieves speedups of up to $1.79\\times$ on Wan2.1-14B and $2.01\\times$ on HunyuanVideo while maintaining strong visual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38522", "url": null, "sourceid": 37827, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40201, "uid": "781a5b9b1037727527423ddb406a47a7", "name": "Grounded Latents for Entity-Centric 4D Scene Generation", "authors": [{"id": 89967, "fullname": "Jinhyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/89967?format=json", "institution": "Carnegie Mellon University"}, {"id": 158144, "fullname": "Navyata Sanghvi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158144?format=json", "institution": "Carnegie Mellon University"}, {"id": 193763, "fullname": "Erica Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193763?format=json", "institution": "Carnegie Mellon University"}, {"id": 89975, "fullname": "Shawn Hunt", "url": "http://cvpr.thecvf.com/api/miniconf/users/89975?format=json", "institution": "DENSO International America, Inc."}, {"id": 137539, "fullname": "Shinya Tanaka", "url": "http://cvpr.thecvf.com/api/miniconf/users/137539?format=json", "institution": "Denso International America Inc"}, {"id": 133682, "fullname": "Hironobu Fujiyioshi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133682?format=json", "institution": "DENSO CORPORATION"}, {"id": 88213, "fullname": "Kris Kitani", "url": "http://cvpr.thecvf.com/api/miniconf/users/88213?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Although recent work has explored generative modeling of 3D or 4D driving scenes, most approaches operate on dense voxel-based representations, which are computationally expensive and struggle to maintain temporal or structural consistency. These methods often produce blurred or merged entities (i.e., cars, trucks, pedestrians) and lack fine-grained control over individual scene elements. We propose to perform generative modeling in a compact, entity-centric latent space, where each grounded 3D latent represents a semantically meaningful local region of the scene. This formulation enables precise, consistent control of both foreground and background elements while preserving geometric detail. We further extend this representation to 4D by learning a motion diffusion model for both ego and dynamic actors, conditioned on the generated 3D scene, and by propagating the grounded latents through time. Our framework produces physically consistent and temporally coherent 4D scenes, supporting controllable and realistic generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40201", "url": null, "sourceid": 30983, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39515, "uid": "f0e1267124a8106605be012d766d65f1", "name": "A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks", "authors": [{"id": 192244, "fullname": "Tangzheng Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/192244?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 192245, "fullname": "Guanyu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192245?format=json", "institution": null}, {"id": 192246, "fullname": "Yijing Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/192246?format=json", "institution": null}, {"id": 73485, "fullname": "Dimitrios Kollias", "url": "http://cvpr.thecvf.com/api/miniconf/users/73485?format=json", "institution": "Queen Mary University of London"}, {"id": 164949, "fullname": "Oya Celiktutan", "url": "http://cvpr.thecvf.com/api/miniconf/users/164949?format=json", "institution": "King&#x27;s College London"}], "abstract": "While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \\textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \\textbf{bounded utility losses}. Our method is \\textbf{training-free}, requires \\textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \\textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance. Code will be made available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39515", "url": null, "sourceid": 36208, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39015, "uid": "c306791fb7d2c4488fd53630c7e5e91d", "name": "Grounded 3D-Aware Spatial Vision-Language Modeling", "authors": [{"id": 154466, "fullname": "An-Chieh Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154466?format=json", "institution": "University of California, San Diego"}, {"id": 102978, "fullname": "Yang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102978?format=json", "institution": "University of California San Diego"}, {"id": 191183, "fullname": "Yatai Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/191183?format=json", "institution": "University of Hong Kong"}, {"id": 154456, "fullname": "Ligeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154456?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 130381, "fullname": "Guanqi Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130381?format=json", "institution": "VGG, University of Oxford"}, {"id": 154298, "fullname": "Zhuoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154298?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 191184, "fullname": "Zhaojing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191184?format=json", "institution": null}, {"id": 85763, "fullname": "Song Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85763?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 154012, "fullname": "Yao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154012?format=json", "institution": "NVIDIA"}, {"id": 73958, "fullname": "Pavlo Molchanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/73958?format=json", "institution": "NVIDIA"}, {"id": 191185, "fullname": "Vidya Nariyambut Murali", "url": "http://cvpr.thecvf.com/api/miniconf/users/191185?format=json", "institution": "NVIDIA"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 91790, "fullname": "Xiaolong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91790?format=json", "institution": "UCSD"}, {"id": 73956, "fullname": "Danny Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/73956?format=json", "institution": "NVIDIA"}, {"id": 76011, "fullname": "Sifei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76011?format=json", "institution": "NVIDIA"}], "abstract": "We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities\u2014explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding\u2014within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39015", "url": null, "sourceid": 41669, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39467, "uid": "9c082cb3963143beeb37c1bcf9ed6210", "name": "Decoupling Vision and Language: Codebook Anchored Visual Adaptation", "authors": [{"id": 192134, "fullname": "Jason Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192134?format=json", "institution": "University of California, Los Angeles"}, {"id": 152817, "fullname": "Tianchen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152817?format=json", "institution": "Amazon"}, {"id": 87556, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87556?format=json", "institution": "Northeastern University"}, {"id": 127609, "fullname": "Jiarui Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/127609?format=json", "institution": "Amazon AWS AI"}, {"id": 152818, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152818?format=json", "institution": "Amazon"}, {"id": 191908, "fullname": "Zhuowei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191908?format=json", "institution": "Amazon"}, {"id": 192135, "fullname": "Aaditya Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/192135?format=json", "institution": "Amazon"}, {"id": 133743, "fullname": "Xiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133743?format=json", "institution": "Amazon"}, {"id": 88694, "fullname": "Mani Srivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/88694?format=json", "institution": "UCLA"}, {"id": 192136, "fullname": "Jonathan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192136?format=json", "institution": "Amazon"}], "abstract": "Large Vision\u2013Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses.Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes.We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model.This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 14.98\\% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM\u2019s linguistic capabilities and outperforming peer methods that operate on continuous tokens.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39467", "url": null, "sourceid": 35070, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37154, "uid": "eb840fce767293ed769b0e5ac37d4554", "name": "RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing", "authors": [{"id": 149471, "fullname": "Kaifa Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149471?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 131912, "fullname": "Qi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131912?format=json", "institution": "Tencent MediaLab"}, {"id": 107313, "fullname": "Yiling Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107313?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186795, "fullname": "Zhu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186795?format=json", "institution": "University of Missouri - Kansas City"}], "abstract": "3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission.Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are 1) sensitive to the number and selection of views; 2) rely on specialized differentiable rasterizers; and 3) have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules, as well as resulting in limited scalability and generalization.To address these issues, we propose RAP \u2014 a fast feedforward Rendering-free Attribute-guided method for efficient importance score Prediction in 3DGS. RAP infers primitive significance directly from intrinsic Gaussian attributes and local neighborhood statistics, avoiding any rendering-based or visibility-dependent computations. A compact MLP is trained to predict per-primitive importance scores using a combination of rendering loss, pruning-aware loss, and significance distribution regularization loss. After being trained on a small set of scenes, RAP generalizes effectively to unseen data and can be seamlessly integrated into reconstruction, compression, and transmission pipelines, providing a unified and efficient pruning solution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37154", "url": null, "sourceid": 37366, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40249, "uid": "6315d05a06c6802f2e81cb79428686b9", "name": "DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease", "authors": [{"id": 193876, "fullname": "Runsheng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193876?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 193877, "fullname": "Chengyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193877?format=json", "institution": "nanjing university"}, {"id": 176253, "fullname": "Yangdong Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/176253?format=json", "institution": "Tsinghua University"}], "abstract": "Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip connections to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of $\\tfrac{1}{n}$ or $\\tfrac{2}{n+1}$, depending on whether the conservative or aggressive mode is used, where $n$ denotes the number of devices. Empirically, DRiffusion attains 1.5\u00d7\u20134\u00d7 speedup on Stable Diffusion 2.1 with minimal degradation in generation quality. On MS-COCO dataset, both FID and CLIP remain close to those of the original sampler: averaged across configurations, DRiffusion even improves FID by 0.45 and incurs a negligible 0.06 drop in CLIP score. These results show that DRiffusion delivers substantial acceleration while largely preserving perceptual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40249", "url": null, "sourceid": 41623, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37667, "uid": "6a8782653d5924b6e4102093849b1c2a", "name": "Is the Modality Gap a Bug or a Feature? A Robustness Perspective", "authors": [{"id": 187978, "fullname": "Rhea Chowers", "url": "http://cvpr.thecvf.com/api/miniconf/users/187978?format=json", "institution": "Hebrew University of Jerusalem"}, {"id": 187979, "fullname": "Oshri Naparstek", "url": "http://cvpr.thecvf.com/api/miniconf/users/187979?format=json", "institution": "IBM Research"}, {"id": 183410, "fullname": "Udi Barzelay", "url": "http://cvpr.thecvf.com/api/miniconf/users/183410?format=json", "institution": "IBM"}, {"id": 187980, "fullname": "Yair Weiss", "url": "http://cvpr.thecvf.com/api/miniconf/users/187980?format=json", "institution": "Hebrew University of Jerusalem"}], "abstract": "Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss will lead to a representation in which the two modalities are separated by a global gap vector that is orthogonal to the embeddings of both modalities.  We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when small, semantically inconsequential changes are made to the input.  Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss to clean accuracy.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37667", "url": null, "sourceid": 42923, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36251, "uid": "7b92988fdc205f49374662dd64dd1bbd", "name": "Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction", "authors": [{"id": 183636, "fullname": "Haato Watanabe", "url": "http://cvpr.thecvf.com/api/miniconf/users/183636?format=json", "institution": "The University of Tokyo"}, {"id": 184582, "fullname": "Nobuyuki Umetani", "url": "http://cvpr.thecvf.com/api/miniconf/users/184582?format=json", "institution": "The University of Tokyo"}], "abstract": "Recent years have witnessed the rapid emergence of 3D Gaussian Splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition.To overcome this limitation, we propose Neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron (MLP) that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy that selects mismatch primitives for pruning and cloning based on frequency energy.Our method achieves accurate reconstruction of challenging high-frequency surfaces. We demonstrate its effectiveness through extensive experiments on both standard benchmarks, such as Mip-NerRF360 and high-frequency surface datasets (e.g., checkered patterns), supported by comprehensive ablation studies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36251", "url": null, "sourceid": 36210, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38982, "uid": "9f205dad67b7407fec00834c17b2cf2e", "name": "DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification", "authors": [{"id": 191119, "fullname": "Kenji Tojo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191119?format=json", "institution": "The University of Tokyo"}, {"id": 191120, "fullname": "Bernd Bickel", "url": "http://cvpr.thecvf.com/api/miniconf/users/191120?format=json", "institution": "ETHZ - ETH Zurich; Google"}, {"id": 184582, "fullname": "Nobuyuki Umetani", "url": "http://cvpr.thecvf.com/api/miniconf/users/184582?format=json", "institution": "The University of Tokyo"}], "abstract": "Radiance field reconstruction aims to recover high-quality 3D representations from multi-view RGB images. Recent advances, such as 3D Gaussian splatting, have achieved real-time rendering with high visual fidelity, given sufficiently powerful graphics hardware. However, drastic model simplification \u2014 i.e., reducing the number of primitives by several orders of magnitude \u2014 is required to enable efficient online transmission and rendering across diverse hardware platforms. We introduce DiffSoup, a radiance field representation that employs a soup (i.e., a highly unstructured primitives) of a small number of triangles with neural textures that have binary opacity. We show that the binary opacity representation is directly differentiable via stochastic opacity masking, enabling stable training without molifier (i.e., smooth rasterization). DiffSoup can be rasterized with a traditional depth-testing framework, allowing the optimized scenes to be seamlessly integrated into conventional graphics pipelines and rendered interactively on consumer-grade laptops and mobile devices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38982", "url": null, "sourceid": 31708, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40311, "uid": "37ce349008efb55b22c8594ca047413c", "name": "SAM 3D Body: Robust Full-Body Human Mesh Recovery", "authors": [{"id": 85606, "fullname": "Xitong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85606?format=json", "institution": "Meta"}, {"id": 127724, "fullname": "Devansh Kukreja", "url": "http://cvpr.thecvf.com/api/miniconf/users/127724?format=json", "institution": "Carnegie Mellon University"}, {"id": 188071, "fullname": "Don Pinkus", "url": "http://cvpr.thecvf.com/api/miniconf/users/188071?format=json", "institution": "Indepedent"}, {"id": 188072, "fullname": "Taosha Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188072?format=json", "institution": null}, {"id": 89967, "fullname": "Jinhyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/89967?format=json", "institution": "Carnegie Mellon University"}, {"id": 119973, "fullname": "Soyong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/119973?format=json", "institution": "Carnegie Mellon University"}, {"id": 188073, "fullname": "Jinkun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188073?format=json", "institution": "Amazon FAR"}, {"id": 188074, "fullname": "Jia-Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188074?format=json", "institution": "Facebook"}, {"id": 188075, "fullname": "Nicol\u00e1s Ugrinovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/188075?format=json", "institution": null}, {"id": 188076, "fullname": "Anushka Sagar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188076?format=json", "institution": "Meta"}, {"id": 75437, "fullname": "Jitendra Malik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75437?format=json", "institution": "University of California at Berkeley"}, {"id": 75834, "fullname": "Matt Feiszli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75834?format=json", "institution": "Meta AI"}, {"id": 188077, "fullname": "Piotr Doll\u00e1r", "url": "http://cvpr.thecvf.com/api/miniconf/users/188077?format=json", "institution": "Thinking Machines"}, {"id": 88213, "fullname": "Kris Kitani", "url": "http://cvpr.thecvf.com/api/miniconf/users/88213?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal pose and body shape. 3DB employs an encoder\u2013decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40311", "url": null, "sourceid": -39197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37714?format=json"], "related_events_ids": [37714]}, {"id": 39508, "uid": "c3fbc138598bb208f9a02bff183976c3", "name": "SAM 3D: 3Dfy Anything in Images", "authors": [{"id": 130167, "fullname": "Xingyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130167?format=json", "institution": "Facebook"}, {"id": 85560, "fullname": "Fu-Jen Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85560?format=json", "institution": "Facebook"}, {"id": 130160, "fullname": "Pierre Gleize", "url": "http://cvpr.thecvf.com/api/miniconf/users/130160?format=json", "institution": "Polytech Nice Sophia"}, {"id": 127733, "fullname": "Kevin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127733?format=json", "institution": "FAIR at Meta"}, {"id": 127052, "fullname": "Alexander Sax", "url": "http://cvpr.thecvf.com/api/miniconf/users/127052?format=json", "institution": "University of California Berkeley"}, {"id": 130151, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130151?format=json", "institution": "Meta Platforms"}, {"id": 130164, "fullname": "Weiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130164?format=json", "institution": "Facebook"}, {"id": 192228, "fullname": "Michelle Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192228?format=json", "institution": "Facebook"}, {"id": 192229, "fullname": "Thibaut Hardin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192229?format=json", "institution": "Facebook"}, {"id": 155531, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155531?format=json", "institution": "University of Illinois, Urbana Champaign"}, {"id": 192230, "fullname": "Aohan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192230?format=json", "institution": "Facebook"}, {"id": 188074, "fullname": "Jia-Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188074?format=json", "institution": "Facebook"}, {"id": 159767, "fullname": "Ziqi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/159767?format=json", "institution": "California Institute of Technology"}, {"id": 188076, "fullname": "Anushka Sagar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188076?format=json", "institution": "Meta"}, {"id": 148301, "fullname": "Bowen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/148301?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 188612, "fullname": "Xiaodong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188612?format=json", "institution": "Meta Platforms, Inc."}, {"id": 150918, "fullname": "Jianing &amp;quot;Jed&amp;quot; Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150918?format=json", "institution": "UMich / Meta"}, {"id": 70795, "fullname": "Bowen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70795?format=json", "institution": "University of Science and Technology of China"}, {"id": 188077, "fullname": "Piotr Doll\u00e1r", "url": "http://cvpr.thecvf.com/api/miniconf/users/188077?format=json", "institution": "Thinking Machines"}, {"id": 86982, "fullname": "Georgia Gkioxari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86982?format=json", "institution": "California Institute of Technology"}, {"id": 75834, "fullname": "Matt Feiszli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75834?format=json", "institution": "Meta AI"}, {"id": 75437, "fullname": "Jitendra Malik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75437?format=json", "institution": "University of California at Berkeley"}], "abstract": "We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale.  We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D \"data barrier\". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39508", "url": null, "sourceid": 40168, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40370?format=json"], "related_events_ids": [40370]}, {"id": 40370, "uid": "c3fbc138598bb208f9a02bff183976c3", "name": "SAM 3D: 3Dfy Anything in Images", "authors": [{"id": 130167, "fullname": "Xingyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130167?format=json", "institution": "Facebook"}, {"id": 85560, "fullname": "Fu-Jen Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85560?format=json", "institution": "Facebook"}, {"id": 130160, "fullname": "Pierre Gleize", "url": "http://cvpr.thecvf.com/api/miniconf/users/130160?format=json", "institution": "Polytech Nice Sophia"}, {"id": 127733, "fullname": "Kevin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127733?format=json", "institution": "FAIR at Meta"}, {"id": 127052, "fullname": "Alexander Sax", "url": "http://cvpr.thecvf.com/api/miniconf/users/127052?format=json", "institution": "University of California Berkeley"}, {"id": 130151, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130151?format=json", "institution": "Meta Platforms"}, {"id": 130164, "fullname": "Weiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130164?format=json", "institution": "Facebook"}, {"id": 192228, "fullname": "Michelle Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192228?format=json", "institution": "Facebook"}, {"id": 192229, "fullname": "Thibaut Hardin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192229?format=json", "institution": "Facebook"}, {"id": 155531, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155531?format=json", "institution": "University of Illinois, Urbana Champaign"}, {"id": 192230, "fullname": "Aohan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192230?format=json", "institution": "Facebook"}, {"id": 188074, "fullname": "Jia-Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188074?format=json", "institution": "Facebook"}, {"id": 159767, "fullname": "Ziqi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/159767?format=json", "institution": "California Institute of Technology"}, {"id": 188076, "fullname": "Anushka Sagar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188076?format=json", "institution": "Meta"}, {"id": 148301, "fullname": "Bowen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/148301?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 188612, "fullname": "Xiaodong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188612?format=json", "institution": "Meta Platforms, Inc."}, {"id": 150918, "fullname": "Jianing &amp;quot;Jed&amp;quot; Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150918?format=json", "institution": "UMich / Meta"}, {"id": 70795, "fullname": "Bowen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70795?format=json", "institution": "University of Science and Technology of China"}, {"id": 188077, "fullname": "Piotr Doll\u00e1r", "url": "http://cvpr.thecvf.com/api/miniconf/users/188077?format=json", "institution": "Thinking Machines"}, {"id": 86982, "fullname": "Georgia Gkioxari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86982?format=json", "institution": "California Institute of Technology"}, {"id": 75834, "fullname": "Matt Feiszli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75834?format=json", "institution": "Meta AI"}, {"id": 75437, "fullname": "Jitendra Malik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75437?format=json", "institution": "University of California at Berkeley"}], "abstract": "We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale.  We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D \"data barrier\". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40370", "url": null, "sourceid": -40168, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39508?format=json"], "related_events_ids": [39508]}, {"id": 36817, "uid": "5a9fa2197c5e405543c00bd501751082", "name": "Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay", "authors": [{"id": 182325, "fullname": "qianyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182325?format=json", "institution": "NTU"}, {"id": 185942, "fullname": "Shujian Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185942?format=json", "institution": "Vrije Universiteit Amsterdam; University of Troms\u00f8"}], "abstract": "Functional magnetic resonance imaging (fMRI) is widely used for studying and diagnosing brain disorders, with functional connectivity (FC) matrices providing powerful representations of large-scale neural interactions. However, existing diagnostic models are trained either on a single site or under full multi-site access, making them unsuitable for real-world scenarios where clinical data arrive sequentially from different institutions. This results in limited generalization and severe catastrophic forgetting.This paper presents the first continual learning framework specifically designed for fMRI-based diagnosis across heterogeneous clinical sites. Our framework introduces a structure-aware variational autoencoder that synthesizes realistic FC matrices for both patient and control groups. Built on this generative backbone, we develop a multi-level knowledge distillation strategy that aligns predictions and graph representations between new-site data and replayed samples. To further enhance efficiency, we incorporate a hierarchical contextual bandit scheme for adaptive replay sampling.Experiments on multi-site datasets for major depressive disorder (MDD), schizophrenia (SZ), and autism spectrum disorder (ASD) show that the proposed generative model enhances data augmentation quality, and the overall continual learning framework substantially outperforms existing methods in mitigating catastrophic forgetting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36817", "url": null, "sourceid": 40338, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37468, "uid": "f3f611415939d28bdeb54703b65a36bb", "name": "A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction", "authors": [{"id": 183587, "fullname": "Jingzhou Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183587?format=json", "institution": "Florida International University"}, {"id": 177003, "fullname": "Tianya Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/177003?format=json", "institution": "Florida International University"}, {"id": 73967, "fullname": "Xuyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73967?format=json", "institution": "Florida International University"}], "abstract": "In this paper, we introduce Geometric Algebra\u2013Informed 3D Gaussian Splatting (GAI-GS), a framework for wireless modeling that couples 3D Gaussian splatting with a geometric-algebra\u2013based attention mechanism to explicitly model ray\u2013object interactions in complex propagation environments. GAI-GS encodes joint spatial\u2013electromagnetic (EM) relations into token representations, enabling scene-level aggregation within a unified, end-to-end neural architecture. This design renders ray tracing for wireless propagation physically grounded, with token interactions that respect EM constraints including multipath, path-dependent attenuation, and reflection/diffraction. Through extensive evaluations on on multiple real-world indoor datasets, GAI-GS consistently surpasses current baselines across various wireless tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37468", "url": null, "sourceid": 40420, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38264, "uid": "66221bae97a54a1d31b0319c30437bdd", "name": "Generative Diffusion Priors for 3D Mapping of the Dark Universe", "authors": [{"id": 130723, "fullname": "Brandon Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130723?format=json", "institution": "California Institute of Technology"}, {"id": 189452, "fullname": "Diana Scognamiglio", "url": "http://cvpr.thecvf.com/api/miniconf/users/189452?format=json", "institution": "Jet Propulsion Laboratory"}, {"id": 189453, "fullname": "Olivier Dor\u00e9", "url": "http://cvpr.thecvf.com/api/miniconf/users/189453?format=json", "institution": "Jet Propulsion Laboratory"}, {"id": 184011, "fullname": "Katie Bouman", "url": "http://cvpr.thecvf.com/api/miniconf/users/184011?format=json", "institution": "California Institute of Technology"}], "abstract": "Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset \\texttt{Conicus3D}, which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38264", "url": null, "sourceid": 41523, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37322, "uid": "fb4f401f943fac2830a81ac63178e9a4", "name": "TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis", "authors": [{"id": 174347, "fullname": "Zhengpeng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/174347?format=json", "institution": "University of Cambridge"}, {"id": 187151, "fullname": "Clement Atzberger", "url": "http://cvpr.thecvf.com/api/miniconf/users/187151?format=json", "institution": "dClimate"}, {"id": 187152, "fullname": "Sadiq Jaffer", "url": "http://cvpr.thecvf.com/api/miniconf/users/187152?format=json", "institution": "University of Cambridge"}, {"id": 187153, "fullname": "Jovana Knezevic", "url": "http://cvpr.thecvf.com/api/miniconf/users/187153?format=json", "institution": "University of Cambridge"}, {"id": 187154, "fullname": "Silja Sormunen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187154?format=json", "institution": "Aalto University"}, {"id": 187155, "fullname": "Robin Young", "url": "http://cvpr.thecvf.com/api/miniconf/users/187155?format=json", "institution": "University of Cambridge"}, {"id": 187156, "fullname": "Madeline Lisaius", "url": "http://cvpr.thecvf.com/api/miniconf/users/187156?format=json", "institution": "University of Cambridge"}, {"id": 187157, "fullname": "Markus Immitzer", "url": "http://cvpr.thecvf.com/api/miniconf/users/187157?format=json", "institution": "Cyclops MRV Inc; Universit\u00e4t f\u00fcr Bodenkultur Wien"}, {"id": 187158, "fullname": "Toby Jackson", "url": "http://cvpr.thecvf.com/api/miniconf/users/187158?format=json", "institution": "University of Bristol"}, {"id": 175418, "fullname": "James Ball", "url": "http://cvpr.thecvf.com/api/miniconf/users/175418?format=json", "institution": "University of Cambridge "}, {"id": 187159, "fullname": "David Coomes", "url": "http://cvpr.thecvf.com/api/miniconf/users/187159?format=json", "institution": "University of Cambridge"}, {"id": 187160, "fullname": "Anil Madhavapeddy", "url": "http://cvpr.thecvf.com/api/miniconf/users/187160?format=json", "institution": "University of Cambridge"}, {"id": 174736, "fullname": "Andrew Blake", "url": "http://cvpr.thecvf.com/api/miniconf/users/174736?format=json", "institution": "Clare Hall, U. Cambridge."}, {"id": 187161, "fullname": "Srinivasan Keshav", "url": "http://cvpr.thecvf.com/api/miniconf/users/187161?format=json", "institution": "University of Cambridge"}], "abstract": "Satellite Earth-observation (EO) time series in the optical and microwave ranges are often irregular due to orbital patterns and cloud obstruction, and while compositing addresses these issues, it loses critical phenological information. To overcome this, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient embeddings. During training, TESSERA uses Barlow Twins and sparse random temporal sampling to enforce invariance to the selection of valid observations, aided by two key regularizers: global shuffling to decorrelate spatial neighborhoods and mix-based regulation for invariance under extreme sparsity. We find that for diverse classification, segmentation, and regression tasks, TESSERA embeddings deliver state-of-the-art accuracy with high label efficiency, often requiring only a small task head and minimal computation. To democratize access, adhere to FAIR principles, and simplify use, we release global, annual, 10m, pixel-wise int8 embeddings together with open weights/code and lightweight adaptation heads, providing practical tooling for large-scale retrieval and inference at planetary scale.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37322", "url": null, "sourceid": 33548, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39457, "uid": "cc80340474901bf4636a8630bb8edd3b", "name": "Drainage: A Unifying Framework for Addressing Class Uncertainty", "authors": [{"id": 182140, "fullname": "Yasser Taha", "url": "http://cvpr.thecvf.com/api/miniconf/users/182140?format=json", "institution": "Robert Koch Institute"}, {"id": 192109, "fullname": "Gr\u00e9goire Montavon", "url": "http://cvpr.thecvf.com/api/miniconf/users/192109?format=json", "institution": "Charit\u00e9 - Universit\u00e4tsmedizin Berlin"}, {"id": 192110, "fullname": "Nils K\u00f6rber", "url": "http://cvpr.thecvf.com/api/miniconf/users/192110?format=json", "institution": "Robert Koch Institute"}], "abstract": "Modern deep learning faces significant challenges with noisy labels, class ambiguity, as well as the need to robustly reject out-of-distribution or corrupted samples. In this work, we propose a unified framework based on the concept of a \"drainage node\" which we add at the output of the network. The node serves to reallocate probability mass toward uncertainty, while preserving desirable properties such as end-to-end training and differentiability. This mechanism provides a natural escape route for highly ambiguous, anomalous, or noisy samples, particularly relevant for instance-dependent and asymmetric label noise. In systematic experiments involving the addition of varying proportions of instance-dependent noise or asymmetric noise to CIFAR-10/100 labels, our drainage formulation achieves an accuracy increase of up to 9% over existing approaches in the high-noise regime. Our results on real-world datasets, such as mini-WebVision, mini-ImageNet and Clothing-1M, match or surpass existing state-of-the-art methods. Qualitative analysis reveals a denoising effect, where the drainage neuron consistently absorbs corrupt, mislabeled, or outlier data, leading to more stable decision boundaries. Furthermore, our drainage formulation enables applications well beyond classification, with immediate benefits for web-scale, semi-supervised dataset cleaning, and open-set applications.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39457", "url": null, "sourceid": 42755, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36813, "uid": "62e2d6c7039cae71d31bfb49b2226b6a", "name": "WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling", "authors": [{"id": 185939, "fullname": "Shaoheng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185939?format=json", "institution": "University of Texas at Austin"}, {"id": 135363, "fullname": "Hanwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135363?format=json", "institution": "Adobe Systems"}, {"id": 91909, "fullname": "Yunpeng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/91909?format=json", "institution": "Tsinghua University"}, {"id": 85661, "fullname": "Niloy J. Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/85661?format=json", "institution": "University College London"}, {"id": 91087, "fullname": "Qixing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91087?format=json", "institution": "University of Texas at Austin"}], "abstract": "Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement.We train WorldReel by carefully combining synthetic and real data: synthetic data provides precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity.Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36813", "url": null, "sourceid": 35750, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40002, "uid": "36248ba19f52a21af8e9363b4424a5fe", "name": "Generative Modeling of Weights: Generalization or Memorization?", "authors": [{"id": 193261, "fullname": "Boya Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193261?format=json", "institution": "Princeton University"}, {"id": 193257, "fullname": "Yida Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193257?format=json", "institution": "Princeton University"}, {"id": 193275, "fullname": "Zhiqiu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193275?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 153070, "fullname": "Zhuang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153070?format=json", "institution": "Princeton University"}], "abstract": "Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative, well-known methods in this emerging area on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Contrary to claims in prior work, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. Our further results suggest that the memorization potentially resulted from limited data, overparameterized models, and the underuse of structural priors specific to weight data. Our findings highlight the need for more careful design and evaluation of generative models in new domains.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40002", "url": null, "sourceid": 38216, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38659, "uid": "e6effb67cd9e11866f4b9c5b6ed8c1b7", "name": "Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion", "authors": [{"id": 190403, "fullname": "Linjun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190403?format=json", "institution": "Zhejiang University"}, {"id": 190404, "fullname": "Jiejia Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190404?format=json", "institution": "Zhejiang University"}, {"id": 190405, "fullname": "Leyang Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190405?format=json", "institution": "Hangzhou Research Institute of AI and Holographic Technology"}, {"id": 107376, "fullname": "He Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107376?format=json", "institution": "University College London, University of London"}, {"id": 190406, "fullname": "Bowen Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190406?format=json", "institution": "Zhejiang University"}, {"id": 190407, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190407?format=json", "institution": "Tencent"}, {"id": 190408, "fullname": "Hao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190408?format=json", "institution": null}, {"id": 190409, "fullname": "Fei Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/190409?format=json", "institution": "Tencent Timi  L1 Studio"}, {"id": 190410, "fullname": "Fei Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/190410?format=json", "institution": null}, {"id": 190411, "fullname": "Jun Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190411?format=json", "institution": ""}, {"id": 102599, "fullname": "Xiaogang Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/102599?format=json", "institution": "State Key Lab of CAD&amp;CG, Zhejiang University"}], "abstract": "Text-conditioned human motion in-betweening leverages keyframes for spatio-temporal control, with text providing high-level semantic guidance for the transitions. However, existing methods are unable to establish a coherent alignment between textual semantics and the spatio-temporal constraints provided by keyframes, often resulting in insufficiently constrained motions with unintended behavior. Moreover, they struggle with precise spatial control, often generating motions that deviate from keyframe constraints. To address these issues, we propose a multi-level diffusion framework that integrates textual semantics with implicit cues from keyframe sequences to modulate global motion dynamics, while leveraging individual keyframes to guide local transitions around them. During inference, to ensure strict keyframe adherence, we propose a novel trajectory refinement strategy that adjusts the root positions of the generated motion, followed by diffusion imputation to refine the poses of the generated keyframes. Additionally, our framework enables semantics-preserving motion editing, allowing for plausible modifications while retaining the original motion semantics. Extensive experiments demonstrate that our method generates high-quality motions that strictly satisfy keyframe constraints while achieving precise semantic alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38659", "url": null, "sourceid": 37597, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38306, "uid": "a0851c2a81f3b106242e08d54365233d", "name": "Cross-Hand Latent Representation for Vision-Language-Action models", "authors": [{"id": 189553, "fullname": "Guangqi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189553?format=json", "institution": "University of California, San Diego"}, {"id": 183389, "fullname": "Yutong Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183389?format=json", "institution": "University of California San Diego"}, {"id": 188205, "fullname": "Jianglong Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/188205?format=json", "institution": "University of California, San Diego"}, {"id": 189554, "fullname": "Jia Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189554?format=json", "institution": "University of California, San Diego"}, {"id": 184247, "fullname": "Changwei Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/184247?format=json", "institution": "University of California San Diego"}, {"id": 188206, "fullname": "Rocky Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188206?format=json", "institution": "Amazon FAR"}, {"id": 84867, "fullname": "Pieter Abbeel", "url": "http://cvpr.thecvf.com/api/miniconf/users/84867?format=json", "institution": "Covariant"}, {"id": 91790, "fullname": "Xiaolong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91790?format=json", "institution": "UCSD"}, {"id": 189555, "fullname": "Xueyan Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189555?format=json", "institution": "University of California, San Diego"}], "abstract": "Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception\u2014vision, sound, and language-guided intent\u2014to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce \\ourmethod, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that \\ourmethod consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38306", "url": null, "sourceid": 44821, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36258, "uid": "7f3721efabdb38d926d9688f7e1ed28e", "name": "OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks", "authors": [{"id": 180604, "fullname": "Zhihao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180604?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 184595, "fullname": "Cheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184595?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 184596, "fullname": "Shengyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184596?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 184597, "fullname": "Zhiying Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184597?format=json", "institution": "Sun Yat-sen Memorial Hospital, Sun Yat-sen University"}, {"id": 184598, "fullname": "Zanting Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184598?format=json", "institution": "Southern Medical University"}, {"id": 184599, "fullname": "Min Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/184599?format=json", "institution": "Fudan University"}, {"id": 184600, "fullname": "Peter Woo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184600?format=json", "institution": "Prince of Wales Hospital"}, {"id": 89467, "fullname": "Yixuan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89467?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Brain imaging analysis is crucial for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly supporting it. However, current brain imaging visual question-answering (VQA) benchmarks either cover a limited number of imaging modalities or are restricted to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs across the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis with closed- and open-ended evaluations. OmniBrainBench comprises 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluations of 24 state-of-the-art models, including open-source general-purpose, medical, and proprietary MLLMs, highlight the substantial challenges posed by OmniBrainBench. Experiments reveal that proprietary MLLMs like GPT-5 (63.37%) outperform open-source and medical MLLMs yet lag far behind physicians (91.35%), while medical MLLMs show wide variance in closed- and open-ended VQA. Open-source general-purpose MLLMs generally trail but excel in specific tasks, and all MLLMs fall short in complex preoperative reasoning, revealing a critical visual-to-clinical gap. OmniBrainBench establishes a new standard to assess MLLMs in brain imaging analysis, highlighting the gaps against physicians.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36258", "url": null, "sourceid": 45490, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36612, "uid": "a40214bd3a777dc2d496e989bc520f98", "name": "MoCha: End-to-End Video Character Replacement without Structural Guidance", "authors": [{"id": 185463, "fullname": "Zhengbo Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185463?format=json", "institution": "Department of Computer Science, University of Oxford"}, {"id": 129313, "fullname": "Jie Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/129313?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 185464, "fullname": "Ziheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185464?format=json", "institution": "Columbia University"}, {"id": 107337, "fullname": "Zhan Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107337?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185465, "fullname": "Jun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185465?format=json", "institution": "Alibaba Group"}, {"id": 183393, "fullname": "Jing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183393?format=json", "institution": "Alibaba Group"}], "abstract": "Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data.Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies.In this paper, we propose MoCha, a pioneering framework that mitigates these limitations by harnessing the inherent tracking ability of the video diffusion model, therefore requiring only a single arbitrary frame mask and no structural guidance.To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage.Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs.Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches.We will release the code and dataset to facilitate further research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36612", "url": null, "sourceid": 34819, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36171, "uid": "7281942387a1a0c3f72a50a8b0bb0920", "name": "Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain", "authors": [{"id": 175621, "fullname": "Avery gump", "url": "http://cvpr.thecvf.com/api/miniconf/users/175621?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 184325, "fullname": "Connor Andrew Henley", "url": "http://cvpr.thecvf.com/api/miniconf/users/184325?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 184326, "fullname": "Sungjin Cheong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184326?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 184327, "fullname": "Akarsh Prabhakara", "url": "http://cvpr.thecvf.com/api/miniconf/users/184327?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 126898, "fullname": "Mohit Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/126898?format=json", "institution": "Department of Computer Sciences, University of Wisconsin - Madison"}], "abstract": "Modern LiDARs are rapidly transitioning from bulky, mechanically scanned systems to ultra-compact, low-cost, solid-state arrays. This miniaturization\u2014while enabling scalability, affordability, and camera-like data structures\u2014introduces a new and severe failure mode: internal-multipath glare. When light from a bright or retroreflective surface reflects and scatters within the LiDAR, light that should reach a single pixel spreads across the pixel array. The resulting artifacts create phantom objects, obscure real ones, and produce safety-critical \u201cghosts in the point clouds.\u201d This paper introduces a physically-grounded sensing model and algorithmic techniques for addressing this effect. We show that internal glare can be represented as a linear, scene-independent operator\u2014the Transient Glare Spread Function (TGSF)\u2014acting on the raw transient histogram cube. This formulation enables simple, training-free inversion directly in the measurement domain, before nonlinear point-cloud formation. We develop exact and approximate de-glare algorithms that are general, computationally efficient, and compatible with existing LiDAR data-processing pipelines. Using experiments with real single-photon LiDAR hardware, we demonstrate suppression of severe glare artifacts at millisecond latency, establishing de-glare as a practical, lightweight preprocessing step for next-generation LiDAR systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36171", "url": null, "sourceid": 38925, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40024, "uid": "2de39d164c7807c1be5cad819d978cf2", "name": "Computer Vision with a Superpixelation Camera", "authors": [{"id": 180697, "fullname": "Sasidharan Mahalingam", "url": "http://cvpr.thecvf.com/api/miniconf/users/180697?format=json", "institution": "Portland State University"}, {"id": 193331, "fullname": "Rachel Brown", "url": "http://cvpr.thecvf.com/api/miniconf/users/193331?format=json", "institution": "Willamette University"}, {"id": 193332, "fullname": "Atul Ingle", "url": "http://cvpr.thecvf.com/api/miniconf/users/193332?format=json", "institution": "Portland State University"}], "abstract": "Conventional cameras generate a lot of data that can be challenging to process in resource-constrained applications. Usually, cameras generate data streams on the order of the number of pixels in the image. However, most of this captured data is redundant for many downstream computer vision algorithms. We propose a novel camera design, which we call SuperCam, that adaptively processes captured data by performing superpixel segmentation on the fly. We show that SuperCam performs better than current state-of-the-art superpixel algorithms under memory-constrained situations. We also compare how well SuperCam performs when the compressed data is used for downstream computer vision tasks. Our results demonstrate that the proposed design provides superior output for image segmentation, object detection, and monocular depth estimation in situations where the available memory on the camera is limited. We posit that superpixel segmentation will play a crucial role as more computer vision inference models are deployed in edge devices. SuperCam would allow computer vision engineers to design more efficient systems for these applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40024", "url": null, "sourceid": 35084, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37900, "uid": "9a233d274515549314aeacfaf2702f25", "name": "FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants", "authors": [{"id": 145933, "fullname": "Mahesh Bhosale", "url": "http://cvpr.thecvf.com/api/miniconf/users/145933?format=json", "institution": "University at Buffalo"}, {"id": 188527, "fullname": "Abdul Wasi Lone", "url": "http://cvpr.thecvf.com/api/miniconf/users/188527?format=json", "institution": "State University of New York at Buffalo"}, {"id": 188528, "fullname": "Shantam Srivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/188528?format=json", "institution": "State University of New York at Buffalo"}, {"id": 188529, "fullname": "Shifa Latif", "url": "http://cvpr.thecvf.com/api/miniconf/users/188529?format=json", "institution": "University of Kashmir"}, {"id": 88006, "fullname": "Tianyu Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88006?format=json", "institution": "State University of New York at Buffalo"}, {"id": 88895, "fullname": "Mingchen Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88895?format=json", "institution": "University at Buffalo, SUNY"}, {"id": 138526, "fullname": "David Doermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/138526?format=json", "institution": "State University of New York at Buffalo"}, {"id": 127134, "fullname": "Xuan Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/127134?format=json", "institution": "Harvard University"}], "abstract": "While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model\u2019s representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report\u2013generation task shows FairLLaVA consistently reduces inter-group gaps while improving equity-scaled clinical performance and natural language generation metrics. Code and models will be open-sourced.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37900", "url": null, "sourceid": 38415, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37140, "uid": "92600cf5751eef50371a96f136857c84", "name": "Hierarchical Action Learning for Weakly-Supervised Action Segmentation", "authors": [{"id": 180232, "fullname": "Junxian Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180232?format=json", "institution": "Guangdong University of Technology"}, {"id": 186753, "fullname": "Ruichu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186753?format=json", "institution": "Guangdong University of Technology"}, {"id": 186754, "fullname": "Juntao Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186754?format=json", "institution": "Guangdong University of Technology"}, {"id": 186755, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186755?format=json", "institution": "Guangdong University of Technology"}, {"id": 186756, "fullname": "Boyan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186756?format=json", "institution": null}, {"id": 186757, "fullname": "Weilin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186757?format=json", "institution": null}, {"id": 186758, "fullname": "Zijian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186758?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 153801, "fullname": "Shenghua Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153801?format=json", "institution": "University of Hong Kong"}], "abstract": "Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action govern the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \\textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \\textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37140", "url": null, "sourceid": 42670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39413, "uid": "502a3825a6dbdba0449624b8354cedc8", "name": "UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking", "authors": [{"id": 183543, "fullname": "Hao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183543?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 192027, "fullname": "Xudong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192027?format=json", "institution": "Hangzhou City University"}, {"id": 174090, "fullname": "Jialiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174090?format=json", "institution": "Ocean University of China"}, {"id": 183538, "fullname": "Junlong Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/183538?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 192028, "fullname": "Xinghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192028?format=json", "institution": "Hong Kong Polytechnic University; Eastern Institute of Technology"}, {"id": 188306, "fullname": "Junyan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188306?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 188308, "fullname": "Yunpu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188308?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen; Siemens Corporate Research"}, {"id": 154481, "fullname": "Xiaoyu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154481?format=json", "institution": "Eastern Institute of Technology, Ningbo"}], "abstract": "One-stream Transformer-based trackers achieve advanced performance in visual object tracking suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, a critical limitation persists: no existing work performs pruning jointly across all three critical components\u2014the search region, dynamic template, and static template. This isolation overlooks interdependencies, yielding suboptimal pruning and degraded accuracy. To address this, we introduce \\textbf{UTPTrack}, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multi-modal and language-guided tasks within a single model. Comprehensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4\\% of vision tokens in RGB-based tracking and 67.5\\% in unified tracking while preserving 99.7\\% and 100.5\\% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39413", "url": null, "sourceid": 45869, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38495, "uid": "419c2a2a2f4ef10201e0bd13e579ca21", "name": "Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection", "authors": [{"id": 126164, "fullname": "Feng Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/126164?format=json", "institution": "Nanchang University"}, {"id": 189983, "fullname": "Wenhui Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189983?format=json", "institution": "Nanchang University"}, {"id": 189984, "fullname": "Yunpeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189984?format=json", "institution": "Nanchang University"}, {"id": 131987, "fullname": "Xinan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/131987?format=json", "institution": "Nanchang University"}, {"id": 153953, "fullname": "Hong Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153953?format=json", "institution": "Nanchang University"}, {"id": 126179, "fullname": "Shu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126179?format=json", "institution": "Purdue University"}], "abstract": "Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38495", "url": null, "sourceid": 45862, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38510, "uid": "089dd30eda093ec9415c24ce6da94b33", "name": "Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons", "authors": [{"id": 190024, "fullname": "Lingyun Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190024?format=json", "institution": "Zhejiang University"}, {"id": 190025, "fullname": "Zehao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190025?format=json", "institution": "Zhejiang University"}, {"id": 190026, "fullname": "Yan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190026?format=json", "institution": "Zhejiang University"}, {"id": 153814, "fullname": "Shi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153814?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 185438, "fullname": "Peng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185438?format=json", "institution": "Zhejiang University"}, {"id": 185439, "fullname": "De Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185439?format=json", "institution": "Zhejiang University"}, {"id": 87278, "fullname": "Huajin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87278?format=json", "institution": "Zhejiang University"}, {"id": 87673, "fullname": "Qian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87673?format=json", "institution": "Zhejiang University"}, {"id": 87277, "fullname": "Gang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87277?format=json", "institution": "Zhejiang University"}], "abstract": "Novel view synthesis for dynamic scenes remains challenging due to complex motion variations.Recent methods represent dynamic and static regions with separate Gaussians to improve efficiency and accuracy, but inaccurate assignment of  static and dynamic Gaussian primitive still limits performance.We identify two key issues, namely inaccurate mask priors and improper tag representations, which lead to boundary artifacts, loss of fine-grained motion details, and overfitting on input views, resulting in degraded side-view synthesis.To address these problems, we propose a spatio-temporally fine-grained mask field and a discontinuous dynamic\u2013static tagging field to achieve accurate assignment of dynamic and static Gaussian primitives, enabling high-quality novel view synthesis, especially in fine-grained motions, motion boundary regions, and side viewpoints.Experiments show that our method achieves state-of-the-art rendering quality and real-time performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38510", "url": null, "sourceid": 42610, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38811, "uid": "b6f79519d3b769aff4991526d838022d", "name": "GUI-SAGE: Enhancing GUI Automation with Self-Explanatory Learning", "authors": [{"id": 183048, "fullname": "Fei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183048?format=json", "institution": "Nankai University"}, {"id": 156777, "fullname": "Zhangxuan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156777?format=json", "institution": "Ant Group"}, {"id": 190743, "fullname": "Zhengxi Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190743?format=json", "institution": "Zhejiang University"}, {"id": 190744, "fullname": "Shangzhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190744?format=json", "institution": "University of Oxford"}, {"id": 187083, "fullname": "Zhengwen Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187083?format=json", "institution": "Ant Group"}, {"id": 187084, "fullname": "Shuheng Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187084?format=json", "institution": null}, {"id": 90247, "fullname": "Changhua Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90247?format=json", "institution": "Nanjing University"}, {"id": 190745, "fullname": "Yuchen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190745?format=json", "institution": "Zhejiang University"}, {"id": 155990, "fullname": "Wenqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155990?format=json", "institution": "Zhejiang University"}, {"id": 190746, "fullname": "Yongliang Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190746?format=json", "institution": "Zhejiang University"}, {"id": 190747, "fullname": "Weiming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190747?format=json", "institution": "Zhejiang University"}, {"id": 129046, "fullname": "Yueting Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129046?format=json", "institution": "Zhejiang University"}], "abstract": "Reinforcement learning with verifiable rewards (RLVR) has shown promise for GUI automation, enabling agents to learn from binary task completion signals. However, when task difficulty exceeds model capacity, on-policy exploration fails to discover correct actions, creating zero-advantage traps that eliminate learning signals. While incorporating off-policy expert demonstrations seems intuitive, it causes persistent high-entropy states due to distribution mismatch, disrupting effective learning. We propose GUI-SAGE, a self-explanation framework that generates in-distribution reasoning trajectories for GUI automation. By conditioning on ground-truth actions, our method produces in-distribution guidance that avoids the confusion caused by out-of-distribution expert demonstrations. We further introduce Entropy-Modulated Credit Assignment, which recalibrates learning weights by jointly considering prediction confidence and reward signals, enabling amplified updates for confident correct actions and attenuated updates for uncertain explorations. Extensive experiments on AndroidControl and GUI-Odyssey demonstrate that GUI-SAGE-3B achieves competitive performance with 81.1\\% success rate, substantially outperforming existing methods. Our analysis validates that self-explanations maintain stable learning dynamics while expert demonstrations cause entropy collapse, and that entropy modulation provides the largest improvements on in-distribution samples.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38811", "url": null, "sourceid": 43627, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39109, "uid": "a0a6d7cc839b1db0f0d06c7e0c74594f", "name": "Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting", "authors": [{"id": 191378, "fullname": "Peiyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191378?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 183589, "fullname": "Shuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183589?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 88913, "fullname": "Xin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88913?format=json", "institution": "Adobe Systems"}, {"id": 163462, "fullname": "Krishna Mullia", "url": "http://cvpr.thecvf.com/api/miniconf/users/163462?format=json", "institution": "Canva"}, {"id": 191379, "fullname": "Raymond Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191379?format=json", "institution": null}, {"id": 127644, "fullname": "Iliyan Georgiev", "url": "http://cvpr.thecvf.com/api/miniconf/users/127644?format=json", "institution": "Adobe"}], "abstract": "Ray-tracing-based 3D Gaussian splatting (3DGS) enjoys the generality of supporting non-pinhole camera models and relightable formulations. However, they are usually lacking in performance, partially due to the need for depth-based sorting of all intersecting Gaussians along the traced rays.In this paper, we introduce a sorting-free differentiable stochastic formulation for ray-traced 3DGS, enabling efficient reconstruction and rendering of both standard and relightable 3DGS scenes.For standard 3DGS, our method offers performance comparable to rasterization-based 3DGS and outperforms sorting-based ray tracing.For relightable 3DGS, our technique provides higher-quality reconstructions and renderings thanks to the accurate shadow and shading computation provided by fully ray-traced shadow and light rays.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39109", "url": null, "sourceid": 36961, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39210, "uid": "8230258b576f81e8dec86997100d1bfb", "name": "TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery", "authors": [{"id": 149857, "fullname": "Li Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149857?format=json", "institution": "University of California, San Diego"}, {"id": 92390, "fullname": "Shruti Agarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/92390?format=json", "institution": "Adobe"}, {"id": 90975, "fullname": "John Collomosse", "url": "http://cvpr.thecvf.com/api/miniconf/users/90975?format=json", "institution": "University of Surrey"}, {"id": 191593, "fullname": "Pengtao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/191593?format=json", "institution": "University of California, San Diego"}, {"id": 76133, "fullname": "Vishal Asnani", "url": "http://cvpr.thecvf.com/api/miniconf/users/76133?format=json", "institution": "Michigan State University"}], "abstract": "Generative AI models pose a significant challenge to intellectual property (IP), as they can replicate unique artistic styles and concepts without attribution. While watermarking offers a potential solution, existing methods often fail in complex scenarios where multiple concepts (e.g., an object and an artistic style) are composed within a single image. These methods struggle to disentangle and attribute each concept individually. In this work, we introduce TokenTrace, a novel proactive watermarking framework for robust, multi-concept attribution. Our method embeds secret signatures into the semantic domain by simultaneously perturbing the text prompt embedding and the initial latent noise that guide the diffusion model's generation process. For retrieval, we propose a query-based TokenTrace module that takes the generated image and a textual query specifying which concepts need to be retrieved (e.g., a specific object or style) as inputs. This query-based mechanism allows the module to disentangle and independently verify the presence of multiple concepts from a single generated image. Extensive experiments show that our method achieves state-of-the-art performance on both single-concept (object and style) and multi-concept attribution tasks, significantly outperforming existing baselines while maintaining high visual quality and robustness to common transformations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39210", "url": null, "sourceid": 34368, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38042, "uid": "740a20d4e1051f3e11ca595e8d5a046f", "name": "SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning", "authors": [{"id": 188905, "fullname": "Cai Selvas-Sala", "url": "http://cvpr.thecvf.com/api/miniconf/users/188905?format=json", "institution": "Computer Vision Center"}, {"id": 188906, "fullname": "Lei Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188906?format=json", "institution": "Computer Vision Center, Universitat Aut\u00f3noma de Barcelona"}, {"id": 188907, "fullname": "Lluis Gomez", "url": "http://cvpr.thecvf.com/api/miniconf/users/188907?format=json", "institution": "Computer Vision Center, Universitat Aut\u00f2noma de Barcelona"}], "abstract": "As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a \\textit{Compromised} model polluted with this data, and a \\textit{Clean} model without it. Both are trained from scratch on a 400M-pair \\texttt{retain} set to isolate unlearning effects. We propose a novel evaluation protocol with structured holdout sets (\\texttt{holdout\\_identity}, \\texttt{holdout\\_association}) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38042", "url": "https://cvc-mmu.github.io/salmubench/", "sourceid": 37509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39121, "uid": "d4d367f9ef2000f4ae6c37752ce3afc2", "name": "VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes", "authors": [{"id": 191402, "fullname": "Yikang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191402?format=json", "institution": "Tongji University"}, {"id": 185544, "fullname": "Rui Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185544?format=json", "institution": "Tongji University"}], "abstract": "3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects.Source code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39121", "url": null, "sourceid": 40569, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37715, "uid": "bd0027406f48f714d12dc1de747133cf", "name": "Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model", "authors": [{"id": 85387, "fullname": "Minh Quan Dao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85387?format=json", "institution": "Rutgers University"}, {"id": 75668, "fullname": "Dimitris N. Metaxas", "url": "http://cvpr.thecvf.com/api/miniconf/users/75668?format=json", "institution": "Rutgers University"}], "abstract": "Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50\\% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37715", "url": null, "sourceid": 40754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39285, "uid": "93978cb97ec047e348cad39ee701fe8c", "name": "EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction", "authors": [{"id": 182131, "fullname": "Lucas Iijima", "url": "http://cvpr.thecvf.com/api/miniconf/users/182131?format=json", "institution": "Imperial College London"}, {"id": 183591, "fullname": "Yihao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183591?format=json", "institution": "Imperial College London"}, {"id": 191766, "fullname": "Roberto Sesia", "url": "http://cvpr.thecvf.com/api/miniconf/users/191766?format=json", "institution": "Imperial College London"}, {"id": 191767, "fullname": "Amit Kaura", "url": "http://cvpr.thecvf.com/api/miniconf/users/191767?format=json", "institution": "Imperial College London"}, {"id": 191768, "fullname": "Jamil Mayet", "url": "http://cvpr.thecvf.com/api/miniconf/users/191768?format=json", "institution": "Imperial College London"}, {"id": 188938, "fullname": "Choon Hwai Yap", "url": "http://cvpr.thecvf.com/api/miniconf/users/188938?format=json", "institution": "Imperial College London"}], "abstract": "3D echocardiography provides superior cardiac quantification to traditional 2D echocardiography, which suffers from geometric idealizations and imaging plane misalignment. However, despite its advantages, clinical adoption of 3D echo remains limited due to logistical and visualization challenges. We propose a novel framework that reconstructs the 3D shape of the left ventricle (LV) throughout the cardiac cycle from sparse 2D echocardiographic views routinely acquired in clinical practice, without the need for external hardware or manual tracking. Our method integrates EchoPOSE, a new deep network that automatically estimates the 6D pose (position and orientation) of LV segmentations, with a graph-harmonic algorithm for 3D shape reconstruction. EchoPOSE employs a transformer-based architecture that combines local image features with global multi-view context, and introduces a geometry-aware loss to ensure spatial consistency across intersecting imaging planes. Trained and evaluated on large-scale synthetic data derived from 3D echocardiography and validated on prospectively acquired clinical echocardiograms, EchoPOSE achieves 3.78 mm and 8.65$^{\\circ}$ pose errors, yielding 87.5 Dice reconstruction accuracy, 1.44\\% ejection fraction error, and 3.03\\% volume error, outperforming alternative deep learning techniques and classical clinical approaches. Notably, the framework remains robust under suboptimal imaging alignment, suggesting that EchoPOSE can reduce the sonography skills required for transducer positioning and allow minimally trained clinicians to perform echo scans.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39285", "url": null, "sourceid": 39694, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36895, "uid": "39f90409e6bf42733041b973096d3d10", "name": "It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models", "authors": [{"id": 186139, "fullname": "Anne Harrington", "url": "http://cvpr.thecvf.com/api/miniconf/users/186139?format=json", "institution": "University of California, Berkeley"}, {"id": 150990, "fullname": "A. Koepke", "url": "http://cvpr.thecvf.com/api/miniconf/users/150990?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 162350, "fullname": "Shyamgopal Karthik", "url": "http://cvpr.thecvf.com/api/miniconf/users/162350?format=json", "institution": "University of T\u00fcbingen"}, {"id": 86710, "fullname": "Trevor Darrell", "url": "http://cvpr.thecvf.com/api/miniconf/users/86710?format=json", "institution": "Electrical Engineering &amp; Computer Science Department"}, {"id": 72510, "fullname": "Alexei A. Efros", "url": "http://cvpr.thecvf.com/api/miniconf/users/72510?format=json", "institution": "UC Berkeley"}], "abstract": "Contemporary text-to-image models exhibit a surprising degree of mode collapse, as can be seen when sampling several images given the same text prompt. While previous work has attempted to address this issue by steering the model using guidance mechanisms, or by generating a large pool of candidates and refining them, in this work we take a different direction and optimize for variation in generations via noise optimization. Specifically, we show that a simple noise optimization objective can mitigate mode collapse while preserving the fidelity of the base model. We also analyze the frequency characteristics of the noise and show that alternative noise initializations with different frequency profiles can improve both optimization and search. Our experiments demonstrate that noise optimization yields superior results in terms of generation quality and variety.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36895", "url": null, "sourceid": 38802, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39204, "uid": "db5362e819448c35aafb4dc80add9aec", "name": "InterRVOS: Interaction-Aware Referring Video Object Segmentation", "authors": [{"id": 191572, "fullname": "Woojeong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191572?format=json", "institution": "KAIST"}, {"id": 191573, "fullname": "Seongchan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191573?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191574, "fullname": "Jae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191574?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression.However, most existing approaches focus only on the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. In this paper, we introduce Interaction-Aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on explicit interaction modeling by requiring separate segmentation of actor and target objects.This formulation enables fine-grained understanding of object relationships, as many video events are defined by such interactions rather than individual objects. We present InterRVOS-127K, a large-scale dataset of over 127K automatically annotated expressions with distinct actor-target mask pairs, and propose ReVIOSa, a MLLM-based architecture that introduces interaction-aware special tokens and attention mask loss (AML) to enhance interaction-aware segmentation. We also propose a new evaluation protocol that separately evaluates actor and target segmentation for more accurate role distinction. Comprehensive experiments demonstrate that ReVIOSa outperforms existing baselines on the proposed InterRVOS-127K benchmark, with further analyses validating the necessity and effectiveness of both ReVIOSa and InterRVOS-127K. Code and datasets will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39204", "url": null, "sourceid": 38339, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36680, "uid": "eff9f344fdd9b97ea827134223f62586", "name": "Diff-SemiER: Transparency-Aware Adaptive Fusion Diffusion Model with Generative Prior for Semi-Transparent Eyeglasses Removal", "authors": [{"id": 175592, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/175592?format=json", "institution": "Henan University"}, {"id": 185631, "fullname": "Shiqi Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185631?format=json", "institution": "Henan Univeristy"}, {"id": 185632, "fullname": "Zhenxiang Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/185632?format=json", "institution": "Henan Univeristy"}, {"id": 185633, "fullname": "jingtao guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185633?format=json", "institution": "School of Software, Henan University, China"}], "abstract": "Existing eyeglasses removal methods primarily focus on opaque or fully transparent lenses. However, when dealing with semi-transparent sunglasses, these methods often corrupt the visible facial details beneath the lenses, thereby degrading the performance of downstream vision tasks. To address this issue, we propose Diff-SemiER, a novel diffusion-based framework for semi-transparent eyeglasses removal that leverages generative priors and transparency-aware adaptive fusion. The proposed framework fully utilizes the visible eye-region information beneath the lenses while retaining sufficient generative flexibility, thereby striking a balance between generation and restoration within semi-transparent regions. Specifically, Diff-SemiER comprises two diffusion branches: the Generative Prior Diffusion Model (GPDM) generates high-quality eyeglass-free facial images via image inpainting, which provides global semantic guidance for highly occluded scenarios. The Transparency-Aware Adaptive Fusion Diffusion Model (TAFDM) employs a Soft Mask-Aware Adaptive Fusion (SMAF) mechanism to adaptively merge generative and restorative features across multiple scales, enabling dynamic trade-offs between generative capability and fine-detail preservation under varying occlusion levels. Furthermore, we design a transmittance-based data synthesis method to construct a large-scale, high-quality dataset of faces with semi-transparent eyeglasses for model training and evaluation. Extensive experimental results demonstrate that Diff-SemiER significantly outperforms state-of-the-art methods in both synthetic and real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36680", "url": null, "sourceid": 44744, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39923, "uid": "e7964b868c9e6d6506e5a69c1b680ee0", "name": "MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models", "authors": [{"id": 193121, "fullname": "lulu hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193121?format=json", "institution": "Alibaba Group"}, {"id": 180044, "fullname": "Xiao Wenhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180044?format=json", "institution": "alibaba"}, {"id": 179678, "fullname": "Chen Xin", "url": "http://cvpr.thecvf.com/api/miniconf/users/179678?format=json", "institution": "alibaba cloud"}, {"id": 193122, "fullname": "Xinhua Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193122?format=json", "institution": "Peking University"}, {"id": 193123, "fullname": "Bowen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193123?format=json", "institution": "Alibaba Group"}, {"id": 193124, "fullname": "Kun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193124?format=json", "institution": null}, {"id": 193125, "fullname": "Yongliang Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193125?format=json", "institution": null}], "abstract": "Post-training quantization (PTQ) with computational equivalence for Large Language Models (LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models (MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39923", "url": null, "sourceid": 37343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36585, "uid": "efaee7456f7f0595875626c8b9588f31", "name": "Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment", "authors": [{"id": 185413, "fullname": "Yifan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185413?format=json", "institution": "Peking University"}, {"id": 185414, "fullname": "Haofeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185414?format=json", "institution": "Peking University"}, {"id": 87306, "fullname": "Wenhan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87306?format=json", "institution": "Peng Cheng Lab"}, {"id": 75639, "fullname": "Jiaying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75639?format=json", "institution": "Peking University"}], "abstract": "Low-light degradation hampers machine understanding at night. Existing methods either overfit labeled data (paired supervision) or specific distributions (unpaired supervision), resulting in poor generalization under unseen degradations. In this paper, we propose $\\textbf{UniPrior}$, a unified prior-based low-light adaptation framework that integrates the general semantic prior embedded in vision foundation models (VFMs) with illumination-invariant priors, to capture both stable and changing semantics under varied low-light degradation without any real low-light training data. In detail, the illumination-invariant prior is used as an auxiliary input, and a parallel decoder reconstructs it as a regularization target, enforcing representation consistency and reducing feature drift. Such signal constancy enables us to build a VFM-aligned semantic space via a contrastive training strategy guided by VFM self-correlation maps, enriching features with high-level cues, thereby improving adaptation to diverse low-light conditions. Beyond high-level features, we also give a joint consideration of such unified prior and low-level signal space through our machine-oriented enhancement scheme. We extend the signal prior to handle overexposure and inject VFM-guided semantic cues into the enhancement process via a CLIP-based loss. This coupling of semantic alignment and pixel correction enables sample-adaptive optimization to improve performance. Extensive experiments on multiple low-light tasks demonstrate our method\u2019s superiority and practical utility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36585", "url": null, "sourceid": 37776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36683, "uid": "2181d94fba9a1d2de2b5f6fb75f8ab08", "name": "GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation", "authors": [{"id": 182286, "fullname": "Xutao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/182286?format=json", "institution": "Liaoning Normal University"}, {"id": 185638, "fullname": "Jiarui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185638?format=json", "institution": "Liaoning Normal University"}, {"id": 185639, "fullname": "\u5218\u4fca\u6587 \u5218\u4fca\u6587", "url": "http://cvpr.thecvf.com/api/miniconf/users/185639?format=json", "institution": "liaoning Normal University"}, {"id": 184059, "fullname": "Yonggong Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/184059?format=json", "institution": "Liaoning Normal University"}], "abstract": "Recently, the Vision Mamba architecture has emerged as a promising paradigm for medical image segmentation. However, representation discrepancies often arise between anatomical structures and their associated tissue types, while crucial diagnostic cues tend to be spatially entangled, constraining the performance of Mamba architectures in this domain. To address these limitations, we propose $\\textbf{GeoSemba}$, a novel Mamba-based segmentation framework that unifies geometric\u2013semantic and spatial\u2013channel representations. Specifically, we reformulate the Mamba\u2019s state-space equations with two key components, a Semantic-guided State Refiner (SSR) and a Cross-dimensional Affinity Refiner (CAR). SSR reconstructs information flow within an abstract semantic space to forge a synergistic representation between anatomical textures and geometric contours.  Concurrently, CAR adaptively models spatial\u2013channel affinities to capture the intrinsic tissue heterogeneity common in medical imaging. By jointly integrating SSR and CAR in a complementary manner, $\\textbf{GeoSemba}$ only requires a single scan to effectively achieve cross-dimensional consistency and cross-level interaction. Extensive experiments on public datasets spanning six medical imaging modalities demonstrate that $\\textbf{GeoSemba}$ consistently delivers superior segmentation accuracy while maintaining high computational efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36683", "url": null, "sourceid": 44555, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36812, "uid": "ac029f072468dd8c97c15f0a9fa96f00", "name": "Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery", "authors": [{"id": 151843, "fullname": "Minh Kha Do", "url": "http://cvpr.thecvf.com/api/miniconf/users/151843?format=json", "institution": "La Trobe University"}, {"id": 91374, "fullname": "Wei Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91374?format=json", "institution": "La Trobe University"}, {"id": 156020, "fullname": "Kang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/156020?format=json", "institution": "La Trobe University"}, {"id": 184154, "fullname": "Di Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184154?format=json", "institution": "La Trobe University"}, {"id": 156022, "fullname": "Khoa T. Phan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156022?format=json", "institution": "La Trobe University"}, {"id": 155106, "fullname": "Yi-Ping Phoebe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155106?format=json", "institution": "La Trobe University"}, {"id": 92855, "fullname": "Gaowen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92855?format=json", "institution": "Cisco Systems"}, {"id": 129129, "fullname": "Ramana Kompella", "url": "http://cvpr.thecvf.com/api/miniconf/users/129129?format=json", "institution": "Cisco"}], "abstract": "Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over state-of-the-art baselines, demonstrating an efficient and deployable path toward spectrum-aware vision-language learning for Earth observation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36812", "url": "https://ikhado.github.io/sattxt/", "sourceid": 45665, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38959, "uid": "d01e50a1df779a3b2ee9cf8776e02303", "name": "Sampling-Aware Quantization for Diffusion Models", "authors": [{"id": 143622, "fullname": "Qian Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/143622?format=json", "institution": "Zhejiang University"}, {"id": 85433, "fullname": "Jie Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85433?format=json", "institution": "Zhejiang University"}, {"id": 188519, "fullname": "Yuanyu Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188519?format=json", "institution": "Zhejiang University"}, {"id": 131897, "fullname": "Huiqiong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131897?format=json", "institution": "Zhejiang University"}, {"id": 85446, "fullname": "Mingli Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85446?format=json", "institution": "Zhejiang University"}], "abstract": "Diffusion models have recently emerged as the dominant approach in visual generation tasks. However, the lengthy denoising chains and the computationally intensive noise estimation networks hinder their applicability in low-latency and resource-limited environments. Previous research has endeavored to address these limitations in a decoupled manner, utilizing either advanced samplers or efficient model quantization techniques. In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. To attain dual acceleration with high fidelity, we propose a sampling-aware quantization strategy, wherein a Mixed-Order Trajectory Alignment technique is devised to impose a more stringent constraint on the error bounds at each sampling step, facilitating a more linear probability flow. Extensive experiments on sparse-step fast sampling across multiple datasets demonstrate that our approach preserves the rapid convergence characteristics of high-speed samplers while maintaining superior generation quality. Code will be made publicly available soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38959", "url": null, "sourceid": 45489, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37894, "uid": "56390d1113302e0c420d4393523ca92b", "name": "SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models", "authors": [{"id": 153796, "fullname": "Yuechen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/153796?format=json", "institution": "Zhejiang University"}, {"id": 188516, "fullname": "Xiaoyan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188516?format=json", "institution": "Zhejiang University"}, {"id": 188517, "fullname": "Yicheng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188517?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 188518, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188518?format=json", "institution": "Manycore, Inc."}, {"id": 86545, "fullname": "Rui Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86545?format=json", "institution": "Kujiale.com"}, {"id": 186097, "fullname": "Rong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186097?format=json", "institution": null}, {"id": 85446, "fullname": "Mingli Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85446?format=json", "institution": "Zhejiang University"}, {"id": 188519, "fullname": "Yuanyu Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188519?format=json", "institution": "Zhejiang University"}, {"id": 85433, "fullname": "Jie Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85433?format=json", "institution": "Zhejiang University"}], "abstract": "Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. We will release our code and dataset soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37894", "url": null, "sourceid": 44012, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36645, "uid": "52ec1c0cc952d63a8bda67ff969b6968", "name": "Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion", "authors": [{"id": 172710, "fullname": "Jinsong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172710?format=json", "institution": "Capital Normal University"}, {"id": 185545, "fullname": "Ying Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185545?format=json", "institution": "Beijing Normal University"}, {"id": 185546, "fullname": "Yuan Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185546?format=json", "institution": "Capital Normal University"}, {"id": 185547, "fullname": "Hairong Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185547?format=json", "institution": "University of Tennessee, Knoxville; University of Tennessee, Knoxville"}, {"id": 185548, "fullname": "Zhenzhou Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185548?format=json", "institution": "Capital Normal University"}], "abstract": "Frequent and precise land surface monitoring is critical for numerous applications, yet existing satellites struggle to achieve both simultaneously. Spatiotemporal fusion (STF) tackles this challenge by integrating multiple satellite images to generate data with improved temporal and spatial resolution, enabling more frequent and precise land surface observations. However, current methods often fail to recover dynamic landscape changes due to significant scale discrepancies among multi-source images. To address these challenges, we propose a semantic-adaptive diffusion framework for dynamic spatiotemporal fusion (SA-STF), which constrains the solution space using low-resolution and high-frequency features decoupled via a Taylor-inspired decoder. By incorporating temporal feature alignment and semantic-adaptive fusion modules, SA-STF projects multimodal images with temporal dynamics into a unified latent space, and adaptively enhances spatial details while maintaining the spectral consistency of the reconstructed images. Experiments on benchmark datasets demonstrate that SA-STF outperforms existing methods in both quantitative and qualitative evaluations, particularly in complex and dynamic scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36645", "url": null, "sourceid": 39743, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39625, "uid": "8bd73c11f9e36c039ff3cee4f3f5e2d2", "name": "Block-based Learned Image Compression without Blocking Artifacts", "authors": [{"id": 192503, "fullname": "Jong Wook Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192503?format=json", "institution": "Kyung Hee University"}, {"id": 192504, "fullname": "Suyong Bahk", "url": "http://cvpr.thecvf.com/api/miniconf/users/192504?format=json", "institution": "Kyung Hee University"}, {"id": 192505, "fullname": "TaeHwa Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192505?format=json", "institution": "Kyung Hee University"}, {"id": 192506, "fullname": "HyunDong CHO", "url": "http://cvpr.thecvf.com/api/miniconf/users/192506?format=json", "institution": "Kyung Hee University"}, {"id": 192507, "fullname": "Donghyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192507?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology; Electronics and Telecommunications Research Institute (ETRI)"}, {"id": 192508, "fullname": "Sung-Chang Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192508?format=json", "institution": "Electronics and Telecommunications Research Institute (ETRI)"}, {"id": 192509, "fullname": "Jin Soo Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192509?format=json", "institution": "ETRI"}, {"id": 192510, "fullname": "Hui Yong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192510?format=json", "institution": "Kyung Hee University"}], "abstract": "Learned Image Compression (LIC) outperforms traditional codecs but suffers from excessive peak memory usage when handling high-resolution images. Consequently, block-based LIC has been studied to reduce peak memory and computational costs; however, this approach often introduces blocking artifacts that degrade visual quality.To mitigate this, the JPEG-AI standard introduced a patch-based scheme where overlapped blocks are coded independently using empirically determined overlap sizes. However, the experimental search for optimal overlaps is time-consuming and does not guarantee blocking-free reconstruction. To address these limitations, we propose an analytic framework modeling overlap propagation through convolutional and transposed convolutional layers to precisely determine the minimal overlaps for blocking-free reconstruction.Based on the minimum overlaps calculated, we provide the block-based implementation methodology for the convolution networks used in most CNN-based LIC models.Applied to four CNN-based LIC models on 4K images partitioned into 256$\\times$256 blocks, our method achieves rate\u2013distortion performance identical to full-image coding while reducing average peak memory usage to 18.7\\% (encoder) and 17.9\\% (decoder), and only with average computational cost of 4.23\\% and 2.34\\%, respectively. Notably, the proposed block-based framework does not require any re-training of the original model. Furthermore, it can also be applied to most CNN-based image processing neural networks without worrying about any performance degradation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39625", "url": null, "sourceid": 38509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39090, "uid": "75b106cadadfe7ffb2ae427acec5041d", "name": "Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition", "authors": [{"id": 191339, "fullname": "Xuemei Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/191339?format=json", "institution": "Wuhan University"}, {"id": 89172, "fullname": "Jiawei Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/89172?format=json", "institution": "Centre for Frontier AI Research (CFAR), A*STAR, Singapore"}, {"id": 191340, "fullname": "Hui Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191340?format=json", "institution": "University of Oulu"}, {"id": 130766, "fullname": "Jun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130766?format=json", "institution": "Wuhan University"}, {"id": 89174, "fullname": "Joey Tianyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89174?format=json", "institution": "National University of Singapore "}, {"id": 75857, "fullname": "Zheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75857?format=json", "institution": "Wuhan University"}], "abstract": "High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development\u2014ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks.We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity.Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples.During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment.Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39090", "url": null, "sourceid": 35793, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37951, "uid": "d83a3b7c173b00c655629bc5bde5bb9c", "name": "MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks", "authors": [{"id": 174882, "fullname": "Lirong Che", "url": "http://cvpr.thecvf.com/api/miniconf/users/174882?format=json", "institution": "Tsinghua University"}, {"id": 188663, "fullname": "Shuo Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188663?format=json", "institution": "McGill University, MILA- Quebec AI Institute"}, {"id": 177798, "fullname": "Huang Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/177798?format=json", "institution": "Tsinghua university"}, {"id": 188664, "fullname": "wang chuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188664?format=json", "institution": null}, {"id": 188665, "fullname": "yuzhe yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188665?format=json", "institution": null}, {"id": 177824, "fullname": "Gregory Dudek", "url": "http://cvpr.thecvf.com/api/miniconf/users/177824?format=json", "institution": "McGill University"}, {"id": 88722, "fullname": "Xueqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88722?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188666, "fullname": "Jian Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/188666?format=json", "institution": "Agibot"}], "abstract": "Real-world robotic tasks are long-horizon and often span multiple floors, requiring complex spatial reasoning. Existing embodied benchmarks, however, are largely confined to single-floor homes, failing to evaluate agents on realistic, building-scale tasks. We introduce MANSION, a language-driven framework for generating building-scale, multi-floor 3D environments for long-horizon tasks. Using this framework, we release MansionWorld, a large-scale dataset featuring over 1,000 diverse, non-residential buildings. These environments support cross-floor skills and long-horizon task generation on reusable building layouts. Experiments show that current methods degrade sharply on our multi-floor tasks, highlighting both the challenge and the value of this setting for advancing embodied AI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37951", "url": null, "sourceid": 38136, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37896, "uid": "434c2eb8627ed4e1a0a4f0ee5d6022aa", "name": "Gated KalmaNet: A fading memory layer through test-time ridge regression", "authors": [{"id": 92174, "fullname": "Liangzu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/92174?format=json", "institution": "Johns Hopkins University"}, {"id": 169266, "fullname": "Aditya Chattopadhyay", "url": "http://cvpr.thecvf.com/api/miniconf/users/169266?format=json", "institution": "Amazon"}, {"id": 85578, "fullname": "Luca Zancato", "url": "http://cvpr.thecvf.com/api/miniconf/users/85578?format=json", "institution": "AWS AI Labs"}, {"id": 188522, "fullname": "Elvis Nunez", "url": "http://cvpr.thecvf.com/api/miniconf/users/188522?format=json", "institution": "Amazon Web Services"}, {"id": 187281, "fullname": "Wei Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/187281?format=json", "institution": "Amazon"}, {"id": 130843, "fullname": "Stefano Soatto", "url": "http://cvpr.thecvf.com/api/miniconf/users/130843?format=json", "institution": "University of California, Los Angeles"}], "abstract": "As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented settings. We propose \\ourname (\\ourshortname), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. \\ourshortname achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And 2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, \\ourshortname shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, \\ourshortname excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$\\% relative improvement over other fading memory baselines.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37896", "url": null, "sourceid": 44332, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36400, "uid": "0bddac7512f3a93b174c33117a370741", "name": "SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection", "authors": [{"id": 129884, "fullname": "Jiaming Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129884?format=json", "institution": "Shenzhen University"}, {"id": 184952, "fullname": "Yifeng Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184952?format=json", "institution": "Shenzhen University"}, {"id": 184953, "fullname": "Chunlin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184953?format=json", "institution": "Shenzhen University"}, {"id": 184954, "fullname": "Weihua Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184954?format=json", "institution": "Shenzhen University"}, {"id": 184955, "fullname": "bingye Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184955?format=json", "institution": "Shenzhen University"}, {"id": 183119, "fullname": "Qiwei Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183119?format=json", "institution": "Shenzhen University"}, {"id": 184956, "fullname": "Boyang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184956?format=json", "institution": "Shenzhen University"}, {"id": 184957, "fullname": "Xiaochun Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184957?format=json", "institution": "Shenzhen University"}, {"id": 184958, "fullname": "Qiang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/184958?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts.Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named \\textbf{OVCOD-D} by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector\u2019s ability to discriminate camouflaged objects from background objects. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36400", "url": null, "sourceid": 35654, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36499, "uid": "6f1098383b8d527a7b2391d00b0dda70", "name": "Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation", "authors": [{"id": 154346, "fullname": "Kwanyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/154346?format=json", "institution": "Hanyang University"}, {"id": 147634, "fullname": "SeungJu Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/147634?format=json", "institution": "Hanyang University"}, {"id": 185215, "fullname": "Yebin Ahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/185215?format=json", "institution": "Hanyang Universty"}, {"id": 147999, "fullname": "Hyunwoo Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/147999?format=json", "institution": "Hanyang University"}, {"id": 185216, "fullname": "Sungho Koh", "url": "http://cvpr.thecvf.com/api/miniconf/users/185216?format=json", "institution": "Hanyang University"}, {"id": 69900, "fullname": "Dong-Jin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/69900?format=json", "institution": "Hanyang University"}], "abstract": "Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) \u2014 a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36499", "url": null, "sourceid": 34875, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36744, "uid": "6710565534bb35faf7fad5971cf885b3", "name": "Video-CoE: Reinforcing Video Event Prediction via Chain of Events", "authors": [{"id": 180867, "fullname": "Qile Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/180867?format=json", "institution": "AMAP, Alibaba Group"}, {"id": 185769, "fullname": "Jing Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185769?format=json", "institution": "Alibaba Group"}, {"id": 183308, "fullname": "Rui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183308?format=json", "institution": "AMAP, Alibaba Group"}, {"id": 185770, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185770?format=json", "institution": "Alibaba Group"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}], "abstract": "Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with.In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information.To address these challenges, we propose **C**hain of **E**vents (**CoE**) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols.Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task.Codes and models will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36744", "url": null, "sourceid": 41177, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37265, "uid": "8e09c1416fa221eafaacbb6c60e11f02", "name": "From Scale to Speed: Adaptive Test-Time Scaling for Image Editing", "authors": [{"id": 151496, "fullname": "Xiangyan Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151496?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 187047, "fullname": "Zhenlong Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187047?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 185769, "fullname": "Jing Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185769?format=json", "institution": "Alibaba Group"}, {"id": 183308, "fullname": "Rui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183308?format=json", "institution": "AMAP, Alibaba Group"}, {"id": 145621, "fullname": "Datao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145621?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 187048, "fullname": "Meng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187048?format=json", "institution": "Lanzhou University"}, {"id": 185770, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185770?format=json", "institution": "Alibaba Group"}, {"id": 186537, "fullname": "Yancheng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186537?format=json", "institution": "Alibaba Group"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}, {"id": 154418, "fullname": "Gaopeng Gou", "url": "http://cvpr.thecvf.com/api/miniconf/users/154418?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 154419, "fullname": "Gang Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154419?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Science"}, {"id": 153126, "fullname": "Yujun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/153126?format=json", "institution": "The University of Queensland"}], "abstract": "Image Chain-of-Thought (Image-CoT) is a test-time scaling paradigm that improves image generation by extending inference time. Most Image-CoT methods focus on text-to-image (T2I) generation. Unlike T2I generation, image editing is goal-directed: the solution space is constrained by the source image and instruction. This mismatch causes three challenges when applying Image-CoT to editing: inefficient resource allocation with fixed sampling budgets, unreliable early-stage verification using general MLLM scores, and redundant edited results from large-scale sampling. To address this, we propose ADaptive Edit-CoT (ADE-CoT), an on-demand test-time scaling framework to enhance editing efficiency and performance. It incorporates three key strategies: (1) a difficulty-aware resource allocation that assigns dynamic budgets based on estimated edit difficulty; (2) edit-specific verification in early pruning that uses region localization and caption consistency to select promising candidates; and (3) depth-first opportunistic stopping, guided by an instance-specific verifier, that terminates when intent-aligned results are found. Extensive experiments on three SOTA editing models (Step1X-Edit, BAGEL, FLUX.1 Kontext) across three benchmarks show that ADE-CoT achieves superior performance-efficiency trade-offs. With comparable sampling budgets, ADE-CoT obtains better performance with more than 2\u00d7 speedup over Best-of-N.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37265", "url": null, "sourceid": 32174, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39107, "uid": "7923554f3401e7aaadf9466f121ce1a4", "name": "Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction", "authors": [{"id": 191375, "fullname": "Yisheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191375?format=json", "institution": "Tongyi Lab, Alibaba Group"}], "abstract": "We introduce a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, directly animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39107", "url": null, "sourceid": 33658, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38277, "uid": "da0945054b3ff8a44325d3507066d10d", "name": "Mixture of States: Routing Token-Level Dynamics for Multimodal Generation", "authors": [{"id": 86851, "fullname": "Haozhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86851?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 187098, "fullname": "Ding Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187098?format=json", "institution": "Meta"}, {"id": 90942, "fullname": "Mingchen Zhuge", "url": "http://cvpr.thecvf.com/api/miniconf/users/90942?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 103410, "fullname": "Zijian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/103410?format=json", "institution": "King&#x27;s College London"}, {"id": 157112, "fullname": "Tian Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/157112?format=json", "institution": "Meta"}, {"id": 127697, "fullname": "Sen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/127697?format=json", "institution": "Meta AI"}, {"id": 189482, "fullname": "Yukang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189482?format=json", "institution": "Princeton University"}, {"id": 76480, "fullname": "Shuming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76480?format=json", "institution": "KAUST"}, {"id": 157113, "fullname": "Yuren Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/157113?format=json", "institution": "Meta"}, {"id": 140044, "fullname": "Jiadong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/140044?format=json", "institution": "Microsoft"}, {"id": 156979, "fullname": "Hongyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156979?format=json", "institution": "Meta                  Reality Labs"}, {"id": 140855, "fullname": "Ke Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/140855?format=json", "institution": "Meta"}, {"id": 157111, "fullname": "Kam Woh Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157111?format=json", "institution": "Meta"}, {"id": 189483, "fullname": "Juan Camilo Perez", "url": "http://cvpr.thecvf.com/api/miniconf/users/189483?format=json", "institution": "Facebook"}, {"id": 90976, "fullname": "Juan-Manuel P\u00e9rez-R\u00faa", "url": "http://cvpr.thecvf.com/api/miniconf/users/90976?format=json", "institution": "Meta AI"}, {"id": 85886, "fullname": "Tao Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85886?format=json", "institution": "University of Surrey"}, {"id": 102365, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102365?format=json", "institution": "Meta Inc."}, {"id": 76995, "fullname": "Shikun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76995?format=json", "institution": "Meta AI"}, {"id": 128347, "fullname": "J\u00fcrgen Schmidhuber", "url": "http://cvpr.thecvf.com/api/miniconf/users/128347?format=json", "institution": "King Abdullah University of Science and Technology"}], "abstract": "We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $\\epsilon$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38277", "url": null, "sourceid": 41663, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38965, "uid": "04ee08ead1a259009cab6b2198fb6d93", "name": "OMoBlur: An Object Motion Blur Dataset and Benchmark for Real-World Local Motion Deblurring", "authors": [{"id": 174243, "fullname": "Dingchuan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174243?format=json", "institution": "Zhejiang University"}, {"id": 191073, "fullname": "Jiatong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191073?format=json", "institution": "Zhejiang University"}, {"id": 191074, "fullname": "Jingwen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191074?format=json", "institution": "Zhejiang University"}, {"id": 191075, "fullname": "Zhengyue Zhuge", "url": "http://cvpr.thecvf.com/api/miniconf/users/191075?format=json", "institution": "Zhejiang University"}, {"id": 191076, "fullname": "Yueting Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191076?format=json", "institution": "Zhejiang University"}, {"id": 178457, "fullname": "Qi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178457?format=json", "institution": "Zhejiang University"}], "abstract": "Object motion blur in static scenes is spatially heterogeneous, differing from conventional deblurring problems yet frequently occurring in real handheld capture scenarios. Existing datasets either rely on costly beam-splitting capture with residual misalignment or employ synthetic blur that fails to model the continuous photon-integration process during exposure. To overcome these limitations, we introduce OMoBlur, a physically grounded dataset that emulates realistic exposure integration via programmable sensor control, ensuring close alignment between synthetic and real blur distributions. OMoBlur provides 20,000 blur\u2013sharp\u2013mask pairs covering diverse object motion types. Leveraging this dataset, we further propose OMDNet, an object-motion-aware deblurring network that integrates a Motion\u2013Appearance Extract Block, a Flow-Guided Gate Predictor, and an Adaptive Gated Fusion mechanism. This design enables the network to selectively restore blurred regions while preserving static backgrounds, without requiring pixel-accurate mask annotations. Extensive experiments demonstrate that OMoBlur\u2019s physically faithful data collection and large-scale diversity significantly enhance the network\u2019s generalization to real-world motion blur, establishing OMoBlur and OMDNet as a robust benchmark and practical solution for local motion deblurring. The dataset and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38965", "url": null, "sourceid": 39840, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37293, "uid": "d312e6a8dd81e4a0fca59ebc3915e5e0", "name": "OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory", "authors": [{"id": 113866, "fullname": "Zhaochong An", "url": "http://cvpr.thecvf.com/api/miniconf/users/113866?format=json", "institution": "University of Copenhagen"}, {"id": 128169, "fullname": "Menglin Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/128169?format=json", "institution": "Facebook"}, {"id": 155930, "fullname": "Haonan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155930?format=json", "institution": "Nanyang Technological University"}, {"id": 103410, "fullname": "Zijian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/103410?format=json", "institution": "King&#x27;s College London"}, {"id": 154250, "fullname": "Xiaoke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154250?format=json", "institution": "UC Santa Cruz"}, {"id": 126968, "fullname": "Zhiheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126968?format=json", "institution": "University of Science and Technology of China"}, {"id": 130315, "fullname": "Weiming Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/130315?format=json", "institution": "University of Waterloo"}, {"id": 187097, "fullname": "Kumara Kahatapitiya", "url": "http://cvpr.thecvf.com/api/miniconf/users/187097?format=json", "institution": "Meta"}, {"id": 187098, "fullname": "Ding Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187098?format=json", "institution": "Meta"}, {"id": 127697, "fullname": "Sen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/127697?format=json", "institution": "Meta AI"}, {"id": 187099, "fullname": "Chenyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187099?format=json", "institution": "Meta"}, {"id": 85886, "fullname": "Tao Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85886?format=json", "institution": "University of Surrey"}, {"id": 187100, "fullname": "Fanny Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187100?format=json", "institution": "Facebook"}, {"id": 73712, "fullname": "Serge Belongie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73712?format=json", "institution": "Pioneer Centre for Artificial Intelligence, Copenhagen University"}, {"id": 157112, "fullname": "Tian Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/157112?format=json", "institution": "Meta"}], "abstract": "Storytelling in real-world videos often unfolds through multiple shots\u2014discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling. Our model and data will be released with the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37293", "url": null, "sourceid": 40140, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39949, "uid": "fc45efdc2a351a5eeee3268204677741", "name": "Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token", "authors": [{"id": 154812, "fullname": "Anqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154812?format=json", "institution": "Beijing Institute of Technology"}, {"id": 193178, "fullname": "Xiaokang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/193178?format=json", "institution": "Beijing Institute of Technology"}, {"id": 154813, "fullname": "Guangyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154813?format=json", "institution": "Beijing Institute of Technology"}, {"id": 70991, "fullname": "Jianbo Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/70991?format=json", "institution": "University of Birmingham"}, {"id": 90079, "fullname": "Chi Harold Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90079?format=json", "institution": "Beijing Institute of Technology"}, {"id": 75837, "fullname": "Yunchao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/75837?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding~(SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without MLLM processing, respectively, to unleash the details of compressed features and simulate the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39949", "url": null, "sourceid": 42501, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36972, "uid": "2b2050eb04c8bf0965fd4f153767c64b", "name": "Soft Modality-Guided Expert Specialization in MoE-VLMs", "authors": [{"id": 182555, "fullname": "Zi-Hao Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182555?format=json", "institution": "Li Auto Inc."}, {"id": 186352, "fullname": "Yaqian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186352?format=json", "institution": "Li Auto Inc."}, {"id": 186353, "fullname": "Anzhou Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186353?format=json", "institution": null}, {"id": 146843, "fullname": "rinyoichi takezoe", "url": "http://cvpr.thecvf.com/api/miniconf/users/146843?format=json", "institution": "Li Auto Inc."}, {"id": 186354, "fullname": "Ertao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186354?format=json", "institution": null}, {"id": 186355, "fullname": "Tianxiang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186355?format=json", "institution": "Li Auto Inc."}, {"id": 186356, "fullname": "Jiale Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186356?format=json", "institution": "Li Auto Inc."}, {"id": 186357, "fullname": "Mo Guang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186357?format=json", "institution": null}, {"id": 186358, "fullname": "Kaiwen Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186358?format=json", "institution": "Li Auto Inc."}], "abstract": "Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored.    Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization.    We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization.    Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization.    Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9\\% and 4.2\\% average gain on multimodal and language-only tasks, 56.1\\% reduction in EP communication overhead, and 12.3\\% throughput improvement under realistic deployment.    These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36972", "url": null, "sourceid": 37401, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39956, "uid": "f275c3f63affaa9fd3251140a5578ee6", "name": "Building Robust Vision Encoders for Cross-Dataset Evaluation in Immunofluorescent Microscopy", "authors": [{"id": 181250, "fullname": "Umar Marikkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/181250?format=json", "institution": "University of Surrey"}, {"id": 193190, "fullname": "Syed Sameed Husain", "url": "http://cvpr.thecvf.com/api/miniconf/users/193190?format=json", "institution": "University of Surrey"}, {"id": 154655, "fullname": "Muhammad Awais", "url": "http://cvpr.thecvf.com/api/miniconf/users/154655?format=json", "institution": "University of Surrey"}, {"id": 154652, "fullname": "Sara Atito", "url": "http://cvpr.thecvf.com/api/miniconf/users/154652?format=json", "institution": "University of Surrey"}], "abstract": "Immunofluorescence (IF) images reveal detailed information about structures and functions at the subcellular level. However, unlike RGB images, IF datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Although existing approaches build channel-adaptive models for training, they do not perform evaluations across IF datasets with unseen channel configurations. To address this, we first introduce a biologically informed view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference for the concept channels in the image. We leverage this view to propose Channel Conditioned Cell Representations (C3R), a framework that learns representations that transfers well to both in-distribution (ID) and out-of-distribution (OOD) datasets which contain same and different channel configurations, respectively.  C3R is a two-fold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while yielding state-of-the-art results on frozen encoder evaluation on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IF datasets, with no need for retraining on unseen channel configurations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39956", "url": null, "sourceid": 41184, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38078, "uid": "ad0896d82c8097a18dfb0206489abcac", "name": "Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance", "authors": [{"id": 189006, "fullname": "Huakeng Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/189006?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 189007, "fullname": "Yaowen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189007?format=json", "institution": "Zhejiang University"}, {"id": 85731, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85731?format=json", "institution": "Zhejiang University"}, {"id": 85722, "fullname": "Hongzhi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85722?format=json", "institution": "Zhejiang University"}], "abstract": "We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and appearance. Our depth results compare favorably with state-of-the-art techniques, while our reflectance results are comparable when validated against photographs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38078", "url": null, "sourceid": 44721, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39176, "uid": "bb2da85c47ec0d635ae708645f47475f", "name": "Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs", "authors": [{"id": 152503, "fullname": "Pei An", "url": "http://cvpr.thecvf.com/api/miniconf/users/152503?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 152504, "fullname": "Junfeng Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/152504?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 89263, "fullname": "Jiaqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89263?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 131539, "fullname": "Yulong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131539?format=json", "institution": "Huazhong Agricultural University"}, {"id": 152505, "fullname": "Jie Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/152505?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 127808, "fullname": "Liangliang Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127808?format=json", "institution": "Delft University of Technology"}], "abstract": "Image-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph framework that jointly refines cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. It learns a heterogeneous graph by mining multi-path feature relationships, adapts features under the guidance of heterogeneous edges, and prunes correspondences using graph-based projection consistency. Experiments on six indoor and outdoor benchmarks under cross-domain setups demonstrate that Hg-I2P significantly outperforms existing methods in both generalization and accuracy. Code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39176", "url": "https://github.com/anpei96/hg-i2p-demo", "sourceid": 35474, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65724, "file": "/media/PosterPDFs/CVPR%202026/39176.png", "modified": "2026-04-22T19:32:02.552268-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65725, "file": "/media/PosterPDFs/CVPR%202026/39176-thumb.png", "modified": "2026-04-22T19:32:02.766390-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65726, "modified": "2026-04-22T19:55:55.812346-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/39176.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37444, "uid": "aa5d5b1b36318c95dffe1d6c4e153f2d", "name": "Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains", "authors": [{"id": 181826, "fullname": "Qingwei Ben", "url": "http://cvpr.thecvf.com/api/miniconf/users/181826?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 154244, "fullname": "Botian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154244?format=json", "institution": "Tsinghua University"}, {"id": 127073, "fullname": "Kailin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127073?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187463, "fullname": "Feiyu Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/187463?format=json", "institution": "University of Science and Technology of China"}, {"id": 187464, "fullname": "Wentao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187464?format=json", "institution": "The University of Tokyo"}, {"id": 187465, "fullname": "Jingping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187465?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 126543, "fullname": "Jingbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126543?format=json", "institution": "Shanghai AI LAB"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 88208, "fullname": "Jiangmiao Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88208?format=json", "institution": "Shanghai AI Laboratory "}], "abstract": "Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure.This paper presents $\\textbf{Gallant}$, a voxel-grid\u2013based framework for humanoid locomotion and local navigation in 3D constrained terrains.It leverages voxelized LiDAR data as a lightweight and structured perceptual representation, and employs a z-grouped 2D CNN to map this representation to the control policy, enabling fully end-to-end optimization. A high-fidelity LiDAR simulation that dynamically generates realistic observations is developed to support scalable, LiDAR-based training and ensure sim-to-real consistency.Experimental results show that Gallant\u2019s broader perceptual coverage facilitates the use of a single policy that goes beyond the limitations of previous methods confined to ground-level obstacles, extending to lateral clutter, overhead constraints, multi-level structures, and narrow passages. Gallant also firstly achieves near-100\\% success rates in challenging scenarios such as stair climbing and stepping onto elevated platforms through improved end-to-end optimization. This project will be fully open-source.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37444", "url": null, "sourceid": 36629, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39906, "uid": "97b21687a279e25c7a74e7cfb9005bba", "name": "LinVideo: A Post-Training Framework towards $\\mathcal{O}(n)$ Attention in Efficient Video Generation", "authors": [{"id": 179914, "fullname": "yushi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179914?format=json", "institution": "SenseTime"}, {"id": 130712, "fullname": "Xingtong Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/130712?format=json", "institution": "Beijing Institute of Technology"}, {"id": 139422, "fullname": "RUIHAO GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/139422?format=json", "institution": "Beihang University"}, {"id": 193090, "fullname": "Chengtao Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/193090?format=json", "institution": "Nanyang Technological University"}, {"id": 90255, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90255?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Video diffusion models (DMs) have enabled high-quality video synthesis, but their computation costs scale quadratically with sequence length due to the nature of self-attention. While linear attention offers a more efficient alternative, fully replacing quadratic attention demands costly pretraining. This is largely because linear attention lacks sufficient expressiveness and struggles with the complex spatiotemporal dynamics inherent to video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose a selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and even inefficiency of existing objectives in optimizing this challenge transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is highly efficient and recovers model performance. Extensive experiments show that LinVideo achieves a $\\mathbf{1.43\\text{-}1.71\\times}$ speedup while preserving generation quality, and the 4-step distilled models further reduce latency by $\\mathbf{15.9\\text{-}20.9\\times}$ with only a minor drop in visual quality.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39906", "url": null, "sourceid": 38257, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39170, "uid": "738754a8dc187735701c05a35e80aff9", "name": "Probabilistic Precipitation Nowcasting with Rectified Flow Transformers", "authors": [{"id": 153479, "fullname": "Johannes Schusterbauer", "url": "http://cvpr.thecvf.com/api/miniconf/users/153479?format=json", "institution": "CompVis       LMU Munich"}, {"id": 170371, "fullname": "Jannik Wiese", "url": "http://cvpr.thecvf.com/api/miniconf/users/170371?format=json", "institution": "LMU Munich"}, {"id": 153590, "fullname": "Nick Stracke", "url": "http://cvpr.thecvf.com/api/miniconf/users/153590?format=json", "institution": "CompVis       LMU Munich"}, {"id": 191500, "fullname": "Timy Phan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191500?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 85132, "fullname": "Bj\u00f6rn Ommer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85132?format=json", "institution": "University of Munich"}], "abstract": "Accurate weather forecasts are essential across various domains and are safety-critical in extreme weather conditions.Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting.In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation.However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process.In this work, we introduce **FREUD**, a **Fr**ame-wise **E**ncoder and **U**nited **D**ecoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty through ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by scaling model size. With FREUD and the latent rectified flow model, we aim to push the boundaries of data-driven weather nowcasting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39170", "url": null, "sourceid": 44123, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36330, "uid": "3bcad4e7af821b33b29f7078b90ab75a", "name": "3D Gaussian Splatting from unposed Spike Stream", "authors": [{"id": 180975, "fullname": "Yijia Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180975?format=json", "institution": "Peking University"}, {"id": 182125, "fullname": "Tong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182125?format=json", "institution": "Peking University"}, {"id": 71710, "fullname": "Liwen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71710?format=json", "institution": "Peking University"}, {"id": 128297, "fullname": "Lei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/128297?format=json", "institution": "Peking University"}, {"id": 88774, "fullname": "Tiejun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88774?format=json", "institution": "Peking University"}], "abstract": "3D Gaussian Splatting (3DGS) has significantly advanced 3D reconstruction with its impressive performance. However, its reliance on sharp images and precise camera pose priors limits its effectiveness in high-speed scenarios. Recent advances have integrated spike camera, a bio-inspired sensor with a high temporal resolution, to enhance 3DGS in such conditions. Although spike-based methods reduce the need for sharp images, they still face challenges in achieving precise camera pose estimation due to unstable observations and visual texture deficiency.To address these challenges, we propose Nope-SGS, the first framework that reconstructs high-speed 3D scenes from **unposed captures** of the bio-inspired high-temporal-resolution spike camera. To achieve robust 3D reconstruction and pose estimation, we first reformulate the spike model from a probabilistic perspective and extend its application to keyframing, effectively alleviating the instability caused by the spike stream. Building upon this foundation, we devise a progressive optimization framework to facilitate swift 3D reconstruction.  The experimental results demonstratethat our method achieves up to 7.4dB higher PSNR and 40\\% lower AbsoluteTrajectory Error (ATE) compared to state-of-the-art methodsunder challenging high-speed scenarios while maintaining the fastest reconstruction speed among spike-based methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36330", "url": null, "sourceid": 45078, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36495, "uid": "5ebcc5d6380763e7fbaadf4270357929", "name": "SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation", "authors": [{"id": 185193, "fullname": "Linkuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185193?format=json", "institution": "Northwestern Polytechnical University Xi&#x27;an"}, {"id": 185194, "fullname": "Yinghao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/185194?format=json", "institution": "Northwestern Polytechnical University Xi&#x27;an"}, {"id": 185195, "fullname": "Yufei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185195?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185196, "fullname": "Xiangyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185196?format=json", "institution": "Harbin Institute of Technology"}, {"id": 185197, "fullname": "Wenjie Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/185197?format=json", "institution": "University of Science and Technology of China"}, {"id": 185198, "fullname": "Cong Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185198?format=json", "institution": "Macquarie University"}, {"id": 185199, "fullname": "leyi wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185199?format=json", "institution": "Shandong University"}, {"id": 87585, "fullname": "Ran Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/87585?format=json", "institution": "Tianjin University"}, {"id": 182986, "fullname": "Qiangguo Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182986?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (**S**tructure-aware **H**ierarchical Unsupervised Domain **A**daptation with **P**lausibility **E**valuation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture.This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability.SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08\\% (MRI$\\to$CT) and 78.51\\% (CT$\\to$MRI) on cardiac data, and 87.48\\% (MRI$\\to$CT) and 86.89\\% (CT$\\to$MRI) on abdominal data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36495", "url": null, "sourceid": 34387, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36666, "uid": "75e4b350d35c7e5892cfd9432683c990", "name": "Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision", "authors": [{"id": 152572, "fullname": "Yunhe Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152572?format=json", "institution": "Rutgers University"}, {"id": 185592, "fullname": "Yabin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185592?format=json", "institution": "Stanford University"}, {"id": 185593, "fullname": "Chong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185593?format=json", "institution": "Stanford University"}, {"id": 185594, "fullname": "Jiaming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185594?format=json", "institution": "Stanford University"}, {"id": 185595, "fullname": "Maya Varma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185595?format=json", "institution": "Stanford University"}, {"id": 185596, "fullname": "Jean-Benoit Delbrouck", "url": "http://cvpr.thecvf.com/api/miniconf/users/185596?format=json", "institution": "Stanford University"}, {"id": 166890, "fullname": "Akshay Chaudhari", "url": "http://cvpr.thecvf.com/api/miniconf/users/166890?format=json", "institution": "Stanford University"}, {"id": 185597, "fullname": "Curtis Langlotz", "url": "http://cvpr.thecvf.com/api/miniconf/users/185597?format=json", "institution": "Stanford University"}], "abstract": "Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40\\% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. These results validate that mask-guided pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36666", "url": null, "sourceid": 36365, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39189, "uid": "404d3f0612eb04b0949751d0102417a9", "name": "TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models", "authors": [{"id": 126968, "fullname": "Zhiheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126968?format=json", "institution": "University of Science and Technology of China"}, {"id": 130315, "fullname": "Weiming Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/130315?format=json", "institution": "University of Waterloo"}, {"id": 86851, "fullname": "Haozhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86851?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 103410, "fullname": "Zijian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/103410?format=json", "institution": "King&#x27;s College London"}, {"id": 160689, "fullname": "Shoufa Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/160689?format=json", "institution": "The University of Hong Kong"}, {"id": 155930, "fullname": "Haonan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155930?format=json", "institution": "Nanyang Technological University"}, {"id": 154250, "fullname": "Xiaoke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154250?format=json", "institution": "UC Santa Cruz"}, {"id": 113866, "fullname": "Zhaochong An", "url": "http://cvpr.thecvf.com/api/miniconf/users/113866?format=json", "institution": "University of Copenhagen"}, {"id": 187100, "fullname": "Fanny Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187100?format=json", "institution": "Facebook"}, {"id": 139232, "fullname": "Aditya Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/139232?format=json", "institution": "Meta"}, {"id": 191542, "fullname": "Viktar Atliha", "url": "http://cvpr.thecvf.com/api/miniconf/users/191542?format=json", "institution": "Meta"}, {"id": 190543, "fullname": "Tony Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190543?format=json", "institution": "Facebook"}, {"id": 157110, "fullname": "Xiao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/157110?format=json", "institution": "Meta AI"}, {"id": 188036, "fullname": "Chuyan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188036?format=json", "institution": "Facebook"}, {"id": 187099, "fullname": "Chenyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187099?format=json", "institution": "Meta"}, {"id": 187098, "fullname": "Ding Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187098?format=json", "institution": "Meta"}, {"id": 90976, "fullname": "Juan-Manuel P\u00e9rez-R\u00faa", "url": "http://cvpr.thecvf.com/api/miniconf/users/90976?format=json", "institution": "Meta AI"}, {"id": 127697, "fullname": "Sen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/127697?format=json", "institution": "Meta AI"}, {"id": 128347, "fullname": "J\u00fcrgen Schmidhuber", "url": "http://cvpr.thecvf.com/api/miniconf/users/128347?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 129054, "fullname": "Wenhu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129054?format=json", "institution": "University of Waterloo"}, {"id": 86834, "fullname": "Ping Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86834?format=json", "institution": "The University of Hong Kong"}, {"id": 102365, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102365?format=json", "institution": "Meta Inc."}, {"id": 85886, "fullname": "Tao Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85886?format=json", "institution": "University of Surrey"}, {"id": 132523, "fullname": "Jonas Schult", "url": "http://cvpr.thecvf.com/api/miniconf/users/132523?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 157113, "fullname": "Yuren Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/157113?format=json", "institution": "Meta"}], "abstract": "Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39189", "url": null, "sourceid": 34763, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36800, "uid": "34aba6f934aa1a8de795d088a6875d28", "name": "Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models", "authors": [{"id": 185592, "fullname": "Yabin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185592?format=json", "institution": "Stanford University"}, {"id": 185595, "fullname": "Maya Varma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185595?format=json", "institution": "Stanford University"}, {"id": 152572, "fullname": "Yunhe Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152572?format=json", "institution": "Rutgers University"}, {"id": 185596, "fullname": "Jean-Benoit Delbrouck", "url": "http://cvpr.thecvf.com/api/miniconf/users/185596?format=json", "institution": "Stanford University"}, {"id": 185594, "fullname": "Jiaming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185594?format=json", "institution": "Stanford University"}, {"id": 185593, "fullname": "Chong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185593?format=json", "institution": "Stanford University"}, {"id": 185597, "fullname": "Curtis Langlotz", "url": "http://cvpr.thecvf.com/api/miniconf/users/185597?format=json", "institution": "Stanford University"}], "abstract": "Out-of-distribution (OOD) detection aims to identify samples that deviate from in-distribution (ID). One popular pipeline addresses this by introducing negative labels distant from ID classes and detecting OOD based on their distance to these labels.However, such labels may present poor activation on OOD samples, failing to capture the OOD characteristics. To address this, we propose \\underline{T}est-time \\underline{A}ctivated \\underline{N}egative \\underline{L}abels (TANL) by dynamically evaluating activation levels across the corpus dataset and mining candidate labels with high activation responses in the testing process. Specifically, TANL identifies high-confidence test images online and accumulates their assignment probabilities over the corpus to construct a label activation metric.Such a metric leverages historical test samples to adaptively align with the test distribution, enabling the selection of distribution-adaptive activated negative labels. By further exploring the activation information within the current testing batch, we introduce a more fine-grained, batch-adaptive variant. To fully utilize label activation knowledge, we propose an activation-aware score function that emphasizes negative labels with stronger activations, boosting performance and enhancing its robustness to the label number.Our TANL is training-free, test-efficient, and grounded in theoretical justification. Experiments on diverse backbones and wide task settings validate its effectiveness.Notably, on the large-scale ImageNet benchmark, TANL significantly reduces the FPR95 from 17.5\\% to 9.8\\%.Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36800", "url": null, "sourceid": 45621, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38404, "uid": "cb0217e062ae7cb7f4ddd97326c843cf", "name": "CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization", "authors": [{"id": 189804, "fullname": "Liangbin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189804?format=json", "institution": "Alibaba Group"}, {"id": 189805, "fullname": "Xiaohua Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189805?format=json", "institution": "Alibaba Group"}, {"id": 189806, "fullname": "Chaoqun Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/189806?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 187574, "fullname": "Shijing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187574?format=json", "institution": null}, {"id": 189807, "fullname": "Zhaolong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189807?format=json", "institution": null}, {"id": 187352, "fullname": "Yanlong Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/187352?format=json", "institution": "Alibaba Group"}, {"id": 189808, "fullname": "Wenji Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189808?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration \\& Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38404", "url": null, "sourceid": 39485, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39169, "uid": "5d904c056ee171ce7a126a71d7ac507d", "name": "MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models", "authors": [{"id": 87561, "fullname": "Dohwan Ko", "url": "http://cvpr.thecvf.com/api/miniconf/users/87561?format=json", "institution": "Korea University"}, {"id": 182644, "fullname": "Jinyoung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/182644?format=json", "institution": "KAIST"}, {"id": 191498, "fullname": "Seoung Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191498?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 182643, "fullname": "Sanghyeok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182643?format=json", "institution": "KAIST (Korea Advanced Institute of Science and Technology)"}, {"id": 191499, "fullname": "Seohyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191499?format=json", "institution": null}, {"id": 156026, "fullname": "Hyunwoo J. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156026?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that encourages the router to assign tokens to the appropriate modality-specific experts during training. Extensive experiments on multimodal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and improving generalization. To the best of our knowledge, this is the first work to explicitly optimize the expert selection policy through RL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39169", "url": null, "sourceid": 32326, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36265, "uid": "53a5bd61dfc6a512ba5da320ed0e4494", "name": "DocPrune: Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning", "authors": [{"id": 87575, "fullname": "Joonmyung Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87575?format=json", "institution": "Korea University"}, {"id": 182643, "fullname": "Sanghyeok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182643?format=json", "institution": "KAIST (Korea Advanced Institute of Science and Technology)"}, {"id": 130915, "fullname": "Jongha Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130915?format=json", "institution": "Korea University"}, {"id": 130094, "fullname": "Sehyung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130094?format=json", "institution": "Korea University"}, {"id": 87561, "fullname": "Dohwan Ko", "url": "http://cvpr.thecvf.com/api/miniconf/users/87561?format=json", "institution": "Korea University"}, {"id": 184622, "fullname": "Jihyung Kil", "url": "http://cvpr.thecvf.com/api/miniconf/users/184622?format=json", "institution": "Adobe Research"}, {"id": 156026, "fullname": "Hyunwoo J. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156026?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Recent advances in vision\u2013language models have shown strong performance across diverse multimodal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the waste of substantial computational resources, especially for long documents. We observe that existing token reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DOCPRUNE, a training-free document token pruning framework designed for efficient long document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model\u2019s level of comprehension. Our experiments on the M3DocRAG benchmark show that DOCPRUNE improves throughput by 3.0\u00d7 and 3.3\u00d7 in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36265", "url": null, "sourceid": 46156, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40204, "uid": "29ef8c120a1b5783630e8bbe933d9db4", "name": "Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor", "authors": [{"id": 132376, "fullname": "Yapeng Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/132376?format=json", "institution": "Tsinghua University"}, {"id": 180591, "fullname": "Lin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180591?format=json", "institution": "Communication University of China"}, {"id": 193768, "fullname": "Yuguo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193768?format=json", "institution": "Tsinghua University"}, {"id": 193769, "fullname": "Xiangru Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193769?format=json", "institution": "Tsinghua University"}, {"id": 193770, "fullname": "Taoyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193770?format=json", "institution": "Primievision Co."}, {"id": 146518, "fullname": "Lijian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146518?format=json", "institution": "Tsinghua University"}, {"id": 176576, "fullname": "Zheyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176576?format=json", "institution": "Tsinghua University"}, {"id": 193771, "fullname": "Yihan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193771?format=json", "institution": "Xiamen University"}, {"id": 193772, "fullname": "Rong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193772?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Motion blur arises when rapid scene changes occur during the exposure period, collapsing rich intra-exposure motion into a single RGB frame. Without explicit structural or temporal cues, RGB-only deblurring is highly ill-posed and often fails under extreme motion.Inspired by the human visual system, neuromorphic sensors introduce temporally dense information to alleviate this problem; however, event cameras still suffer from event rate saturation under rapid motion, while the event modality entangles edge features and motion cues, which limits their effectiveness.As a recent breakthrough, the complementary vision sensor (CVS) captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference ($\\mathcal{SD}$, encoding structural edges) and temporal difference ($\\mathcal{TD}$, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses $\\mathcal{SD}$ and $\\mathcal{TD}$ sequences to restore structure and color details lost in blurry RGB inputs. Our method outperforms current RGB or event-based approaches in both synthetic CVS dataset and real-world evaluations. Moreover, STGDNet exhibits strong generalization capability across over 100 extreme real-world scenarios. Our code, dataset and pre-trained weights will be fully publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40204", "url": null, "sourceid": 41842, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36357, "uid": "239805ed8d17af152c5aec76c7b22dfe", "name": "MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent", "authors": [{"id": 184865, "fullname": "Yuxia Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184865?format=json", "institution": "The University of Queensland"}, {"id": 142567, "fullname": "Zhizhen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/142567?format=json", "institution": "The University of Queensland"}, {"id": 184866, "fullname": "Yuqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184866?format=json", "institution": "The University of Queensland"}, {"id": 184867, "fullname": "Zijian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184867?format=json", "institution": "The University of Queensland"}, {"id": 90777, "fullname": "Zi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90777?format=json", "institution": "University of Queensland"}, {"id": 158034, "fullname": "Yadan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158034?format=json", "institution": "The University of Queensland"}], "abstract": "Recent Vision-Language-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? With an empirical decomposition of learnable parameters during VLA fine-tuning, we identify two key sources of non-mergeability:(1) Finetuning drives LoRA adapters in the VLM backbone toward divergent, task-specific directions beyond the capacity of existing merging methods to unify.(2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and preventing modular recombination.To address these challenges, we present MergeVLA, a merging-oriented VLA architecture that preserves mergeability by design.MergeVLA introduces sparsely activated LoRA adapters via task masks to retain consistent parameters and reduce irreconcilable conflicts in the VLM.Its action expert replaces self-attention with cross-attention-only blocks to keep specialization localized and composable.When the task is unknown, it uses a test-time task router to adaptively select the appropriate task mask and expert head from the initial observation, enabling unsupervised task inference.Across LIBERO, LIBERO-Plus, RoboTwin, and multi-task experiments on the real SO101 robotic arm, MergeVLA achieves performance comparable to or even exceeding individually finetuned experts, demonstrating robust generalization across tasks, embodiments, and environments.The source code can be found in the supplementary material for reference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36357", "url": null, "sourceid": 43470, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39744, "uid": "5cd580b09d20ca28f5aeaeb0d505bc6d", "name": "The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection", "authors": [{"id": 154787, "fullname": "Qingdong He", "url": "http://cvpr.thecvf.com/api/miniconf/users/154787?format=json", "institution": "Tencent Youtu Lab"}, {"id": 192770, "fullname": "Xueqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192770?format=json", "institution": "Sichuan University"}, {"id": 192771, "fullname": "Yanjie Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192771?format=json", "institution": "Fudan University"}, {"id": 192554, "fullname": "Peng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192554?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 158874, "fullname": "Pengcheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158874?format=json", "institution": "Western University"}, {"id": 89708, "fullname": "Zhenye Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89708?format=json", "institution": "Tencent Youtu Lab"}, {"id": 86912, "fullname": "Chengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86912?format=json", "institution": "Tencent Youtu Lab; Shanghai Jiao Tong University"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 86953, "fullname": "Yabiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86953?format=json", "institution": "Zhejiang University"}], "abstract": "Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently, two tailored keyframe-driven modules\u2014the garment details enhancement module and the collaborative background optimization module\u2014are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes. These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15,070 high-quality video samples at a resolution of 810 \u00d7 1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios. The dataset and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39744", "url": null, "sourceid": 38641, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38485, "uid": "4b0cb9685dd1da13cd7d85b3e4de824f", "name": "Target-Aware Invertible Encoder with Reconstruction Guidance for Infrared Small Target Detection", "authors": [{"id": 176841, "fullname": "Shule Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/176841?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189955, "fullname": "Zetian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189955?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189956, "fullname": "Xiao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189956?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 189957, "fullname": "Zexuan Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/189957?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Modern detectors typically deepen backbones and rely on aggressive downsampling to harvest high-level semantics. But this severely degrades low-energy infrared tiny targets via rescale-induced information loss. This work introduces InvDet, a target-aware invertible encoder that unifies information preservation and target-aware enhancement within a reconstruction-guided detection framework. An invertible pathway reconstructs the input from feature latents, exposing information loss as an optimizable quantity. To decouple detection from irrelevant reconstruction, a Target-Aware Reconstruction Modulation (TARM) module operates only in the inverse path, gating high-pass latents and applying a mild gain to low-pass features without altering the forward detection distribution. In addition, a Geometry\u2013Content Tolerance Metric (GCTM) is proposed to focus on truly informative regions and yields a pixel-wise weight map that gently regularizes the reconstruction branch. Our method yields state-of-the-art accuracy on five public infrared benchmarks, providing a principled pathway toward detection-friendly representation learning for scale-challenged visual regimes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38485", "url": null, "sourceid": 31327, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38979, "uid": "52342fd964a7fcc286db3b72884c57cc", "name": "The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation", "authors": [{"id": 191110, "fullname": "Guannan Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191110?format=json", "institution": "Nanjing University"}, {"id": 105203, "fullname": "Da-Wei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/105203?format=json", "institution": "Nanjing University"}, {"id": 191111, "fullname": "Zhenguo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191111?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology; Huawei Noah&#x27;s Ark Lab"}, {"id": 84842, "fullname": "Han-Jia Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/84842?format=json", "institution": "Nanjing University"}], "abstract": "Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency\u2013generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38979", "url": null, "sourceid": 44484, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40247, "uid": "9a8a7205e2aa8f927bd0ee99371f8e41", "name": "Gloria: Consistent Character Video Generation via Content Anchors", "authors": [{"id": 126329, "fullname": "Yuhang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126329?format=json", "institution": "University of Science and Technology of China"}, {"id": 107356, "fullname": "Fan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107356?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 184432, "fullname": "Huaijin Pi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184432?format=json", "institution": "The University of Hong Kong"}, {"id": 77241, "fullname": "Ailing Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/77241?format=json", "institution": "Anuttacon"}, {"id": 172154, "fullname": "Shuai Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/172154?format=json", "institution": "University of Science and Technology of China"}, {"id": 193872, "fullname": "Guowei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193872?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 86247, "fullname": "Wei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86247?format=json", "institution": "University of Science and Technology of China"}, {"id": 86250, "fullname": "Yang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86250?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the \"memory\", leading to suboptimal consistency.Recognizing that character video generation inherently resembles an ``outside-looking-in\" scenario. In this work, we propose represent the character\u2019s visual attributes through a compact set of anchor frames.This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors.Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding $10$ minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40247", "url": null, "sourceid": 39299, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39041, "uid": "c7f803c381391bfba324f22ba9203446", "name": "Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models", "authors": [{"id": 191229, "fullname": "Hengzhuang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191229?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 191230, "fullname": "Xinsong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191230?format=json", "institution": "Tencent Hunyuan Research"}, {"id": 191231, "fullname": "QIMING PENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/191231?format=json", "institution": null}, {"id": 191232, "fullname": "Bin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191232?format=json", "institution": null}, {"id": 126300, "fullname": "Han Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126300?format=json", "institution": "Microsft Research Asia"}, {"id": 158676, "fullname": "Dengyang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158676?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 84842, "fullname": "Han-Jia Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/84842?format=json", "institution": "Nanjing University"}, {"id": 191233, "fullname": "Teng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191233?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 84759, "fullname": "Hai Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84759?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks.Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations.This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers.To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM.Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities.Code will be publicly available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39041", "url": null, "sourceid": 42065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38863, "uid": "7b4e82cb855801d7098534835e2ca260", "name": "Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving", "authors": [{"id": 85001, "fullname": "Jianhua Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85001?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190861, "fullname": "Meng Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190861?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190862, "fullname": "Jiangtong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190862?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190863, "fullname": "Fan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190863?format=json", "institution": null}, {"id": 190864, "fullname": "Huixin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190864?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190865, "fullname": "Sitong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190865?format=json", "institution": null}, {"id": 190866, "fullname": "Dechang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190866?format=json", "institution": null}, {"id": 190867, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190867?format=json", "institution": ""}, {"id": 178413, "fullname": "Pei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178413?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190868, "fullname": "Yuze Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190868?format=json", "institution": "Yinwang Intelligent Technology Co.,Ltd."}, {"id": 88270, "fullname": "Minzhe Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88270?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190869, "fullname": "Haojie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190869?format=json", "institution": "Shanghai Yinwang Intelligent Technologies Co., Ltd"}, {"id": 190870, "fullname": "Qichao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190870?format=json", "institution": null}, {"id": 190871, "fullname": "Xuechao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190871?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190872, "fullname": "Siyuan Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190872?format=json", "institution": null}, {"id": 128148, "fullname": "Lu Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128148?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190873, "fullname": "Qingqiu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190873?format=json", "institution": null}, {"id": 75577, "fullname": "Xiaosong Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/75577?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 91873, "fullname": "Hang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91873?format=json", "institution": "Huawei Noah\u2018s Ark Lab"}], "abstract": "Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision\u2013language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability.To address these challenges, we introduce Percept-WAM, a perception-enhanced World\u2013Awareness\u2013Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM).Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence.We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM.Qualitative results further highlight its strong open-vocabulary and long-tail generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38863", "url": null, "sourceid": 32439, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37610, "uid": "074ec9c45a208ce26a241b07e38e0e17", "name": "Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal", "authors": [{"id": 181324, "fullname": "Kazuma Ikeda", "url": "http://cvpr.thecvf.com/api/miniconf/users/181324?format=json", "institution": "Keio University"}, {"id": 179705, "fullname": "Ryosei Hara", "url": "http://cvpr.thecvf.com/api/miniconf/users/179705?format=json", "institution": "Keio University"}, {"id": 187850, "fullname": "Rokuto Nagata", "url": "http://cvpr.thecvf.com/api/miniconf/users/187850?format=json", "institution": "Keio University"}, {"id": 187851, "fullname": "Ozora Sako", "url": "http://cvpr.thecvf.com/api/miniconf/users/187851?format=json", "institution": "Keio University"}, {"id": 187852, "fullname": "Zihao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/187852?format=json", "institution": "Keio University"}, {"id": 187853, "fullname": "Takahiro Kado", "url": "http://cvpr.thecvf.com/api/miniconf/users/187853?format=json", "institution": "Sony Semiconductor Solutions"}, {"id": 187854, "fullname": "Ibuki Fujioka", "url": "http://cvpr.thecvf.com/api/miniconf/users/187854?format=json", "institution": "Sony Semiconductor Solutions"}, {"id": 178556, "fullname": "Taro Beppu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178556?format=json", "institution": "Sony Semiconductor Solutions"}, {"id": 90260, "fullname": "Mariko Isogawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/90260?format=json", "institution": "Keio University"}, {"id": 187855, "fullname": "Kentaro Yoshioka", "url": "http://cvpr.thecvf.com/api/miniconf/users/187855?format=json", "institution": "Keio University"}], "abstract": "LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghost), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal rely on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100$\\times$ larger than existing annotated FWL datasets. Experiments show that our baseline outperforms existing methods in ghost removal accuracy, and our ghost removal further enhance downstream tasks such as LiDAR-based SLAM (66% trajectory error reduction) and 3D object detection (50$\\times$ false positive reduction).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37610", "url": null, "sourceid": 39036, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39616, "uid": "3a8e92de76e8511cda8a11cbb937a020", "name": "AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks", "authors": [{"id": 192486, "fullname": "Irene Tenison", "url": "http://cvpr.thecvf.com/api/miniconf/users/192486?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 192487, "fullname": "Soumyajit Chatterjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192487?format=json", "institution": "Brave Software Research"}, {"id": 192488, "fullname": "Fahim Kawsar", "url": "http://cvpr.thecvf.com/api/miniconf/users/192488?format=json", "institution": "University of Glasgow"}, {"id": 192489, "fullname": "Mohammad Malekzadeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/192489?format=json", "institution": "Nokia networks GmbH"}], "abstract": "To utilize pre-trained neural networks on edge and mobile devices, we often require efficient adaptation to user-specific runtime data distributions while operating under limited compute and memory resources. On-device retraining with a target dataset can facilitate such adaptations; however, it remains impractical due to the increasing depth of modern neural nets, as well as the computational overhead associated with gradient-based optimization across all layers. Current approaches reduce training cost by selecting a subset of layers for retraining; however, they rely on labeled data, at least one full-model backpropagation, or server-side meta-training, limiting their suitability for constrained devices. We introduce AdaBet, a gradient-free layer selection approach to rank important layers, followed by important channels of these layers, by analyzing topological features of their activation spaces through Betti Numbers and using forward passes alone. AdaBet allows selecting layers and channels with high learning capacity, which are important for retraining and adaptation, without requiring labels or gradients. Evaluating AdaBet on sixteen pairs of benchmark models and datasets shows AdaBet achieves an average gain of 5\\% more classification accuracy over gradient-based baselines while reducing average peak memory consumption by 40\\%. We open-source our code at \\url{https://anonymous.4open.science/r/adabet-37CF/}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39616", "url": "https://github.com/Nokia-Bell-Labs/efficient_layer_selection", "sourceid": 44738, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37445, "uid": "ae63d6b3d240db3112d29de1bde72a8c", "name": "Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection", "authors": [{"id": 107490, "fullname": "Qian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107490?format=json", "institution": "Montreal Institute for Learning Algorithms, University of Montreal, Universit\u00e9 de Montr\u00e9al"}, {"id": 128717, "fullname": "Shivam Chandhok", "url": "http://cvpr.thecvf.com/api/miniconf/users/128717?format=json", "institution": "INRIA"}, {"id": 187466, "fullname": "Oscar Ma\u00f1as", "url": "http://cvpr.thecvf.com/api/miniconf/users/187466?format=json", "institution": "Meta"}, {"id": 151087, "fullname": "Kanishk Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/151087?format=json", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"id": 92323, "fullname": "Aishwarya Agrawal", "url": "http://cvpr.thecvf.com/api/miniconf/users/92323?format=json", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"id": 69182, "fullname": "Leonid Sigal", "url": "http://cvpr.thecvf.com/api/miniconf/users/69182?format=json", "institution": "University Of British Columbia"}], "abstract": "Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive\u2014requiring large scale datasets, high-quality annotations and large-compute budget. We propose $\\textbf{PR}$ioritized c$\\textbf{O}$ncept learnin$\\textbf{G}$ via $\\textbf{R}$elative $\\textbf{E}$rror-driven $\\textbf{S}$ample $\\textbf{S}$election -- $\\textbf{PROGRESS}$ -- a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples: those it has not already mastered and are not too difficult to learn at the current state of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, querying answers only on a need basis, avoids reliance on additional supervision from auxiliary VLM, or compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization to different VLMs and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37445", "url": null, "sourceid": 43920, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38500, "uid": "55f8c1c115741faf7519ccc644af7074", "name": "FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding", "authors": [{"id": 171219, "fullname": "j zg", "url": "http://cvpr.thecvf.com/api/miniconf/users/171219?format=json", "institution": "University of Science and Technology of China"}, {"id": 189998, "fullname": "Fang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189998?format=json", "institution": "University of Science and Technology of China"}, {"id": 185105, "fullname": "YongXiang Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/185105?format=json", "institution": "University of Science and Technology of China"}, {"id": 189999, "fullname": "Bocheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189999?format=json", "institution": "University of Science and Technology of China"}, {"id": 190000, "fullname": "Wentao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190000?format=json", "institution": null}, {"id": 128045, "fullname": "Linli Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128045?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Autoregressive (AR) models have achieved remarkable success in natural language processing, yet their application to image generation faces significant challenges. When implementing VQ-based decoders for autoregressive image generation, the generated images typically preserve semantic information but struggle with fine-grained details. Recent hybrid AR image generation approaches address these issues by integrating diffusion models as decoder heads, enabling more  high-fidelity generation. However, the diffusion-based denoising process introduces significant computational overhead during inference.To accelerate hybrid AR image generation, we propose the Lookahead Decoding Strategy, which integrates the strengths of autoregressive and diffusion models by separating the process into two complementary branches: semantic prediction and detail refinement. The autoregressive branch captures high-level semantic structures while refining coarse predictions made by the parallel. Furthermore, we introduce Guided Diffusion Sampling to steer the diffusion denoising trajectory, significantly reducing the number of denoising steps. Extensive experiments demonstrate that our approach provides an effective solution for accelerating hybrid AR image generation models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38500", "url": null, "sourceid": 38476, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40200, "uid": "2e7cf629b7196ebe926e93fb3c02567a", "name": "ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos", "authors": [{"id": 127734, "fullname": "Luigi Seminara", "url": "http://cvpr.thecvf.com/api/miniconf/users/127734?format=json", "institution": "University of Catania"}, {"id": 106402, "fullname": "Davide Moltisanti", "url": "http://cvpr.thecvf.com/api/miniconf/users/106402?format=json", "institution": "University of Bath"}, {"id": 126366, "fullname": "Antonino Furnari", "url": "http://cvpr.thecvf.com/api/miniconf/users/126366?format=json", "institution": "University of Catania"}], "abstract": "Procedural planning aims to predict a sequence of actions that transforms an initial visual state into a desired goal, a fundamental ability for intelligent agents operating in complex environments. Existing approaches typically rely on large-scale models that learn procedural structures implicitly, resulting in limited sample-efficiency and high computational cost. In this work we introduce ViterbiPlanNet, a principled framework that explicitly integrates procedural knowledge into the learning process through a Differentiable Viterbi Layer (DVL). The DVL embeds a Procedural Knowledge Graph (PKG) directly with the Viterbi decoding algorithm, replacing non-differentiable operations with smooth relaxations that enable end-to-end optimization. This design allows the model to learn through graph-based decoding. Experiments on CrossTask, COIN, and NIV demonstrate that ViterbiPlanNet achieves state-of-the-art performance with an order of magnitude fewer parameters than diffusion- and LLM-based planners. Extensive ablations show that performance gains arise from our differentiable structure-aware training rather than post-hoc refinement, resulting in improved sample efficiency and robustness to shorter unseen horizons. We also address testing inconsistencies establishing a unified testing protocol with consistent splits and evaluation metrics. With this new protocol, we run experiments multiple times and report results using bootstrapping to assess statistical significance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40200", "url": null, "sourceid": 38303, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38800, "uid": "9802e8b631921383a4b9426bbae4a531", "name": "AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention", "authors": [{"id": 190698, "fullname": "Lei Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190698?format=json", "institution": "Li Auto Inc."}, {"id": 190699, "fullname": "Jifeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190699?format=json", "institution": "Li Auto Inc."}, {"id": 190700, "fullname": "Juntao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190700?format=json", "institution": "Beijing University Of Technology"}, {"id": 180259, "fullname": "Feiyang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180259?format=json", "institution": "Li Auto Inc."}, {"id": 190701, "fullname": "Yan Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190701?format=json", "institution": "Independent Author"}, {"id": 190702, "fullname": "Jingjing Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190702?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 190703, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190703?format=json", "institution": "Beijing University of Technology"}, {"id": 177084, "fullname": "Yong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177084?format=json", "institution": "Li Auto Inc."}, {"id": 190704, "fullname": "Xiaoyuan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190704?format=json", "institution": ""}], "abstract": "Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in embodied AI tasks. However, existing VLA models, often built upon Vision-Language Models (VLMs), typically process dense visual inputs independently at each timestep. This approach implicitly models the task as a Markov Decision Process (MDP). However, this history-agnostic design is suboptimal for effective visual token processing in dynamic sequential decision-making, as it fails to leverage the context of history. To address this limitation, we reformulate the problem from a Partially Observable Markov Decision Process (POMDP) perspective and propose a novel framework named AVA-VLA. Inspired by the POMDP that the action generation should be conditioned on the belief state. AVA-VLA introduces Active Visual Attention (AVA) to dynamically modulate visual processing. It achieves this by leveraging the recurrent state, which is a neural approximation of the agent's belief state derived from the previous decision step. Specifically, the AVA module uses the recurrent state to compute the soft weights to actively process task-relevant visual tokens based on its historical context. Comprehensive evaluations demonstrate that AVA-VLA achieves state-of-the-art performance across popular robotic benchmarks, including LIBERO and CALVIN. Furthermore, real-world deployments on a dual-arm robot platform validate the framework's practical applicability and robust sim-to-real transferability.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38800", "url": "https://liauto-dsr.github.io/AVA-VLA-Page/", "sourceid": 43016, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38636, "uid": "62b257077af83f9950f299b97d065b18", "name": "OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe", "authors": [{"id": 190359, "fullname": "Kaichen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190359?format=json", "institution": "The University of Hong Kong"}, {"id": 190360, "fullname": "Keming Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190360?format=json", "institution": "School of Software, Tsinghua University"}, {"id": 186684, "fullname": "Zuhao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186684?format=json", "institution": "Nanyang Technological University"}, {"id": 90785, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90785?format=json", "institution": "Nanyang Technological University"}, {"id": 190361, "fullname": "Kairui Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190361?format=json", "institution": "Nanyang Technological University"}, {"id": 127348, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127348?format=json", "institution": "Tsinghua University"}, {"id": 182987, "fullname": "Xingxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182987?format=json", "institution": "MiroMind"}, {"id": 126171, "fullname": "Lidong Bing", "url": "http://cvpr.thecvf.com/api/miniconf/users/126171?format=json", "institution": "Alibaba DAMO Academy"}], "abstract": "Recent advancements in reasoning language models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual and video reasoning, the lack of transparent and reproducible data curation and training pipelines remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874k-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74k-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 9.5\\% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38636", "url": null, "sourceid": 30670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38677, "uid": "4741dc94acb74a29da52f86a87c6508b", "name": "SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge", "authors": [{"id": 190446, "fullname": "Yumeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190446?format=json", "institution": "University of Southern California"}, {"id": 136657, "fullname": "Ying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136657?format=json", "institution": "University of California, Los Angeles"}, {"id": 190447, "fullname": "Jiayin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190447?format=json", "institution": "University of California, Los Angeles"}, {"id": 152336, "fullname": "Yin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152336?format=json", "institution": "University of Utah"}, {"id": 150889, "fullname": "Chenfanfu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150889?format=json", "institution": "University of California, Los Angeles"}], "abstract": "Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38677", "url": null, "sourceid": 31537, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40344?format=json"], "related_events_ids": [40344]}, {"id": 40344, "uid": "4741dc94acb74a29da52f86a87c6508b", "name": "SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge", "authors": [{"id": 190446, "fullname": "Yumeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190446?format=json", "institution": "University of Southern California"}, {"id": 136657, "fullname": "Ying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136657?format=json", "institution": "University of California, Los Angeles"}, {"id": 190447, "fullname": "Jiayin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190447?format=json", "institution": "University of California, Los Angeles"}, {"id": 152336, "fullname": "Yin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152336?format=json", "institution": "University of Utah"}, {"id": 150889, "fullname": "Chenfanfu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150889?format=json", "institution": "University of California, Los Angeles"}], "abstract": "Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40344", "url": null, "sourceid": -31537, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38677?format=json"], "related_events_ids": [38677]}, {"id": 36724, "uid": "e5d2af5148daa43f2ff08bb3af6780e7", "name": "Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models", "authors": [{"id": 183927, "fullname": "Hayeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183927?format=json", "institution": "Seoul National University"}, {"id": 183930, "fullname": "Ji Ha Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183930?format=json", "institution": "Seoul National University"}, {"id": 180148, "fullname": "Junghun James Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/180148?format=json", "institution": "Seoul National University"}, {"id": 87674, "fullname": "Se Young Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87674?format=json", "institution": "Seoul National University"}], "abstract": "While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized with entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36724", "url": null, "sourceid": 31256, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38852, "uid": "f29f71107b98a4283e9c5af8170ecef6", "name": "GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation", "authors": [{"id": 190702, "fullname": "Jingjing Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190702?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 190834, "fullname": "Boyao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190834?format=json", "institution": "Hunan University"}, {"id": 90288, "fullname": "Chen Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/90288?format=json", "institution": "Peking University"}, {"id": 190698, "fullname": "Lei Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190698?format=json", "institution": "Li Auto Inc."}, {"id": 190835, "fullname": "Long Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190835?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86239, "fullname": "Shaoshuai Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86239?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 133263, "fullname": "Li Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133263?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38852", "url": null, "sourceid": 36288, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37196, "uid": "af6e8730844faa627625a6c3fa98f0fc", "name": "Unified Camera Positional Encoding for Controlled Video Generation", "authors": [{"id": 102085, "fullname": "Cheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102085?format=json", "institution": "Monash University"}, {"id": 186892, "fullname": "Boying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186892?format=json", "institution": "Monash University"}, {"id": 186893, "fullname": "Meng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186893?format=json", "institution": "Monash University"}, {"id": 84806, "fullname": "Yan-Pei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84806?format=json", "institution": "Tencent ARC Lab"}, {"id": 131297, "fullname": "Camilo Cruz Gambardella", "url": "http://cvpr.thecvf.com/api/miniconf/users/131297?format=json", "institution": "Monash University"}, {"id": 128648, "fullname": "Dinh Phung", "url": "http://cvpr.thecvf.com/api/miniconf/users/128648?format=json", "institution": "Monash University"}, {"id": 126993, "fullname": "Jianfei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126993?format=json", "institution": "Monash University"}], "abstract": "Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce **Relative Ray Encoding**, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for **Absolute Orientation Encoding**, enabling full control over the initial camera orientation. Together, these designs form **UCPE (Unified Camera Positional Encoding)**, which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding **less than 1% trainable parameters** while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37196", "url": null, "sourceid": 30706, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37262, "uid": "251081321995fc4185d55a5d3e788e39", "name": "Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models", "authors": [{"id": 182791, "fullname": "JangHyeon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182791?format=json", "institution": "University of Minnesota"}, {"id": 140667, "fullname": "Philipe Ambrozio Dias", "url": "http://cvpr.thecvf.com/api/miniconf/users/140667?format=json", "institution": "Oak Ridge National Laboratory"}, {"id": 187034, "fullname": "Yao-Yi Chiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187034?format=json", "institution": "University of Minnesota, Minneapolis"}, {"id": 106955, "fullname": "Dalton Lunga", "url": "http://cvpr.thecvf.com/api/miniconf/users/106955?format=json", "institution": "Oak Ridge National Laboratory"}], "abstract": "Learning general-purpose representations of geographic locations has become essential to geospatial tasks such as population estimation and environmental monitoring. To obtain such representations, multimodal geo-foundation models often use contrastive learning (CL) to align satellite imagery with geo-coordinates, implicitly assuming that cross-modal (shared) information suffices for downstream tasks. However, not all task-relevant information is shared between modalities, and retaining modality-specific (unique) features can improve task performance. Prior methods retain unique information through extra training objectives or databases, increasing training complexity and computation. Motivated by the conventional wisdom that earlier layers capture general input features while later layers become task-specific, we hypothesize that early layers in CL models consist unique information that is lost toward the final layer. Through a comprehensive layerwise analysis of modality gap, representation similarity, and mutual information, we confirm this trend and find that fusing intermediate (more unique) and final (more shared) representations outperforms state-of-the-art models across diverse geospatial benchmarks. Our findings reveal underutilized information diversity in CL models and show that simple layerwise fusion is an efficient path to richer geo-embeddings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37262", "url": null, "sourceid": 41305, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37660, "uid": "7cab59661cfaa9701dba0ee2d13eb25f", "name": "KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems", "authors": [{"id": 187968, "fullname": "Alireza Kheirandish", "url": "http://cvpr.thecvf.com/api/miniconf/users/187968?format=json", "institution": "Georgia Institute of Technology"}, {"id": 187969, "fullname": "Jihoon Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187969?format=json", "institution": "Georgia Institute of Technology"}, {"id": 165077, "fullname": "Sara Fridovich-Keil", "url": "http://cvpr.thecvf.com/api/miniconf/users/165077?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback-Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37660", "url": null, "sourceid": 32383, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36808, "uid": "d65598b7a583ff113467d5b6f693a031", "name": "Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification", "authors": [{"id": 167235, "fullname": "Han Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/167235?format=json", "institution": "Siemens Healthineers"}, {"id": 185922, "fullname": "Bogdan Georgescu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185922?format=json", "institution": "Siemens Healthineers"}, {"id": 185923, "fullname": "Yanbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185923?format=json", "institution": "Siemens Healthineers"}, {"id": 185924, "fullname": "Youngjin Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185924?format=json", "institution": "Siemens Healthineers"}, {"id": 185925, "fullname": "Michael Baumgartner", "url": "http://cvpr.thecvf.com/api/miniconf/users/185925?format=json", "institution": "Siemens Healthineers"}, {"id": 90948, "fullname": "Riqiang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90948?format=json", "institution": "Siemens Healthineers"}, {"id": 185926, "fullname": "Jianing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185926?format=json", "institution": "Siemens Healthineers"}, {"id": 185927, "fullname": "Gengyan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185927?format=json", "institution": "Siemens Healthineers"}, {"id": 185928, "fullname": "Eli Gibson", "url": "http://cvpr.thecvf.com/api/miniconf/users/185928?format=json", "institution": "Siemens Healthineers"}, {"id": 90931, "fullname": "Dorin Comaniciu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90931?format=json", "institution": "Siemens Healthineers"}, {"id": 185929, "fullname": "Sasa Grbic", "url": "http://cvpr.thecvf.com/api/miniconf/users/185929?format=json", "institution": null}], "abstract": "3D medical image classification is essential to modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. It allows efficient scaling to new tasks by adding only lightweight plugins (~1M parameters per task) to a single frozen backbone. Besides, this versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities and systematically evaluate state-of-the-art 3D classification techniques. Our analysis reveals several key insights: (1) effective adaptation is critical to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (e.g., 1st place in the *** challenge), eliminating the need for separate task-specific 3D models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36808", "url": null, "sourceid": 32791, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38233, "uid": "6dda8f32d9ddc2a9ffe87be7d361a5c9", "name": "Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models", "authors": [{"id": 157407, "fullname": "Enguang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157407?format=json", "institution": "Nankai University"}, {"id": 189386, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189386?format=json", "institution": "Tencent Youtu Lab"}, {"id": 189387, "fullname": "Yuanchen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189387?format=json", "institution": "Tencent YouTu Lab"}, {"id": 107147, "fullname": "Ke Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/107147?format=json", "institution": "Tencent Youtu Lab"}, {"id": 101549, "fullname": "Xinbin Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/101549?format=json", "institution": "Nankai university"}, {"id": 87241, "fullname": "Shouhong Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/87241?format=json", "institution": "Tencent Youtu Lab"}, {"id": 75691, "fullname": "Xialei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75691?format=json", "institution": "Nankai University"}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}], "abstract": "While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear.   In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs.   Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe)  to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations.  Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38233", "url": null, "sourceid": 41981, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37015, "uid": "fb5dc7f2fdd8cb3c0c885f0c269039da", "name": "Collaborative Multi-Mode Pruning for Vision-Language Models", "authors": [{"id": 145145, "fullname": "Zimeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145145?format=json", "institution": "Beihang University"}, {"id": 89528, "fullname": "Yunhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89528?format=json", "institution": "Beihang University"}, {"id": 186491, "fullname": "Donghao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186491?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 87568, "fullname": "Jiaxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87568?format=json", "institution": "Beihang University"}], "abstract": "Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning.Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores.Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum.Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches.The source code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37015", "url": null, "sourceid": 36311, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37651, "uid": "49723df33764496f4b5fed667878b26d", "name": "FSLoRA: Harmonizing Detection and Re-Identification via Freq-Spatial Low-Rank Adapter for One-Stage Person Search", "authors": [{"id": 187950, "fullname": "Yanling TIAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/187950?format=json", "institution": "Nanjing University of Science and Technology; Waseda University"}, {"id": 157800, "fullname": "Shanshan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157800?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 126713, "fullname": "Di Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126713?format=json", "institution": "Alibaba Group"}, {"id": 85000, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85000?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Person search, which aims to detect and re-identify individuals in unconstrained scenes, faces an inherent conflict in one-stage models: pedestrian detection focuses on shared human features, while person re-identification requires identity-specific representations. Existing approaches, such as feature decoupling and loss re-weighting, primarily address this issue in later network stages but fail to resolve early-stage feature entanglement. To overcome this limitation, we propose FSLoRA, a Freq-Spatial Low-Rank Adapter that progressively decouples task-specific features at the backbone level. FSLoRA consists of a Spatial-Level Module (SLM), which employs LoRA and a mixture-of-experts to dynamically activate task-relevant spatial features, and a Frequency-Level Module (FLM), which transforms features into the frequency domain to selectively enhance task-relevant frequency components while suppressing task-irrelevant noise. By integrating both spatial and frequency-based adaptations, FSLoRA reduces feature interference, enabling more effective joint optimization. Extensive experiments on CUHK-SYSU, PRW, and Posetrack21 demonstrate that FSLoRA not only achieves state-of-the-art performance but also serves as a plug-and-play module adaptable to various person search frameworks, offering a unified and generalizable solution for one-stage person search.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37651", "url": null, "sourceid": 39475, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36981, "uid": "6cf8ae4c2312ba4a0103e20d0ace1ea3", "name": "TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model", "authors": [{"id": 186372, "fullname": "Ao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186372?format=json", "institution": "Shandong University"}, {"id": 186373, "fullname": "Yuxiang Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186373?format=json", "institution": "Shandong University"}, {"id": 186374, "fullname": "Jinghui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186374?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 186375, "fullname": "Congbo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186375?format=json", "institution": "New York University, Abu Dhabi"}, {"id": 85271, "fullname": "Yutong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85271?format=json", "institution": "University of Adelaide"}, {"id": 89213, "fullname": "Gustavo Carneiro", "url": "http://cvpr.thecvf.com/api/miniconf/users/89213?format=json", "institution": "University of Surrey"}, {"id": 186376, "fullname": "Mohammad Yaqub", "url": "http://cvpr.thecvf.com/api/miniconf/users/186376?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 91679, "fullname": "Hu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91679?format=json", "institution": "The University of Adelaide"}], "abstract": "Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational cost issues due to the input of large number of visual tokens, motivating token pruning to improve inference efficiency.The key challenge lies in identifying which tokens are truly important.Most existing approaches rely on attention- or similarity-based criteria to estimate token importance.However, they inherently suffer from certain limitations, such as being task-agnostic and exhibiting positional bias.In this work, we explore a new perspective on token importance assignment based on token transitions in LVLMs, where token transitions are defined as the changes in token representations occurring as they propagate through the model\u2019s modules.We observe that the transition of token representations provides a meaningful signal of semantic information.Based on this insight, we propose TransPrune, a training-free and efficient token pruning method.Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV), which measures changes in both the magnitude and direction of token representations; as well as Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to visual tokens via attention.Extensive experiments on various LVLM architectures, such as LLaVA-v1.5, LLaVA-Next and Qwen2.5-VL, demonstrate that TransPrune maintains comparable multimodal performance while reducing inference TFLOPs by more than half.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36981", "url": null, "sourceid": 45570, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37133, "uid": "9e38b231e32ad1066f81da8e83626957", "name": "Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization", "authors": [{"id": 181241, "fullname": "Mingbo Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181241?format=json", "institution": "University of Twente"}, {"id": 154086, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154086?format=json", "institution": "Drexel University"}, {"id": 186740, "fullname": "Caroline Gevaert", "url": "http://cvpr.thecvf.com/api/miniconf/users/186740?format=json", "institution": "University of Twente; The World Bank"}, {"id": 72228, "fullname": "George Vosselman", "url": "http://cvpr.thecvf.com/api/miniconf/users/72228?format=json", "institution": "University of Twente"}, {"id": 133794, "fullname": "Hao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133794?format=json", "institution": "University of Twente"}], "abstract": "Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely **Bridge**, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment, **Bridge** blocks confounders' effects to mitigate spurious correlations, while simultaneously refining representations by filtering redundant and task-irrelevant components.**Bridge** can be seamlessly integrated with both discriminative (e.g., DINOv2/3, SAM) and generative (e.g., Stable Diffusion) Vision Foundation Models (VFMs).Extensive experiments across multiple domain generalization object detection datasets, i.e., Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, and Diverse Weather DroneVehicle (our newly augmented real-world UAV-based benchmark), underscore the superiority of our proposed method over previous state-of-the-art approaches.Code, models, and data will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37133", "url": null, "sourceid": 37609, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36236, "uid": "90a9beec2356e3613f667e20704c953d", "name": "Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning", "authors": [{"id": 154699, "fullname": "Zijian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154699?format=json", "institution": "National University of Defense Technology"}, {"id": 184535, "fullname": "Zicheng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184535?format=json", "institution": "Beijing Jiaotong University"}, {"id": 154700, "fullname": "Xingxing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154700?format=json", "institution": "Tsinghua University"}, {"id": 153556, "fullname": "Kele Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153556?format=json", "institution": "National University of Defense Technology"}, {"id": 154705, "fullname": "Huaimin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154705?format=json", "institution": "National University of Defense Technology"}], "abstract": "Continual Visual Question Answering (Continual VQA) poses unique challenges for multimodal continual learning, requiring models to incrementally acquire new knowledge while preserving visual\u2013semantic grounding across tasks. However, existing benchmarks hinder fair and robust evaluation of such capabilities, as they allow models to exploit dataset biases rather than demonstrate genuine continual reasoning. We identify two structural flaws in current benchmark design. First, shared answer vocabularies across tasks encourage answer memorization, inflating performance and underestimating forgetting. Second, static answer priors within each task make the training and test answer distributions nearly identical, obscuring robustness under distribution shifts. To address these issues, we introduce UCo-VQA, an Unbiased benchmark suite that enforces token-level disjoint answer spaces across tasks and introduces intra-task train\u2013test distribution shifts, enabling fairer assessment of forgetting and generalization in multimodal continual learning. We further provide a parameter-efficient baseline that mitigates forgetting and enhances grounding through question-only replay and dual-level distillation, offering a lightweight and memory-efficient framework for continual adaptation. Extensive experiments on UCo-VQA reveal that prior methods substantially overestimate performance under biased setups, while our approach achieves state-of-the-art results, improving robustness and retention by up to 4.18% and 2.21%, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36236", "url": null, "sourceid": 34428, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37765, "uid": "35dd8512e2d97da12b68a1a094145924", "name": "Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers", "authors": [{"id": 188208, "fullname": "Chung-Shien Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188208?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 188209, "fullname": "Christian Schmidt", "url": "http://cvpr.thecvf.com/api/miniconf/users/188209?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 134583, "fullname": "Jens Piekenbrinck", "url": "http://cvpr.thecvf.com/api/miniconf/users/134583?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 75750, "fullname": "Bastian Leibe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75750?format=json", "institution": "RWTH Aachen University"}], "abstract": "Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than 3$\\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlesslyintegrates into existing global attention-based architectures such as VGGT, $\\pi^3$, and MapAnything, while substantially improving scalability to large image collections.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37765", "url": null, "sourceid": 41304, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38323, "uid": "2ec8adb60182b3c41a31333baecd332a", "name": "TV2TV: A Unified Framework for Interleaved Language and Video Generation", "authors": [{"id": 188476, "fullname": "Xiaochuang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188476?format=json", "institution": "Meta FAIR"}, {"id": 189603, "fullname": "Youssef Emad", "url": "http://cvpr.thecvf.com/api/miniconf/users/189603?format=json", "institution": "Facebook"}, {"id": 150972, "fullname": "Melissa Hall", "url": "http://cvpr.thecvf.com/api/miniconf/users/150972?format=json", "institution": "FAIR (Meta)"}, {"id": 185169, "fullname": "John Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185169?format=json", "institution": "Facebook"}, {"id": 189604, "fullname": "Karthik Padthe", "url": "http://cvpr.thecvf.com/api/miniconf/users/189604?format=json", "institution": "Meta AI"}, {"id": 189605, "fullname": "Liam Robbins", "url": "http://cvpr.thecvf.com/api/miniconf/users/189605?format=json", "institution": "FAIR; Georgia Institute of Technology"}, {"id": 106665, "fullname": "Amir Bar", "url": "http://cvpr.thecvf.com/api/miniconf/users/106665?format=json", "institution": "Meta (FAIR)"}, {"id": 189606, "fullname": "Delong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189606?format=json", "institution": "AMI Labs &amp; HKUST"}, {"id": 150973, "fullname": "Michal Drozdzal", "url": "http://cvpr.thecvf.com/api/miniconf/users/150973?format=json", "institution": "Meta"}, {"id": 189607, "fullname": "Maha Elbayad", "url": "http://cvpr.thecvf.com/api/miniconf/users/189607?format=json", "institution": "FAIR"}, {"id": 72562, "fullname": "Yushi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72562?format=json", "institution": "University of Washington"}, {"id": 130924, "fullname": "Shang-Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130924?format=json", "institution": "Facebook"}, {"id": 138110, "fullname": "Jakob Verbeek", "url": "http://cvpr.thecvf.com/api/miniconf/users/138110?format=json", "institution": "Meta"}, {"id": 90357, "fullname": "XuDong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90357?format=json", "institution": "UC Berkeley"}, {"id": 189608, "fullname": "Marjan Ghazvininejad", "url": "http://cvpr.thecvf.com/api/miniconf/users/189608?format=json", "institution": "Facebook"}, {"id": 130921, "fullname": "Luke Zettlemoyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/130921?format=json", "institution": "University of Washington"}, {"id": 189609, "fullname": "Emily Dinan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189609?format=json", "institution": "Facebook"}], "abstract": "Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to \"think in words\" about subsequent content before \"acting in pixels\" to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality (preferred 92% of the time in human evaluations vs. a comparable text-to-video model) and controllability (19 point improvement in fine-grained instruction following accuracy vs. a \"think-then-act\" approach). TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38323", "url": null, "sourceid": 35146, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36339, "uid": "a46f11583fd90dc51b576a8c1c320d04", "name": "Efficient Weighted Sampling via Score-based Generative Models", "authors": [{"id": 147161, "fullname": "Heasung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/147161?format=json", "institution": "University of Texas at Austin"}, {"id": 147159, "fullname": "Taekyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/147159?format=json", "institution": "University of Texas at Austin"}, {"id": 184812, "fullname": "Hyeji Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/184812?format=json", "institution": "University of Texas, Austin"}, {"id": 184813, "fullname": "Gustavo De Veciana", "url": "http://cvpr.thecvf.com/api/miniconf/users/184813?format=json", "institution": "University of Texas, Austin"}], "abstract": "Weighted sampling\u2014sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function\u2014is a fundamental technique with wide-ranging applications in variance reduction, biased sampling, data augmentation, and more. Leveraging the increasing availability of pretrained score-based generative models (SGMs), we propose a training-free weighted sampling framework that approximates the backward diffusion process of the target distribution by augmenting the pretrained base score function with an auxiliary guidance term, in a principled and computationally efficient manner. Our approach builds on two key components: a lightweight approximation of the guidance that avoids costly higher-order derivatives of both the score and weight functions, and an uncertainty-aware scheduler that dynamically adjusts the guidance strength based on a temporal analysis of approximation error. Together, these components enable accurate and stable sampling without relying on particle-based resampling or Hessian evaluations commonly required by existing methods. We validate the effectiveness of our method from synthetic to large-scale settings such as Stable Diffusion XL, where our framework achieves $1.2\\times$ to $4.7\\times$ speedups while consistently matching or outperforming state-of-the-art baselines in task performance. These results position our method as a scalable and inference-efficient solution for task-adaptive, time-sensitive sampling in generative applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36339", "url": null, "sourceid": 45758, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65727, "modified": "2026-04-22T21:57:54.399344-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/36339.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38389, "uid": "5761547e8b52b26d2f7f9bdc41e0336a", "name": "Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning", "authors": [{"id": 189775, "fullname": "Guanjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189775?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189776, "fullname": "Shirui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189776?format=json", "institution": "SLIM"}, {"id": 189777, "fullname": "Yifu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189777?format=json", "institution": null}, {"id": 189778, "fullname": "Kai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189778?format=json", "institution": null}, {"id": 189779, "fullname": "Jianchen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189779?format=json", "institution": null}, {"id": 156078, "fullname": "Xiaoye Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156078?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 74051, "fullname": "Yu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74051?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189780, "fullname": "Peng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189780?format=json", "institution": "Tencent"}], "abstract": "Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement.Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are attached in the supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38389", "url": null, "sourceid": 38338, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38586, "uid": "6e91d06a7bd6b43a1bdd70dd9068bdc7", "name": "Revisiting Model Stitching In the Foundation Model Era", "authors": [{"id": 85353, "fullname": "Zheda Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85353?format=json", "institution": "Ohio State University"}, {"id": 135447, "fullname": "Ke Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135447?format=json", "institution": "Amazon"}, {"id": 156889, "fullname": "Fu-En Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156889?format=json", "institution": "Amazon"}, {"id": 190214, "fullname": "Zixiao Ken Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190214?format=json", "institution": "Amazon"}, {"id": 94700, "fullname": "Albert Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/94700?format=json", "institution": "Amazon"}, {"id": 98825, "fullname": "Lu Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/98825?format=json", "institution": "Amazon"}, {"id": 93149, "fullname": "Min Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/93149?format=json", "institution": "Amazon/NTHU"}, {"id": 84728, "fullname": "Wei-Lun Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84728?format=json", "institution": "Ohio State University"}, {"id": 130482, "fullname": "Cheng-Hao Kuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/130482?format=json", "institution": "Amazon"}], "abstract": "Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitchlayer, has served as a probe of representational compatibility. Prior work finds that models trained on the same datasetremain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model\u2019s penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38586", "url": null, "sourceid": 46466, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39374, "uid": "6f39194d3df14d057e8ba796fcae6942", "name": "Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping", "authors": [{"id": 159508, "fullname": "Jiwon Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159508?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191945, "fullname": "Yeji Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191945?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191946, "fullname": "JoungBin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191946?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 153107, "fullname": "Wooseok Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153107?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185599, "fullname": "Jinhyeok Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185599?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191947, "fullname": "Taekeun Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191947?format=json", "institution": "Samsung"}, {"id": 191948, "fullname": "Yongjae Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/191948?format=json", "institution": "Samsung"}, {"id": 191949, "fullname": "Myungin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191949?format=json", "institution": "Ajou University; Samsung Electronics"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Face swapping aims to transfer the identity of a source face onto a target face while preserving target-specific attributes such as pose, expression, lighting, skin tone, and makeup. However, since real face-swapping ground truth is unavailable, achieving both accurate identity transfer and high-quality attribute preservation remains challenging. Although recent diffusion-based approaches attempt to improve visual fidelity through conditional inpainting on masked target images, the masked condition removes crucial appearance cues, resulting in plausible yet misaligned attributes due to the lack of explicit supervision. To address these limitations, we propose APPLE (Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping) a diffusion-based teacher\u2013student framework that enhances attribute fidelity through attribute-aware pseudo-label supervision. First, we reformulate face swapping as a conditional deblurring task to more faithfully preserve target-specific attributes such as lighting, skin tone, and makeup. In addition, we introduce an attribute-aware inversion scheme to further improve detailed attribute preservation. Through an elaborate attribute-preserving design for teacher learning, APPLE produces high-quality pseudo triplets that explicitly provide the student with direct face-swapping supervision. Overall, APPLE achieves state-of-the-art performance in terms of attribute preservation and identity transfer, producing more photorealistic and target-faithful results. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39374", "url": null, "sourceid": 31037, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36452, "uid": "33c7c478e181b849b1a65eef4ba8d414", "name": "GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport", "authors": [{"id": 107602, "fullname": "Youngju Na", "url": "http://cvpr.thecvf.com/api/miniconf/users/107602?format=json", "institution": "KAIST"}, {"id": 135036, "fullname": "Jaeseong Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/135036?format=json", "institution": "NAVER LABS"}, {"id": 185089, "fullname": "Soohyun Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185089?format=json", "institution": "Naver Labs"}, {"id": 185090, "fullname": "Hyunsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185090?format=json", "institution": "Naver Labs"}, {"id": 89458, "fullname": "Sung-Eui Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/89458?format=json", "institution": "KAIST"}, {"id": 155592, "fullname": "Suyong Yeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/155592?format=json", "institution": "NAVER LABS"}], "abstract": "While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate that GLINT achieves state-of-the-art performance in 3D reconstruction of complex transparent scenes.Our code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36452", "url": null, "sourceid": 45476, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40261?format=json"], "related_events_ids": [40261]}, {"id": 37958, "uid": "075a91a991e7d46281436936d8e17bff", "name": "ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing", "authors": [{"id": 184039, "fullname": "Rui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184039?format=json", "institution": "Alipay (Hangzhou) Digital Service Technology Co., Ltd."}, {"id": 188680, "fullname": "Shuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188680?format=json", "institution": "Ant Group"}, {"id": 188681, "fullname": "Xiaoxuan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188681?format=json", "institution": "Alibaba Group"}, {"id": 188682, "fullname": "Ruirui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188682?format=json", "institution": "Alibaba Group"}, {"id": 188683, "fullname": "Yi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188683?format=json", "institution": "nanjing university"}, {"id": 188684, "fullname": "Tao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188684?format=json", "institution": null}, {"id": 188685, "fullname": "Wenhao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188685?format=json", "institution": "Alibaba Group"}, {"id": 188686, "fullname": "Yong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188686?format=json", "institution": "Alibaba Group"}], "abstract": "Multimodal Web Agents demonstrate a practically valuable capability by fusing information from diverse modalities (e.g., text and vision), retrieved iteratively from the internet, to respond to complex user queries.However, the visual modality is prone to information overload, and the noise contained within it\u2014such as irrelevant background details or complex structures\u2014can disrupt the model's attention, misdirecting its operational focus toward an erroneous path.To address the aforementioned challenge, we propose ReFAct (Reasoning, Focusing, and Acting), a novel framework that empowers the agent to actively manage its cross-modal context. This allows the agent to adjust its operational focus, thereby mitigating the impact of noise on multimodal Web Agents.Specifically, ReFAct employs a Grounding tool for active visual perception to dynamically filter information. We also design external memory-based Defocus/Refocus operations for selective information retention, further modulating information density within the multimodal context. Ultimately, this ensures the agent maintains focus during problem-solving.To evaluate and enhance agent capabilities in complex and noisy multimodal contexts, we first propose a pipeline for constructing datasets with flexible complexity. We introduce a new open-source benchmark: GroundedVQA. Finally, we experimentally demonstrate the effectiveness of our proposed method on GroundedVQA and other widely-used benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37958", "url": null, "sourceid": 37067, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39278, "uid": "4100a0586c1f755aa468aeb5f28df335", "name": "VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking", "authors": [{"id": 169037, "fullname": "Jingyang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/169037?format=json", "institution": "University of Rochester"}, {"id": 139388, "fullname": "Jialian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/139388?format=json", "institution": "Advanced Micro Devices"}, {"id": 152509, "fullname": "Jiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152509?format=json", "institution": "Advanced Micro Devices"}, {"id": 131723, "fullname": "Ximeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131723?format=json", "institution": "Boston University"}, {"id": 191753, "fullname": "Ze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191753?format=json", "institution": "Luma AI"}, {"id": 191754, "fullname": "Xiaodong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191754?format=json", "institution": "Snowflake"}, {"id": 85765, "fullname": "Jiebo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85765?format=json", "institution": "University of Rochester"}, {"id": 126358, "fullname": "Zicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126358?format=json", "institution": "Microsoft"}, {"id": 149601, "fullname": "Emad Barsoum", "url": "http://cvpr.thecvf.com/api/miniconf/users/149601?format=json", "institution": "AMD"}], "abstract": "Video agentic models have substantially advanced video-language understanding performance. However, most agentic approaches heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. Instead, we argue that leveraging the logical flow of videos allows models to use far fewer frames while maintaining, or even improving, their video understanding capability. In this paper, we introduce VideoSeek, a long-horizon video agent that actively seeks informative content via tool use, conditioned on the underlying logic flows throughout videos. Specifically, the VideoSeek agent follows a think\u2013act\u2013observe loop: it reasons over collected evidence to determine a tool-using plan, then acts by calling tools to gather new observations, and stops once it is sufficient to answer the given question. Experiments on four long-form video understanding and complex reasoning benchmarks demonstrate the superiority of VideoSeek. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Futher, a comprehensive analysis highlights the significance of leveraging logic flow, strong reasoning capability, and toolkit design for video agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39278", "url": null, "sourceid": 33422, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39666, "uid": "4c4aa2205676a162ba900d37ea48e67d", "name": "SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors", "authors": [{"id": 192598, "fullname": "Shaoqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192598?format=json", "institution": "North China Electric Power University (Baoding)"}, {"id": 192599, "fullname": "Jiadai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192599?format=json", "institution": "Baidu"}, {"id": 174657, "fullname": "Bosen Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/174657?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 192600, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192600?format=json", "institution": "NCEPU"}, {"id": 73930, "fullname": "Bin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/73930?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 76063, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/76063?format=json", "institution": "Northwestern Polytechnical University Xi&#x27;an"}, {"id": 192601, "fullname": "Bin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192601?format=json", "institution": "NCEPU"}, {"id": 87079, "fullname": "Yuchao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87079?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Learning-based Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost volumes through multi-view feature similarity computation and regularization. However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions, such as weakly textured or non-Lambertian surfaces. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). The SPE represents the 3D positional information of pixels in each image within a unified metric space, constructed using monocular depth priors. We integrate the SPE alongside image data as input and introduce a Photometric-Spatial Hybrid Feature Extractor, along with an SPE-enhanced cost volume construction module. These components incorporate spatial position-based similarity computation, substantially improving robustness in challenging areas. Furthermore, we propose a Monocular Depth-guided Enhancement (MDGE) module that enhances depth probability distributions using monocular depth priors, thereby further boosting the depth estimation performance. Extensive experiments demonstrate that our method significantly improves reconstruction quality in difficult regions and achieves state-of-the-art (\\textit{SOTA}) performance on multiple benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39666", "url": null, "sourceid": 31238, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39940, "uid": "1910192e62b0e66ee2c4b827aff0fca6", "name": "GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks", "authors": [{"id": 139519, "fullname": "Saelyne Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/139519?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 193158, "fullname": "Jaesang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193158?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 136475, "fullname": "Yi-Hao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/136475?format=json", "institution": "Carnegie Mellon University"}, {"id": 88436, "fullname": "Kevin Qinghong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88436?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 76828, "fullname": "Jae Won Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/76828?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 69927, "fullname": "Yale Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/69927?format=json", "institution": "Google"}, {"id": 193159, "fullname": "Juho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193159?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software. While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency.To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI Understanding, Intent, and Help Decision Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations that surface user intent, across 10 complex software (e.g., PowerPoint, Photoshop). GUIDE defines three tasks\u2014(i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model\u2019s ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled with the tasks, achieving only 44.6\\% and 55.0\\% accuracy on behavior state and help prediction. However, providing user context such as behavioral state and intent significantly improved the performance, raising help prediction by up to 50.2\\%. These results highlight the critical role of structured user understanding in effective assistance.Our benchmark provides a path toward GUI agents that go beyond automation to become truly user-aware collaborators.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39940", "url": null, "sourceid": 35968, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36529, "uid": "ead316e44cc96ef5f2397ca7cceaba82", "name": "GVIS: Generative Vector Image Steganography", "authors": [{"id": 185286, "fullname": "ZiHao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185286?format=json", "institution": "ChangChun University"}, {"id": 185287, "fullname": "Dawei xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185287?format=json", "institution": "Beijing Institute of Technology"}, {"id": 185288, "fullname": "Zihan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185288?format=json", "institution": "Beijing Institute of Technology"}, {"id": 185289, "fullname": "Xixi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185289?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 185290, "fullname": "Chuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185290?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Vector images have attracted increasing attention in the field of information hiding in recent years due to their scalability and structural properties. However, existing steganographic methods for vector images often introduce noticeable modifications to the files themselves, resulting in potential security risks and limited embedding capacity. Motivated by recent advances in diffusion models and image generative steganography, we propose GVIS, a novel \\textbf{G}enerative \\textbf{V}ector \\textbf{I}mage \\textbf{S}teganography framework. GVIS deterministically generates bitmap images using diffusion models, which are subsequently vectorized into scalable vector images. On the sender side, we design a lightweight overlap detection algorithm to identify B\u00e9zier curve control points suitable for data embedding, which enables the secret information to be encoded into the polar coordinate parameters of these control points. Then, the receiver can use the pre-shared conditional inputs to reconstruct the generation process and accurate message extraction by vector difference. Extensive theoretical analysis and experimental results demonstrate that GVIS achieves high-capacity, high-accuracy, secure, and training-free steganography. To the best of our knowledge, this is the first work to introduce generative model into the domain of vector image steganography.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36529", "url": null, "sourceid": 32546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36846, "uid": "b17817e6bd62910a6e9016c9a58ee9bb", "name": "PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection", "authors": [{"id": 182874, "fullname": "Zhengjian Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182874?format=json", "institution": "New York University"}, {"id": 186012, "fullname": "Jun Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186012?format=json", "institution": "Boise State University"}, {"id": 186013, "fullname": "Kangtong Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186013?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 186014, "fullname": "Qi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186014?format=json", "institution": "University of California, Irvine"}, {"id": 186015, "fullname": "Rui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186015?format=json", "institution": "Illinois Institute of Technology"}, {"id": 182977, "fullname": "Ye Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182977?format=json", "institution": "University Of Pittsburgh"}], "abstract": "Detection Transformer (DETR) has redefined object detection by casting it as a set prediction task within an end-to-end framework. Despite its elegance, DETR and its variants still rely on fixed learnable queries and suffer from severe query utilization imbalance, which limits adaptability and leaves the model capacity underused. We propose PaQ-DETR (Pattern and Quality-Aware DETR), a unified framework that enhances both query adaptivity and supervision balance. It learns a compact set of shared latent patterns capturing global semantics and dynamically generates image-specific queries through content-conditioned weighting. In parallel, a quality-aware one-to-many assignment strategy adaptively selects positive samples based on localization\u2013classification consistency, enriching supervision and promoting balanced query optimization. Experiments on COCO, CityScapes, and other benchmarks show consistent gains of 1.5\\%\u20134.2\\% mAP across DETR backbones, including ResNet and Swin-Transformer. Beyond accuracy improvement, our method provides interpretable insights into how dynamic patterns cluster semantically across object categories.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36846", "url": null, "sourceid": 32418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38086, "uid": "db6812fe6a7155525f318274bdf46c4c", "name": "MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention", "authors": [{"id": 181244, "fullname": "Pedro Curvo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181244?format=json", "institution": "University of Amsterdam"}, {"id": 189031, "fullname": "Jan-Willem van de Meent", "url": "http://cvpr.thecvf.com/api/miniconf/users/189031?format=json", "institution": "University of Amsterdam"}, {"id": 189032, "fullname": "Maksim Zhdanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/189032?format=json", "institution": "University of Amsterdam"}], "abstract": "A key scalability challenge in neural solvers for industrial-scale physics simulations is efficiently capturing both fine-grained local interactions and long-range global dependencies across millions of spatial elements. We introduce the Multi-Scale Patch Transformer (MSPT), an architecture that combines local point attention within patches with global attention to coarse patch-level representations. To partition the input domain into spatially-coherent patches, we employ ball trees, which handle irregular geometries efficiently. This dual-scale design enables MSPT to scale to millions of points on a single GPU. We validate MSPT on standard PDE benchmarks (elasticity, plasticity, fluid dynamics, porous flow) and large-scale aerodynamic datasets (ShapeNet-Car, Ahmed-ML), achieving state-of-the-art accuracy with substantially lower memory footprint and computational cost.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38086", "url": null, "sourceid": 43173, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39306, "uid": "5301bb35a65ac7aef3ba3f102d07f770", "name": "Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation", "authors": [{"id": 181481, "fullname": "Jiahao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181481?format=json", "institution": "Tianjin University"}, {"id": 191809, "fullname": "Xiaohan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191809?format=json", "institution": "National University of Singapore"}, {"id": 191810, "fullname": "Xingchen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191810?format=json", "institution": "Donghua University, Shanghai"}, {"id": 126752, "fullname": "Chongyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126752?format=json", "institution": "Sichuan University"}, {"id": 77013, "fullname": "Kun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77013?format=json", "institution": "Tianjin University"}, {"id": 120005, "fullname": "Buzhen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/120005?format=json", "institution": "Southeast University"}], "abstract": "Co-manipulation requires multiple humans to synchronize their motions with a shared object while ensuring reasonable interactions, maintaining natural poses, and preserving stable states. However, most existing motion generation approaches are designed for single-character scenarios or fail to account for payload-induced dynamics. In this work, we propose a flow-matching framework that ensures the generated co-manipulation motions align with the intended goals while maintaining naturalness and effectiveness. Specifically, we first introduce a generative model that derives explicit manipulation strategies from the object\u2019s affordance and spatial configuration, which guide the motion flow toward successful manipulation. To improve motion quality, we then design an adversarial interaction prior that promotes natural individual poses and realistic inter-person interactions during co-manipulation. In addition, we also incorporate a stability-driven simulation into the flow matching process, which refines unstable interaction states through sampling-based optimization and directly adjusts the vector field regression to promote more effective manipulation. The experimental results demonstrate that our method achieves higher contact accuracy, lower penetration, and better distributional fidelity compared to state-of-the-art human-object interaction baselines. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39306", "url": null, "sourceid": 40935, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36462, "uid": "74003f585934ff31d164241113dc5a24", "name": "Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer", "authors": [{"id": 181917, "fullname": "Li Yuze", "url": "http://cvpr.thecvf.com/api/miniconf/users/181917?format=json", "institution": "Tianjin University"}, {"id": 88361, "fullname": "Dong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88361?format=json", "institution": "University of New South Wales"}, {"id": 185109, "fullname": "Xiao Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185109?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185110, "fullname": "Junchao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185110?format=json", "institution": "Tianjin University"}, {"id": 185111, "fullname": "Dongsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185111?format=json", "institution": "Tianjin University"}, {"id": 156172, "fullname": "Lei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156172?format=json", "institution": "Hainan University"}, {"id": 185112, "fullname": "Yun Sing Koh", "url": "http://cvpr.thecvf.com/api/miniconf/users/185112?format=json", "institution": "University of Auckland"}, {"id": 156173, "fullname": "Cheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156173?format=json", "institution": "Tianjin University"}, {"id": 128184, "fullname": "Xinyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128184?format=json", "institution": "The University of Adelaide"}], "abstract": "Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer. All code, models, and benchmarks will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36462", "url": null, "sourceid": 43205, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38362, "uid": "874121fee1d3bebea38815d169d47848", "name": "FMPose: 3D Pose Estimation via Flow Matching", "authors": [{"id": 183935, "fullname": "Ti Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183935?format=json", "institution": "EPFL"}, {"id": 189724, "fullname": "Xiaohang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189724?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 189725, "fullname": "Mackenzie Mathis", "url": "http://cvpr.thecvf.com/api/miniconf/users/189725?format=json", "institution": "Swiss Federal Institute of Technology Lausanne"}], "abstract": "Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses.In particular, diffusion-based models have demonstrated strong performance, but their iterative denoising process typically requires many time steps for each prediction, making inference computationally expensive.In contrast, Flow Matching (FM) learns an ODE-based velocity field, enabling efficient generation of 3D pose samples with only a few integration steps. Inspired by this capability, we propose a novel generative pose estimation framework, FMPose, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned on 2D inputs. While the ODE trajectories are deterministic, FMPose naturally generates diverse pose hypotheses by sampling different noise seeds.To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both human and animal 3D pose domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38362", "url": null, "sourceid": 30694, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40253, "uid": "5c3a444b864621608460b0e5589556ab", "name": "What Matters in Practical Learned Image Compression", "authors": [{"id": 193888, "fullname": "Kedar Tatwawadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193888?format=json", "institution": "Apple"}, {"id": 176289, "fullname": "Parisa Rahimzadeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/176289?format=json", "institution": "Apple"}, {"id": 76117, "fullname": "Zhanghao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76117?format=json", "institution": "Stanford University"}, {"id": 193889, "fullname": "Zhiqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193889?format=json", "institution": "Apple"}, {"id": 87027, "fullname": "Ziyun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87027?format=json", "institution": "Duke University"}, {"id": 193890, "fullname": "Sanjay Nair", "url": "http://cvpr.thecvf.com/api/miniconf/users/193890?format=json", "institution": null}, {"id": 193891, "fullname": "Divija Hasteer", "url": "http://cvpr.thecvf.com/api/miniconf/users/193891?format=json", "institution": "Apple"}, {"id": 94821, "fullname": "Oren Rippel", "url": "http://cvpr.thecvf.com/api/miniconf/users/94821?format=json", "institution": "Apple"}], "abstract": "One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed.In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime \u2014 including within the ablations several novel techniques. We then perform performance-aware neural architecture search over millions of backbone configurations to identify models that achieve the target on-device runtime while maximizing compression performance as captured by perceptual metrics. We combine the various optimizations to construct a new codec that achieves a significantly improved tradeoff between speed and perceptual quality. Based on rigorous subjective user studies, it provides 2.3-3\u00d7 bitrate savings against AV1, AV2, VVC, ECM and JPEG-AI, and 20-40% bitrate savings against the best learned codec alternatives. At the same time, on an iPhone 17 Pro Max, it encodes 12MP images as fast as 230ms, and decodes them in 150ms \u2014 faster than most top ML-based codecs run on a V100 GPU.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40253", "url": null, "sourceid": 40787, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38219, "uid": "a5676171b00e004f5956b7933de18b0d", "name": "QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification", "authors": [{"id": 189351, "fullname": "Komal Komal", "url": "http://cvpr.thecvf.com/api/miniconf/users/189351?format=json", "institution": "IIT Ropar"}, {"id": 189352, "fullname": "Mukul Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/189352?format=json", "institution": null}, {"id": 189353, "fullname": "Saumya Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/189353?format=json", "institution": null}, {"id": 181845, "fullname": "SANTOSH VIPPARTHI", "url": "http://cvpr.thecvf.com/api/miniconf/users/181845?format=json", "institution": "INDIAN INSTITUTE OF TECHNOLOGY ROPAR"}, {"id": 189354, "fullname": "Chakradhar Reddy Chandupatla", "url": "http://cvpr.thecvf.com/api/miniconf/users/189354?format=json", "institution": "Indian Institute Of Technology\u2013Ropar (IIT\u2013Ropar)"}, {"id": 189355, "fullname": "Subrahmanyam Murala", "url": "http://cvpr.thecvf.com/api/miniconf/users/189355?format=json", "institution": "Trinity College Dublin, Ireland"}], "abstract": "We present QuCNet, a hybrid quantum classical network for efficient remote sensing image classification. QuCNet integrates a lightweight convolutional encoder with sixteen parallel four-qubit trainable quantum circuits (TQCs) trained under a Hybrid Cyclic Weight-Sharing (HCWS)} strategy. This design enhances expressibility while keeping the parameter count extremely low ~87K, 85\u00d7 smaller than prior hybrid models). Guided by expressibility analysis, the proposed quantum configuration maintains stable gradients and mitigates barren plateaus on near term quantum devices. Extensive experiments across seven remote sensing benchmarks (AID, AIDER, UC Merced, NWPU-45, EuroSAT, IIITDMJ Smoke, and USTC SmokeRS) demonstrate that QuCNet consistently improves accuracy and generalization over classical CNN baselines. Furthermore, hardware only inference on IBM Quantum processors (ibm\\_torino, ibm\\_fez) confirms robustness under realistic noise and connectivity constraints. These results suggest a practical path toward \\textbf{scalable, hardware feasible quantum deep learning} for geospatial applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38219", "url": null, "sourceid": 42409, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38605, "uid": "5732c8f48b699d91047d62ad16125a72", "name": "Saliency-Driven Token Merging for Vision Transformers", "authors": [{"id": 76627, "fullname": "Weiying Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/76627?format=json", "institution": "Xidian University"}, {"id": 190273, "fullname": "Xiaoyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190273?format=json", "institution": "Xidian University"}, {"id": 190274, "fullname": "Xin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190274?format=json", "institution": "Xidian University"}, {"id": 150495, "fullname": "Chenhe Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/150495?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 190275, "fullname": "Jitao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190275?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 73148, "fullname": "Yunsong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73148?format=json", "institution": "Xidian University"}, {"id": 77219, "fullname": "Leyuan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77219?format=json", "institution": "Hunan University"}], "abstract": "Vision Transformers (ViTs) exhibit robust performance across diverse visual scenarios. However, their efficiency is constrained by excessive token counts. Token merging offers a viable solution for achieving efficient ViTs. Existing methods merge tokens based solely on specific characteristics within the attention mechanism, which changes significantly across different layers. In this paper, we propose a novel training-free SAliency-Driven Token Merging (SAD-TM) approach by leveraging not only the semantic relevance in the attention space but also the latent visual saliency of input patches. Our SAD-TM is inspired by the discovery that saliency-based statistics can directly capture the causal relationship between model input and output, regardless of the layers. Based on the observation, we develop a method that is mathematically formulated to merge tokens with high saliency outliers. The principle behind our merging is that tokens with high saliency outliers usually imply inconsistencies with the global gradient direction, and thus can be merged safely. Besides, our systematic analysis indicates that class attention shows considerable variation across early blocks, so a deferred merging strategy is introduced to optimize the selection of merging rates. In a training-free manner, SAD-TM demonstrates superior performance across various ViT architectures. Especially, with a FLOPs compression of 23.08\\% on DeiT-Tiny, SAD-TM achieves a Top-1 Accuracy comparable to that of the pretrained baseline on ImageNet dataset. The code will be available soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38605", "url": null, "sourceid": 31854, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39847, "uid": "de356af39a1d1dc704a0978744396217", "name": "AR\u00b2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos", "authors": [{"id": 173219, "fullname": "Teng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/173219?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 192965, "fullname": "Yihan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192965?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192966, "fullname": "Jiongxu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192966?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192967, "fullname": "Teng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192967?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192968, "fullname": "Jiaqi LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/192968?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192969, "fullname": "Bingzhuo Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192969?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR\u00b2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR\u00b2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and \u201324.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39847", "url": null, "sourceid": 40567, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39173, "uid": "c7204958ab2a72430f7c9c8bbea79be0", "name": "GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective", "authors": [{"id": 158769, "fullname": "Daixun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158769?format=json", "institution": "State Key Laboratory of Integrated Services Networks"}, {"id": 191504, "fullname": "Zirui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191504?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191505, "fullname": "Sibo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191505?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191506, "fullname": "Jiayun Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191506?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191507, "fullname": "Mingxiang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191507?format=json", "institution": "State Key Laboratory of Integrated Services Networks"}, {"id": 76627, "fullname": "Weiying Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/76627?format=json", "institution": "Xidian University"}, {"id": 187579, "fullname": "Yunke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187579?format=json", "institution": "University of Sydney"}, {"id": 190274, "fullname": "Xin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190274?format=json", "institution": "Xidian University"}, {"id": 191508, "fullname": "Yusi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191508?format=json", "institution": "Xidian University of Electronic Science and Technology"}, {"id": 73148, "fullname": "Yunsong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73148?format=json", "institution": "Xidian University"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}, {"id": 77219, "fullname": "Leyuan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77219?format=json", "institution": "Hunan University"}], "abstract": "Multimodal Large Language Models (MLLMs) have shown strong potential in remote sensing (RS) through multi-task reasoning and cross-modal generalization.However, existing RS-MLLMs mainly rely on a single shared expert for all tasks, making it hard to produce reliable results. Meanwhile, the intrinsic redundancy and homogeneity of RS images bring substantial difficulties for both training and inference. These challenges directly conflict with the demands of remote sensing, which values task precision and trustworthy reasoning.To address these limitations, we propose GeoCoT, a manifold-driven mixture-of-experts (MoE) system with Chain-of-Thought (CoT) reasoning. GeoCoT introduces Mani-MoE, a sparse expert architecture grounded in local manifold mapping. It projects high-dimensional tokens onto low-rank subspaces adaptively to eliminate redundancy and uncover intrinsic structure, and then routes them through a sparse expert pathway, where gating decisions are guided by the manifold structure of the input.To optimize this architecture, we adopt a CoT-driven multi-stage training strategy. It leverages a cold-start phase for domain adaptation, followed by our RS Vision Group Relative Policy Optimization (RSV-GRPO) to systematically strengthen structured reasoning from global to objectives. Furthermore, we innovatively build *RS-CoT-20k* dataset for task-specific supervision.Extensive experiments on multi-task datasets demonstrate that GeoCoT outperforms prior approaches, achieving $5.27 \\\\%$ higher average accuracy than the state-of-the-art method. Our code will be available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39173", "url": null, "sourceid": 37163, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38302, "uid": "ac3e8c12239b89993979761d511af70f", "name": "CC-VQA: Conflict- and Correlatoin-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering", "authors": [{"id": 173778, "fullname": "Yuyang Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/173778?format=json", "institution": "Institute of Automation,CAS"}, {"id": 152480, "fullname": "Jiaqi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152480?format=json", "institution": "Alibaba Group"}, {"id": 152482, "fullname": "Yujing Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152482?format=json", "institution": "Alibaba Group"}, {"id": 152481, "fullname": "Lubin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152481?format=json", "institution": "Alibaba Cloud"}, {"id": 130742, "fullname": "Qi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130742?format=json", "institution": "School of Artificial Intelligence, University of Chinese Academy of Sciences."}, {"id": 149633, "fullname": "Ying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149633?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences; Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189545, "fullname": "Kun Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/189545?format=json", "institution": "Institute of Automation"}, {"id": 126248, "fullname": "Yue Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126248?format=json", "institution": "Alibaba Group"}, {"id": 89642, "fullname": "Shiming Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89642?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 88210, "fullname": "Jieping Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/88210?format=json", "institution": "Alibaba Group"}], "abstract": "Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose CC-VQA: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\\% to 6.4\\% compared to existing methods. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38302", "url": null, "sourceid": 38494, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37453, "uid": "7072a5c26d796e02dc28cb861210abba", "name": "Simple-ViLMedSAM: Simple Text Prompts Meet Vision-Language Models for Medical Image Segmentation", "authors": [{"id": 159592, "fullname": "Chengcan Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/159592?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 158781, "fullname": "Dong Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/158781?format=json", "institution": "Meta Inc."}, {"id": 187484, "fullname": "Geng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187484?format=json", "institution": "Northwest Polytechnical University"}, {"id": 87100, "fullname": "Daoqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87100?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 187485, "fullname": "Xuyun Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187485?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "Medical image segmentation is challenging due to limited annotated data, high labeling costs, and substantial image heterogeneity. Although large-scale vision foundation models (e.g., SAM) have shown great potential in this field, existing SAM-based methods typically rely on expert-defined geometric prompts or complex clinical text prompts, which limits their generalizability across diverse medical image segmentation tasks. To overcome these challenges, we propose Simple-ViLMedSAM, a CLIP-SAM integration framework that enables high-accuracy segmentation in zero-shot and few-shot settings using only simple text queries, that is, using only basic anatomical or disease-related text labels. At its core is an Implicit Pos-Prompter (IPP), which generates attribution maps containing implicit positional cues to replace traditional geometric prompts. IPP incorporates a multi-modal information bottleneck and an affinity-based refinement strategy to ensure high-quality guidance from CLIP-SAM interactions. To further enhance segmentation, we introduce a Bidirectional Interaction Decoder (BID) that employs bidirectional cross-attention to align IPP\u2019s positional maps with SAM's pixel-level features. By jointly modeling global semantics and local details, BID significantly improves segmentation accuracy. Extensive experiments on four public datasets demonstrate that Simple-ViLMedSAM consistently outperforms existing methods in both zero-shot and few-shot medical image segmentation tasks, using only simple text queries. The code will be publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37453", "url": null, "sourceid": 42316, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37594, "uid": "7dfca50f1976e6b46fb99619513b9215", "name": "Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models", "authors": [{"id": 152742, "fullname": "Jinlong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152742?format=json", "institution": "University of Trento"}, {"id": 187783, "fullname": "Liyuan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187783?format=json", "institution": ""}, {"id": 105779, "fullname": "Haonan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/105779?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}], "abstract": "Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token Anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global Optimal Transport (AOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37594", "url": "https://tyroneli.github.io/AOT/", "sourceid": 45667, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40224, "uid": "155722af3900b91a17eeb5a3c987defe", "name": "Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning", "authors": [{"id": 193820, "fullname": "Yingkai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193820?format=json", "institution": "Beijing Institute of Technology"}, {"id": 158835, "fullname": "Tao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158835?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 177117, "fullname": "Jing Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/177117?format=json", "institution": "Beijing Institute of Technology"}, {"id": 73966, "fullname": "Ying Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73966?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image.In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models.Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers while dedicating the subsequent network to enhancing the initial abundance map.To leverage the spatial texture of the unregistered reference, we introduce a coarse-to-fine deformable aggregation module, which first estimates a pixel-level flow and a similarity map using a coarse pyramid predictor. It further performs fine sub-pixel refinement to achieve deformable aggregation of the reference features. The aggregative features are then refined via a series of spatial-channel abundance cross-attention blocks. Furthermore, a spatial-channel modulated fusion module is presented to merge encoder-decoder features using dynamic gating weights, yielding a high-quality, high-resolution HSI. Experimental results on simulated and real datasets confirm that our proposed method achieves state-of-the-art super-resolution performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40224", "url": null, "sourceid": 41019, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37491, "uid": "cb2363691a8351ee799c9108229c75b4", "name": "Continual Learning by Reuse, New, Adapt and Skip: A Hierarchical Exploration-Exploitation Approach", "authors": [{"id": 182488, "fullname": "Chinmay Savadikar", "url": "http://cvpr.thecvf.com/api/miniconf/users/182488?format=json", "institution": "North Carolina State University"}, {"id": 187576, "fullname": "Michelle Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187576?format=json", "institution": "Johns Hopkins University"}, {"id": 92047, "fullname": "Tianfu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92047?format=json", "institution": "North Carolina State University"}], "abstract": "To effectively manage the complexities of real-world dynamic environments, continual learning must incrementally acquire, update, and accumulate knowledge from a stream of tasks of different nature\u2014without suffering from catastrophic forgetting of prior knowledge. While this capability is innate to human cognition, it remains a significant challenge for modern deep learning systems.At the heart of this challenge lies *the stability-plasticity dilemma*: the need to balance leveraging prior knowledge, integrating novel information, and allocating model capacity adaptively based on task complexity and synergy.  In this paper, we propose a novel exemplar-free class-incremental continual learning (ExfCCL) framework that addresses these issues through a Hierarchical Exploration-Exploitation (HEE) approach. The core of our method is a HEE-guided efficient neural architecture search (HEE-NAS) that enables a learning-to-adapt backbone via four primitive operations - reuse, new, adapt, and skip\u2014thereby serving as an internal memory that dynamically updates selected components across streaming tasks.  To address the task ID inference problem in  ExfCCL, we exploit an external memory of task centroids proposed in the prior art. We term our method **CHEEM** (Continual Hierarchical-Exploration-Exploitation Memory). CHEEM is evaluated on the challenging MTIL and VDD benchmarks using both Tiny and Base Vision Transformers and a proposed holistic **Figure-of-Merit (FoM) metric**. It significantly outperforms state-of-the-art prompting-based continual learning methods, closely approaching full fine-tuning upper bounds. Furthermore, it learns adaptive model structures tailored to individual tasks in a semantically meaningful way.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37491", "url": null, "sourceid": 43309, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37088, "uid": "a40a236b4b73aa4df86721e02ef932e9", "name": "Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching", "authors": [{"id": 77453, "fullname": "Paul Roetzer", "url": "http://cvpr.thecvf.com/api/miniconf/users/77453?format=json", "institution": "University of Bonn"}, {"id": 186630, "fullname": "Anders Johan Thunberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/186630?format=json", "institution": "Lund University"}, {"id": 131073, "fullname": "Zorah L\u00e4hner", "url": "http://cvpr.thecvf.com/api/miniconf/users/131073?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 88205, "fullname": "Florian Bernard", "url": "http://cvpr.thecvf.com/api/miniconf/users/88205?format=json", "institution": "University of Bonn"}], "abstract": "In many real world applications of non-rigid shape matching, the shapes are subject to topological noise (i.e. varying genus). In this paper, we propose a novel formulation based on Markov Random Fields (MRF) that can handle these cases with topological noise. The solutions to our optimisation problem can be approximated efficiently using the alpha expansion algorithm, which gives rise to theoretical approximation guarantees. In particular, we cast non-rigid 3D shape matching as a multi-labelling problem in which each triangle of the source shape is assigned a label that represents the matching to a specific surface element on the target shape. We propose a novel pairwise term that imposes that our matching prefers solutions in which neighbouring triangles on the source shape remain close on the target shape. Further, by exploiting the specific structure of our label space, we show that the alpha expansion algorithm can be customised to gain significant speed-ups, while maintaining its approximation guarantees. We test our formalism on various shape matching datasets including settings in which shapes have topological artefacts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37088", "url": null, "sourceid": 38525, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39706, "uid": "4a447fa0819e4b3e6dcbc7dad6c83670", "name": "Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model", "authors": [{"id": 96226, "fullname": "Mostofa Uddin Uddin", "url": "http://cvpr.thecvf.com/api/miniconf/users/96226?format=json", "institution": "Carnegie Mellon University"}, {"id": 192692, "fullname": "HM Shadman Tabib", "url": "http://cvpr.thecvf.com/api/miniconf/users/192692?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 192693, "fullname": "Thanh-Huy Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192693?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 192694, "fullname": "Kashish Gandhi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192694?format=json", "institution": "Carnegie Mellon University"}, {"id": 92284, "fullname": "Min Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92284?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We introduce an unsupervised approach for segmenting multiscale subcellular objects in 3D volumetric cryo-electron tomography (cryo-ET) images. To this end, we address key challenges such as lack of annotated data, large data volumes, high heterogeneity of subcellular shapes and sizes, and high inter-domain variability of cellular cryo-ET images across different experiments and contexts. Our method requires users to only select a small number of slabs from a few representative tomograms in the dataset. The core of our method is extracting features for the corresponding slabs, leveraging a Stable Diffusion foundation model pretrained on mostly natural images. The feature extraction is followed by a novel heuristic-based feature aggregation strategy, and adaptive thresholding to segment the aggregated features. The resulting masks are refined with pretrained CellPose to split composite regions, and then utilized as pseudo-ground truth for training supervised deep learning models. We validated our unsupervised foundation-model based pipeline on publicly available cryo-ET benchmark datasets, demonstrating performance that closely approximates expert human annotations. This fully automated, data-driven framework enables the mining of multi-scale subcellular patterns, paving the way for accelerated biological discoveries from large-scale cellular cryo-ET datasets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39706", "url": null, "sourceid": 40894, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39573, "uid": "01e651518630062d985188c1f0dbd83a", "name": "FedARA: Resource-adaptive Low-rank Personalized Federated Learning via Anchor-driven Representation Alignment on Heterogeneous Edge Devices", "authors": [{"id": 181935, "fullname": "Ruonan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181935?format=json", "institution": "Zhengzhou University"}, {"id": 192380, "fullname": "Zheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192380?format=json", "institution": "Zhengzhou University"}, {"id": 192381, "fullname": "Debin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192381?format=json", "institution": "Zhengzhou University"}, {"id": 192382, "fullname": "shijie lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/192382?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 126581, "fullname": "Laurence Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126581?format=json", "institution": "Hainan University"}], "abstract": "Personalized Federated Learning (PFL) has gained significant attention for enabling participating clients to train customized personalized models on non-IID local data. However, current PFL methods mainly suffer from two limitations: 1) Only the personalized part supports heterogeneous design, while the shared part must remain homogeneous. 2) The semantic representations of models generated on different clients with non-IID data characteristics inevitably tend to be inconsistent, negatively impacting model performance. To conquer them, this paper proposes a novel resource-adaptive personalized Federated Learning via Anchor-driven Representation Alignment (FedARA). Concretely, we design a low-rank decomposition and reconstruction fusion scheme for shared feature extractors based on the matrix decomposition technology, where each client can autonomously set the rank value based on its locally available resources, controlling the complexity of extractors and naturally reducing communication and computational costs. Moreover, to address the inconsistency of feature spaces across clients, an anchor-driven representation consistency learning mechanism is developed, which can guide client models to learn unified feature representations and alleviate global knowledge forgetting, thereby improving personalized model performance. Extensive experimental results demonstrate that our method significantly outperforms seventeen state-of-the-art baselines in diverse heterogeneous scenarios with less communication and computational costs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39573", "url": null, "sourceid": 31888, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39380, "uid": "6d63e1eebba07ade3b0fe1982583e746", "name": "MMGait: Towards Multi-Modal Gait Recognition", "authors": [{"id": 191959, "fullname": "Chenye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191959?format=json", "institution": "Beijing Normal University"}, {"id": 185560, "fullname": "Qingyuan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185560?format=json", "institution": "Beijing Normal University"}, {"id": 185561, "fullname": "Saihui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185561?format=json", "institution": "Beijing Normal University"}, {"id": 191960, "fullname": "Aoqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191960?format=json", "institution": null}, {"id": 88386, "fullname": "Yongzhen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88386?format=json", "institution": "Beijing Normal University"}], "abstract": "Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, complete codebase, and pretrained checkpoints will be publicly released upon acceptance to promote future research in multi-modal gait recognition.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39380", "url": null, "sourceid": 41587, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38533, "uid": "08faa15558a741e7cc59f718a5cc6213", "name": "DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding", "authors": [{"id": 183275, "fullname": "Hao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183275?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85267, "fullname": "Yuliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85267?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 176160, "fullname": "Xingchen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176160?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190077, "fullname": "Yuyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190077?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190078, "fullname": "Minghui Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190078?format=json", "institution": "Huawei"}, {"id": 149634, "fullname": "Jihao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149634?format=json", "institution": "\u534e\u4e3a"}, {"id": 190079, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190079?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``Analysis, Localization and Reasoning'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as an ideal foundation for their implementation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38533", "url": null, "sourceid": 32921, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37973, "uid": "402b101d0a6ddddecf2227b18c03f248", "name": "R$^2$TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification", "authors": [{"id": 180645, "fullname": "Yubo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180645?format=json", "institution": "University of Science and Technology of China"}, {"id": 87596, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87596?format=json", "institution": "Shanghai AI lab"}, {"id": 107620, "fullname": "Bin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107620?format=json", "institution": "University of Science and Technology of China"}, {"id": 188716, "fullname": "Xulin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188716?format=json", "institution": "University of Science and Technology of China"}, {"id": 188717, "fullname": "Jixiang Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188717?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Text-Image Person Re-Identification (TI-ReID) models are widely deployed in intelligent surveillance.Built on deep neural networks and vision\u2013language models, TI-ReID models inherit their vulnerability to adversarial attacks, posing potential security risks.Yet their security issues have received far less attention than retrieval accuracy, and the robustness of TI-ReID to adversarial attacks remains largely unexplored.To fill this gap, we propose Reconstruction-residual based Targeted and Untargeted Attack (R$^2$TUA), which takes an image and an adversarial text prompt as input and generates perturbations that make TI-ReID models incorrectly match the perturbed image to the identity described by the adversarial prompt.To precisely inject identity attributes into perturbations and achieve fine-grained targeted attack, R$^2$TUA proposes Transformer-based Gradual Multimodal Fusion (TGMF) that fuses image and adversarial prompt progressively across layers with tunable cross-modal weight.In addition, we propose a fully differentiable Soft Clamp Function (SCF), which enables us to ensure perturbations remain inconspicuous while avoiding local gradient vanishing effects that would trap training into suboptimal local minima.To further align perturbed images with the adversarial text descriptions while leading them to mismatch their original descriptions, R$^2$TUA employs Push-Pull Losses (PPLs) and matching losses during training.Extensive evaluations across multiple datasets and models demonstrate the superior untargeted attack and targeted attack performance of R$^2$TUA. It also exhibits strong adaptability and transferability against black-box models, outperforming all related attacks across multiple tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37973", "url": null, "sourceid": 45255, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39688, "uid": "1c495f16f70906bb6768dd8933f8ab5b", "name": "Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization", "authors": [{"id": 183822, "fullname": "Mohammadjavad Matinkia", "url": "http://cvpr.thecvf.com/api/miniconf/users/183822?format=json", "institution": "University of Alberta"}, {"id": 192650, "fullname": "Nilanjan Ray", "url": "http://cvpr.thecvf.com/api/miniconf/users/192650?format=json", "institution": "University of Alberta"}], "abstract": "Diffeomorphic image registration (DIR) seeks topology-preserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduce **SGDIR**, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism (near-zero folding) and significantly outperforms existing diffeomorphic methods, while remaining competitive with leading non-diffeomorphic deformable models. When the regularization is relaxed, the same architecture functions as a deformable method and substantially surpasses state-of-the-art non-diffeomorphic approaches in registration accuracy. These results demonstrate that continuous-time deformation modeling, guided solely by our semigroup-based regularization, yields a unified framework capable of both rigorously diffeomorphic mapping and state-of-the-art deformable registration.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39688", "url": null, "sourceid": 41273, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40376?format=json"], "related_events_ids": [40376]}, {"id": 40376, "uid": "1c495f16f70906bb6768dd8933f8ab5b", "name": "Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization", "authors": [{"id": 183822, "fullname": "Mohammadjavad Matinkia", "url": "http://cvpr.thecvf.com/api/miniconf/users/183822?format=json", "institution": "University of Alberta"}, {"id": 192650, "fullname": "Nilanjan Ray", "url": "http://cvpr.thecvf.com/api/miniconf/users/192650?format=json", "institution": "University of Alberta"}], "abstract": "Diffeomorphic image registration (DIR) seeks topology-preserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduce **SGDIR**, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism (near-zero folding) and significantly outperforms existing diffeomorphic methods, while remaining competitive with leading non-diffeomorphic deformable models. When the regularization is relaxed, the same architecture functions as a deformable method and substantially surpasses state-of-the-art non-diffeomorphic approaches in registration accuracy. These results demonstrate that continuous-time deformation modeling, guided solely by our semigroup-based regularization, yields a unified framework capable of both rigorously diffeomorphic mapping and state-of-the-art deformable registration.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40376", "url": null, "sourceid": -41273, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39688?format=json"], "related_events_ids": [39688]}, {"id": 37366, "uid": "cabe93e7ad2813b2009aae33c3974878", "name": "VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training", "authors": [{"id": 135748, "fullname": "Mengmeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135748?format=json", "institution": "Zhejiang University of Technology"}, {"id": 158676, "fullname": "Dengyang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158676?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 184349, "fullname": "Liuzhuozheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184349?format=json", "institution": "University of Tokyo"}, {"id": 187261, "fullname": "Yucheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187261?format=json", "institution": "Zhejiang University of Technology"}, {"id": 157908, "fullname": "Guojiang Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157908?format=json", "institution": "Zhejiang University of Technology"}, {"id": 157907, "fullname": "Xiangjie Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/157907?format=json", "institution": "Zhejiang University of Technology"}, {"id": 89243, "fullname": "Yong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89243?format=json", "institution": "Zhejiang University"}, {"id": 127182, "fullname": "Guang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/127182?format=json", "institution": "SGIT AI"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}], "abstract": "Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes VAE-REPA, a lightweight intrinsic guidance framework for efficient diffusion training. VAE-REPA leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, VAE-REPA aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that VAE-REPA improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37366", "url": null, "sourceid": 45555, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38340, "uid": "ec3afa919ec7331e91c462f93608e0da", "name": "G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning", "authors": [{"id": 179607, "fullname": "Wenbo hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179607?format=json", "institution": "UCLA"}, {"id": 178519, "fullname": "JINGLI LIN", "url": "http://cvpr.thecvf.com/api/miniconf/users/178519?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189640, "fullname": "Yilin Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/189640?format=json", "institution": "Fudan University"}, {"id": 189641, "fullname": "Yunlong Ran", "url": "http://cvpr.thecvf.com/api/miniconf/users/189641?format=json", "institution": "Zhejiang University"}, {"id": 151776, "fullname": "Lihan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151776?format=json", "institution": "University of Science and Technology of China"}, {"id": 189642, "fullname": "Yifan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189642?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 145076, "fullname": "Chenming Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145076?format=json", "institution": "The University of Hong Kong"}, {"id": 188611, "fullname": "Runsen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188611?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 127299, "fullname": "Tai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127299?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 88208, "fullname": "Jiangmiao Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88208?format=json", "institution": "Shanghai AI Laboratory "}], "abstract": "Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations.Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks.By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38340", "url": null, "sourceid": 40621, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38567, "uid": "df29803a60f8f4f537908df8188f6676", "name": "M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA", "authors": [{"id": 181994, "fullname": "Venkata Kesav Venna", "url": "http://cvpr.thecvf.com/api/miniconf/users/181994?format=json", "institution": "TIH-IITB"}, {"id": 190166, "fullname": "Sai Madhusudan Gunda", "url": "http://cvpr.thecvf.com/api/miniconf/users/190166?format=json", "institution": "International Institute of Information Technology Hyderabad"}, {"id": 181871, "fullname": "Jyothi Swaroopa Jinka", "url": "http://cvpr.thecvf.com/api/miniconf/users/181871?format=json", "institution": "International Institute of Information Technology , Hyderabad , India"}, {"id": 190167, "fullname": "Hrithik Rachakonda", "url": "http://cvpr.thecvf.com/api/miniconf/users/190167?format=json", "institution": "International Institute of Information Technology, Hyderabad, INDIA. \ud83c\uddee\ud83c\uddf3"}, {"id": 182015, "fullname": "Anirudh Srinivasan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182015?format=json", "institution": "BharatGen"}, {"id": 151819, "fullname": "Ravi Kiran Sarvadevabhatla", "url": "http://cvpr.thecvf.com/api/miniconf/users/151819?format=json", "institution": "International Institute of Information Technology Hyderabad, Dhirubhai Ambani Institute Of Information and Communication Technology"}], "abstract": "**Document QA** requires not only accurate answers but also identifying where each answer is grounded on the page. Most models treat the task as text-only generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduce **M3Grounder, a hybrid vision\u2013language and segmentation architecture that formulates document grounding as pixel-level segmentation. It produces fine-grained evidence masks** refined by a bleed-suppression loss to prevent spillover. M3Grounder autoregressively generates answer text interleaved with [GROUND] tokens that link individual answer spans to their corresponding evidence regions. Also, **M3Grounder grounds evidence hierarchically across phrase, line, and block levels** using an enclosure loss that enforces spatial containment. We release **GroundingDocQA dataset (200K documents, 2M multi-span and multi-granular QA pairs with pixel-level grounding masks)**, built through a data engine that handles complex layouts, curved-text, and graphics-rich documents. We also release **GroundingDocQA-Bench, a diverse and challenging human-verified benchmark**. M3Grounder sets **a new state of the art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded evidence.**", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38567", "url": null, "sourceid": 45106, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39904, "uid": "95753ccc84f42b133d6db84a22df8dd4", "name": "Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment", "authors": [{"id": 161085, "fullname": "Chen Xiaodong", "url": "http://cvpr.thecvf.com/api/miniconf/users/161085?format=json", "institution": "University of Science and Technology of China"}, {"id": 193083, "fullname": "Qian Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193083?format=json", "institution": null}, {"id": 193084, "fullname": "Xudong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193084?format=json", "institution": null}, {"id": 193085, "fullname": "Jianping Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193085?format=json", "institution": "Meituan"}, {"id": 193086, "fullname": "Jintao Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193086?format=json", "institution": "Meituan"}, {"id": 85538, "fullname": "Yongdong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85538?format=json", "institution": "University of Science and Technology of China"}, {"id": 85027, "fullname": "Tao Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/85027?format=json", "institution": "JD Explore Academy"}, {"id": 76244, "fullname": "Wu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76244?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Although progress has been made in LLM-based text-driven motion generation, it still has the limitations of generating fine-grained and semantically consistent motions. These limitations stem from: 1) fine-grained motion quantization errors; 2) mismatches between causal reasoning language and non-causal motion representation; and 3) lack of human preference alignment. To solve them, this paper proposes MoTiGA, a multi-level causal LLM-based text-to-motion generation framework with human alignment. Firstly, MoTiGA employs Causal RVQ-VAE for multi-level causal fine-grained motion representation, then explores iterative residual quantization and causal convolutions to reduce fine-grained motion quantization errors, while preserving the causality as language presentation. Furthermore, the framework incorporates a time-lagged causal prediction strategy, enabling parallel prediction across motion token levels while maintaining temporal dependencies. Finally, to enhance human alignment, we propose Multi-level Hybrid-weighted Preference Optimization (MHPO), which dynamically adjusts semantic similarity weighting and continuous similarity scores. For MHPO, we also release the HumanML3D-R dataset, the first large-scale preference dataset for motion generation, with 101,490 human preference pairs. Evaluations show MoTiGA's superior performance, with an 82.3\\% FID improvement on HumanML3D and a 64.7\\% improvement on KIT-ML over other LLM-based methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39904", "url": null, "sourceid": 33079, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39226, "uid": "6af61fb573a8011b0b15832b1c9201d3", "name": "GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning", "authors": [{"id": 184248, "fullname": "Jiayin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184248?format=json", "institution": "\u4e2d\u79fb\u4e5d\u5929\u4eba\u5de5\u667a\u80fd\u79d1\u6280\uff08\u5317\u4eac\uff09\u6709\u9650\u516c\u53f8"}, {"id": 191637, "fullname": "Caixia Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191637?format=json", "institution": "JIUTIAN Research"}, {"id": 191638, "fullname": "Boyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191638?format=json", "institution": "China Mobile Research Institute"}, {"id": 191639, "fullname": "hailin li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191639?format=json", "institution": "China Mobile Communications Company Limited Research Institute"}, {"id": 191640, "fullname": "Xiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191640?format=json", "institution": "China Mobile Communications Company Limited Research Institute"}, {"id": 186947, "fullname": "Yi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186947?format=json", "institution": "cmjt"}, {"id": 152251, "fullname": "Errui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/152251?format=json", "institution": "Baidu"}, {"id": 155606, "fullname": "Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155606?format=json", "institution": "Alibaba Group"}, {"id": 191641, "fullname": "Chao Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191641?format=json", "institution": "China Mobile Research Institute"}, {"id": 191642, "fullname": "Junlan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191642?format=json", "institution": "China Mobile"}], "abstract": "Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 $\\times$ larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39226", "url": null, "sourceid": 37788, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40095, "uid": "43d1e4f821fc02813f594e852f0c59b6", "name": "VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement", "authors": [{"id": 180600, "fullname": "Tiancheng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180600?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 188547, "fullname": "Bowen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188547?format=json", "institution": "Alibaba Group"}, {"id": 193497, "fullname": "Lingxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193497?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 104646, "fullname": "Jiangjing Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104646?format=json", "institution": "Alibaba Group"}, {"id": 91226, "fullname": "Chengfei Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/91226?format=json", "institution": "Zhejiang University"}, {"id": 128373, "fullname": "Chaoyue Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128373?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 149695, "fullname": "Fan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149695?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "We propose VIAFormer, a \\textbf{V}oxel-\\textbf{I}mage \\textbf{A}lignment Trans-\\textbf{former} model designed for Multi-view Conditioned Voxel Refinement\u2014the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40095", "url": null, "sourceid": 36707, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39639, "uid": "fa0b77ceb5f375388fba9a76d7d6d953", "name": "Zoo3D: Zero-Shot 3D Object Detection at Scene Level", "authors": [{"id": 183547, "fullname": "Andrey Lemeshko", "url": "http://cvpr.thecvf.com/api/miniconf/users/183547?format=json", "institution": "Higher School of Economics, Higher School of Economics"}, {"id": 192543, "fullname": "Bulat Gabdullin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192543?format=json", "institution": null}, {"id": 192544, "fullname": "Nikita Drozdov", "url": "http://cvpr.thecvf.com/api/miniconf/users/192544?format=json", "institution": "Lomonosov Moscow State University"}, {"id": 128152, "fullname": "Anton Konushin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128152?format=json", "institution": "Samsung"}, {"id": 128154, "fullname": "Danila Rukhovich", "url": "http://cvpr.thecvf.com/api/miniconf/users/128154?format=json", "institution": "Samsung Research"}, {"id": 192545, "fullname": "Maksim Kolodiazhnyi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192545?format=json", "institution": "Moscow State University, Lomonosov Moscow State University"}], "abstract": "3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing $Zoo3D$, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. $Zoo3D$ operates in two modes: the zero-shot $Zoo3D_{0}$, which requires no training at all, and the self-supervised $Zoo3D_{1}$, which refines 3D box prediction by training a class-agnostic detector on $Zoo3D_{0}$-generated pseudo labels. Furthermore, we extend $Zoo3D$ beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both $Zoo3D_{0}$ and $Zoo3D_{1}$ achieve state-of-the-art results in open-vocabulary 3D detection. Remarkably, our zero-shot $Zoo3D_{0}$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39639", "url": null, "sourceid": 42252, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38590, "uid": "326a8605975106f1672c912bd16b4e7b", "name": "ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization", "authors": [{"id": 183578, "fullname": "Anzhe Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183578?format=json", "institution": "University of Southern Califotnia"}, {"id": 190218, "fullname": "Shukai Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190218?format=json", "institution": "University of Southern California"}, {"id": 190219, "fullname": "Shixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190219?format=json", "institution": "University of Southern California"}, {"id": 190220, "fullname": "Chenzhong Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190220?format=json", "institution": null}, {"id": 154541, "fullname": "Mingxi Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154541?format=json", "institution": "Microsoft"}, {"id": 190221, "fullname": "Heng Ping", "url": "http://cvpr.thecvf.com/api/miniconf/users/190221?format=json", "institution": "University of Southern California"}, {"id": 190222, "fullname": "Tamoghna Chattopadhyay", "url": "http://cvpr.thecvf.com/api/miniconf/users/190222?format=json", "institution": "University of Southern California"}, {"id": 190223, "fullname": "Sophia Thomopoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/190223?format=json", "institution": "University of Southern California"}, {"id": 190224, "fullname": "Shahin Nazarian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190224?format=json", "institution": null}, {"id": 190225, "fullname": "Paul Thompson", "url": "http://cvpr.thecvf.com/api/miniconf/users/190225?format=json", "institution": "University of Southern California"}, {"id": 190226, "fullname": "Paul Bogdan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190226?format=json", "institution": "University of Southern California"}], "abstract": "Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts, but suffer from two core challenges: misalignment between router logits and each expert\u2019s internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an Eigenbasis Score\u2014the cosine similarity between input features and an expert\u2019s basis. This content-aware routing ties token assignments directly to experts\u2019 representation spaces, inherently stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE eliminates the need for explicit balancing losses and avoids the interfering gradients they introduce. We demonstrate that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by over 7% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models, directly addressing core routing instabilities and enabling improved performance with scalable, interpretable specialization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38590", "url": null, "sourceid": 32881, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38077, "uid": "7b7114a49b5827132d6097582ec1722d", "name": "Splatent: Splatting Diffusion Latents for Novel View Synthesis", "authors": [{"id": 71753, "fullname": "Or Hirschorn", "url": "http://cvpr.thecvf.com/api/miniconf/users/71753?format=json", "institution": "Tel Aviv University"}, {"id": 189001, "fullname": "Omer Sela", "url": "http://cvpr.thecvf.com/api/miniconf/users/189001?format=json", "institution": "Tel Aviv University"}, {"id": 126125, "fullname": "Inbar Huberman-Spiegelglas", "url": "http://cvpr.thecvf.com/api/miniconf/users/126125?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 189002, "fullname": "Netalee Efrat Sela", "url": "http://cvpr.thecvf.com/api/miniconf/users/189002?format=json", "institution": "Amazon"}, {"id": 189003, "fullname": "Eli Alshan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189003?format=json", "institution": null}, {"id": 178482, "fullname": "Ianir Ideses", "url": "http://cvpr.thecvf.com/api/miniconf/users/178482?format=json", "institution": "Amazon"}, {"id": 189004, "fullname": "Frederic Devernay", "url": "http://cvpr.thecvf.com/api/miniconf/users/189004?format=json", "institution": "Amazon"}, {"id": 162723, "fullname": "Yochai Zvik", "url": "http://cvpr.thecvf.com/api/miniconf/users/162723?format=json", "institution": "Amazon"}, {"id": 189005, "fullname": "Lior Fritz", "url": "http://cvpr.thecvf.com/api/miniconf/users/189005?format=json", "institution": null}], "abstract": "Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery.Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.We will release our code upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38077", "url": null, "sourceid": 33356, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38667, "uid": "e724e5ad0d0df625744239ef1c60e5e8", "name": "HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction", "authors": [{"id": 128102, "fullname": "Zhong Muyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128102?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186663, "fullname": "Erfei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186663?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 128091, "fullname": "Sen Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/128091?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186575, "fullname": "Weiyun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186575?format=json", "institution": "Fudan University"}, {"id": 190423, "fullname": "Wen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190423?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 190424, "fullname": "Yuchen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190424?format=json", "institution": null}, {"id": 126958, "fullname": "Yanting Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126958?format=json", "institution": "Donghua University, Shanghai"}, {"id": 190425, "fullname": "Xiaowei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190425?format=json", "institution": "South China University of Technology"}, {"id": 87610, "fullname": "Wenhai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87610?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 190426, "fullname": "Chao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190426?format=json", "institution": "Tsinghua University; Shanghai Artificial Intelligence Laboratory; University College London"}, {"id": 87572, "fullname": "Jifeng Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87572?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Multimodal large language models (MLLMs) have expanded from vision\u2013language systems to include audio, unlocking new capabilities in cross-modal reasoning and interaction. To address the limitation that existing benchmarks focus mainly on perception tasks and lack a unified cognitive evaluation framework, we propose Hierarchical Audio-Visual Evaluation Benchmark (HAVE-Bench). It systematically evaluates the audio-related capabilities of MLLMs along a three-level cognitive hierarchy: Perception, Reasoning, and Interaction, utilizing 2,451 curated samples and manually annotated multi-turn interaction-level tasks. Experiments using this unified framework reveal significant gaps in existing models at the reasoning and interaction levels, with speech-driven visual question answering (VQA) performance significantly lagging behind the text\u2013image setting. These findings underscore the urgency of enhancing models\u2019 handling of long and complex audio and facilitating the transfer of reasoning capabilities from the vision\u2013text to the audio\u2013visual domain.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38667", "url": null, "sourceid": 32443, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37960, "uid": "0aff79643e0e8ce75a892aa9a9e736f4", "name": "Towards Multimodal Domain Generalization with Few Labels", "authors": [{"id": 180457, "fullname": "Hongzhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180457?format=json", "institution": "Zhengzhou University"}, {"id": 155538, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155538?format=json", "institution": "ETH Zurich"}, {"id": 188689, "fullname": "Hualei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188689?format=json", "institution": "Zhengzhou University"}, {"id": 188690, "fullname": "Shupan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188690?format=json", "institution": null}, {"id": 156113, "fullname": "Mingliang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156113?format=json", "institution": "Zhengzhou University"}, {"id": 76698, "fullname": "Muhammad Haris Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76698?format=json", "institution": "Mohamed Bin Zayed University of Artificial Intelligence"}], "abstract": "Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code will be released to support future research.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37960", "url": null, "sourceid": 35610, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37217, "uid": "0ac9a79e4aa15b44845b6b553cfcddbd", "name": "Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement", "authors": [{"id": 186943, "fullname": "Fei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186943?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 186944, "fullname": "Xiwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186944?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186945, "fullname": "Qingqing Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186945?format=json", "institution": "Zhejiang University"}, {"id": 89126, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89126?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 89142, "fullname": "Wei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89142?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 186946, "fullname": "Chen Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/186946?format=json", "institution": "Xi&#x27;an University of Posts and Telecommunications"}, {"id": 186947, "fullname": "Yi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186947?format=json", "institution": "cmjt"}, {"id": 186948, "fullname": "Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186948?format=json", "institution": "China Mobile Communications Company Limited Research Institute"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 88282, "fullname": "Yanning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88282?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Cross-domain few-shot image interpretation (CD-FSII) has been significantly advanced by fine-tuning pre-trained visual feature models using limited labeled samples in target domains. However, profound cross-domain distribution discrepancies, along with inherent conflicts between extensive object visual appearance variations and limited annotations, trap those existing pure visual feature representations into some non-transferable short-cut patterns, thus degrading their cross-domain generalization capacity. To mitigate this problem, we present a simple yet effective cross-modal visual feature enhancement framework which primarily contributes in the following three aspects. 1) We make the first attempt to introduce linguistic descriptions of image attributes to regulate the pre-trained visual feature model for specific target image adaptation. Specifically, image-level attributes (e.g., object appearance in individual images) and domain-level attributes (e.g., overall style and background characteristics of the dataset) are extracted using a pre-trained image captioning model and a large language model (LLM), respectively, to construct comprehensive linguistic characterizations. 2) A lightweight residual cross-attention scheme is developed to seamlessly embed linguistic descriptions of image attributes into visual feature representations, thereby compensating for the limitations of purely visual cues in capturing cross-domain transferable high-level semantic characteristics. 3) The proposed framework is task-agnostic and can be seamlessly integrated with off-the-shelf pre-trained visual feature models. It demonstrates superior generalization performance compared to several state-of-the-art methods across multiple CD-FSII benchmarks, including image classification, semantic segmentation, and object detection. We will release all code and data to facilitate further research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37217", "url": null, "sourceid": 31846, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36721, "uid": "40965efa19c345bfd62e38846d71eb85", "name": "Advancing Image Classification with Discrete Diffusion Classification Modeling", "authors": [{"id": 180108, "fullname": "Omer Belhasin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180108?format=json", "institution": "Technion - Israel Institute of Technology / NVIDIA"}, {"id": 185723, "fullname": "Shelly Golan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185723?format=json", "institution": "Tel Aviv University"}, {"id": 185724, "fullname": "Ran El-Yaniv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185724?format=json", "institution": "Technion; Deci; Technion - Israel Institute of Technology, Technion"}, {"id": 185725, "fullname": "Michael Elad", "url": "http://cvpr.thecvf.com/api/miniconf/users/185725?format=json", "institution": "NVIDIA; Computer Science Department, Technion - Israel Institute of Technology"}], "abstract": "Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36721", "url": null, "sourceid": 41418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40273?format=json"], "related_events_ids": [40273]}, {"id": 38711, "uid": "f2024bb88e892c25cfab097cce2d24fa", "name": "PARSE: Part-Aware Relational Spatial Modeling", "authors": [{"id": 181395, "fullname": "Yinuo Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181395?format=json", "institution": "ShanghaiTech University"}, {"id": 182835, "fullname": "Peijun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182835?format=json", "institution": "ShanghaiTech University"}, {"id": 190504, "fullname": "Kuixiang Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190504?format=json", "institution": "ShanghaiTech University"}, {"id": 190505, "fullname": "Yuyang Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190505?format=json", "institution": "ShanghaiTech University"}, {"id": 190506, "fullname": "Jingxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190506?format=json", "institution": "ShanghaiTech University"}, {"id": 86411, "fullname": "Kaixin Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86411?format=json", "institution": "ShanghaiTech University"}, {"id": 154716, "fullname": "Jiayuan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154716?format=json", "institution": "ShanghaiTech University"}, {"id": 75945, "fullname": "Jingyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75945?format=json", "institution": "Shanghai Tech University"}], "abstract": "Inter-object relations underpin spatial intelligence, yet existing representations\u2014linguistic prepositions or object-level scene graphs\u2014are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatial grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38711", "url": null, "sourceid": 43367, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40273, "uid": "40965efa19c345bfd62e38846d71eb85", "name": "Advancing Image Classification with Discrete Diffusion Classification Modeling", "authors": [{"id": 180108, "fullname": "Omer Belhasin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180108?format=json", "institution": "Technion - Israel Institute of Technology / NVIDIA"}, {"id": 185723, "fullname": "Shelly Golan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185723?format=json", "institution": "Tel Aviv University"}, {"id": 185724, "fullname": "Ran El-Yaniv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185724?format=json", "institution": "Technion; Deci; Technion - Israel Institute of Technology, Technion"}, {"id": 185725, "fullname": "Michael Elad", "url": "http://cvpr.thecvf.com/api/miniconf/users/185725?format=json", "institution": "NVIDIA; Computer Science Department, Technion - Israel Institute of Technology"}], "abstract": "Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40273", "url": null, "sourceid": -41418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36721?format=json"], "related_events_ids": [36721]}, {"id": 38878, "uid": "01623048e7d81ab613de0f5d03e95fcc", "name": "ArtLLM: Generating Articulated Assets via 3D LLM", "authors": [{"id": 180002, "fullname": "Penghao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180002?format=json", "institution": "ShanghaiTech University"}, {"id": 190907, "fullname": "Siyuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190907?format=json", "institution": "ShanghaiTech University"}, {"id": 156079, "fullname": "Jiawei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156079?format=json", "institution": "Stony Brook University"}, {"id": 159485, "fullname": "Xianghui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159485?format=json", "institution": "Tencent"}, {"id": 127361, "fullname": "Jingwei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127361?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}, {"id": 154716, "fullname": "Jiayuan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154716?format=json", "institution": "ShanghaiTech University"}], "abstract": "Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a variable number of parts and joints, inferring their kinematic structure in a unified manner from the object\u2019s point cloud. This articulation-aware layout then conditions a 3D generative model to synthesize high-fidelity part geometries. Experiments on the PartNet-Mobility dataset show that ArtLLM significantly outperforms state-of-the-art methods in both part layout accuracy and joint prediction, while generalizing robustly to real-world objects. Finally, we demonstrate its utility in constructing digital twins, highlighting its potential for scalable robot learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38878", "url": null, "sourceid": 39400, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36990, "uid": "45cdb55866bcdb7caac8f7e643856747", "name": "SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation", "authors": [{"id": 157942, "fullname": "Zhanfeng Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157942?format=json", "institution": "Tsinghua University"}, {"id": 127477, "fullname": "Jiajun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127477?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 88533, "fullname": "Hanzhang Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88533?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186397, "fullname": "Zhixi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186397?format=json", "institution": "Tsinghua University"}, {"id": 127294, "fullname": "Yunqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127294?format=json", "institution": "Central China Normal University"}, {"id": 133204, "fullname": "Hongwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133204?format=json", "institution": "Beijing Normal University"}, {"id": 75944, "fullname": "Yebin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75944?format=json", "institution": "Tsinghua University"}], "abstract": "Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation.Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitive\u2019s motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity.Moreover, we design a lifespan\u2013velocity\u2013aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable.Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36990", "url": null, "sourceid": 39636, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38122, "uid": "6003822a5989689befd308ce1be4ac75", "name": "HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling", "authors": [{"id": 107118, "fullname": "Fengyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107118?format=json", "institution": "National University of Singapore"}, {"id": 167356, "fullname": "Tanuj Sur", "url": "http://cvpr.thecvf.com/api/miniconf/users/167356?format=json", "institution": "Chennai Mathematical Institute"}, {"id": 189098, "fullname": "Tze Ho Elden Tse", "url": "http://cvpr.thecvf.com/api/miniconf/users/189098?format=json", "institution": "National University of Singapore"}, {"id": 85773, "fullname": "Angela Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85773?format=json", "institution": "National University of Singapore"}], "abstract": "Recovering global human and camera motion from monocular video is essential for world-coordinate human reconstruction but remains challenging due to entangled motions in image space. Traditional SLAM methods estimate monocular camera motion but fail in scenes dominated by foreground objects such as humans. A common workaround is to mask out dynamic objects, yet this approach becomes brittle when humans occupy most of the view or the background is too noisy, leading to unstable tracking and loss of constraints. This paper takes the opposite stance and reintegrates human motion as informative landmarks. We introduce HumanBA, a human-aware bundle adjustment framework that transforms dynamic humans into usable constraints via motion decoupling.  HumanBA subtracts the human-induced component from observed joint trajectories, isolating a camera-induced (pseudo-static) component that can be safely incorporated into bundle adjustment alongside background features. To mitigate noise in global human estimates, HumanBA applies motion refinements and motion-aware reliability weighting.  Across EMDB and SLOPER4D benchmarks, we show consistent improvements on camera pose estimation and reduce global human reconstruction error, demonstrating the benefits of treating humans as dynamic yet informative landmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38122", "url": null, "sourceid": 40615, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37273, "uid": "8d6bb63cd30962947301a373dee84121", "name": "RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval", "authors": [{"id": 152856, "fullname": "Khanh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152856?format=json", "institution": "The University of Western Australia"}, {"id": 187061, "fullname": "Dasith de Silva Edirimuni", "url": "http://cvpr.thecvf.com/api/miniconf/users/187061?format=json", "institution": "University of Western Australia"}, {"id": 93221, "fullname": "Ghulam Mubashar Hassan", "url": "http://cvpr.thecvf.com/api/miniconf/users/93221?format=json", "institution": "The University of Western Australia"}, {"id": 90780, "fullname": "Ajmal Mian", "url": "http://cvpr.thecvf.com/api/miniconf/users/90780?format=json", "institution": "University of Western Australia"}], "abstract": "3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37273", "url": null, "sourceid": 38484, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39961, "uid": "9799b6b22f21d84883c0514f307b54b5", "name": "Drift-Resilient Temporal Priors for Visual Tracking", "authors": [{"id": 128144, "fullname": "Yuqing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128144?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 193200, "fullname": "Liting Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193200?format=json", "institution": "University of Limerick"}, {"id": 188737, "fullname": "Weijun Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188737?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 89529, "fullname": "Zhenyu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/89529?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 128139, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128139?format=json", "institution": "Peng Cheng Laboratory"}], "abstract": "Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures\u2014OSTrack, ODTrack, and LoRAT\u2014and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5\\% Success rate on LaSOT and an 80.3\\% AO on GOT-10k.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39961", "url": null, "sourceid": 42244, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38747, "uid": "17a65bb1dc56c1235d7ee8029c94408b", "name": "Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations", "authors": [{"id": 190575, "fullname": "Tuan Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190575?format=json", "institution": "Australian Institute for Machine Learning, University of Adelaide"}, {"id": 190576, "fullname": "Minh Khoi Ho", "url": "http://cvpr.thecvf.com/api/miniconf/users/190576?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 129517, "fullname": "Qi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129517?format=json", "institution": "The University of Adelaide"}, {"id": 85271, "fullname": "Yutong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/85271?format=json", "institution": "University of Adelaide"}, {"id": 190577, "fullname": "Cam-Tu Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190577?format=json", "institution": "Nanjing University"}, {"id": 190578, "fullname": "Minh Khoi Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190578?format=json", "institution": "Hanoi University of Science and Technology"}, {"id": 190579, "fullname": "Dang Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190579?format=json", "institution": "Hanoi University of Science and Technology; Hanoi - Amsterdam High School for the Gifted, Vietnam"}, {"id": 88134, "fullname": "Anton van den Hengel", "url": "http://cvpr.thecvf.com/api/miniconf/users/88134?format=json", "institution": "University of Adelaide"}, {"id": 128842, "fullname": "Johan Verjans", "url": "http://cvpr.thecvf.com/api/miniconf/users/128842?format=json", "institution": "University of Adelaide"}, {"id": 184202, "fullname": "Le Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184202?format=json", "institution": "Institute for AI Innovation and Societal Impact, Hanoi University of Science and Technology"}, {"id": 107228, "fullname": "Vu Minh Hieu Phan", "url": "http://cvpr.thecvf.com/api/miniconf/users/107228?format=json", "institution": "Australian Institute for Machine Learning, University of Adelaide"}], "abstract": "Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers.Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention and (ii) they fail to exhibit meaningful semantic alignment with any visual region.Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38747", "url": "https://token-grounding-detection-cvpr26.github.io/", "sourceid": 40157, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38600, "uid": "d6b7011e4b5b41fc4c9b0fc470013b0c", "name": "Refer-Agent: A Collaborative Multi-Agent System for Referring Video Object Segmentation with Reasoning and Reflection", "authors": [{"id": 190264, "fullname": "Haichao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190264?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 101034, "fullname": "Tianming Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101034?format=json", "institution": "Sun Yat-sen University"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 88842, "fullname": "Jian-Fang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88842?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \\textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent\u2019s visual focus.Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches.Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38600", "url": null, "sourceid": 35421, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39973, "uid": "94c6b1d6c1c454fc8ccea6e5f5a082de", "name": "Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation", "authors": [{"id": 101034, "fullname": "Tianming Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101034?format=json", "institution": "Sun Yat-sen University"}, {"id": 190264, "fullname": "Haichao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190264?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 193220, "fullname": "Yuting Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193220?format=json", "institution": null}, {"id": 193221, "fullname": "Chaolei Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193221?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 157617, "fullname": "Shuai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157617?format=json", "institution": "Shandong University"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 88842, "fullname": "Jian-Fang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88842?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames.To advance the task towards more practical scenarios, we introduce \\textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing.The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships.Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency.We benchmark 7 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges.To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies.Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos. Our dataset and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39973", "url": null, "sourceid": 36419, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39236, "uid": "812030f5df0d7e38686fee76cb224528", "name": "HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning", "authors": [{"id": 191677, "fullname": "Xuerui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191677?format=json", "institution": null}, {"id": 191678, "fullname": "Xuehao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191678?format=json", "institution": "Zhejiang University"}, {"id": 191679, "fullname": "Zhan Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191679?format=json", "institution": "City University of Hong Kong"}, {"id": 191680, "fullname": "Linglan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191680?format=json", "institution": "Tencent Youtu Lab"}, {"id": 191681, "fullname": "Ziyue Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191681?format=json", "institution": "Technical University of Munich"}, {"id": 191682, "fullname": "Xinmin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191682?format=json", "institution": "Zhejiang University"}, {"id": 191683, "fullname": "Zhihuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/191683?format=json", "institution": null}, {"id": 86185, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86185?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (*e.g.*, only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL).Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures.To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario.To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase.The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator.Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39236", "url": null, "sourceid": 33116, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36651, "uid": "2e93fad1e91614dbea27879646a09bd6", "name": "First Frame Is the Place to Go for Video Content Customization", "authors": [{"id": 104666, "fullname": "Jingxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/104666?format=json", "institution": "University of Maryland College Park"}, {"id": 126703, "fullname": "Zongxia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126703?format=json", "institution": "University of Maryland, College Park"}, {"id": 179987, "fullname": "Zhichao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179987?format=json", "institution": "Independant"}, {"id": 185563, "fullname": "Guangyao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185563?format=json", "institution": "University of Southern California"}, {"id": 126693, "fullname": "Xiyang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126693?format=json", "institution": "University of Maryland, College Park"}, {"id": 185564, "fullname": "Fuxiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185564?format=json", "institution": "NVIDIA"}, {"id": 136019, "fullname": "Cornelia Fermuller", "url": "http://cvpr.thecvf.com/api/miniconf/users/136019?format=json", "institution": "University of Maryland, College Park"}, {"id": 128449, "fullname": "Brandon Y. Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128449?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 85403, "fullname": "Yiannis Aloimonos", "url": "http://cvpr.thecvf.com/api/miniconf/users/85403?format=json", "institution": "University of Maryland, College Park"}], "abstract": "What role does the first frame play in video generation models? Traditionally, it\u2019s viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20\u201350 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36651", "url": null, "sourceid": 32376, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36296, "uid": "e4b79783cddf15c2a0903352eb73ad7c", "name": "Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections", "authors": [{"id": 184715, "fullname": "Zhiwen Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184715?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184716, "fullname": "Hao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184716?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184717, "fullname": "Huiyu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184717?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184718, "fullname": "Zhao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184718?format=json", "institution": "Northumbria University"}, {"id": 184719, "fullname": "Guangyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184719?format=json", "institution": "Peking University"}, {"id": 184720, "fullname": "Shaowei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184720?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 176647, "fullname": "Wenwen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176647?format=json", "institution": "Johns Hopkins University"}, {"id": 184721, "fullname": "Bin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184721?format=json", "institution": "Hunan University"}, {"id": 148526, "fullname": "Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/148526?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184722, "fullname": "Xiaoshuai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184722?format=json", "institution": "Ocean University of China"}, {"id": 181645, "fullname": "Xingru Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181645?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "This paper studies passive non-line-of-sight corner-camera detection and human localization using faint indirect reflections on a visible wall. The challenge is twofold: multi-exposure wall observations are unstable and entangled with sensor nonlinearities, and mapping these observations to a hidden-view RGB image is severely underdetermined, making purely discriminative regressors brittle and unconstrained diffusion priors stochastic. To address these challenges, we introduce the Similarity-Likelihood Diffusion Network (SLD-Net), a two-stage framework that produces measurement-consistent, deterministic reconstructions. First, DeLi-Inversion forms an exposure-aware differential representation and jointly predicts an initial reconstruction and a pixel-wise precision map, yielding a heteroscedastic pseudo-likelihood. Second, SiCo-Diffusion injects this likelihood as precision-weighted energy into a deterministic DDIM trajectory and fuses it with the diffusion prior using an annealed Bayesian precision rule, producing a unique reconstruction for fixed observations and schedules. Extensive experiments on two real datasets: Reflect-Corridor and Reflect-Room, demonstrate that the proposed method outperforms generic, physics-inspired, and NLOS-specific baselines across PSNR, SSIM, LPIPS, and FID. In particular, relative to the best-performing baseline, it improves PSNR from 13.84 to 15.58 dB on Reflect-Corridor and from 11.58 to 12.49 dB on Reflect-Room, and reduces FID from 264.91 to 73.54 and from 177.05 to 26.89, respectively, while also achieving the lowest LPIPS on both datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36296", "url": null, "sourceid": 42173, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36770, "uid": "97388f939a25aa3982d259411d298b00", "name": "CodePercept: Code-Grounded Visual STEM Perception for MLLM", "authors": [{"id": 90267, "fullname": "Tongkun Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90267?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86577, "fullname": "Zhibo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86577?format=json", "institution": "Alibaba Group"}, {"id": 131375, "fullname": "Jianqiang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131375?format=json", "institution": "Alibaba Group"}, {"id": 181352, "fullname": "Mingkun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181352?format=json", "institution": "ALIBABA (BEIJING) SOFTWARE SERVICES CO., LTD."}, {"id": 154364, "fullname": "Zhentao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154364?format=json", "institution": "Beijing Institute of Technology"}, {"id": 185833, "fullname": "Zijian Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185833?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185834, "fullname": "Ruilin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185834?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185835, "fullname": "Ruizhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185835?format=json", "institution": "Zhejiang University"}, {"id": 181993, "fullname": "Sontao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181993?format=json", "institution": "Zhejiang University"}, {"id": 185836, "fullname": "Peng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185836?format=json", "institution": "Alibaba tong yi team"}, {"id": 76466, "fullname": "Wei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76466?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185837, "fullname": "Junyang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185837?format=json", "institution": "Alibaba Group"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}], "abstract": "When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium\u2014executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment.  Code will be available soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36770", "url": null, "sourceid": 37255, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40325, "uid": "2f838cade4a6012a6cb1016d1d8d95ed", "name": "FAITHFUL CONTOURING: NEAR-LOSSLESS 3D VOXEL REPRESENTATION FREE FROM ISO-SURFACE", "authors": [{"id": 183591, "fullname": "Yihao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183591?format=json", "institution": "Imperial College London"}, {"id": 188934, "fullname": "Xianglong He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188934?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188935, "fullname": "Chuanyu Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188935?format=json", "institution": null}, {"id": 87723, "fullname": "Yiwen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87723?format=json", "institution": "Nanyang Technological University"}, {"id": 188936, "fullname": "Jiaqi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188936?format=json", "institution": "Math Magic"}, {"id": 126956, "fullname": "Yangguang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126956?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 151075, "fullname": "Wanli Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151075?format=json", "institution": "Shanghai AI Lab"}, {"id": 188937, "fullname": "Yuanming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188937?format=json", "institution": "Meshy AI"}, {"id": 149352, "fullname": "Guang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149352?format=json", "institution": "Imperial College London"}, {"id": 188938, "fullname": "Choon Hwai Yap", "url": "http://cvpr.thecvf.com/api/miniconf/users/188938?format=json", "institution": "Imperial College London"}], "abstract": "Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\\% reduction in Chamfer Distance and a 35\\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40325", "url": null, "sourceid": -36354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38053?format=json"], "related_events_ids": [38053]}, {"id": 37588, "uid": "59039a3b52d947c16b7eb0060d7b57ea", "name": "Guiding a Diffusion Model by Swapping Its Tokens", "authors": [{"id": 180356, "fullname": "Weijia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180356?format=json", "institution": "Shanghai Jiao Tong University / Alibaba"}, {"id": 166210, "fullname": "Yuehao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166210?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186981, "fullname": "Shanyan Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186981?format=json", "institution": "vivo"}, {"id": 187778, "fullname": "Wu Ran", "url": "http://cvpr.thecvf.com/api/miniconf/users/187778?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186983, "fullname": "Yanhao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/186983?format=json", "institution": "Future Imaging Area"}, {"id": 186984, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186984?format=json", "institution": "vivo; Huawei Technologies Ltd."}, {"id": 86334, "fullname": "Chao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86334?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling toward higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar tokens in either spatial or channel dimensions.Unlike existing methods that apply perturbation in a global or  less constrained manner, our approach modifies only selected tokens, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO2014, MS-COCO 2017, and ImageNet datasets demonstrate that our Self-Swap Guidance (SSG), when applied to  state-of-the-art diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37588", "url": null, "sourceid": 38637, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40306?format=json"], "related_events_ids": [40306]}, {"id": 36881, "uid": "5d80f41d392a2f39804eae9eb91fe770", "name": "Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning", "authors": [{"id": 153654, "fullname": "Minghe Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153654?format=json", "institution": "Zhejiang University"}, {"id": 88890, "fullname": "Juncheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88890?format=json", "institution": "Zhejiang University"}, {"id": 186091, "fullname": "Yuze Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186091?format=json", "institution": "Zhejiang University"}, {"id": 186092, "fullname": "Xuqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186092?format=json", "institution": "Zhejiang University"}, {"id": 186093, "fullname": "Jiaming Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/186093?format=json", "institution": "Peking University"}, {"id": 186094, "fullname": "Xiaoran Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186094?format=json", "institution": "Zhejiang University"}, {"id": 186095, "fullname": "Zihan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186095?format=json", "institution": "Zhejiang University"}, {"id": 149258, "fullname": "Xian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/149258?format=json", "institution": "Zhejiang University"}, {"id": 186096, "fullname": "Mingjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186096?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 76247, "fullname": "Wei Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/76247?format=json", "institution": "Nanjing University"}, {"id": 186097, "fullname": "Rong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186097?format=json", "institution": null}, {"id": 86545, "fullname": "Rui Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86545?format=json", "institution": "Kujiale.com"}, {"id": 186098, "fullname": "Qizhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186098?format=json", "institution": "Zhejiang University"}, {"id": 186099, "fullname": "Kai Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186099?format=json", "institution": null}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}, {"id": 75891, "fullname": "Qi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75891?format=json", "institution": "University of Adelaide"}, {"id": 126909, "fullname": "Siliang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126909?format=json", "institution": "Zhejiang University"}, {"id": 129046, "fullname": "Yueting Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129046?format=json", "institution": "Zhejiang University"}], "abstract": "We contend that embodied learning is fundamentally a lifecycle problem rather than a single-stage optimization. Systems that optimize only one link (data collection, simulation, learning, or deployment) rarely sustain improvement or generalize beyond narrow settings. We introduce Arcadia, a closed-loop framework that operationalizes embodied lifelong learning by tightly coupling four stages: (1) Self-evolving exploration and grounding for autonomous data acquisition in physical environments, (2) Generative scene reconstruction and augmentation for realistic and extensible scene creation, (3) a Shared embodied representation architecture that unifies navigation and manipulation within a single multimodal backbone, and (4) Sim-from-real evaluation and evolution that closes the feedback loop through simulation-based adaptation. This coupling is non-decomposable: removing any stage breaks the improvement loop and reverts to one-shot training. Arcadia delivers consistent gains on navigation and manipulation benchmarks and transfers robustly to physical robots, indicating that a tightly coupled lifecycle: continuous real-world data acquisition, generative simulation update, and shared-representation learning, supports lifelong improvement and end-to-end generalization. We release standardized interfaces enabling reproducible evaluation and cross-model comparison in reusable environments, positioning Arcadia as a scalable foundation for general-purpose embodied agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36881", "url": null, "sourceid": 34860, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40306, "uid": "59039a3b52d947c16b7eb0060d7b57ea", "name": "Guiding a Diffusion Model by Swapping Its Tokens", "authors": [{"id": 180356, "fullname": "Weijia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180356?format=json", "institution": "Shanghai Jiao Tong University / Alibaba"}, {"id": 166210, "fullname": "Yuehao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166210?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186981, "fullname": "Shanyan Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186981?format=json", "institution": "vivo"}, {"id": 187778, "fullname": "Wu Ran", "url": "http://cvpr.thecvf.com/api/miniconf/users/187778?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186983, "fullname": "Yanhao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/186983?format=json", "institution": "Future Imaging Area"}, {"id": 186984, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186984?format=json", "institution": "vivo; Huawei Technologies Ltd."}, {"id": 86334, "fullname": "Chao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86334?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling toward higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar tokens in either spatial or channel dimensions.Unlike existing methods that apply perturbation in a global or  less constrained manner, our approach modifies only selected tokens, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO2014, MS-COCO 2017, and ImageNet datasets demonstrate that our Self-Swap Guidance (SSG), when applied to  state-of-the-art diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40306", "url": null, "sourceid": -38637, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37588?format=json"], "related_events_ids": [37588]}, {"id": 39203, "uid": "5578e16e3443da7f7af10c70874a7fe2", "name": "Vision Transformers Need More Than Registers", "authors": [{"id": 184960, "fullname": "Cheng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184960?format=json", "institution": "University of Hong Kong"}, {"id": 89008, "fullname": "Yizhou Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89008?format=json", "institution": "The University of Hong Kong"}, {"id": 184961, "fullname": "Sibei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184961?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior. All the code and weights will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39203", "url": null, "sourceid": 40843, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36401, "uid": "f3535a949e0637d4894131d516649a5f", "name": "WeaveTime: Streaming from Earlier Frames into Emergent Memory in VideoLLMs", "authors": [{"id": 184959, "fullname": "Yulin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184959?format=json", "institution": "ShanghaiTech University"}, {"id": 184960, "fullname": "Cheng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184960?format=json", "institution": "University of Hong Kong"}, {"id": 184961, "fullname": "Sibei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184961?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past\u2013current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model-agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective\u2014our Streaming Order Perception enhancement\u2014that instills order-aware representations with minimal finetuning and no specialized streaming data. At inference, a Past\u2013Current Dynamic Focus Cache performs uncertainty-triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time-aware stream Video-LLMs under strict online, time-causal constraints. Code and weights will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36401", "url": null, "sourceid": 30854, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36304, "uid": "5499df60b3f5b017d19ba4f03165deda", "name": "RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction", "authors": [{"id": 184737, "fullname": "Chenxu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184737?format=json", "institution": "Nankai University; Zhejiang University of Technology"}, {"id": 162852, "fullname": "Chenxu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162852?format=json", "institution": "NanKai University"}, {"id": 184738, "fullname": "Yimian Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184738?format=json", "institution": "Nankai University"}, {"id": 132148, "fullname": "Yongxiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132148?format=json", "institution": "National University of Defense Technology"}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}, {"id": 130949, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130949?format=json", "institution": "Nankai University"}], "abstract": "Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce $WorldRoadSeg-360K$, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. $WorldRoadSeg-360K$ serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present $RoadGIE$, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, $RoadGIE$ supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, $RoadGIE$ integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. Meanwhile, to alleviate user intent ambiguity, RoadGIE introduces a topo-semantic instantiation during training to enhance interaction stability and consistency. $RoadGIE$ achieves state-of-the-art performance in both segmentation accuracy and topological consistency on $WorldRoadSeg-360K$ and other benchmarks, while maintaining efficient operation with only 3.7 million parameters and real-time processing capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36304", "url": null, "sourceid": 46349, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37233, "uid": "1ce018a9cf7f2480f079ce6bdd49af8a", "name": "FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics", "authors": [{"id": 186971, "fullname": "Taejin Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186971?format=json", "institution": "Yonsei University"}, {"id": 182669, "fullname": "Joohyeok Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182669?format=json", "institution": "Yonsei University"}, {"id": 153958, "fullname": "Jinyeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153958?format=json", "institution": "Yonsei University"}, {"id": 132222, "fullname": "Chanyoung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/132222?format=json", "institution": "Emory University"}, {"id": 107168, "fullname": "Seong Jae Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107168?format=json", "institution": "Yonsei University"}], "abstract": "Spatial Transcriptomics (ST) provides spatially-resolved gene expression, offering crucial insights into tissue architecture and complex diseases. However, its prohibitive cost limits widespread adoption, leading to significant attention on inferring spatial gene expression from readily available whole slide images. While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. To better reflect biological interactions, we introduce negative-aware attention, which models both excitatory and inhibitory interactions, capturing essential negative relationships that standard attention often overlooks. Furthermore, to mitigate the information loss from truncated or ignored context in standard spot image extraction, we introduce an off-grid sampling strategy that gathers additional images from intermediate regions, allowing the model to capture a richer morphological context. Experiments on public ST datasets show that FEAST surpasses state-of-the-art methods in gene expression prediction while providing biologically plausible attention maps that clarify positive and negative interactions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37233", "url": null, "sourceid": 33366, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40307, "uid": "66b708d452969d6b22e8b5eaeb029b0d", "name": "Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations", "authors": [{"id": 87958, "fullname": "Shai Bagon", "url": "http://cvpr.thecvf.com/api/miniconf/users/87958?format=json", "institution": "Weizmann Institute of Science, Israel"}, {"id": 187779, "fullname": "Matan Kichler", "url": "http://cvpr.thecvf.com/api/miniconf/users/187779?format=json", "institution": "Weizmann Institute of Science"}, {"id": 89397, "fullname": "Mark Sheinin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89397?format=json", "institution": "Weizmann Institute of Science"}], "abstract": "Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant).In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing the plurality of vibration signals to estimate the original sound source of the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios where it performs poorly.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40307", "url": null, "sourceid": -43134, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37589?format=json"], "related_events_ids": [37589]}, {"id": 37589, "uid": "66b708d452969d6b22e8b5eaeb029b0d", "name": "Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations", "authors": [{"id": 87958, "fullname": "Shai Bagon", "url": "http://cvpr.thecvf.com/api/miniconf/users/87958?format=json", "institution": "Weizmann Institute of Science, Israel"}, {"id": 187779, "fullname": "Matan Kichler", "url": "http://cvpr.thecvf.com/api/miniconf/users/187779?format=json", "institution": "Weizmann Institute of Science"}, {"id": 89397, "fullname": "Mark Sheinin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89397?format=json", "institution": "Weizmann Institute of Science"}], "abstract": "Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant).In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing the plurality of vibration signals to estimate the original sound source of the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios where it performs poorly.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37589", "url": null, "sourceid": 43134, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40307?format=json"], "related_events_ids": [40307]}, {"id": 36578, "uid": "80ee6b4e99b7282ccb7d6688af1b6edc", "name": "OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding", "authors": [{"id": 130623, "fullname": "Minghang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130623?format=json", "institution": "Peking University"}, {"id": 185396, "fullname": "Zihao Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185396?format=json", "institution": null}, {"id": 185397, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185397?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 87023, "fullname": "Yuxin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87023?format=json", "institution": "Peking University"}, {"id": 89783, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89783?format=json", "institution": "Peking University"}], "abstract": "Video Temporal Grounding (VTG), the task of localizing video segments from natural language queries, faces significant challenges in open-world applications. These challenges stem from the limited scale and semantic diversity of existing datasets, which lead to a performance gap between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36578", "url": null, "sourceid": 37158, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39111, "uid": "5162c8e2f5e54609e7f32ae984efe16a", "name": "TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models", "authors": [{"id": 191380, "fullname": "Jiaming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191380?format=json", "institution": "Ant Group"}, {"id": 191381, "fullname": "Guanyu Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191381?format=json", "institution": "University of Manchester"}, {"id": 87628, "fullname": "Hongwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87628?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 191382, "fullname": "Zhicong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191382?format=json", "institution": "Ant Group"}, {"id": 191383, "fullname": "Kangjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191383?format=json", "institution": "Tianjin University"}, {"id": 76212, "fullname": "Yi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76212?format=json", "institution": "Nanyang Technological University, Singapore"}, {"id": 191384, "fullname": "Wenbo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191384?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 91735, "fullname": "Guowen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91735?format=json", "institution": "Nanyang Technological University"}, {"id": 87654, "fullname": "Tianwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87654?format=json", "institution": "Nanyang Technological University"}], "abstract": "Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods, which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a $\\textbf{TE}$mporal-aware $\\textbf{A}$utomated $\\textbf{R}$ed-teaming framework, named $\\textbf{TEAR}$, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80\\% attack success rate, a significant boost from prior best result of 57\\%.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39111", "url": null, "sourceid": 35252, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40357?format=json"], "related_events_ids": [40357]}, {"id": 39776, "uid": "372fb05aa270af730e0f6029a811f835", "name": "Forecasting 3D Scanpaths in Egocentric Video", "authors": [{"id": 86790, "fullname": "Fiona Ryan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86790?format=json", "institution": "Georgia Institute of Technology"}, {"id": 126841, "fullname": "Ishwarya Ananthabhotla", "url": "http://cvpr.thecvf.com/api/miniconf/users/126841?format=json", "institution": "Meta Reality Labs Research"}, {"id": 149057, "fullname": "Yijun Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/149057?format=json", "institution": "META AI"}, {"id": 192830, "fullname": "Judy Hoffman", "url": "http://cvpr.thecvf.com/api/miniconf/users/192830?format=json", "institution": "University of California, Irvine"}, {"id": 95904, "fullname": "James Rehg", "url": "http://cvpr.thecvf.com/api/miniconf/users/95904?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 75447, "fullname": "Vamsi Krishna Ithapu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75447?format=json", "institution": "Facebook Reality Labs"}, {"id": 152865, "fullname": "Calvin Murdock", "url": "http://cvpr.thecvf.com/api/miniconf/users/152865?format=json", "institution": "Reality Labs Research"}], "abstract": "Forecasting gaze behavior is an important task for understanding user intent and creating AR/VR systems that can anticipate where users will look and interact next. While prior works have addressed predicting scanpaths in static images, forecasting gaze in egocentric videos presents new challenges due to the dynamic nature of the scene and the camera wearer\u2019s continuous movement through the 3D environment. To address these challenges, we formulate the novel task of egocentric scanpath prediction as forecasting a sequence of future fixations in 3D Cartesian coordinates relative to the last observed camera pose, producing a 3D scanpath that is grounded in the environment. We propose a transformer architecture that leverages egocentric video frames, head pose, and past 3D gaze observations to predict future 3D fixation sequences. We evaluate our method on the Aria Digital Twin dataset. Our findings establish a baseline for the novel task of 3D scanpath prediction and highlight important architectural elements for our task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39776", "url": null, "sourceid": 45602, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36533, "uid": "35c0435bac5b49fc667bd23a5c49fea1", "name": "Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats", "authors": [{"id": 155023, "fullname": "Wenxuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155023?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 183526, "fullname": "Chenglei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183526?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 185296, "fullname": "Chengzhi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185296?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 76839, "fullname": "Xuelin Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/76839?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 88282, "fullname": "Yanning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88282?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Deep learning models remain highly vulnerable to evolving adversarial attacks. While existing continual adversarial training approaches often assume abundant adversarial data at each stage, real-world scenarios frequently involve limited data availability. This paper addresses the setting of Few-shot Continual Adversarial Training, where only a small number of adversarial examples are available per stage, presenting major challenges in achieving robust generalization and mitigating catastrophic forgetting. To tackle these challenges, we propose a novel continual adversarial training framework that incorporates three key components: (i) an Adversarial Margin loss that explicitly pushes clean samples away from decision boundaries to enhance feature discrimination; (ii) a Gaussian mixture model Prototype Replay strategy that synthesizes representative pseudo-features to preserve knowledge of past adversarial domains; and (iii) a Multi-Domain Balanced loss that guides updates to stabilize learning across diverse attack distributions. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that our approach consistently outperforms state-of-the-art methods in both clean and robust accuracy across a variety of adversarial settings. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36533", "url": null, "sourceid": 35885, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38210, "uid": "7bd68c5248f3487b5837947038e2f943", "name": "GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction", "authors": [{"id": 189322, "fullname": "Ayesh Abu Lehyeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/189322?format=json", "institution": "University of Vermont"}, {"id": 189323, "fullname": "Xiaohan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189323?format=json", "institution": "University of Vermont"}, {"id": 189324, "fullname": "Ahmad Arrabi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189324?format=json", "institution": "University of Vermont"}, {"id": 97909, "fullname": "Waqas Sultani", "url": "http://cvpr.thecvf.com/api/miniconf/users/97909?format=json", "institution": null}, {"id": 73542, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73542?format=json", "institution": "University of Central Florida"}, {"id": 189325, "fullname": "Safwan Wshah", "url": "http://cvpr.thecvf.com/api/miniconf/users/189325?format=json", "institution": "University of Vermont"}], "abstract": "Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively 'flow' from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38210", "url": null, "sourceid": 44399, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37474, "uid": "21f0db5872591cdad329f08cd5a1f770", "name": "The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy", "authors": [{"id": 187539, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187539?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187540, "fullname": "Fanyue Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187540?format=json", "institution": "National University of Singapore"}, {"id": 187541, "fullname": "Runze Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187541?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 88449, "fullname": "Jingjing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88449?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 88453, "fullname": "Lixin Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88453?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 85773, "fullname": "Angela Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85773?format=json", "institution": "National University of Singapore"}, {"id": 88470, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88470?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce $\\textbf{SynPS}$, a method that $\\textbf{Syn}$ergistically leverages$\\textbf{P}$ositional embeddings and $\\textbf{S}$emantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach. Our code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37474", "url": null, "sourceid": 33646, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39369, "uid": "a9ff8314bd4af3bf87c9037dceceff53", "name": "Visual Diffusion Models are Geometric Solvers", "authors": [{"id": 191940, "fullname": "Nir Goren", "url": "http://cvpr.thecvf.com/api/miniconf/users/191940?format=json", "institution": "Tel Aviv University"}, {"id": 178509, "fullname": "Shai Yehezkel", "url": "http://cvpr.thecvf.com/api/miniconf/users/178509?format=json", "institution": "Tel-Aviv University"}, {"id": 189400, "fullname": "Omer Dahary", "url": "http://cvpr.thecvf.com/api/miniconf/users/189400?format=json", "institution": "Snap Inc.; Tel Aviv University"}, {"id": 129547, "fullname": "Andrey Voynov", "url": "http://cvpr.thecvf.com/api/miniconf/users/129547?format=json", "institution": "Google Research"}, {"id": 88548, "fullname": "Or Patashnik", "url": "http://cvpr.thecvf.com/api/miniconf/users/88548?format=json", "institution": "Tel Aviv University"}, {"id": 87616, "fullname": "Daniel Cohen-Or", "url": "http://cvpr.thecvf.com/api/miniconf/users/87616?format=json", "institution": "Google"}], "abstract": "In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Maximum Area Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39369", "url": null, "sourceid": 39712, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37263, "uid": "263674f3b69459e3f1d42bccb8996777", "name": "One Algorithm to Align Them All", "authors": [{"id": 187035, "fullname": "Boyi Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187035?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187036, "fullname": "Savva Ignatyev", "url": "http://cvpr.thecvf.com/api/miniconf/users/187036?format=json", "institution": "Applied AI Institute"}, {"id": 187037, "fullname": "Vladimir Ippolitov", "url": "http://cvpr.thecvf.com/api/miniconf/users/187037?format=json", "institution": null}, {"id": 187038, "fullname": "Ramil Khafizov", "url": "http://cvpr.thecvf.com/api/miniconf/users/187038?format=json", "institution": "Applied AI Institute"}, {"id": 187039, "fullname": "Yurii Melnik", "url": "http://cvpr.thecvf.com/api/miniconf/users/187039?format=json", "institution": "Applied AI Institute"}, {"id": 86933, "fullname": "Oleg Voynov", "url": "http://cvpr.thecvf.com/api/miniconf/users/86933?format=json", "institution": "Applied AI Institute"}, {"id": 187040, "fullname": "Maksim Nakhodnov", "url": "http://cvpr.thecvf.com/api/miniconf/users/187040?format=json", "institution": "Moscow State University, Lomonosov Moscow State University; Constructor University"}, {"id": 129266, "fullname": "Aibek Alanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/129266?format=json", "institution": "Artificial Intelligence Research Institute"}, {"id": 128724, "fullname": "Xiaopeng Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128724?format=json", "institution": "Harbin Institute of Technology"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}, {"id": 187041, "fullname": "Evgeny Burnaev", "url": "http://cvpr.thecvf.com/api/miniconf/users/187041?format=json", "institution": "Applied AI Institute"}], "abstract": "We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse and often provides cartoonish results.  By contrast, our suggested approach relies on the joint transport of a segment in the sample space yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space.  We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches.  We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines.  For 3D generation it is able to show comparable quality while working orders of magnitude faster.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37263", "url": null, "sourceid": 31313, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40014, "uid": "f6688fcc298d6dee5c1898e2b090621c", "name": "Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?", "authors": [{"id": 193299, "fullname": "Peter Yongho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193299?format=json", "institution": "Seoul National University"}, {"id": 193300, "fullname": "Juhyeon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/193300?format=json", "institution": "Seoul National University"}, {"id": 193301, "fullname": "Jungwoo Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/193301?format=json", "institution": "Seoul National University"}, {"id": 193302, "fullname": "Jubin Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193302?format=json", "institution": "Seoul National University"}, {"id": 193303, "fullname": "Jungwoo Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193303?format=json", "institution": "Seoul National University"}, {"id": 193304, "fullname": "Jiook Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/193304?format=json", "institution": "Seoul National University"}, {"id": 88560, "fullname": "Taesup Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/88560?format=json", "institution": "Seoul National University"}], "abstract": "Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks.Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40014", "url": null, "sourceid": 45010, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37701, "uid": "2b9a13841f7514fa2ea25e98497b8e97", "name": "OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios", "authors": [{"id": 184073, "fullname": "Hong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184073?format=json", "institution": "School of Computer Science and Engineering, Southeast University"}, {"id": 188040, "fullname": "Jingyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188040?format=json", "institution": "ZTE Corporation"}, {"id": 188041, "fullname": "Xiangkai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188041?format=json", "institution": "ZTE Corporation"}, {"id": 188042, "fullname": "Kangni Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188042?format=json", "institution": "ZTE Corporation"}, {"id": 188043, "fullname": "Yunchen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188043?format=json", "institution": "ZTE Corporation"}, {"id": 188044, "fullname": "Bin Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188044?format=json", "institution": "ZTE Corporation"}, {"id": 188045, "fullname": "Xurui Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188045?format=json", "institution": "Brown University"}, {"id": 87706, "fullname": "Min-Ling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87706?format=json", "institution": "Southeast University"}], "abstract": "Spatio-Temporal Video Grounding (STVG) aims to localize target objects in videos based on natural language descriptions. While Multimodal Large Language Models have shown promise, a significant gap remains between current models and real-world demands involving diverse objects and complex queries. We attribute this to limited benchmark scope, causing models to exhibit category bias, oversimplified reasoning, and poor linguistic robustness.To address these limitations, we introduce OmniGround, a comprehensive benchmark with 3,475 videos spanning 81 categories and complex real-world queries. We propose the Forward-Backward-Refinement (FBR) annotation pipeline for high-quality labels and DeepSTG, a systematic evaluation framework quantifying dataset quality beyond superficial statistics.Evaluations reveal performance average drops of 10.4% on complex real-world scenes, particularly with small/occluded objects and intricate spatial relations. Motivated by these, we propose PG-TAF, a training-free two-stage framework decomposing STVG into high-level temporal grounding and fine-grained spatio-temporal propagation. Experiments demonstrate PG-TAF achieves 25.6% and 35.6% improvements in m\\_tIoU and m\\_vIoU on OmniGround with consistent gains across four benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37701", "url": null, "sourceid": 45890, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36326, "uid": "f9148ba9f7fe304fd171caff200636ab", "name": "PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting", "authors": [{"id": 184778, "fullname": "Yixiao Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/184778?format=json", "institution": "Beijing Jiaotong University"}, {"id": 85328, "fullname": "Qingyong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85328?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184245, "fullname": "Wen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184245?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184779, "fullname": "Zhicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184779?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Unsupervised point cloud segmentation is critical for embodied intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. Integrating 2D pre-trained models such as SAM to supplement semantic information is a natural choice, yet this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer.\u200b To address these limitations and achieve semantic-consistent segmentation, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9\\% mIoU on ScanNet-V2 and +2.8\\% mIoU on S3DIS, highlighting its effectiveness in label-free 3D segmentation.\u200b", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36326", "url": null, "sourceid": 30628, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37335, "uid": "9b0e182ca0eb4fa7e7e5958418aa8208", "name": "On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models", "authors": [{"id": 187192, "fullname": "Chongyang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187192?format=json", "institution": "University of New South Wales (UNSW Sydney)"}, {"id": 187193, "fullname": "Mingsong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187193?format=json", "institution": "University of New South Wales"}, {"id": 152667, "fullname": "Haodong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152667?format=json", "institution": "University of New South Wales"}, {"id": 88361, "fullname": "Dong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88361?format=json", "institution": "University of New South Wales"}], "abstract": "Multimodal Continual Instruction Tuning aims to continually enhance Large Vision-Language Models  by learning from new data without forgetting previously acquired knowledge. Mixture-of-Experts (MoE) architectures support this by adding new experts and expanding routers while keeping existing ones frozen. However, despite expert isolation, they still suffer from forgetting due to router drifting, where old-task tokens are mistakenly attracted to newly added experts, leading to performance degradation, \\ie, forgetting. We propose a dynamic MoE approach with drift-aware token assignment to regularize router drifting and mitigate forgetting. We analyze the failure mode and identify its link to how different token types are assigned during training. In particular, tokens with ambiguous assignments between old and new experts tend to cause problems, although some can still be benign or even beneficial.Motivated by this, our proposed LLaVA-DyMoE incrementally expands the MoE and learns with a two-fold regularization strategy that regularizes token assignment and dispatching by representing token types through their routing scores, reducing router drift. Our drift-aware token assignment guidance provides conditional guidance for ambiguous tokens to preserve old patterns, complemented by a pair of synergistic routing losses that enforce separation and promote new expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE outperforms baselines, achieving over a 7\\% increase in average accuracy and a 12\\% reduction in forgetting by mitigating this router-drift\u2013induced forgetting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37335", "url": "zhaoc5.github.io/DyMoE", "sourceid": 37915, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40136, "uid": "18197405b180a434e40551201cd25c54", "name": "Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding", "authors": [{"id": 99349, "fullname": "Yue Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/99349?format=json", "institution": "University of Amsterdam"}, {"id": 126228, "fullname": "Qi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/126228?format=json", "institution": "ETH Zurich, INSAIT Sofia"}, {"id": 193605, "fullname": "Runyi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193605?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology (INSAIT)"}, {"id": 193606, "fullname": "Mengjiao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193606?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 186317, "fullname": "Bin Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/186317?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 193607, "fullname": "Nikola Popovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/193607?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 153394, "fullname": "Theo Gevers", "url": "http://cvpr.thecvf.com/api/miniconf/users/153394?format=json", "institution": "University of Amsterdam, University of Amsterdam"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}, {"id": 88372, "fullname": "Martin R. Oswald", "url": "http://cvpr.thecvf.com/api/miniconf/users/88372?format=json", "institution": "University of Amsterdam"}], "abstract": "While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models.  Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians\u2019 centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using $39.9\\times$ fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40136", "url": null, "sourceid": 40319, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40395?format=json"], "related_events_ids": [40395]}, {"id": 38595, "uid": "1c3e364cbbce8b2c102bf0dd90ca89fa", "name": "ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning", "authors": [{"id": 190240, "fullname": "Jingjing Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190240?format=json", "institution": "The University of British Columbia"}, {"id": 190241, "fullname": "Anda Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190241?format=json", "institution": "Peking University"}, {"id": 190242, "fullname": "Qiangqiang Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190242?format=json", "institution": "University of British Columbia"}, {"id": 128968, "fullname": "Zhouchen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128968?format=json", "institution": "Peking University"}, {"id": 190243, "fullname": "Yankai Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190243?format=json", "institution": "University of British Columbia"}], "abstract": "Tensor\u2013based fine-tuning has attracted growing interest for its ability to reduce trainable parameters beyond matrix-based approaches such as LoRA and PiSSA, while capturing inter-layer correlations within networks. However, existing tensor-based methods typically require repeated reconstruction of model weights during training, leading to substantial computational and memory overhead. To overcome these limitations, we propose Reconstruction-Free Tensor-Based Adaptation (ReFTA), which offers four key advantages:  (1) it eliminates repeated explicit tensor reconstruction by exploiting the  algebraic properties of tensors; (2) it achieves lower quantization error by fine-tuning only the principal tensor components;  (3) it is supported by a rigorous generalization guarantee rooted in the algebraic foundations of tensor product\u2013based approaches; and  (4) it adopts a unified design controlled by a single tensor rank configuration. Extensive experiments on both image classification (IC) and natural language understanding (NLU) tasks demonstrate that ReFTA achieves the best accuracy\u2013efficiency trade-off among all evaluated methods. Across most cases, ReFTA attains the highest average accuracy with the fewest trainable parameters. On NLU tasks with RoBERTa-Large, ReFTA improves the average accuracy by approximately 5% over most existing methods while using only 86.4% fewer parameters than LoRA (r=1) and  97.5% fewer than PiSSA. In particular, ReFTA achieves substantially lower peak GPU memory consumption, reducing usage by over 30% compared with tensor-based baselines on the RTE dataset and demonstrating markedly improved scalability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38595", "url": null, "sourceid": 43832, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37526, "uid": "ecb369befefc67e57056c2c89ab1af20", "name": "Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation", "authors": [{"id": 151786, "fullname": "Zihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151786?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89750, "fullname": "Yuxiang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89750?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 187647, "fullname": "Xinpeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187647?format=json", "institution": null}, {"id": 86105, "fullname": "Tianyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86105?format=json", "institution": "Beihang University"}, {"id": 187648, "fullname": "Tao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187648?format=json", "institution": "duxiaoman"}, {"id": 126241, "fullname": "Yalong Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126241?format=json", "institution": "JD AI Research"}, {"id": 89122, "fullname": "Hongzhi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89122?format=json", "institution": "School of Computer Science and Technology, Harbin Institute of Technology"}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization.We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt.To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process.To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization.Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37526", "url": null, "sourceid": 43361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39513, "uid": "5246c96db29909e5fc9432e1db33c2b5", "name": "Masked Region Transformer for Layered Image Generation and Editing at Scale", "authors": [{"id": 153161, "fullname": "Zhicong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153161?format=json", "institution": "Tsinghua University"}, {"id": 135784, "fullname": "Jingye Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/135784?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 131107, "fullname": "Zhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131107?format=json", "institution": "Sensetime Research"}, {"id": 192240, "fullname": "Mohan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192240?format=json", "institution": "Canva; Harbin Institute of Technology"}, {"id": 192241, "fullname": "Yuchi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192241?format=json", "institution": null}, {"id": 130360, "fullname": "Yifan Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130360?format=json", "institution": "Tsinghua University"}, {"id": 126241, "fullname": "Yalong Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126241?format=json", "institution": "JD AI Research"}, {"id": 192242, "fullname": "Ethan Smith", "url": "http://cvpr.thecvf.com/api/miniconf/users/192242?format=json", "institution": "Canva"}, {"id": 192243, "fullname": "Yuhui Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192243?format=json", "institution": "Canva"}], "abstract": "Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of the generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present the Masked Region Transformer, a 20B-parameter diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make three key technical contributions. First, we unify three complementary tasks---text-to-layers, image-to-layers, and layers-to-layers---within a shared masked region diffusion framework, where selective token masking enables flexible cross-modal generation and fine-grained layer-wise editing. Second, we design an efficient conditional diffusion decoder that incorporates Gated DeltaNet and gated attention mechanisms, enhancing visual fidelity while maintaining computational efficiency. Third, we introduce an overflow-aware canvas layer to handle boundary inconsistencies and support semi-transparent background synthesis, enabling complete editable layer generation beyond visible canvas boundaries. Additionally, we apply distribution matching distillation to achieve one-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches across all three tasks, establishing a new benchmark for region-aware transparent image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39513", "url": null, "sourceid": 40520, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38008, "uid": "598ab90b05ae513d138400065f45927d", "name": "SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding", "authors": [{"id": 184073, "fullname": "Hong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184073?format=json", "institution": "School of Computer Science and Engineering, Southeast University"}, {"id": 188041, "fullname": "Xiangkai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188041?format=json", "institution": "ZTE Corporation"}, {"id": 188044, "fullname": "Bin Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188044?format=json", "institution": "ZTE Corporation"}, {"id": 188822, "fullname": "Junjie Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188822?format=json", "institution": "ZTE Corporation"}, {"id": 188823, "fullname": "Fangyu Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188823?format=json", "institution": "ZTE Corporation"}, {"id": 188824, "fullname": "Yutong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188824?format=json", "institution": "ZTE Corporation"}, {"id": 188825, "fullname": "Xiugang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188825?format=json", "institution": "ZTE Corporation"}, {"id": 188045, "fullname": "Xurui Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188045?format=json", "institution": "Brown University"}, {"id": 87706, "fullname": "Min-Ling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87706?format=json", "institution": "Southeast University"}], "abstract": "Spatio-Temporal Video Grounding (STVG) requires models to localize objects both spatially and temporally. Despite recent progress, existing methods struggle with complex and fine-grained spatial semantics in language descriptions, leading to error propagation from temporal to spatial grounding stages. We identify that this fundamental limitation arises from the absence of iterative refinement between temporal and spatial predictions. To address these challenges, we propose SARL-STG, the first RL-based framework for STVG. It progressively refines spatio-temporal grounding through multi-stage optimization, leveraging reinforcement learning to enable dynamic interaction between temporal and spatial modules, where spatial grounding quality serves as feedback to improve temporal localization. Specifically, SARL-STG contains: (1) a unified architecture that seamlessly integrates a pretrained MLLM for temporal reasoning with an open-vocabulary detector for spatial localization, (2) a hierarchical RL training strategy that progresses from coarse temporal to fine-grained spatio-temporal optimization, and (3) a spatial knowledge-injected reward mechanism that uses spatial grounding confidence as discriminative signals for temporal refinement. To facilitate training at scale, we also construct STVG-Wild, a large-scale dataset with diverse spatio-temporal annotations. Experiments demonstrate that our method achieves state-of-the-art performance on multiple benchmarks (HCSTVG, VidSTG, Charades-STA, etc.), significantly reducing error accumulation and enhances both temporal and spatial grounding accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38008", "url": null, "sourceid": 43110, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38357, "uid": "b6f5a67bcd2f8ec4ff7e5eff7d14554c", "name": "GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching", "authors": [{"id": 189705, "fullname": "Yuqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189705?format=json", "institution": "Nanyang Technological University"}, {"id": 189706, "fullname": "GAO JUNJIE", "url": "http://cvpr.thecvf.com/api/miniconf/users/189706?format=json", "institution": "Nanyang Technological University"}, {"id": 189707, "fullname": "Pan Yongzhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189707?format=json", "institution": "Nanyang Technological University"}, {"id": 189708, "fullname": "Siyuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/189708?format=json", "institution": "Nanyang Technological University"}, {"id": 174938, "fullname": "ZIXUAN ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/174938?format=json", "institution": "Nanyang Technological University "}, {"id": 184069, "fullname": "Jiaping Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184069?format=json", "institution": "Nanyang Technological University"}, {"id": 189709, "fullname": "Mir Feroskhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189709?format=json", "institution": "Nanyang Technological University"}], "abstract": "Image-goal navigation driven by generative models has recently shown strong potential owing to their ability to perform multi-modal reasoning and stable learning in continuous control spaces.Despite their promise, current methods still face several fundamental limitations.Many rely on pre-built priors and lack explicit mechanisms for trajectory evaluation, restricting generalization and goal alignment in map-free navigation. Moreover, current generative policies often face inefficiency or temporal inconsistency, resulting in temporally unstable motion. The absence of interactive, closed-loop benchmarks further limits fair and reproducible comparison.To address these issues, we propose GeniNav, a generative image-goal navigation framework that couples a VLM-driven latent subgoal imagination module for high-level semantic guidance with Multi-Segment Consistency Flow Matching (MS-CFM) for temporally smooth and dynamically coherent motion generation. A hybrid trajectory evaluation module further integrates semantic alignment and geometric feasibility to assess goal consistency.We also introduce a closed-loop simulation benchmark with a large-scale dataset spanning 176 scenes and 491.6 km for standardized training and evaluation. Extensive experiments in simulation and on real robots demonstrate the effectiveness of our method.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38357", "url": null, "sourceid": 45791, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38295, "uid": "9b6b49274600928e603580cb381b299e", "name": "Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization", "authors": [{"id": 189532, "fullname": "Jungwook Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189532?format=json", "institution": "Hanyang Universty"}, {"id": 189533, "fullname": "Minjeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189533?format=json", "institution": "Hanyang University"}, {"id": 189534, "fullname": "Younkwan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189534?format=json", "institution": "Samsung Electronics"}, {"id": 189535, "fullname": "Seungho Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189535?format=json", "institution": "Samsung"}, {"id": 91550, "fullname": "Sungyong Baik", "url": "http://cvpr.thecvf.com/api/miniconf/users/91550?format=json", "institution": "Hanyang University"}], "abstract": "Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a training-free Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution.  We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form.  In particular, we do not use the optimized features themselves\u2014the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior.  The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve.  Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38295", "url": null, "sourceid": 45216, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37950, "uid": "db4d9c25578b08df3afd5599a2e7099d", "name": "EarlyTom: Early Token Compression Completes Fast Video Understanding", "authors": [{"id": 177740, "fullname": "Hesong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177740?format=json", "institution": "Westlake University"}, {"id": 188659, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188659?format=json", "institution": "Westlake University"}, {"id": 181072, "fullname": "Lu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181072?format=json", "institution": "Hangzhou AliCloud Apsara Information Technology Co., Ltd."}, {"id": 188660, "fullname": "Chenhaowen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188660?format=json", "institution": "Alibaba Group"}, {"id": 188661, "fullname": "Jian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188661?format=json", "institution": "Alibaba Group"}, {"id": 188662, "fullname": "Qiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188662?format=json", "institution": "Alibaba Group"}, {"id": 87566, "fullname": "Huan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87566?format=json", "institution": "Northeastern University"}], "abstract": "Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65$\\times$ and FLOPs by up to 61\\% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37950", "url": null, "sourceid": 33399, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36267, "uid": "4fbab5b4444f903987961d84f9821488", "name": "M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation", "authors": [{"id": 152126, "fullname": "Hyeongcheol Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/152126?format=json", "institution": "Korea University"}, {"id": 159431, "fullname": "Jiyoung Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/159431?format=json", "institution": "Korea University"}, {"id": 184624, "fullname": "Jaewon Mun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184624?format=json", "institution": "Korea University"}, {"id": 184625, "fullname": "Hogun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/184625?format=json", "institution": "Sungkyunkwan University"}, {"id": 85816, "fullname": "Wonmin Byeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/85816?format=json", "institution": "NVIDIA"}, {"id": 130947, "fullname": "Sung June Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130947?format=json", "institution": "Korea University"}, {"id": 154405, "fullname": "Hyeonsoo Im", "url": "http://cvpr.thecvf.com/api/miniconf/users/154405?format=json", "institution": "Hanwhasystem"}, {"id": 184626, "fullname": "JeungSub Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184626?format=json", "institution": "Hanhwa Group"}, {"id": 130426, "fullname": "Sangpil Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130426?format=json", "institution": "Korea University"}], "abstract": "Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs\u2019 multimodal reasoning and grounding over existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36267", "url": null, "sourceid": 34825, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36599, "uid": "442465f5282183631234848d916ce365", "name": "On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks", "authors": [{"id": 181679, "fullname": "Mengting Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181679?format=json", "institution": "Zhejiang University"}, {"id": 153814, "fullname": "Shi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153814?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 185438, "fullname": "Peng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185438?format=json", "institution": "Zhejiang University"}, {"id": 185439, "fullname": "De Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185439?format=json", "institution": "Zhejiang University"}, {"id": 87278, "fullname": "Huajin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87278?format=json", "institution": "Zhejiang University"}, {"id": 87673, "fullname": "Qian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87673?format=json", "institution": "Zhejiang University"}, {"id": 87277, "fullname": "Gang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87277?format=json", "institution": "Zhejiang University"}], "abstract": "As the third generation of neural networks, Spiking Neural Networks (SNNs) have demonstrated remarkable potential across diverse applications owing to their unique temporal dynamics. In recent years, analyzing the robustness of SNNs from a temporal perspective has become an emerging research focus. However, most existing works examine only the overall temporal behavior of SNNs, typically applying adversarial attacks that rely on time-averaged gradients.In this study, we revisit SNN robustness through the lens of temporal granularity, emphasizing the distinct behaviors that occur at individual time steps. We first introduce a Temporal Granularity Attack (TG-Attack), which selectively perturbs gradients at specific time steps. This approach enables a finer-grained evaluation of SNN robustness across time and demonstrates higher attack success rates than traditional gradient-averaging methods.Furthermore, we theoretically show that the robustness of SNNs at a given time step is determined by the Hessian of the input\u2013output gradient at that step, which we define as Temporal Sensitivity (TS). By calculating the Temporal Sensitivity Value (TSV) for each time step, robustness can be effectively estimated without generating adversarial examples. Finally, we propose a Temporal Granularity Regularization (TG-Reg) term that constrains the TSV across all time steps, thereby improving the model\u2019s overall robustness. Experimental evaluations confirm that our framework consistently outperforms existing state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36599", "url": null, "sourceid": 34514, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39828, "uid": "a75aa7d6160524eaef78d011e48f6c5c", "name": "Order Matters: 3D Shape Generation from Sequential VR Sketches", "authors": [{"id": 132008, "fullname": "Yizi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132008?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 103668, "fullname": "Sidi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103668?format=json", "institution": "ETH Zurich"}, {"id": 192933, "fullname": "Tianyi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192933?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 192934, "fullname": "Nina Wiedemann", "url": "http://cvpr.thecvf.com/api/miniconf/users/192934?format=json", "institution": "Intel; ETH Zurich"}, {"id": 92511, "fullname": "Loic Landrieu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92511?format=json", "institution": "ENPC"}], "abstract": "VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD software. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates ordered VR sketches from arbitrary shapes, (ii) a dataset comprising over 20k synthetic and 900 hand-drawn sketch\u2013shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work and generalizes effectively from synthetic to real sketches with minimal supervision. All data and models will be released in open access.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39828", "url": null, "sourceid": 45203, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38098, "uid": "5a271f2b9dd5fd7223ec585c6ec1fe69", "name": "Robust Spiking Neural Networks by Temporal Mutual Information", "authors": [{"id": 181679, "fullname": "Mengting Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181679?format=json", "institution": "Zhejiang University"}, {"id": 153814, "fullname": "Shi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153814?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 185438, "fullname": "Peng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185438?format=json", "institution": "Zhejiang University"}, {"id": 185439, "fullname": "De Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185439?format=json", "institution": "Zhejiang University"}, {"id": 87278, "fullname": "Huajin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87278?format=json", "institution": "Zhejiang University"}, {"id": 87673, "fullname": "Qian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87673?format=json", "institution": "Zhejiang University"}, {"id": 87277, "fullname": "Gang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87277?format=json", "institution": "Zhejiang University"}], "abstract": "Spiking Neural Networks (SNNs) have attracted increasing attention for their biologically inspired temporal dynamics. As their applications expand, understanding their robustness has become an important research focus. However, little is known about how the intrinsic temporal properties of SNNs affect robustness. In this work, we revisit SNN robustness from an information-theoretic perspective and reveal the pivotal role of temporal dynamics. We establish a theoretical link between robustness error and the mutual information (MI) between inputs and latent representations along the temporal dimension, grounded in the information bottleneck principle. Through an analysis of spike-based information transmission, we show that temporal dynamics inherently compress MI, thereby tightening the robustness error bound. Building on this insight, we propose a Temporal Mutual Information (TMI) regularizer that explicitly exploits temporal characteristics to enhance robustness. Extensive experiments on CIFAR-10, CIFAR-100, DVS-CIFAR10, and Tiny-ImageNet demonstrate that our method consistently improves SNN robustness across various architectures and attack settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38098", "url": null, "sourceid": 41070, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39799, "uid": "9ba2c76cf4e1f1ce6701481c13964f20", "name": "Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality", "authors": [{"id": 180126, "fullname": "Zekai Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180126?format=json", "institution": "Zhejiang University"}, {"id": 192880, "fullname": "Zongze Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/192880?format=json", "institution": "Zhejiang University"}, {"id": 192881, "fullname": "Zhouhang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192881?format=json", "institution": "Zhejiang University"}, {"id": 180084, "fullname": "Hao Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180084?format=json", "institution": "Zhejiang University"}, {"id": 129165, "fullname": "Muzhi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129165?format=json", "institution": "Zhejiang University"}, {"id": 127962, "fullname": "Wen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127962?format=json", "institution": "Zhejiang University"}, {"id": 192882, "fullname": "Yuling Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192882?format=json", "institution": "Zhejiang University"}, {"id": 192883, "fullname": "Chenchen Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/192883?format=json", "institution": "Zhejiang University of Technology"}, {"id": 185384, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185384?format=json", "institution": "Zhejiang University"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}], "abstract": "Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. This work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video\u2019s expressions, lighting, and motion, while significantly reducing manual effort in production workflows.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39799", "url": null, "sourceid": 36391, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40148, "uid": "016d1a9ec4760a10dedf95556c8f7a23", "name": "Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection", "authors": [{"id": 193630, "fullname": "Tianxiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193630?format=json", "institution": "University of Liverpool"}, {"id": 151999, "fullname": "Zhenglin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151999?format=json", "institution": "The University of Liverpool"}, {"id": 193631, "fullname": "Haiquan Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193631?format=json", "institution": "University of Liverpool"}, {"id": 193632, "fullname": "Yiwei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193632?format=json", "institution": "University of Liverpool"}, {"id": 193633, "fullname": "Xinze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193633?format=json", "institution": "University of Liverpool"}, {"id": 193634, "fullname": "BINGYU ZHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/193634?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193635, "fullname": "WUHUI DUAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/193635?format=json", "institution": "University of Liverpool"}, {"id": 193636, "fullname": "Congang CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/193636?format=json", "institution": null}, {"id": 190102, "fullname": "ZEYU FU", "url": "http://cvpr.thecvf.com/api/miniconf/users/190102?format=json", "institution": "University of Exeter"}, {"id": 193637, "fullname": "Yi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193637?format=json", "institution": "University of Liverpool"}, {"id": 88740, "fullname": "Baoyuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88740?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 152228, "fullname": "Guangliang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152228?format=json", "institution": "University of Liverpool"}], "abstract": "Multimodal Deepfakes proliferating on social media threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. We present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 100k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities\u2014image, audio, video, and audio-video talking head and supports a joint detection\u2013localization\u2013explanation protocol. For images, audio, and videos, we define a ternary task (real / partially manipulated / fully synthetic) with spatial or temporal localization masks for fine-grained reasoning. Talking heads are formulated as an audio-video fusion binary task targeting speaking digital humans and lip-synced avatar forgeries. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40148", "url": null, "sourceid": 45552, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38860, "uid": "2bb4997e9e7e9f45820e6df11e801f88", "name": "GenHOI: Towards Object-Consistent Hand\u2013Object Interaction with Temporally Balanced and Spatially Selective Object Injection", "authors": [{"id": 190853, "fullname": "Xuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190853?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 90381, "fullname": "Mochu Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90381?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 88785, "fullname": "Zhelun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88785?format=json", "institution": "Peking University"}, {"id": 130996, "fullname": "Jinbo Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130996?format=json", "institution": "Baidu"}, {"id": 98162, "fullname": "Chenming Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/98162?format=json", "institution": "Baidu"}, {"id": 190854, "fullname": "Chen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190854?format=json", "institution": "Baidu"}, {"id": 152243, "fullname": "Kaisiyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152243?format=json", "institution": "Baidu Inc."}, {"id": 90530, "fullname": "Hang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90530?format=json", "institution": "Baidu"}, {"id": 190855, "fullname": "Shanshan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190855?format=json", "institution": null}, {"id": 90074, "fullname": "Haocheng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90074?format=json", "institution": "Baidu"}, {"id": 185764, "fullname": "Wei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185764?format=json", "institution": "Baidu"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}], "abstract": "Hand\u2013Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing competitors.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38860", "url": null, "sourceid": 32557, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37099, "uid": "8f781e8af632c834150e15eb05e69f27", "name": "Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception", "authors": [{"id": 151065, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151065?format=json", "institution": "Tsinghua University"}, {"id": 186646, "fullname": "Zikun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186646?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186647, "fullname": "Yuner Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186647?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 186648, "fullname": "Zhongwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186648?format=json", "institution": null}, {"id": 182679, "fullname": "Chenyang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182679?format=json", "institution": "Tsinghua University"}, {"id": 186649, "fullname": "Shuocheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186649?format=json", "institution": "Tsinghua University"}, {"id": 186650, "fullname": "Yuxuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186650?format=json", "institution": "Tsinghua University"}, {"id": 151067, "fullname": "Jiaru Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/151067?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186651, "fullname": "Chuang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186651?format=json", "institution": "Tsinghua University"}, {"id": 186652, "fullname": "Shaobing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186652?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 155095, "fullname": "Jianqiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155095?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Cooperative 3D perception via Vehicle-to-Everything (V2X) communication is a promising paradigm for enhancing autonomous driving, offering extended sensing horizons and occlusion resolution.However, the practical deployment of existing methods is hindered at long distances by two critical bottlenecks:the quadratic computational scaling of dense BEV representations and the fragility of feature association mechanisms under significant observation and alignment errors.To overcome these limitations, we introduce Long-SCOPE, a fully sparse framework designed for robust long-distance cooperative 3D perception.Our method introduces two novel components: a Geometry-guided Query Generation module to accurately detect small, distant objects,and a learnable Context-Aware Association module that robustly matches cooperative queries even despite severe positional noise.Experiments on the V2X-Seq and Griffin datasets validate that Long-SCOPE achieves state-of-the-art performance, particularly in challenging long-range settings, while maintaining superior computational efficiency and a highly competitive transmission cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37099", "url": null, "sourceid": 39734, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40101, "uid": "26b54d75fad5660ca6471905fe1793cf", "name": "FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denosing", "authors": [{"id": 154177, "fullname": "Haoming Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/154177?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86716, "fullname": "Delin Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86716?format=json", "institution": "Fudan University"}, {"id": 193530, "fullname": "Yuanqi Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193530?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 102054, "fullname": "Qizhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/102054?format=json", "institution": "Zhejiang University"}, {"id": 193531, "fullname": "Jiarui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193531?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 151589, "fullname": "Qi Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/151589?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 192525, "fullname": "Yiwen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192525?format=json", "institution": "Shanghai AI Lab; Northwest Polytechnical University"}, {"id": 193532, "fullname": "Li Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193532?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193533, "fullname": "Heng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193533?format=json", "institution": "University of Science and Technology of China"}, {"id": 193534, "fullname": "Xianqiang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193534?format=json", "institution": "University of Science and Technology of China"}, {"id": 193535, "fullname": "Yuhang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193535?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 157000, "fullname": "Xiaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157000?format=json", "institution": "BAIDU Inc,"}, {"id": 126999, "fullname": "Modi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126999?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 99906, "fullname": "Guanghui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/99906?format=json", "institution": "AgiBot"}, {"id": 193536, "fullname": "Maoqing Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193536?format=json", "institution": "Agibot"}, {"id": 88582, "fullname": "Bin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88582?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 88588, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88588?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 185717, "fullname": "Xuelong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185717?format=json", "institution": "China Telecom; Northwestern Polytechnical University"}], "abstract": "Humans naturally allocate more time before performing actual actions when handling complex tasks in the physical world. This paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains.However, the potential of test-time computing remains largely unexplored for robotic foundation models interacting with the physical world.In this work, we propose \\textbf{\\ours}: a test-time computing framework that augments flow-based Vision-Language-Action (VLA) generalist policies with value-guided sampling and cascaded action denoising, enabling higher control performance and real-time action rates for dexterous robot manipulation.\\ours first incorporates a flow-based intermediate verifier to estimate state\u2013action values for candidate actions. At test time, the policy iteratively samples multiple noisy action proposals and retains the one with the highest predicted value, yielding value-aligned, high-quality actions without retraining.To satisfy the stringent frequency demands of robot control, \\ours further introduces cascaded action denoising, decoupling expensive value-guided sampling from fast action refinement. A lightweight flow denoiser asynchronously takes the selected high-value noisy action and rapidly denoises it to produce the final control signal, enabling fluid, high-rate execution.During deployment, the intermediate verifier operates at a low frequency to provide value-guided sampling, while the lite-flow denoiser continually processes selected candidates to maintain real-time control.Extensive experiments demonstrate that \\ours scales flow-based VLA models effectively at test time, and achieves state-of-the-art performance across diverse simulation benchmarks and real-world dexterous robotic tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40101", "url": null, "sourceid": 40700, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40300, "uid": "ad2b5e729bc747066e6422cb6e1fa5da", "name": "Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long-Video Understanding", "authors": [{"id": 187427, "fullname": "Pengfei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187427?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 70644, "fullname": "Meng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/70644?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 187428, "fullname": "Yingyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187428?format=json", "institution": null}, {"id": 86626, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86626?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 181696, "fullname": "Jiahua Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181696?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 152920, "fullname": "Jun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152920?format=json", "institution": "Alibaba Group"}, {"id": 185404, "fullname": "YuCheng YuCheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185404?format=json", "institution": null}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 69930, "fullname": "Xiaodan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69930?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm\u2014which alternates between global temporal reasoning and local frame examination\u2014has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft\u2019s proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40300", "url": null, "sourceid": -35599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37426?format=json"], "related_events_ids": [37426]}, {"id": 37706, "uid": "5bd7f2feff1f11170a507fcd0c0e9734", "name": "CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks", "authors": [{"id": 181616, "fullname": "Jeffrey Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181616?format=json", "institution": "Princeton University"}, {"id": 178092, "fullname": "Minkyu Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/178092?format=json", "institution": "Princeton University"}, {"id": 188057, "fullname": "Ambri Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188057?format=json", "institution": "Princeton University"}, {"id": 69178, "fullname": "Serena Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/69178?format=json", "institution": "Stanford"}, {"id": 136461, "fullname": "Ellen Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/136461?format=json", "institution": "Princeton University"}], "abstract": "Cryo-electron microscopy (cryo-EM) is an indispensable technique for determining the 3D structures of dynamic biomolecular complexes. While typically applied to image a single molecular species, cryo-EM holds great potential for structure determination of many targets simultaneously in a high-throughput fashion. However, existing methods typically focus on modeling \\textit{conformational heterogeneity} within a single or very few structures and are not designed to resolve \\textit{compositional heterogeneity} arising from mixtures of many distinct molecular species. To address this challenge, we propose CryoHype, a transformer-based hypernetwork for cryo-EM reconstruction that dynamically adjusts the weights of an implicit neural representation based on the input structure. Using CryoHype, we successfully reconstruct 1,000 distinct structures from cryo-EM imaging in the fixed-pose setting without any pre-existing knowledge of the structures present, which is beyond the capabilities of any existing algorithm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37706", "url": null, "sourceid": 34434, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39870, "uid": "1e4cb2e03f2f23188afd6326c1ccd15b", "name": "End-to-End Language-Action Model for Humanoid Whole Body Control", "authors": [{"id": 180637, "fullname": "Yuxuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180637?format=json", "institution": "Peking University"}, {"id": 193034, "fullname": "Haobin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193034?format=json", "institution": "Peking University"}, {"id": 193035, "fullname": "Shiqing Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193035?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 193036, "fullname": "Ziluo Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/193036?format=json", "institution": "BeingBeyond"}, {"id": 87087, "fullname": "Zongqing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87087?format=json", "institution": "Peking University"}], "abstract": "Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language\u2013action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39870", "url": null, "sourceid": 33313, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40037, "uid": "a2bf11a5cc1a2b5a58a0549dbd66fb37", "name": "GazeOnce360: Fisheye-Based 360\u00b0 Multi-Person Gaze Estimation with Global\u2013Local Feature Fusion", "authors": [{"id": 185249, "fullname": "Zhuojiang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185249?format=json", "institution": "Technical University of Munich"}, {"id": 193355, "fullname": "Zhenghui Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193355?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 128288, "fullname": "Feng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128288?format=json", "institution": "Beihang University, Tsinghua University"}], "abstract": "We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360\u00b0 scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations.Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360\u00b0 gaze estimation in practical multi-person scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40037", "url": null, "sourceid": 42203, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36515, "uid": "1cb5980f737facb627e2bb3d759b73af", "name": "M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation", "authors": [{"id": 145959, "fullname": "Yiheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145959?format=json", "institution": "Tencent VISVISE"}, {"id": 185249, "fullname": "Zhuojiang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185249?format=json", "institution": "Technical University of Munich"}, {"id": 184054, "fullname": "Mingdao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184054?format=json", "institution": "Tsinghua University"}, {"id": 185250, "fullname": "Meitong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185250?format=json", "institution": "Tsinghua University"}, {"id": 185251, "fullname": "Tianxiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185251?format=json", "institution": "Tsinghua University"}, {"id": 156640, "fullname": "Li Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/156640?format=json", "institution": "China Mobile Communications Company Limited Migu Beijing Research Institute"}, {"id": 88239, "fullname": "Yuwang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88239?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "In text-driven 3D scene generation, object layout serves as a crucial intermediate representation that bridges high-level language instructions with detailed geometric output. It not only provides a structural blueprint for ensuring physical plausibility but also supports semantic controllability and interactive editing. However, the learning capabilities of current 3D indoor layout generation models are constrained by the limited scale, diversity, and annotation quality of existing datasets. To address this, we introduce M3DLayout, a large-scale, multi-source dataset for 3D indoor layout generation. M3DLayout comprises 21,367 layouts and over 433k object instances, integrating three distinct sources: real-world scans, professional CAD designs, and procedurally generated scenes. Each layout is paired with detailed structured text describing global scene summaries, relational placements of large furniture, and fine-grained arrangements of smaller items. This diverse and richly annotated resource enables models to learn complex spatial and semantic patterns across a wide variety of indoor environments. To assess the potential of M3DLayout, we establish a benchmark using both a text-conditioned diffusion model and a text-conditioned autoregressive model. Experimental results demonstrate that our dataset provides a solid foundation for training layout generation models. Its multi-source composition enhances diversity, notably through the Inf3DLayout subset which provides rich small-object information, enabling the generation of more complex and detailed scenes. We hope that M3DLayout can serve as a valuable resource for advancing research in text-driven 3D scene synthesis. All dataset and code will be made public upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36515", "url": null, "sourceid": 36628, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40021, "uid": "da969836ff9696f9c182a7026558fe23", "name": "Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification", "authors": [{"id": 128426, "fullname": "Kunlun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128426?format=json", "institution": "Peking University"}, {"id": 193319, "fullname": "Haotong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193319?format=json", "institution": "Jilin University; Peking University"}, {"id": 185408, "fullname": "Jiangmeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185408?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 127460, "fullname": "Xu Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127460?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 107125, "fullname": "Jiahuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/107125?format=json", "institution": "Peking University"}], "abstract": "Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9%-2.2% and 2.1%-2.5% on anti-forgetting and generalization capacity. Our code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40021", "url": null, "sourceid": 34354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39342, "uid": "0dee07203418a72583e5dd79d66965ed", "name": "RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation", "authors": [{"id": 182740, "fullname": "Nicolas Houdr\u00e9", "url": "http://cvpr.thecvf.com/api/miniconf/users/182740?format=json", "institution": "Universit\u00e9 Paris Cit\u00e9 - LIPADE"}, {"id": 166560, "fullname": "Diego Marcos", "url": "http://cvpr.thecvf.com/api/miniconf/users/166560?format=json", "institution": "INRIA"}, {"id": 191885, "fullname": "Hugo Turckheim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191885?format=json", "institution": "INRIA"}, {"id": 191886, "fullname": "Dino Ienco", "url": "http://cvpr.thecvf.com/api/miniconf/users/191886?format=json", "institution": "National Institute for Agriculture, Environment and Food; INRAE, National Research Institute in Agriculture and Environment"}, {"id": 166571, "fullname": "Laurent Wendling", "url": "http://cvpr.thecvf.com/api/miniconf/users/166571?format=json", "institution": "Universit\u00e9 Paris Cit\u00e9 (LIPADE)"}, {"id": 166570, "fullname": "Camille Kurtz", "url": "http://cvpr.thecvf.com/api/miniconf/users/166570?format=json", "institution": "Universit\u00e9 Paris Cit\u00e9 (LIPADE)"}, {"id": 191887, "fullname": "Sylvain Lobry", "url": "http://cvpr.thecvf.com/api/miniconf/users/191887?format=json", "institution": "Universit\u00e9 Paris Cit\u00e9"}], "abstract": "Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39342", "url": null, "sourceid": 39401, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38721, "uid": "95119c54e4be718b4e691e95b0265892", "name": "Multi-modal Test-time adaptation via Adaptive Probabilistic Gaussian Calibration", "authors": [{"id": 177989, "fullname": "Jinglin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177989?format=json", "institution": "Chinese Academy of Science"}, {"id": 190524, "fullname": "Yi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190524?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 190525, "fullname": "Chuxiong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190525?format=json", "institution": "Chinese Academy of Sciences, Institute of Software"}, {"id": 185406, "fullname": "Xiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185406?format=json", "institution": "National Defense University"}, {"id": 185408, "fullname": "Jiangmeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185408?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 190526, "fullname": "Fanjiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190526?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}], "abstract": "Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category\u2011conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38721", "url": null, "sourceid": 35004, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39525, "uid": "f5fa13cee3a924c334302f00db8b0fc9", "name": "Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm", "authors": [{"id": 175940, "fullname": "Jingqi Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/175940?format=json", "institution": "Fudan University"}, {"id": 192261, "fullname": "Yurong Mou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192261?format=json", "institution": "Fudan University"}, {"id": 192262, "fullname": "Hangcheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192262?format=json", "institution": "Fudan University"}, {"id": 192263, "fullname": "Mingzhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192263?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192264, "fullname": "Yongzhuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192264?format=json", "institution": "Fudan University"}, {"id": 192265, "fullname": "Ming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192265?format=json", "institution": "Fudan University"}, {"id": 192266, "fullname": "Qiguang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192266?format=json", "institution": "Harbin Institute of Technology"}, {"id": 180377, "fullname": "Tianyi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180377?format=json", "institution": "Shanghai Innovation Institute, OpenMOSS Team, Fudan University NLP Lab"}, {"id": 192267, "fullname": "Xiaomeng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192267?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 192268, "fullname": "Yining Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192268?format=json", "institution": "Fudan University"}, {"id": 192269, "fullname": "Xinchi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192269?format=json", "institution": "Fudan University"}, {"id": 192270, "fullname": "Jun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192270?format=json", "institution": "Fudan University"}, {"id": 157430, "fullname": "Xuanjing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157430?format=json", "institution": "Fudan University"}, {"id": 130875, "fullname": "Xipeng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130875?format=json", "institution": "Fudan University"}], "abstract": "Thinking with Text and \"Thinking with Images\" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. Therefore, we propose \"Thinking with Video\", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU).  Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpassing GPT5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2\u2019s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positioning \"Thinking with Video\" as a unified multimodal reasoning paradigm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39525", "url": "https://github.com/tongjingqi/Thinking-with-Video", "sourceid": 36294, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65728, "file": "/media/PosterPDFs/CVPR%202026/39525.png", "modified": "2026-04-23T02:26:40.955451-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65729, "file": "/media/PosterPDFs/CVPR%202026/39525-thumb.png", "modified": "2026-04-23T02:26:41.179636-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40167, "uid": "b42f8dd1d6717181700541fb97ce1de1", "name": "UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions", "authors": [{"id": 193692, "fullname": "Wenbin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193692?format=json", "institution": "Xiamen University"}, {"id": 193693, "fullname": "Jiawen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193693?format=json", "institution": "Xiamen University"}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 157848, "fullname": "Yachao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157848?format=json", "institution": "Xiamen University"}, {"id": 87721, "fullname": "Yanyun Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87721?format=json", "institution": "Xiamen University"}], "abstract": "Zero-Shot 3D Visual Grounding (Zero-Shot 3DVG) aims to localize target objects in 3D scenes from natural language descriptions without relying on instance-wise description annotations. Existing methods rely on extra 2D images during inference and/or require multi-turn interactions with large language models (LLMs) or vision-language models (VLMs), which increase latency, computational cost, and deployment complexity. To overcome these limitations, we propose Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions (UZ3DVG), which is fed with 3D point clouds and textual descriptions only during inference and does not depend on external models. This is a new training paradigm: a VLM is employed solely to produce object-wise descriptions (pseudo labels) and reasoning chains for training a lightweight 3DVG model with robust spatial reasoning. Specifically, the introduced Open-Vocabulary Multi-Source Spatial Annotation and Reasoning Chain Generator processes RGB-D images or 3D-projected 2D images from open-world scenes to generate spatial pseudo-labels and reasoning chains for training. Then, we propose Reasoning Chain Distillation, which transfers reasoning knowledge extracted by a large teacher network to a lightweight student network. To represent both global and local geometric relationships, the Geometry-Aware Spatial Modeling (GeoSM) module is introduced to align textual reasoning with 3D spatial structures. Experiments show that UZ3DVG achieves SOTA zero-shot performance on ScanRefer and NR3D, with inference speeds up to $7.7~\\mathrm{FPS}$, approximately 38 times faster than SOTA methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40167", "url": null, "sourceid": 38396, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36581, "uid": "fd04ecf4077388816a37d6ac193c3152", "name": "Test-Time Perturbation Tuning with Delayed Feedback for Vision-Language-Action Models", "authors": [{"id": 184057, "fullname": "Zehua Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184057?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 185405, "fullname": "Xi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185405?format=json", "institution": "Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences"}, {"id": 131593, "fullname": "Fuchun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131593?format=json", "institution": "Tsinghua University"}, {"id": 185406, "fullname": "Xiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185406?format=json", "institution": "National Defense University"}, {"id": 185407, "fullname": "Lixiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185407?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 107125, "fullname": "Jiahuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/107125?format=json", "institution": "Peking University"}, {"id": 185408, "fullname": "Jiangmeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185408?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}], "abstract": "Vision-Language-Action models (VLAs) achieve strong performance in sequential decision-making but remain fragile to subtle environment shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to spurious cues and replicate memorized actions. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates spurious correlations through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting high-confidence errors. Experiments on LIBERO (+7.4\\% success rate) and Atari (+10.3 human normalized score) demonstrate consistent gains in task success over strong VLA, test-time adaptation and even fine-tuned approaches baselines with minimal overhead, establishing a practical path toward reliable test-time adaptation in multimodal decision-making agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36581", "url": null, "sourceid": 34474, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37012, "uid": "ef6a34ef0a948a42a3e116a2fa31dc8d", "name": "From Panel to Pixel: Zoom-In Vision\u2013Language Pretraining from Biomedical Scientific Literature", "authors": [{"id": 152706, "fullname": "Kun yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152706?format=json", "institution": "Universit\u00e9 de Strasbourg &amp; Technical University of Munich"}, {"id": 155680, "fullname": "Min Woo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/155680?format=json", "institution": "Stanford University"}, {"id": 186488, "fullname": "Zhen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186488?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 151391, "fullname": "Alejandro Lozano", "url": "http://cvpr.thecvf.com/api/miniconf/users/151391?format=json", "institution": "Stanford University"}, {"id": 182607, "fullname": "Xiangteng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182607?format=json", "institution": "University of British Columbia"}, {"id": 186489, "fullname": "Shi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186489?format=json", "institution": "Universit\u00e9 de Strasbourg"}, {"id": 86152, "fullname": "Nassir Navab", "url": "http://cvpr.thecvf.com/api/miniconf/users/86152?format=json", "institution": "TU Munich"}, {"id": 179716, "fullname": "Xiaoxiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/179716?format=json", "institution": "Stanford University"}, {"id": 85960, "fullname": "Nicolas Padoy", "url": "http://cvpr.thecvf.com/api/miniconf/users/85960?format=json", "institution": "University of Strasbourg"}, {"id": 69178, "fullname": "Serena Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/69178?format=json", "institution": "Stanford"}], "abstract": "There is growing interest in biomedical vision--language models trained on scientific literature. However, most pipelines compress rich multi-panel figures and long captions into coarse figure-level pairs, discarding the fine-grained correspondences clinicians rely on when zooming into local structures. We introduce Panel2Patch, a data pipeline that mines hierarchical structure from multi-panel, marker-heavy biomedical figures and their surrounding text, and converts them into multi-granular supervision. Given figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs aligned image--text pairs at the figure, panel, and region levels, preserving local semantics instead of treating each figure as a single sample. Built on this corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases in a shared embedding space. Applying Panel2Patch to a small subset of literature figures yields substantially better performance than prior pipelines, demonstrating that exploiting hierarchical figure structure can provide more effective supervision with less pretraining data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37012", "url": null, "sourceid": 38764, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40011, "uid": "7d3029aebef6762e96d7b267c221703a", "name": "VVS: Accelerating Speculative Decoding for Visual Autoregressive Model via Partial Verification Skipping", "authors": [{"id": 180364, "fullname": "Haotian Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180364?format=json", "institution": "Tsinghua Shenzhen International Graduate School"}, {"id": 187769, "fullname": "Ye Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187769?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 193295, "fullname": "Rongwei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193295?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 151909, "fullname": "Chen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151909?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87242, "fullname": "Shu-Tao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87242?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 119978, "fullname": "Zhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/119978?format=json", "institution": "SIGS, Tsinghua University"}], "abstract": "Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its \"draft one step, then verify one step'' paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage\u2019s characteristics, we observe that $\\textbf{verification redundancy}$ and $\\textbf{stale feature reusability}$ are key factors to retain generation quality and speedup for verification-free steps. Inspired of these two observations, we propose a novel SD framework $\\textbf{VVS}$ to accelerate $\\underline{\\text{v}}$isual AR model via partial $\\underline{\\text{v}}$erification $\\underline{\\text{s}}$kipping, which integrates three complementary modules: (1) a verification-free token selector with dynamically truncation, (2) token level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of ${2.8\\times}$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed\u2013quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm. Our code will be publicly available upon acceptance of this paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40011", "url": null, "sourceid": 41563, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37582, "uid": "d5e6ff260692ebb551aa687db5b759b2", "name": "Test-time Sparsity for Extreme Fast Action Diffusion", "authors": [{"id": 180083, "fullname": "Kangye Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/180083?format=json", "institution": "Tsinghua University"}, {"id": 129863, "fullname": "Yuan Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129863?format=json", "institution": "Department of Computer Science and Technology, Tsinghua University"}, {"id": 187768, "fullname": "Jianbo Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187768?format=json", "institution": "South China University of Technology"}, {"id": 187769, "fullname": "Ye Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187769?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 151909, "fullname": "Chen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151909?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 119978, "fullname": "Zhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/119978?format=json", "institution": "SIGS, Tsinghua University"}], "abstract": "Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time.However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denosing timeteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95\\% sparsity by selectively reusing the features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the actions generated by the sparsified diffusion step by step.Extensive experiments demonstrate that our method reduces FLOPs by 92\\% and accelerates action generation by 5$\\times$, achieving lossless performance with an inference frequency of 47.5 Hz.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37582", "url": null, "sourceid": 42879, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37454, "uid": "4979a7b20a42638a04977d3615767bfd", "name": "VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation", "authors": [{"id": 158316, "fullname": "Maximilian Rokuss", "url": "http://cvpr.thecvf.com/api/miniconf/users/158316?format=json", "institution": "Microsoft // German Cancer Research Center (DKFZ)"}, {"id": 187486, "fullname": "Moritz Langenberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/187486?format=json", "institution": "Deutsches Krebsforschungszentrum"}, {"id": 158317, "fullname": "Yannick Kirchhoff", "url": "http://cvpr.thecvf.com/api/miniconf/users/158317?format=json", "institution": "Deutsches Krebsforschungszentrum"}, {"id": 170574, "fullname": "Fabian Isensee", "url": "http://cvpr.thecvf.com/api/miniconf/users/170574?format=json", "institution": "German Cancer Research Center"}, {"id": 187487, "fullname": "Benjamin Hamm", "url": "http://cvpr.thecvf.com/api/miniconf/users/187487?format=json", "institution": "Deutsches Krebsforschungszentrum"}, {"id": 158320, "fullname": "Constantin Ulrich", "url": "http://cvpr.thecvf.com/api/miniconf/users/158320?format=json", "institution": "Deutsches Krebsforschungszentrum"}, {"id": 187488, "fullname": "Sebastian Regnery", "url": "http://cvpr.thecvf.com/api/miniconf/users/187488?format=json", "institution": "Ruprecht-Karls-Universit\u00e4t Heidelberg"}, {"id": 187489, "fullname": "Lukas Bauer", "url": "http://cvpr.thecvf.com/api/miniconf/users/187489?format=json", "institution": "Ruprecht-Karls-Universit\u00e4t Heidelberg"}, {"id": 187490, "fullname": "Efthimios Katsigiannopulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/187490?format=json", "institution": null}, {"id": 187491, "fullname": "Tobias Norajitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/187491?format=json", "institution": null}, {"id": 85944, "fullname": "Klaus Maier-Hein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85944?format=json", "institution": "German Cancer Research Center"}], "abstract": "We introduce *VoxTell*, a vision\u2013language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning 1K+ anatomical and pathological classes, VoxTell uses multi-stage vision\u2013language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code and model will be published at: www.github.com/anonymous", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37454", "url": "https://github.com/MIC-DKFZ/VoxTell", "sourceid": 43975, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37544, "uid": "19ddc61af8f213d2c43c17204efab297", "name": "UniVerse: Empower Unified Generation with Reasoning and Knowledge", "authors": [{"id": 146501, "fullname": "Kaiyue Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/146501?format=json", "institution": "The University of Hong Kong"}, {"id": 187687, "fullname": "Weiyang Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187687?format=json", "institution": "University of Hong Kong"}, {"id": 179893, "fullname": "Chengqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179893?format=json", "institution": "The University of Hong Kong"}, {"id": 153277, "fullname": "Rongyao Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153277?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 156491, "fullname": "Xian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156491?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 156861, "fullname": "Yuwei Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156861?format=json", "institution": "Chongqing University"}, {"id": 154123, "fullname": "Chunwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154123?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 87035, "fullname": "Aoxue Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87035?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 86697, "fullname": "Xihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86697?format=json", "institution": "The University of Hong Kong"}], "abstract": "Current text-to-image (T2I) generation models often struggle with prompts that require complex reasoning or specialized knowledge, failing to accurately interpret implicit user intent. To bridge this gap, we introduce \\textbf{T2I-Reason}, a large-scale dataset designed to empower text-to-image generation in unified multimodal models (UMMs) with reasoning and knowledge. The dataset contains 120k pairs of text triplet and image. The text triplet consists of (1) an implicit prompt, which requires  reasoning or knowledge to decipher its underlying meaning; (2) a reasoning chain, which provides a step-by-step analysis to resolve the implicit prompt's meaning; and (3) an explicit prompt, a clear and straightforward visual description prepared for T2I generation. T2I-Reason is meticulously constructed: 65k samples are dedicated to reasoning, specifically targeting arithmetic reasoning, spatial-attribute relationship reasoning, deductive reasoning (cause to effect), and abductive reasoning (effect to cause).  While 55k samples necessitate specialized knowledge, which covers multiple disciplines, spatial-temporal concepts, and entity knowledge. To validate the effectiveness of our dataset, we train a unified multimodal model, Bagel, on our dataset. Results across multiple benchmarks that evaluate the reasoning capabilities of T2I generation demonstrate that our model achieves significant and consistent improvements on both composition and reasoning, confirming that explicit training on intermediate reasoning chains is a pivotal step towards more intelligent unified generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37544", "url": null, "sourceid": 31242, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38964, "uid": "8e2a291cae0ef77758e70c6cac2daa45", "name": "PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects", "authors": [{"id": 181955, "fullname": "Yan Di", "url": "http://cvpr.thecvf.com/api/miniconf/users/181955?format=json", "institution": "Harbin Institute of Technology"}, {"id": 191071, "fullname": "Yuheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191071?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 157672, "fullname": "Yaoxing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157672?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 191072, "fullname": "Mengge Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191072?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 128010, "fullname": "Shan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128010?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 89145, "fullname": "Xiangyang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/89145?format=json", "institution": "Tsinghua University"}], "abstract": "We present PAMotion, a physics-aware diffusion framework for generating realistic full-body human interactions with multiple objects.Existing diffusion-based methods that jointly synthesize human and object motions often struggle to capture the intricate physical interactions\u2014especially those involving complex hand\u2013object contacts. To address this issue, in this paper, we begin with our key observation: in everyday, slow-motion scenarios, object accelerations inherently reveal the underlying physical interactions.If an object\u2019s acceleration aligns with gravity, it is likely in free motion with no physical contact from human or other objects; otherwise, it must be in contact\u2014directly or indirectly\u2014with the human body.  Building on this intuition, PAMotion jointly models full-body human motion, object motion, and their corresponding accelerations, enforcing physical plausibility through a physics-aware interaction loss.In this loss, we softly penalizes violations of consistency between object acceleration and human-object contact states. PAMotion follows a coarse-to-fine pipline: we first synthesize global torso and object translations, then conditionally refine hand motions and object rotations, achieving both high-level motion-text consistency and low-level physical fidelity. Experiments on two challenging datasets HIMO and ParaHome demonstrate that PAMotion achieves state-of-the-art performance in generating realistic, physically consistent full-body manipulation sequences involving multiple objects.Codes and trained models will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38964", "url": null, "sourceid": 41764, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38823, "uid": "ec2951e5afb60d72a4a3e0be6d3e9c0a", "name": "Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints", "authors": [{"id": 190766, "fullname": "Rotem Gatenyo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190766?format=json", "institution": "Reichman University"}, {"id": 86680, "fullname": "Ohad Fried", "url": "http://cvpr.thecvf.com/api/miniconf/users/86680?format=json", "institution": "Reichman University"}], "abstract": "We study zero-shot 3D alignment of two given meshes from a short text prompt describing their spatial relation---an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time---updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer---without training a new model.Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage controlled surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, camera control concentrates views on the interaction region, and randomized restarts improve robustness.To enable evaluation, we curate a benchmark of 50 \\{mesh pair, prompt\\} cases spanning diverse categories and relations, and compare against baselines. Across the benchmark, our method yields semantically faithful and physically plausible alignments, improving CLIP similarity while reducing intersection volume.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38823", "url": null, "sourceid": 44349, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36706, "uid": "f1704ebe1eb5bfc1c24b0acb73d157a9", "name": "Solvability of the Viewing Graph Under the Affine Camera Model", "authors": [{"id": 185692, "fullname": "Gabriele Pedroni", "url": "http://cvpr.thecvf.com/api/miniconf/users/185692?format=json", "institution": "Institut Polytechnique de Paris"}, {"id": 181458, "fullname": "Rakshith Madhavan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181458?format=json", "institution": "Politecnico di Milano"}, {"id": 167108, "fullname": "Federica Arrigoni", "url": "http://cvpr.thecvf.com/api/miniconf/users/167108?format=json", "institution": "Politecnico di Milano"}], "abstract": "In this paper we focus on the viewing graph, which is used to represent cameras (as nodes) and their pairwise relationships (as edges) in the context of Structure from Motion. By analyzing this graph, it is possible to establish if the available pairwise relationships (e.g., fundamental matrices in the uncalibrated case) are theoretically enough to uniquely determine the cameras, in which case the graph is termed \"solvable\". Previous results considered calibrated and uncalibrated settings, whereas other camera models have not been explored in the context of viewing graph solvability: this work represents the first study under the affine camera model. We provide a characterization of the problem in terms of a linear system, from which we derive a practical method to check affine solvability. We complement this by some theoretical results providing sufficient/necessary conditions for affine solvability, in order to give further insights on the problem. Thanks to our experiments, we analyze synthetic graphs and real graphs coming from structure-from-motion datasets, where we focus on understanding the differences among different camera models (calibrated, uncalibrated and affine) in terms of solvability. In this context, we also raise an open research question and conjecture a possible answer, which is supported by empirical evidence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36706", "url": null, "sourceid": 45924, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36340, "uid": "b1ff86a5f63c33a35b44640aa148b33d", "name": "ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference", "authors": [{"id": 184814, "fullname": "Ali Hojjat", "url": "http://cvpr.thecvf.com/api/miniconf/users/184814?format=json", "institution": "Hamburg University of Technology"}, {"id": 184815, "fullname": "Janek Haberer", "url": "http://cvpr.thecvf.com/api/miniconf/users/184815?format=json", "institution": "Kiel University, Germany"}, {"id": 75654, "fullname": "Soeren Pirk", "url": "http://cvpr.thecvf.com/api/miniconf/users/75654?format=json", "institution": "Adobe"}, {"id": 184816, "fullname": "Olaf Landsiedel", "url": "http://cvpr.thecvf.com/api/miniconf/users/184816?format=json", "institution": "Hamburg University of Technology (TUHH)"}], "abstract": "ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware.Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies.To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty.ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early.Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity.To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage.Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K.We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin.The source code is available at submitted_in_zip.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36340", "url": "https://github.com/ds-kiel/ThinkingViT", "sourceid": 33591, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38286, "uid": "040fec70955253769acc94f134177c63", "name": "AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction", "authors": [{"id": 180068, "fullname": "Ruikai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180068?format=json", "institution": "Beihang University"}, {"id": 189500, "fullname": "Xinrun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189500?format=json", "institution": "Durham University"}, {"id": 186046, "fullname": "Mengwei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186046?format=json", "institution": null}, {"id": 189501, "fullname": "Hao Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189501?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 189502, "fullname": "Shoumeng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189502?format=json", "institution": null}, {"id": 158334, "fullname": "Xinyuan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158334?format=json", "institution": "Alibaba Group"}, {"id": 189503, "fullname": "Yizhe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189503?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 154901, "fullname": "Feng Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154901?format=json", "institution": "Alibaba Group"}, {"id": 189504, "fullname": "Han Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189504?format=json", "institution": "Beihang University"}, {"id": 166363, "fullname": "Yilong Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/166363?format=json", "institution": "State Key Laboratory of Intelligent Transportation Systems, Beijing"}, {"id": 166376, "fullname": "Haiyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166376?format=json", "institution": "School of transportation science and engineering,\u00a0Beihang University"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}, {"id": 189505, "fullname": "Yang Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/189505?format=json", "institution": "Durham University"}, {"id": 189506, "fullname": "Varun Ojha", "url": "http://cvpr.thecvf.com/api/miniconf/users/189506?format=json", "institution": "Newcastle University, UK"}, {"id": 180226, "fullname": "Zhiyong Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/180226?format=json", "institution": "Beihang University"}], "abstract": "Online High-Definition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently \"spatially backward-looking.\" These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a \"distill-from-future\" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with \"look-ahead\" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student's static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of current-frame inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38286", "url": null, "sourceid": 32941, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40387, "uid": "b952172111c8ee9611f15f35632f4c6a", "name": "Understanding and Enforcing Weight Disentanglement in Task Arithmetic", "authors": [{"id": 192996, "fullname": "Shangge Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192996?format=json", "institution": "Nanjing University"}, {"id": 192997, "fullname": "Yuehan Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192997?format=json", "institution": "nanjing university"}, {"id": 85317, "fullname": "Lei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85317?format=json", "institution": "University of Wollonong"}, {"id": 130772, "fullname": "Qi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86630, "fullname": "Yinghuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86630?format=json", "institution": "Nanjing University"}, {"id": 184978, "fullname": "Wenbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184978?format=json", "institution": "Nanjing University"}, {"id": 86625, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86625?format=json", "institution": "Nanjing University"}, {"id": 73994, "fullname": "Dacheng Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73994?format=json", "institution": "Nanyang Technological University"}], "abstract": "Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement\" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model ($\\theta_0$) or the task vectors ($\\tau_t$) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates ($\\Delta W$) that constitute $\\tau_t$ during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40387", "url": null, "sourceid": -46711, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39858?format=json"], "related_events_ids": [39858]}, {"id": 36888, "uid": "f6b3231ed2f02817af859e83a272d419", "name": "LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging", "authors": [{"id": 186115, "fullname": "Zhijian Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186115?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 90999, "fullname": "Cheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/90999?format=json", "institution": "Tencent"}, {"id": 159903, "fullname": "Tao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/159903?format=json", "institution": "Zhejiang University"}, {"id": 88382, "fullname": "Wei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88382?format=json", "institution": " Shenzhen DJI Sciences and Technologies Ltd."}, {"id": 186116, "fullname": "Ben Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186116?format=json", "institution": "China Mobile Zijin (Jiangsu) Innovation Research Institute Co., Ltd."}, {"id": 186117, "fullname": "Zhiyuan Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186117?format=json", "institution": null}, {"id": 186118, "fullname": "Weize Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186118?format=json", "institution": "TARS Robotics"}, {"id": 128100, "fullname": "Yao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128100?format=json", "institution": "Nanjing University"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}, {"id": 184348, "fullname": "Xiaoyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184348?format=json", "institution": "Horizon Robotics"}, {"id": 186119, "fullname": "Xiao-Xiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186119?format=json", "institution": "Nanjing University"}], "abstract": "3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10\u00d7 speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: 1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; 2) token similarity acroses adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging . We analyze each token\u2019s geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT\u2019s core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT\u2019s effectiveness, scalability, and robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36888", "url": null, "sourceid": 45407, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38456, "uid": "4b7147dea804df0cb40380152680314c", "name": "Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation", "authors": [{"id": 99292, "fullname": "Yongrui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/99292?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 130256, "fullname": "Shijie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130256?format=json", "institution": "ByteDance Inc."}, {"id": 76562, "fullname": "Mingde Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76562?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 130240, "fullname": "Junlin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130240?format=json", "institution": "ByteDance Inc."}, {"id": 130296, "fullname": "Li zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130296?format=json", "institution": "Bytedance Inc."}, {"id": 71043, "fullname": "Xiaohong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71043?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 75510, "fullname": "Qi Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75510?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 75943, "fullname": "Jinwei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75943?format=json", "institution": "NVIDIA"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Despite recent progress, diffusion-based video frame interpolation methods still struggle with large complex motions, resulting in discontinuous motions and inconsistent object appearances across frames. We observe that these limitations arise from both the current full-sequence interpolation strategy and the pixel reconstruction training objective. To solve these challenges, we propose ARVFI, a novel video diffusion-based interpolation method for large complex motion interpolation. Instead of generating all intermediate frames simultaneously, ARVFI interpolates in an autoregressive manner from two input frames to the middle ones. Thus, ARVFI interpolates a frame that is further away from the inputs based on all previous interpolation results, resulting in smoother motion transitions and better temporal consistency. Additionally, ARVFI further utilizes DINOv3 features as motion representations, which provide high-level semantics for accurate motion estimation, compared with a simple pixel-level loss. With all these designs, ARVFI generates the intermediate DINOv3 features first and then the frames with an effective conditional generation method for frames. Our ARVFI consistently outperforms existing methods with superior interpolation accuracy and visual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38456", "url": null, "sourceid": 35888, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38842, "uid": "289df84d26002f38f3562ebc1a964ae3", "name": "Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM", "authors": [{"id": 180235, "fullname": "Junyuan Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180235?format=json", "institution": "National University of Singapore"}, {"id": 182645, "fullname": "Qiankun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182645?format=json", "institution": "Nanyang Technological University"}, {"id": 190807, "fullname": "Linghao Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190807?format=json", "institution": "National University of Singapore"}, {"id": 190808, "fullname": "Zhicheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190808?format=json", "institution": "National University of Singapore"}, {"id": 190809, "fullname": "Xinliang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190809?format=json", "institution": "Stanford University"}, {"id": 187445, "fullname": "Kun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187445?format=json", "institution": "Nanyang Technological University"}, {"id": 127378, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127378?format=json", "institution": "Nanyang Technological University"}, {"id": 190810, "fullname": "Yueming Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190810?format=json", "institution": "National University of Singapore"}], "abstract": "Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified \"pixel-to-fine-to-coarse\" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by 30% and reduces hallucination by 20%, outperforming all visual encoders under identical settings. Code is available at the Supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38842", "url": null, "sourceid": 36823, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38999, "uid": "0f7b5a93baceca15e62b21ed14e5fdf0", "name": "Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video", "authors": [{"id": 181314, "fullname": "Yuting Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181314?format=json", "institution": "Communication University of China"}, {"id": 191156, "fullname": "Xilong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191156?format=json", "institution": "Communication University of China"}, {"id": 191157, "fullname": "Yunxiao Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191157?format=json", "institution": "Communication University of China"}, {"id": 191158, "fullname": "Zhengnan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191158?format=json", "institution": "The Chinese University of Hong Kong Shenzhen"}, {"id": 191159, "fullname": "Jingjing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191159?format=json", "institution": "Communication University of China"}], "abstract": "Humans develop visual intelligence through perceiving and interacting with their environment\u2014a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations.This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion.We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing \"proto-objects\" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality.On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, is potential to lay a foundation for achieving robust visual abstraction for embodied intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38999", "url": null, "sourceid": 43887, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36382, "uid": "539ef09cdaadc97ead9ad8794efde829", "name": "Inferring Compositional 4D Scenes without Ever Seeing One", "authors": [{"id": 143041, "fullname": "Ahmet Berke G\u00f6kmen", "url": "http://cvpr.thecvf.com/api/miniconf/users/143041?format=json", "institution": "INSAIT, Sofia University &quot;St. Kliment Ohridski&quot;"}, {"id": 152178, "fullname": "Ajad Chhatkuli", "url": "http://cvpr.thecvf.com/api/miniconf/users/152178?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}], "abstract": "Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard.Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data.At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples.By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos.COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36382", "url": "https://com4d.insait.ai/", "sourceid": 39790, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38145, "uid": "68fa8bc4c8b50ad66a583462a1c684e4", "name": "Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers", "authors": [{"id": 184563, "fullname": "Ufaq Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184563?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 184562, "fullname": "Umair Nawaz", "url": "http://cvpr.thecvf.com/api/miniconf/users/184562?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 189156, "fullname": "Massimo Caputo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189156?format=json", "institution": "University of Bristol"}, {"id": 189157, "fullname": "Muhammad Bilal", "url": "http://cvpr.thecvf.com/api/miniconf/users/189157?format=json", "institution": "Birmingham City University"}, {"id": 189158, "fullname": "Junaid Qadir", "url": "http://cvpr.thecvf.com/api/miniconf/users/189158?format=json", "institution": "University of Qatar"}, {"id": 76698, "fullname": "Muhammad Haris Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76698?format=json", "institution": "Mohamed Bin Zayed University of Artificial Intelligence"}], "abstract": "Medical imaging presents significant challenges due to acoustic shadows, motion blur, and indistinct boundaries. Addressing these issues is crucial for improving diagnostic accuracy. Many conventional vision require extensive fine-tuning on task-specific data and often lose generalizability to natural-image domains. We propose DCRM-ViT, a domain-conditioned residual modulation framework for Vision Transformers that preserves general-vision capability while adapting to diverse domains. DCRM-ViT keeps the backbone frozen and augments each block with a lightweight Residual Modulation Block (RMB) whose parameters are synthesized per sample by a Domain Router (DR) and Parameter Synthesizer Network (PSN). The router outputs soft domain weights from input features, whereas the synthesizer maps these weights to low-rank residuals that modulate selected projections and, optionally, add a domain-aware bias to attention. Crucially, we learn routing and modulation via a bi-level optimization scheme: a short inner loop adapts RMB parameters to task supervision, while an outer loop updates DR, PSN, and RMB initializations/step sizes so the synthesized residuals generalize across medical and natural domains. Across fine-grained classification (Food101, SUN397, Stanford Cars) and medical segmentation (ultrasound, CT, MRI), DCRM-ViT improves over strong baselines while using modest trainable compute. The ablation studies confirmed the benefits of our architectural enhancements, showing improved performance and adaptability. The results demonstrate DCRM-ViT's potential to offer high diagnostic performance with reduced computational overhead of using 4.7 GFLOPs and 0.3 training min/epoch. Our code will be publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38145", "url": null, "sourceid": 42431, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37773, "uid": "0982d54d18a026163f76888c0d226166", "name": "WeatherDiffusion: Controllable Weather Editing in Intrinsic Space", "authors": [{"id": 188234, "fullname": "Yixin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188234?format=json", "institution": "Nanjing University"}, {"id": 76631, "fullname": "Zuo-Liang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76631?format=json", "institution": "Nankai University"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}, {"id": 155414, "fullname": "Milos Hasan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155414?format=json", "institution": "Adobe Systems"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}, {"id": 145863, "fullname": "Beibei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145863?format=json", "institution": "Nanjing University"}], "abstract": "We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches. We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37773", "url": null, "sourceid": 31287, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39415, "uid": "fd3dad51304f5ff120abce31ca350f35", "name": "Cross-Modal Attention Calibration for LVLM Hallucination Mitigation", "authors": [{"id": 126336, "fullname": "Jiaming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126336?format=json", "institution": "Baidu"}, {"id": 192032, "fullname": "Jiacheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192032?format=json", "institution": "The University of Hong Kong"}, {"id": 88289, "fullname": "Zequn Jie", "url": "http://cvpr.thecvf.com/api/miniconf/users/88289?format=json", "institution": "Meituan"}, {"id": 88103, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88103?format=json", "institution": "Meituan"}, {"id": 192033, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192033?format=json", "institution": null}, {"id": 192034, "fullname": "Xiaonan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192034?format=json", "institution": "Guilin University of Electronic Technology; SUN YAT-SEN UNIVERSITY"}, {"id": 185257, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185257?format=json", "institution": null}], "abstract": "Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding, to reduce overreliance on language priors.   However, these approaches overlook hallucinations stemming from position bias and spurious inter-modality correlations. In this paper, we propose a Cross-Modal Attention Calibration (CMAC) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design an Inter-Modality Decoding (IMD) module to alleviate hallucination by a novel contrastive decoding mechanism. IMD masks the value vectors associated with significant cross-modal attention weights as distortion, which addresses both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Cross-Modal Position Calibration (CMPC) module shrinks the position gap of image tokens, alleviating the position bias in cross-modal attention. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations for LVLM. Our code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39415", "url": null, "sourceid": 38174, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36798, "uid": "a0097122646b4c5ff8de96d593cc5de5", "name": "Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation", "authors": [{"id": 185899, "fullname": "Shifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185899?format=json", "institution": "Beihang University"}, {"id": 180730, "fullname": "Yihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180730?format=json", "institution": "Beihang University"}, {"id": 175641, "fullname": "Jun Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/175641?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 76464, "fullname": "Hongyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76464?format=json", "institution": "Beihang University"}, {"id": 87605, "fullname": "Di Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87605?format=json", "institution": "Beihang University"}], "abstract": "Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36798", "url": null, "sourceid": 40897, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36501, "uid": "722122a6b4ba1d6a821792615d953cea", "name": "SVBench: Evaluation of Video Generation Models on Social Reasoning", "authors": [{"id": 185217, "fullname": "Wenshuo Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185217?format=json", "institution": "tencent"}, {"id": 185218, "fullname": "Gongxuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185218?format=json", "institution": "Harbin Institute of Technology"}, {"id": 185219, "fullname": "Tianmeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185219?format=json", "institution": "Baidu"}, {"id": 184555, "fullname": "Chuanhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184555?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 152732, "fullname": "Xiaojie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152732?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 185220, "fullname": "Hui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185220?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184560, "fullname": "Kaipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184560?format=json", "institution": "Shanda AI Research"}], "abstract": "Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text\u2013video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans\u2014who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues\u2014current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36501", "url": null, "sourceid": 35103, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36188, "uid": "1c38fdd837bd814787fd6ea06ec72dc3", "name": "OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models", "authors": [{"id": 184379, "fullname": "Xingkui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184379?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 76439, "fullname": "Dingkang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76439?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184380, "fullname": "Cheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184380?format=json", "institution": "Xiaohongshu"}, {"id": 184381, "fullname": "Daoxin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184381?format=json", "institution": null}, {"id": 184382, "fullname": "lv hanxiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184382?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184383, "fullname": "Zhe Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184383?format=json", "institution": null}, {"id": 126819, "fullname": "Yao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126819?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Sparse activation layers, primarily Mixture-of-Experts (MoE) and memory-based modules, are a central approach for scaling large models and are gaining traction in vision tasks. Despite conceptual similarities, these paradigms have evolved independently, hindering systematic comparison and the development of modules that exploit their complementary strengths. To bridge this gap, we propose **OneSparse**, a unified framework that reformulates MoE and memory modules under a common abstraction. This enables their systematic comparison and integration, revealing a continuous design space. Guided by this abstraction, we design the **Nexus Layer**, which features two key innovations: a unified routing mechanism that merges the efficiency of memory retrieval with MoE's load balancing to ensure stable and scalable token assignment, and an adaptive processing strategy where memory modules sketches coarse representations while expert modules refine critical regions. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate that our Nexus Layer establishes a new performance efficiency frontier, surpassing representative sparse baselines on convolutional and transformer architectures.  These results validate the power of the OneSparse framework to unify and integrate complementary sparse paradigms and underscores the potential of hybrid sparse modeling in vision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36188", "url": null, "sourceid": 46017, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37340, "uid": "b39ac02ff5021fed10cb9988a23d5d02", "name": "A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation", "authors": [{"id": 187200, "fullname": "Kangjian Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187200?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86987, "fullname": "Haobo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86987?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 127401, "fullname": "Jianjun Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/127401?format=json", "institution": "Nanjing University of Science and Techonology"}, {"id": 137969, "fullname": "Jin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/137969?format=json", "institution": "Nanjing University"}], "abstract": "In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views.Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy.To enable cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features.In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions.The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness.Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation.Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise.Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping.Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry.Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37340", "url": null, "sourceid": 30679, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39861, "uid": "72b8ade48bd53ac615b67923ece11593", "name": "EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation", "authors": [{"id": 147534, "fullname": "Shiyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147534?format=json", "institution": "City University of Hong Kong"}, {"id": 76781, "fullname": "Ruihuang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/76781?format=json", "institution": null}, {"id": 187697, "fullname": "Jiale Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187697?format=json", "institution": "Tencent"}, {"id": 168459, "fullname": "Shuai Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/168459?format=json", "institution": "Tencent Hunyuan"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 86410, "fullname": "Jing Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86410?format=json", "institution": "City University of Hong Kong"}], "abstract": "Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX.In this work, we present EffectMaker, a unified reasoning\u2013generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about their adaptation to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic\u2013visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning.Furthermore, we construct EffectData, a largest and high-quality synthetic dataset containing 100K videos across 2K VFX categories, to enhance generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Code and data will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39861", "url": null, "sourceid": 33936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38163, "uid": "adb10166081eca5b9313b4a35e45eaf1", "name": "RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection", "authors": [{"id": 180634, "fullname": "Xiaokai Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180634?format=json", "institution": "Zhejiang University"}, {"id": 144399, "fullname": "Chenxu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/144399?format=json", "institution": "Zhejiang University"}, {"id": 189187, "fullname": "Lianqing Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189187?format=json", "institution": "Tongji University"}, {"id": 189188, "fullname": "Jianan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189188?format=json", "institution": "Momoni AI"}, {"id": 131034, "fullname": "Siyuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131034?format=json", "institution": "Zhejiang University"}, {"id": 189189, "fullname": "Xiaohan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189189?format=json", "institution": "Zhejiang University"}, {"id": 189190, "fullname": "Yiming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189190?format=json", "institution": "Zhejiang University"}, {"id": 176242, "fullname": "Zhengzhuang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176242?format=json", "institution": "Zhejiang University"}, {"id": 77435, "fullname": "Hui-Liang Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/77435?format=json", "institution": "Zhejiang University"}], "abstract": "4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its state-of-the-art performance. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38163", "url": null, "sourceid": 39260, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37739, "uid": "7db335d0b996c9c9e498ba1d3fbe1e81", "name": "Seeing Through Blur: Tackling Defocus in Spike-Based Imaging", "authors": [{"id": 188137, "fullname": "Xiantao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188137?format=json", "institution": "Beijing Institute of Technology; Beijing Institute of Technology"}, {"id": 188138, "fullname": "Siwei Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188138?format=json", "institution": "Peking University"}, {"id": 133113, "fullname": "Lin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133113?format=json", "institution": "Beijing Institute of Technology"}, {"id": 157334, "fullname": "Lizhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157334?format=json", "institution": "Beijing Normal University"}, {"id": 127562, "fullname": "Hua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127562?format=json", "institution": "Beijing Normal University"}], "abstract": "Spike cameras are a novel class of neuromorphic vision sensors that capture scene dynamics with ultra-high temporal resolution via spike planes. While recent methods have addressed motion blur and noise in spike-based reconstruction, defocus blur caused by shallow depth of field or lens adjustment delays remains a critical yet underexplored issue in real-world applications such as autonomous driving. In this work, we present DeSpike, the first end-to-end defocus removal framework specifically designed for spike cameras. Our method begins by explicitly modeling the defocus formation process using a physics-inspired thin-lens approximation to simulate spike responses under optical blur. Guided by this formulation, DeSpike employs multi-temporal-scale integrate-and-fire (IF) neurons to compensate for FPN and extract defocus-aware features from spike streams. These features are then processed by a physics-informed deblurring module constructed from learnable discrete PSF priors. To address spatially variant blur, we introduce a Transformer-based fusion mechanism that adaptively weighs multi-scale deblurring results through attention across defocus levels. Finally, a coarse-to-fine iterative refinement stage combines spike features and PSF priors for progressive restoration. Extensive experiments on both synthetic and real-world defocused spike datasets demonstrate that our method achieves superior performance over state-of-the-art deblurring approaches in terms of structural fidelity, perceptual sharpness, and contrast, setting a new benchmark for defocus-aware spike-based image reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37739", "url": null, "sourceid": 33488, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39152, "uid": "24871a24babba7a37fcd8eab5a1e8e11", "name": "Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition", "authors": [{"id": 191457, "fullname": "Shaowu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191457?format=json", "institution": "Beijing University Of Technology"}, {"id": 191458, "fullname": "Xibin Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/191458?format=json", "institution": "Beijing University of Technology"}, {"id": 191459, "fullname": "Chao Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191459?format=json", "institution": "Beijing University of Technology"}, {"id": 76372, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76372?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 191460, "fullname": "Jing Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191460?format=json", "institution": "Beijing Chao-yang Hospital, Capital Medical University"}, {"id": 191461, "fullname": "Qianmei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191461?format=json", "institution": "Beijing Chao-yang Hospital, Capital Medical University"}], "abstract": "Intricate correlations among atomic actions and inherent visual confounders in long-term action recognition (LTAR) contribute to the persistent challenges in this domain. While methods based on vision-language models that employ label text for supervision offer potential for handling visual confounders, their reliance on statistical correlations rather than causal mechanisms introduces two vulnerabilities: (1) spurious alignments with non-causal co-occurring visual features during cross-modal interaction, and (2) misinterpretation of codependencies among actions. To address these limitations, this paper introduces Progressive Cross-Modal Causal Intervention (PCMCI). PCMCI first mitigates co-occurrence hallucination via causal intervention grounded in optimal transport theory. Subsequently, an action relation-aware mechanism counters the backdoor path induced by codependency illusion, enabling the derivation of deconfounded text embeddings. Finally, these deconfounded embeddings serve as mediator to implement front-door adjustment to remove visual confounders. This progressive causal intervention framework facilitates learning robust representations for LTAR. Experiments on three long-term action benchmarks demonstrate the effectiveness of the proposed model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39152", "url": null, "sourceid": 34260, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37480, "uid": "c17d923ee20f8a125139fe2dc0054ff1", "name": "Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis", "authors": [{"id": 180534, "fullname": "Huan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180534?format=json", "institution": "Wuhan University"}, {"id": 187556, "fullname": "Shuyu Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187556?format=json", "institution": "Central China Normal University"}, {"id": 187557, "fullname": "Yujin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187557?format=json", "institution": null}, {"id": 187558, "fullname": "Dingwen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187558?format=json", "institution": "Wuhan University"}, {"id": 187559, "fullname": "Shenghua Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187559?format=json", "institution": "Wuhan University"}, {"id": 187560, "fullname": "Fan Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187560?format=json", "institution": "Computer Vision Center, Universitat Aut\u00f3noma de Barcelona"}], "abstract": "Recent advances in CLIP-based continual learning have shown the potential of leveraging pre-trained vision\u2013language models for sequential tasks. However, existing methods overlook a key problem we call Asymmetric Drift. In unimodal CLIP-based continual learning, the visual branch undergoes stronger adaptation because the visual distribution shifts significantly, whereas the text branch remains relatively stable due to the low variance of textual prompts. This imbalance increases the modality distance and degrades cross-modal alignment over time.To address this issue, we propose CCA-CL, a framework that accumulates visual-textual covariance statistics across tasks and solves Canonical Correlation Analysis to compute a shared subspace. In this subspace, the distance between visual and textual features is minimized, enabling better alignment without modifying CLIP parameters. This also makes our method naturally compatible with exemplar-free CL settings.To further capture nonlinear relationships that linear Canonical Correlation Analysis hard to model, we introduce Random Fourier Projection as an extension.Experimental results demonstrate that CCA-CL effectively mitigates the asymmetric drift problem and achieves state-of-the-art performance on several benchmarks. Our code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37480", "url": null, "sourceid": 31216, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37435, "uid": "7f6e7707c60d6274e2e0ce07bad488de", "name": "Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization", "authors": [{"id": 102059, "fullname": "Liu Chuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102059?format=json", "institution": "Beihang University"}, {"id": 187446, "fullname": "Yichao Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187446?format=json", "institution": "Central South University"}, {"id": 156647, "fullname": "Xiu Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/156647?format=json", "institution": "Central South University"}, {"id": 187447, "fullname": "Haogang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187447?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "Test-time adaptation (TTA) has emerged as a promising solution to address real world domain shifts in medical image segmentation. Most current approaches adapt by updating or regularizing a pre-trained source model. However, they face two major issues: \\textit{(i) the source models on which they rely are prone to overfitting under domain shift; (ii) in dynamic continual testing scenarios, error accumulation and class forgetting are further exacerbated.} To overcome these limitations, we propose \\textbf{TanGo}, a novel framework that combines \\textbf{T}raining to \\textbf{a}dapt with Fou\\textbf{n}dation \\textbf{G}uidance and C\\textbf{o}ntinual Style Calibration. During training, TanGo learns generalization priors from vision foundation models (VFMs) through distribution-level consistency learning. We incorporate stable low-frequency representations from a frozen encoder of VFMs as priors to guide the source model, constraining its output feature distribution to yield a more generalizable feature space. At test time, we introduce an instance-wise style calibration method that employs a learnable data decorator to transform dynamic test images back toward source-like distributions. Subsequently, a set of source-anchored constraints is applied to preserve semantic integrity in the transformed test images and align their distributions more closely with the enhanced source space. Extensive experiments on multiple medical image segmentation tasks demonstrate that TanGo achieves state-of-the-art performance. All code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37435", "url": null, "sourceid": 31932, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36716, "uid": "d7b34cc18c3b1c0a53acf0987d834a31", "name": "H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction", "authors": [{"id": 92799, "fullname": "Jiaqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/92799?format=json", "institution": "University of Nottingham"}, {"id": 157860, "fullname": "Wenting Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157860?format=json", "institution": "City University of Hong Kong"}, {"id": 91359, "fullname": "Xiangjian He", "url": "http://cvpr.thecvf.com/api/miniconf/users/91359?format=json", "institution": "The University of Nottingham Ningbo China"}, {"id": 185713, "fullname": "Yuanbai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185713?format=json", "institution": "The University of Nottingham Ningbo China"}, {"id": 85971, "fullname": "Sen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85971?format=json", "institution": "Sichuan University"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}, {"id": 157917, "fullname": "Xiaohan Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/157917?format=json", "institution": "Stanford University"}], "abstract": "Cancer survival prediction through multimodal learning that combines histopathology images with genomic data represents a promising research direction. However, current approaches still suffer from two key limitations. First, most methods operate in a Euclidean feature space, which makes it difficult to capture the intrinsic hierarchies in histopathology, where information is organized from patches to whole-slide images to patients, and in genomics, where it progresses from genes to pathways to patients. Second, they typically discretize survival times into coarse risk intervals, neglecting fine-grained ordinal relationships among samples within the same interval and thus failing to capture the continuous ranking characteristics of survival outcomes.To address these issues, we propose \\ourmethod, a hyperbolic hierarchical multimodal learning framework for survival prediction. H2-SurvNet first employs a **hyperbolic hierarchical information modeling**  (H2IM) module that maps multimodal features into a shared hyperbolic space and explicitly encodes intra-modal and inter-modal hierarchies across patches, WSIs, patients, genes, and pathways. On top of this representation, we design a **Temporal Ordinal Contrastive learning** (TOCL) module that models the temporal progression of survival outcomes by enforcing ordinal risk ordering through contrastive objectives, thereby promoting continuity in the learned risk scores.Extensive experiments on heterogeneous cohorts from TCGA, CPTAC, and NLST demonstrate that H2-SurvNet consistently outperforms state-of-the-art multimodal survival prediction methods and exhibits strong robustness and generalization across diverse data distributions. Source code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36716", "url": null, "sourceid": 42673, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38628, "uid": "913c130aa3a3e9780ee459eadf80c05c", "name": "Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts", "authors": [{"id": 179757, "fullname": "Jiude Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/179757?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190339, "fullname": "Yuxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190339?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86440, "fullname": "Cewu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86440?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190340, "fullname": "Jianhua Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190340?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "We human rely on a wide range of commonsense knowledge to interact with an extensive number and categories of objects in the physical world. Likewise, such commonsense knowledge is also crucial for robots to successfully develop generalized object manipulation skills. While recent advancements in Multi-modal Large Language Models (MLLMs) have showcased their impressive capabilities in acquiring commonsense knowledge and conducting commonsense reasoning, effectively grounding this semantic-level knowledge produced by MLLMs to the physical world to thoroughly guide robots in generalized articulated object manipulation remains a challenge that has not been sufficiently addressed. To this end, we introduce analytic concepts, procedurally defined upon mathematical symbolism that can be directly computed and simulated by machines. By leveraging the analytic concepts as a bridge between the semantic-level knowledge inferred by MLLMs and the physical world where real robots operate, we can figure out the knowledge of object structure and functionality with physics-informed representations, and then use the physically grounded knowledge to instruct robot control policies for generalized and accurate articulated object manipulation. Extensive experiments in both real world and simulation demonstrate the superiority of our approach. Please refer to the Supplementary Material for more details, and our codes will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38628", "url": null, "sourceid": 31603, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36542, "uid": "cf7f5d18c19536b2299bd648d5028f0c", "name": "PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation", "authors": [{"id": 185306, "fullname": "Samarth Chopra", "url": "http://cvpr.thecvf.com/api/miniconf/users/185306?format=json", "institution": "University of Maryland, College Park"}, {"id": 181760, "fullname": "Jing Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181760?format=json", "institution": "Stanford University"}, {"id": 185307, "fullname": "Gershom Seneviratne", "url": "http://cvpr.thecvf.com/api/miniconf/users/185307?format=json", "institution": "University of Maryland, College Park"}, {"id": 85839, "fullname": "Dinesh Manocha", "url": "http://cvpr.thecvf.com/api/miniconf/users/85839?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by up to 22.8%, reduces Shore hardness error by up to 61.2%, and lowers kinetic friction error by up to 18.1% compared to deterministic baselines. Our results demonstrate that PhysGS unifies 3D reconstruction, uncertainty modeling, and physical reasoning in a single, spatially continuous framework for dense physical property estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36542", "url": null, "sourceid": 45098, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38409, "uid": "36de309602983818b4c2b9b606ef4ff4", "name": "R$^2$-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection", "authors": [{"id": 189818, "fullname": "Shuaike Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189818?format=json", "institution": "Carnegie Mellon University"}, {"id": 189819, "fullname": "Ke Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189819?format=json", "institution": "Zhejiang University"}, {"id": 189820, "fullname": "Jiaqing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189820?format=json", "institution": "Fudan University"}, {"id": 189821, "fullname": "Shangde Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189821?format=json", "institution": "Zhejiang University"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}, {"id": 189822, "fullname": "Ge Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189822?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 189700, "fullname": "Mireia Crispin-Ortuzar", "url": "http://cvpr.thecvf.com/api/miniconf/users/189700?format=json", "institution": "University of Cambridge"}, {"id": 189823, "fullname": "Shangqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189823?format=json", "institution": "University of Cambridge"}], "abstract": "Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce **R$^2$-Seg**, a **training-free** framework for robust OOD tumor segmentation that operates via a two-stage **Reason-and-Reject** process. First, the **Reason** step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the **Reject** step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, **R$^2$-Seg** substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38409", "url": null, "sourceid": 40644, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40334?format=json"], "related_events_ids": [40334]}, {"id": 36796, "uid": "ea3ce3f65b0dcb09b148b728d01ee291", "name": "CAD-Refiner: A Unified Framework for CAD Generation and Iterative Editing", "authors": [{"id": 181641, "fullname": "Meng Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181641?format=json", "institution": "Jilin University"}, {"id": 185889, "fullname": "Dawei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185889?format=json", "institution": "Jilin University"}, {"id": 133400, "fullname": "Hongxia Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/133400?format=json", "institution": "Jilin University"}, {"id": 158521, "fullname": "Tieru Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158521?format=json", "institution": "school of AI, Jilin University"}, {"id": 126769, "fullname": "Rui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/126769?format=json", "institution": "Jilin University"}], "abstract": "Computer-Aided Design (CAD) modeling underpins a wide range of industrial applications. During the conceptual design phase, designers often refine initial solutions iteratively to achieve desired results. A key goal of AI-assisted CAD is to support the full modeling workflow from initial generation to iterative refinement. However, most existing approaches treat generation and editing as separate tasks, hindering coherence and adaptability in real-world scenarios. To address this limitation, we propose CAD-Refiner, a unified framework that supports free-form multimodal inputs and enables iterative refinement over previously generated results. Specifically, we design an agent named CAD Insighter that interprets multimodal inputs into topological structure graphs, which explicitly represent the fundamental elements and their relationships within CAD objects. We then propose a carefully designed decoder architecture and a Sequence Injection Strategy (SIS) to enable multiple applications within a unified modeling framework. Furthermore, we propose CAD Checker, an error-aware feedback module that performs geometry-based reward shaping during optimization, enhancing modeling quality and geometric validity. Additionally, we introduce MMCAD, a multimodal extension of DeepCAD tailored for CAD generation and editing. Extensive experiments demonstrate the effectiveness of CAD-Refiner across multiple tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36796", "url": null, "sourceid": 37160, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40248, "uid": "adba8655a172abad7782f03d08c9abf3", "name": "Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression", "authors": [{"id": 181798, "fullname": "Hamidreza Dastmalchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181798?format=json", "institution": "York University"}, {"id": 193873, "fullname": "Aijun An", "url": "http://cvpr.thecvf.com/api/miniconf/users/193873?format=json", "institution": "York University"}, {"id": 193874, "fullname": "Ali Cheraghian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193874?format=json", "institution": "CSIRO"}, {"id": 193875, "fullname": "Hamed Barzamini", "url": "http://cvpr.thecvf.com/api/miniconf/users/193875?format=json", "institution": "Northern Illinois University"}], "abstract": "While large vision\u2013language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations\u2014unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality.CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination.In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40248", "url": null, "sourceid": 33728, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36986, "uid": "6d3c7d453759616fd8ae9f6c20e5d071", "name": "SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning", "authors": [{"id": 181095, "fullname": "Tairan HUANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/181095?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 186380, "fullname": "Yulin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186380?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 186381, "fullname": "Junxu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186381?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 186382, "fullname": "Qingqing Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/186382?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 186383, "fullname": "Haibo Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186383?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Visual reinforcement learning has achieved remarkable progress in visual control and robotics, but its vulnerability to adversarial perturbations remains underexplored. Most existing black-box attacks focus on vector-based or discrete-action RL, and their effectiveness on image-based continuous control is limited by the large action space and excessive environment queries. We propose SEBA, a sample-efficient framework for black-box adversarial attacks on visual RL agents. SEBA integrates a shadow Q model that estimates cumulative rewards under adversarial conditions, a generative adversarial network that produces visually imperceptible perturbations, and a world model that simulates environment dynamics to reduce real-world queries. Through a two-stage iterative training procedure that alternates between learning the shadow model and refining the generator, SEBA achieves strong attack performance while maintaining efficiency. Experiments on MuJoCo and Atari benchmarks show that SEBA significantly reduces cumulative rewards, preserves visual fidelity, and greatly decreases environment interactions compared to prior black-box and white-box methods. Our code is provided in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36986", "url": null, "sourceid": 46042, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40334, "uid": "36de309602983818b4c2b9b606ef4ff4", "name": "R$^2$-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection", "authors": [{"id": 189818, "fullname": "Shuaike Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189818?format=json", "institution": "Carnegie Mellon University"}, {"id": 189819, "fullname": "Ke Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189819?format=json", "institution": "Zhejiang University"}, {"id": 189820, "fullname": "Jiaqing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189820?format=json", "institution": "Fudan University"}, {"id": 189821, "fullname": "Shangde Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189821?format=json", "institution": "Zhejiang University"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}, {"id": 189822, "fullname": "Ge Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189822?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 189700, "fullname": "Mireia Crispin-Ortuzar", "url": "http://cvpr.thecvf.com/api/miniconf/users/189700?format=json", "institution": "University of Cambridge"}, {"id": 189823, "fullname": "Shangqi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189823?format=json", "institution": "University of Cambridge"}], "abstract": "Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce **R$^2$-Seg**, a **training-free** framework for robust OOD tumor segmentation that operates via a two-stage **Reason-and-Reject** process. First, the **Reason** step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the **Reject** step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, **R$^2$-Seg** substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40334", "url": null, "sourceid": -40644, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38409?format=json"], "related_events_ids": [38409]}, {"id": 38658, "uid": "a1950012395742fd71f4fdd4cba9414e", "name": "CROWn: A Unified Framework for Anti\u2011Aliased Downsampling and Phase\u2011Calibrated Fusion in 3D Medical Segmentation", "authors": [{"id": 181645, "fullname": "Xingru Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181645?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 190401, "fullname": "Shuanghua Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/190401?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184718, "fullname": "Zhao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184718?format=json", "institution": "Northumbria University"}, {"id": 176647, "fullname": "Wenwen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176647?format=json", "institution": "Johns Hopkins University"}, {"id": 190402, "fullname": "Huiyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190402?format=json", "institution": "University of Leicester"}, {"id": 184715, "fullname": "Zhiwen Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184715?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 148526, "fullname": "Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/148526?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184722, "fullname": "Xiaoshuai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184722?format=json", "institution": "Ocean University of China"}], "abstract": "Precise 3D medical image segmentation is a clinical cornerstone for diagnosis, therapy planning, and longitudinal monitoring. However, routine acquisition with anisotropic voxel spacing and heterogeneous reconstruction induces downsampling aliasing and cross-scale misalignment that blur boundaries, fragment topology, and undermine reliability. Existing U-shaped CNN or Transformer designs neither control alias injection at decimation nor explicitly align high-resolution evidence before decoder fusion, leading to unstable interfaces under device and protocol variability. We introduce the Coset-fibRated micrO-local co-attention Network (CROWn), a general segmentation framework that couples sampling theory with representation learning to jointly suppress aliasing and calibrate cross-scale fusion. CROWn comprises two complementary components. The Microlocal Polyphase Co-Attentive Decimator ($\\mu$PCAD) performs axis-aware polyphase analysis with pooled\u2013subband co-attention and explicit anti-alias low-pass, routing boundary-relevant high-frequency evidence while attenuating spurious phase components during downsampling. The Octaphase Coset Fibration (OCF) anti-aliases high-resolution skips, restructures them via 3D space-to-depth into cosets, and applies phase attention with edge-gated modulation to deliver compact, phase-aligned, boundary-aware features to the decoder. Extensive evaluations across 15 publicly available datasets spanning CT, MRI, and OCT demonstrate CROWn's state-of-the-art performance against 17 recent leading methods, improves overlap and topological consistency, consistently reduces boundary errors, while maintaining controlled training and inference cost. The code is publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38658", "url": null, "sourceid": 30999, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38002, "uid": "fd642be2b73b62dd7d2f2935735f67c8", "name": "ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers", "authors": [{"id": 188800, "fullname": "Yiyang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188800?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 158216, "fullname": "Feng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/158216?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188801, "fullname": "Xuedan Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188801?format=json", "institution": "Tsinghua University"}, {"id": 151493, "fullname": "Pu Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151493?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188802, "fullname": "Yonghao Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188802?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 188803, "fullname": "Jianqin Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188803?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38002", "url": null, "sourceid": 43388, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39193, "uid": "de1dbe3b8d524c96d50c30f2467c4bc6", "name": "TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection", "authors": [{"id": 182698, "fullname": "Jian-Yu Jiang-Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182698?format=json", "institution": "National Taiwan University"}, {"id": 191546, "fullname": "Kang-Yang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191546?format=json", "institution": "National Taiwan University"}, {"id": 180881, "fullname": "LING ZOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180881?format=json", "institution": "National Taiwan University"}, {"id": 103917, "fullname": "Ling Lo", "url": "http://cvpr.thecvf.com/api/miniconf/users/103917?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 191547, "fullname": "Sheng-Ping Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191547?format=json", "institution": null}, {"id": 191548, "fullname": "Yu-Wen Tseng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191548?format=json", "institution": "National Taiwan University"}, {"id": 145431, "fullname": "Kun-Hsiang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/145431?format=json", "institution": "National Taiwan University"}, {"id": 191549, "fullname": "Chia-Ling Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191549?format=json", "institution": "Department of computer science and informational engineering, National Taiwan University"}, {"id": 191550, "fullname": "Yu-Ting Ta", "url": "http://cvpr.thecvf.com/api/miniconf/users/191550?format=json", "institution": "Department of computer science and informational engineering, National Taiwan University"}, {"id": 191551, "fullname": "Yan-Tsung Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191551?format=json", "institution": "Department of computer science and informational engineering, National Taiwan University"}, {"id": 191552, "fullname": "Po-Ching Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191552?format=json", "institution": "Department of computer science and informational engineering, National Taiwan University"}, {"id": 133400, "fullname": "Hongxia Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/133400?format=json", "institution": "Jilin University"}, {"id": 132968, "fullname": "Hong-Han Shuai", "url": "http://cvpr.thecvf.com/api/miniconf/users/132968?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 127961, "fullname": "Wen-Huang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127961?format=json", "institution": "National Taiwan University"}], "abstract": "Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce \\benchname, a comprehensive benchmark for interpretable DeepFake detection. \\benchname\\ contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. \\benchname\\ provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39193", "url": "https://j1anglin.github.io/TriDF/", "sourceid": 44971, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38822, "uid": "bd60b7054e3090a6a7b3fc94f4d9d17f", "name": "OVOD-Agent: A Markov\u2013Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection", "authors": [{"id": 182049, "fullname": "Chujie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182049?format=json", "institution": "Wuhan University"}, {"id": 190762, "fullname": "Jianyu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190762?format=json", "institution": ", Wuhan University"}, {"id": 190763, "fullname": "Zhiyuan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190763?format=json", "institution": "Wuhan University"}, {"id": 190764, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190764?format=json", "institution": "Wuhan University"}, {"id": 190765, "fullname": "Chu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190765?format=json", "institution": "Wuhan University"}], "abstract": "Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision\u2013language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual Chain of Thought with explicit actions. OVOD\u2019s lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent\u2019s state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent improves existing OVOD baselines and outperforms prior methods on novel categories, demonstrating strong generalization and scalability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38822", "url": null, "sourceid": 46334, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38081, "uid": "523b96f500fcb4459aa8718e387c9b23", "name": "UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting", "authors": [{"id": 106547, "fullname": "Geonuk Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/106547?format=json", "institution": "LG Energy Solution"}, {"id": 184060, "fullname": "Minhoi Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/184060?format=json", "institution": "LG Energy Solution"}, {"id": 189019, "fullname": "Kangil Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189019?format=json", "institution": null}, {"id": 189020, "fullname": "Minsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189020?format=json", "institution": null}, {"id": 189021, "fullname": "Hyeonseong Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/189021?format=json", "institution": null}, {"id": 189022, "fullname": "JEONGHOON HAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/189022?format=json", "institution": "LG Energy Solution"}, {"id": 189023, "fullname": "Hyoungjoon Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189023?format=json", "institution": "LG Energy Solution"}, {"id": 189024, "fullname": "Junho Yim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189024?format=json", "institution": "LG Energy Solution"}], "abstract": "Even though industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While the visual prompting approach provides a scalable alternative, it struggles in industrial settings where subtle inter-class differences and high intra-class variance make prompt-to-region matching ambiguous and cause prompt embeddings to collapse, limiting the effectiveness of existing methods. To address these challenges, we introduce UniSpector\u2014 a Universal Inspector for open-set defect detection and segmentation. To empower defect prompt embeddings for robust recognition of novel defects, it comprises two key components: the Spatial\u2013Spectral Prompt Encoder (SSPE) and the Contrastive Prompt Encoder (CPE). SSPE extracts orientation-invariant frequency cues and fuses them with spatial features to distinguish subtle defects. CPE encodes the prompt into an angular space to facilitate semantically meaningful embedding of unseen defect prompts. In addition, to improve adaptability to novel defect types, we introduce Prompt-guided Query Selection (PQS) to generate adaptive object queries aligned with the prompt.To standardize evaluation, we introduce Inspect Anything (InsA), the first benchmark for visual-prompt-based open-set defect localization.Experiments demonstrate that UniSpector significantly surpasses prior baselines by at least 19.7\\% and 15.8\\% in AP50b and AP50m, respectively. These results show that our method enables a scalable, retraining-free inspection paradigm for continuously evolving industrial environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38081", "url": null, "sourceid": 31977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39672, "uid": "f1cc3b90f5460d50d5200128a455979d", "name": "MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts", "authors": [{"id": 155635, "fullname": "Zilong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155635?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 155636, "fullname": "Jun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/155636?format=json", "institution": "Sun Yat-sen University"}, {"id": 192610, "fullname": "Xiaobin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192610?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192611, "fullname": "Ziyi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192611?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 172363, "fullname": "Yang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/172363?format=json", "institution": "Sun Yat-sen University"}, {"id": 129844, "fullname": "Junyan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/129844?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 107541, "fullname": "Weijia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107541?format=json", "institution": "Sun Yat-sen University"}, {"id": 155637, "fullname": "Yiping Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155637?format=json", "institution": "Sun Yat-Sen University"}, {"id": 140977, "fullname": "Ting Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/140977?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations.  We introduce MajutsuCity, a natural language\u2013driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39672", "url": null, "sourceid": 32460, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39062, "uid": "c40df15b5da1af4f7e5e658b00d4c627", "name": "Rethinking Occlusion Modeling for UAV Tracking", "authors": [{"id": 191278, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191278?format=json", "institution": "Sichuan University"}, {"id": 191279, "fullname": "Xincheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191279?format=json", "institution": "Sichuan University"}, {"id": 191280, "fullname": "Yi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191280?format=json", "institution": "Sichuan University"}], "abstract": "Occlusion remains one of the major challenges in UAV tracking, where dynamic viewpoints and complex environments often cause partial or complete visibility loss.Existing transformer-based trackers typically regard occlusion as random information dropout, overlooking its structured and spatially correlated nature in real-world scenes.We rethink occlusion modeling in UAV tracking as a structured process governed by spatial dependencies.Based on this insight, we introduce Clustered Occlusion Modeling (COM) to generate realistic, density-adaptive occlusion patterns that enhance feature robustness under partial visibility.Furthermore, we design Cost-Aware Depth Bias (CADB), which employs a depth-dependent prior to adjust inference depth, yielding better efficiency while maintaining competitive accuracy.Integrating COM and CADB into a unified single-stream transformer framework, termed OCTrack, our tracker achieves robust and efficient UAV tracking in occlusion-prone environments.Extensive experiments on multiple UAV benchmarks validate its effectiveness and demonstrate state-of-the-art performance. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39062", "url": null, "sourceid": 37169, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38159, "uid": "035f5a5a474f2131ccbfe12d0a0fa0af", "name": "Multi-Modal Image Fusion via Intervention-Stable Feature Learning", "authors": [{"id": 189178, "fullname": "Xue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189178?format=json", "institution": "Nanyang Normal University"}, {"id": 189179, "fullname": "Zheng Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189179?format=json", "institution": "Yunnan University"}, {"id": 189180, "fullname": "Wenhua Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189180?format=json", "institution": null}, {"id": 189181, "fullname": "Chengchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189181?format=json", "institution": "Yunnan University"}, {"id": 189182, "fullname": "Runzhuo MA", "url": "http://cvpr.thecvf.com/api/miniconf/users/189182?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Multi-modal image fusion integrates complementary information from different modalities into a unified representation. Current methods predominantly optimize statistical correlations between modalities, often capturing dataset-induced spurious associations that degrade under distribution shifts. In this paper, we propose an intervention-based framework inspired by causal principles to identify robust cross-modal dependencies. Drawing insights from Pearl's causal hierarchy, we design three principled intervention strategies to probe different aspects of modal relationships: i) complementary masking with spatially disjoint perturbations tests whether modalities can genuinely compensate for each other's missing information, ii) random masking of identical regions identifies feature subsets that remain informative under partial observability, and iii) modality dropout evaluates the irreplaceable contribution of each modality. Based on these interventions, we introduce a Causal Feature Integrator (CFI) that learns to identify and prioritize intervention-stable features maintaining importance across different perturbation patterns through adaptive invariance gating, thereby capturing robust modal dependencies rather than spurious correlations. Extensive experiments demonstrate that our method achieves SOTA performance on both public benchmarks and downstream high-level vision tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38159", "url": null, "sourceid": 32375, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39913, "uid": "e69aa3389bba902b227d5de50c815dd1", "name": "Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling", "authors": [{"id": 193102, "fullname": "Jianbin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193102?format=json", "institution": "Dalian University of Technology"}, {"id": 180418, "fullname": "Chaoran Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180418?format=json", "institution": "Peking University"}, {"id": 193103, "fullname": "Miao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193103?format=json", "institution": "Communication University of China"}, {"id": 193104, "fullname": "Yingtao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193104?format=json", "institution": "Dalian University of Technology"}, {"id": 182317, "fullname": "Zhenyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182317?format=json", "institution": "Peking University"}, {"id": 155747, "fullname": "Wangbo Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155747?format=json", "institution": "Peking University"}, {"id": 89465, "fullname": "Yian Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89465?format=json", "institution": "Peking University"}, {"id": 151535, "fullname": "Xiaomin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/151535?format=json", "institution": "Dalian University of Technology"}, {"id": 186105, "fullname": "Li Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186105?format=json", "institution": "Peking University"}, {"id": 86326, "fullname": "Yonghong Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/86326?format=json", "institution": "Peking University"}], "abstract": "Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity\u2014particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts.To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the **SpatialReward-Dataset** with over 80k preference pairs. Building on this dataset, we build **SpatialScore**, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.% Through rigorous data curation and filtering, **SpatialScore** surpasses several leading vision-language models (VLMs) and existing reward models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation. All models and datasets will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39913", "url": null, "sourceid": 36872, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39832, "uid": "cbcb32994827a5c4ef1b8a630a4fa66e", "name": "Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective", "authors": [{"id": 128986, "fullname": "Junjie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128986?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192941, "fullname": "Bao Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192941?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 191730, "fullname": "Meiling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191730?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 128998, "fullname": "WEI SHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/128998?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 87100, "fullname": "Daoqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87100?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "To enhance the precision of cancer prognosis, recent research has increasingly focused on multimodal survival methods by integrating genomic data and histology images. However, current approaches overlook the fact that the proteome serves as an intermediate layer bridging genomic alterations and histopathological features while providing complementary biological information essential for survival prediction. This biological reality exposes another architectural limitation: existing integrative analysis studies fuse these heterogeneous data sources in a flat manner that fails to capture their inherent biological hierarchy. To address these limitations, we propose HFGPI, a hierarchical fusion framework that models the biological progression from genes to proteins to histology images from a systems biology perspective. Specifically, we introduce Molecular Tokenizer, a molecular encoding strategy that integrates identity embeddings with expression profiles to construct biologically informed representations for genes and proteins. We then develop Gene-Regulated Protein Fusion (GRPF), which employs graph-aware cross-attention with structure-preserving alignment to explicitly model gene-protein regulatory relationships and generate gene-regulated protein representations. Additionally, we propose Protein-Guided Hypergraph Learning (PGHL), which establishes associations between proteins and image patches, leveraging hypergraph convolution to capture higher-order protein-morphology relationships. The final features are progressively fused across hierarchical layers to achieve precise survival outcome prediction. Extensive experiments on five benchmark datasets demonstrate the superiority of HFGPI over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39832", "url": null, "sourceid": 42562, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38877, "uid": "7c2410c8be77b896e8a5b26d1a994a23", "name": "ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction", "authors": [{"id": 190904, "fullname": "Xinxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190904?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 190905, "fullname": "Xue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190905?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 190906, "fullname": "Guoqing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190906?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 76385, "fullname": "Qing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76385?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Jointly optimizing camera poses and object geometry from unposed images is a challenging task in neural surface reconstruction. Existing methods often suffer from pose drift and geometric distortion, stemming from the easy-view bias --- uniform view optimization favors easy-to-optimize views with abundant texture and good overlap that dominate gradient updates, while hard-to-optimize counterparts with weak texture or limited overlap yet critical for geometric completeness are progressively marginalized. To address this, we propose ManifoldNeuS, a novel framework that explicitly models and leverages per-view optimizability to guide pose-free neural surface reconstruction. Specifically, we introduce the manifold-aware view optimizability score (MaVOS), which jointly assesses immediate fitness (the ease of optimizing each view) and long-term coverage gain (the value of optimizing each view) over the view-coherent manifold. Building on the MaVOS, we further devise a reconstruction pipeline that incorporates the per-view optimizability as a state control signal to guide the joint optimization process through three key components: dynamic view scheduling, gated positional encoding, and anti-score loss weighting. Experimental results on the benchmark dataset demonstrate that ManifoldNeuS outperforms existing methods in terms of accurate pose estimation and high-quality reconstruction, achieving robust joint optimization without known camera poses.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38877", "url": null, "sourceid": 35345, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39258, "uid": "ede4663c5f5906bda914f26ec185b666", "name": "IEBGL:An Interpretability-Enhanced Brain Graph Learning Framework with LLM-Instructed Topology and Literature-Augmented Semantics", "authors": [{"id": 181155, "fullname": "Yihang Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181155?format=json", "institution": "Nanjing Forestry University"}, {"id": 191728, "fullname": "Shuo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191728?format=json", "institution": "Nanjing Forestry University"}, {"id": 191729, "fullname": "Lizhang Lizhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191729?format=json", "institution": "Nanjing Forestry University"}, {"id": 191730, "fullname": "Meiling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191730?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 191731, "fullname": "Li Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191731?format=json", "institution": "Nanjing Forestry University"}], "abstract": "Resting-state functional MRI (rs-fMRI) provides rich information for modeling brain connectivity in disease diagnosis. However, most existing brain graph learning methods rely solely on imaging data, leading to limited biological interpretability and poor integration of external medical knowledge. To address these challenges, we propose an \\textbf{I}nterpretability-\\textbf{E}nhanced \\textbf{B}rain \\textbf{G}raph \\textbf{L}earning (IEBGL) framework that anchors brain network modeling in large-scale medical knowledge. Our framework introduces two complementary modules: LLM-Instructed Topological Reconstruction (LITR) and Literature-Augmented Semantic Aggregation (LASA). LITR employs large language model (LLM) reasoning to refine brain connectivity and construct topological structure. LASA augments node representations by aggregating  semantic information from biomedical literature, ensuring the model\u2019s interpretability and relevance to clinical disease knowledge. Finally, the framework is trained with the Graph Bi-directional Mamba Network (GBMN) for disease diagnosis. Extensive experiments on the REST-meta-MDD and ABIDE datasets, together with 35,133 depression-related and 32,617 autism-related publications, demonstrate that IEBGL outperforms state-of-the-art methods in classification performance. Further analyses show that the LITR module reveals biologically meaningful alterations in brain connectivity, while the LASA module establishes interpretable associations between these regions and disease-related biomedical literature. Together, these mechanisms help IEBGL explain abnormal brain connections and their links to disease-related knowledge.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39258", "url": null, "sourceid": 31990, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39007, "uid": "116918676ff49aebce8075a874be7e6d", "name": "GSNR: Graph Smooth Null-Space Representation for Inverse Problems", "authors": [{"id": 191167, "fullname": "Romario Gualdr\u00f3n-Hurtado", "url": "http://cvpr.thecvf.com/api/miniconf/users/191167?format=json", "institution": "Universidad Industrial de Santander"}, {"id": 92436, "fullname": "Roman Jacome", "url": "http://cvpr.thecvf.com/api/miniconf/users/92436?format=json", "institution": "Univesridad Industrial de Santander"}, {"id": 191168, "fullname": "Rafael S. Su\u00e1rez", "url": "http://cvpr.thecvf.com/api/miniconf/users/191168?format=json", "institution": "Universidad Industrial de Santander"}, {"id": 95406, "fullname": "Henry Arguello", "url": "http://cvpr.thecvf.com/api/miniconf/users/95406?format=json", "institution": "Universidad Industrial de Santander"}], "abstract": "Inverse problems in imaging are undetermined, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null\u2011space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information into the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the $p$-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null\u2011space variance is captured by $p$ modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39007", "url": null, "sourceid": 35076, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38584, "uid": "35e1d9679148d95fdb4f567db207ae06", "name": "Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition", "authors": [{"id": 180195, "fullname": "Qianrui Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180195?format=json", "institution": "Tsinghua University"}, {"id": 190208, "fullname": "Hua Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190208?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 190209, "fullname": "Yunjin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190209?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 190210, "fullname": "Yifan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190210?format=json", "institution": "Hebei University of Science and Technology"}, {"id": 190211, "fullname": "Songze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190211?format=json", "institution": "Hebei University of Science and Technology"}, {"id": 190212, "fullname": "Hanlei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190212?format=json", "institution": null}], "abstract": "Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens extracted by encoders, capturing localized semantic cues. These tokens are then abstracted into semantic concepts via a label-guided clustering strategy, yielding mid-level intent-aware patterns. To capture higher-order structure, inter-concept relations are selected through a JS-divergence-based mechanism to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER not only consistently outperforms state-of-the-art methods and MLLMs with 1\u20133% gains across all metrics, but also exhibits strong generalization across diverse backbones.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38584", "url": null, "sourceid": 43932, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39118, "uid": "bedfbb70a637d0bfc3fa0a39eefc036f", "name": "Diversity over Uniformity: Rethinking Representation in Generated Image Detection", "authors": [{"id": 191390, "fullname": "Qinghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191390?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 191391, "fullname": "Haifeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191391?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 191392, "fullname": "Qiao Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191392?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 90522, "fullname": "Bo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90522?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 77315, "fullname": "Xiuli Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/77315?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 86069, "fullname": "Bin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86069?format=json", "institution": "Chongqing University of Posts and Tel."}], "abstract": "With the rapid advancement of generative models, generated image detection has become an important task in visual forensics. Although existing methods have achieved remarkable progress, they often rely, after training, on only a small subset of highly salient forgery cues, which limits their ability to generalize to unseen generative mechanisms. We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. This design maintains diverse and complementary evidence within the model, reduces reliance on a small set of salient cues, and enhances robustness under unseen generative settings. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in cross-model scenarios, achieving an accuracy improvement of 5.02\\% and exhibiting superior generalization and detection reliability. The source code will be publicly available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39118", "url": null, "sourceid": 45661, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37227, "uid": "d67f1ab80fc9118ef90cc00cae40529f", "name": "ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding", "authors": [{"id": 179999, "fullname": "Byeongjun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/179999?format=json", "institution": "EverEx"}, {"id": 186955, "fullname": "Byung-Hoon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186955?format=json", "institution": "Yonsei University; EverEx"}, {"id": 152100, "fullname": "Hyungjin Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/152100?format=json", "institution": "EverEx"}, {"id": 85079, "fullname": "Jong Chul Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/85079?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37227", "url": null, "sourceid": 36571, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36417, "uid": "5f937e78a9f11802066ba28a4f8d959f", "name": "Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation", "authors": [{"id": 182206, "fullname": "Junghwan Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/182206?format=json", "institution": "TelePIX"}, {"id": 184988, "fullname": "Woojin Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/184988?format=json", "institution": "Telepix"}, {"id": 184989, "fullname": "Junhyuk Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184989?format=json", "institution": "Telepix"}, {"id": 135948, "fullname": "Darongsae Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/135948?format=json", "institution": "TelePIX Co., Ltd."}, {"id": 184990, "fullname": "Kookjin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184990?format=json", "institution": "Arizona State University"}], "abstract": "Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace.In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients\u2014along with a lightweight rescaling step\u2014while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36417", "url": null, "sourceid": 32561, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40082, "uid": "e58e29db292ff8dfbbe41afb846e469e", "name": "CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation", "authors": [{"id": 105442, "fullname": "JINWON KO", "url": "http://cvpr.thecvf.com/api/miniconf/users/105442?format=json", "institution": "Korea University, Seoul"}, {"id": 191734, "fullname": "Keunsoo Ko", "url": "http://cvpr.thecvf.com/api/miniconf/users/191734?format=json", "institution": "The Catholic University of Korea"}, {"id": 87670, "fullname": "Chang-Su Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/87670?format=json", "institution": "Korea University"}], "abstract": "Reference-based color grading aims to reproduce the tonal mood and color harmony of a reference while preserving scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings --- over-shifting or inconsistently retaining colors --- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot --- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40082", "url": null, "sourceid": 31756, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37143, "uid": "a106cea5c02ed8a70b421831cc4e7192", "name": "InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search", "authors": [{"id": 183720, "fullname": "Qinqin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/183720?format=json", "institution": "Fuzhou University"}, {"id": 186762, "fullname": "Fuhai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186762?format=json", "institution": "Fuzhou University"}, {"id": 186763, "fullname": "Jipeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186763?format=json", "institution": "Minjiang University"}, {"id": 186764, "fullname": "Zhiwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186764?format=json", "institution": "Nanchang University; Jiangxi Normal University"}, {"id": 186765, "fullname": "Zhikai Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186765?format=json", "institution": "Hong Kong Baptist University"}, {"id": 186766, "fullname": "Weiwei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186766?format=json", "institution": null}], "abstract": "Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we validate is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37143", "url": null, "sourceid": 41610, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40156, "uid": "f84caa4b0ddc10bfd833073a466bf638", "name": "TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification", "authors": [{"id": 180662, "fullname": "Yunlong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180662?format=json", "institution": "Dalian University of Technology"}, {"id": 193651, "fullname": "Wenxin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193651?format=json", "institution": "Dalian University of Technology"}, {"id": 193652, "fullname": "Guanglu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193652?format=json", "institution": "Dalian University of Technology"}, {"id": 173988, "fullname": "Senqi Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/173988?format=json", "institution": "Dalian University of Technology"}, {"id": 193653, "fullname": "Linlin Zong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193653?format=json", "institution": "Dalian University of Technology"}, {"id": 193654, "fullname": "Dongyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193654?format=json", "institution": "Dalian University of Technology"}, {"id": 193655, "fullname": "Xinyue Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193655?format=json", "institution": "Dalian University of Technology"}], "abstract": "Fully Few-shot Class-incremental Audio Classification (FFCAC) is challenging since the training samples are limited both in the incremental sessions and in the base session. Existing few-shot learning methods suffer from catastrophic forgetting and overfitting when applied to FFCAC.Pre-trained Audio-Language Models (ALMs) have achieved success in many audio learning tasks. However, we find that it is impractical to directly use ALM on FFCAC, since misalignment between text and audio causes even severe catastrophic forgetting and overfitting. We propose a Task-Adaptive Prototype Evolution (TAPE) framework to facilitate ALMs to tackle the challenges of FFCAC, which consists of two key components:(1) A Task-Adapter that isolates audio features in a metric space to mitigate catastrophic forgetting while preserving knowledge across sessions, and (2) A Prototype Evolution mechanism that dynamically refines class prototypes using query samples during inference, thereby enabling adaptive learning and reducing overfitting.To the best of our knowledge, we are the first to use ALMs on the FFCAC task. We conduct experiments on three audio datasets: NSynth-100 (instrument recognition), FSC-89 (event detection), and LBS-100 (voice recognition). The experimental results show that our proposed approach TAPE significantly surpasses the baselines. Specifically, it averagely improves upon the second best from 54.93\\%  to 82.76\\% in terms of Average Accuracy (AA $\\uparrow$), and from 28.74\\% to 12.56\\% in terms of Performance Dropping rate (PD $\\downarrow$).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40156", "url": null, "sourceid": 36348, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39323, "uid": "68a3ead65a4bad5da277ab9ecf50ca89", "name": "SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation", "authors": [{"id": 171149, "fullname": "Ryosuke Matsuda", "url": "http://cvpr.thecvf.com/api/miniconf/users/171149?format=json", "institution": "Tohoku University"}, {"id": 191846, "fullname": "Keito Kudo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191846?format=json", "institution": "Tohoku University RIKEN"}, {"id": 191847, "fullname": "Haruto Yoshida", "url": "http://cvpr.thecvf.com/api/miniconf/users/191847?format=json", "institution": "Tohoku University"}, {"id": 191848, "fullname": "Nobuyuki Shimizu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191848?format=json", "institution": "LY Corporation"}, {"id": 165099, "fullname": "Jun Suzuki", "url": "http://cvpr.thecvf.com/api/miniconf/users/165099?format=json", "institution": "Tohoku University"}], "abstract": "We introduce SLVMEval, a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. SLVMEval focuses on assessing these systems on long videos of up to 10,486 seconds (approximately 3 hours). Our benchmark targets a fundamental requirement: whether systems can accurately judge video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework.Building on dense video captioning datasets, we synthetically degrade source videos to create controlled ``high-quality vs. low-quality'' pairs across 10 distinct aspects. We then use crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing the final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Our experiments show that human evaluators identify the better long video with 84.7\\%--96.8\\% accuracy, while in 9 of the 10 aspects, the accuracy of these systems falls short of human judgment, revealing weaknesses in text-to-long video evaluation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39323", "url": null, "sourceid": 44792, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38630, "uid": "2a384b15e8016b260de6ef70a54dbd22", "name": "UniCompress: Token Compression for Unified Vision\u2013Language Understanding and Generation", "authors": [{"id": 190343, "fullname": "Ziyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190343?format=json", "institution": "University of Maryland, College Park"}, {"id": 128864, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128864?format=json", "institution": "Sony AI"}, {"id": 152530, "fullname": "Jingtao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152530?format=json", "institution": "Sony AI"}, {"id": 128883, "fullname": "Weiming Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128883?format=json", "institution": "Sony Research"}, {"id": 190344, "fullname": "Jiabo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190344?format=json", "institution": "Sony AI"}, {"id": 128760, "fullname": "Ang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128760?format=json", "institution": "University of Maryland, College Park"}, {"id": 75477, "fullname": "Lingjuan Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75477?format=json", "institution": "Sony AI"}], "abstract": "Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to \\(4\\times\\), achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38630", "url": null, "sourceid": 46060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39028, "uid": "aafe7061523ba396c00a46a0f4056b31", "name": "From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing", "authors": [{"id": 180759, "fullname": "Haoyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180759?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 191201, "fullname": "Keyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191201?format=json", "institution": "Baidu"}, {"id": 191202, "fullname": "Guosheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191202?format=json", "institution": "Baidu"}, {"id": 129305, "fullname": "Haixiao Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/129305?format=json", "institution": "Baidu"}, {"id": 191203, "fullname": "Zhiwen Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191203?format=json", "institution": null}, {"id": 152712, "fullname": "Siran Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152712?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 126688, "fullname": "Tianshuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126688?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 87785, "fullname": "Xiao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87785?format=json", "institution": "Baidu"}, {"id": 185763, "fullname": "KunbinChen KunbinChen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185763?format=json", "institution": "Baidu"}, {"id": 185764, "fullname": "Wei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185764?format=json", "institution": "Baidu"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}, {"id": 75781, "fullname": "Ajian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75781?format=json", "institution": "NLPR, CASIA"}, {"id": 76305, "fullname": "Xiangyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76305?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}], "abstract": "Face recognition remains vulnerable to presentation attacks, calling for robust face anti-spoofing (FAS) solutions. Recent Multimodal Large Language Model (MLLM)-based FAS methods reformulate the binary classification task as the generation of brief textual descriptions to improve cross-domain generalization. However, their generalizability is still limited, as such descriptions mainly capture intuitive semantic cues (e.g., screen borders or mask contours) while struggling to perceive fine-grained visual patterns. To address this limitation, we incorporate external visual tools into MLLMs to encourage deeper investigation of subtle spoof clues. Specifically, we propose the Tool-Augmented Reasoning FAS (TAR-FAS) framework, which reformulates the FAS task as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm, allowing MLLMs to begin with intuitive observations and adaptively invoke external visual tools for fine-grained investigation. To this end, we design a tool-augmented data annotation pipeline and construct the ToolFAS-16K dataset, which contains multi-turn tool-use reasoning trajectories. Furthermore, we introduce a tool-aware FAS training pipeline, where Diverse-Tool Group Relative Policy Optimization (DT-GRPO) enables the model to autonomously learn efficient tool use. Extensive experiments under a challenging one-to-eleven cross-domain protocol demonstrate that TAR-FAS achieves state-of-the-art performance while providing fine-grained visual investigation for trustworthy spoof detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39028", "url": null, "sourceid": 39458, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40137, "uid": "ee1c481777677e45c72cba48b091cbd9", "name": "Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection", "authors": [{"id": 193608, "fullname": "Huafeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193608?format=json", "institution": "nanjing university"}, {"id": 145475, "fullname": "Chenguang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145475?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 190020, "fullname": "Yueming Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190020?format=json", "institution": "Nanjing university"}, {"id": 153687, "fullname": "Caifeng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153687?format=json", "institution": "Nanjing University"}], "abstract": "Most Camouflaged Object Detection (COD) methods rely on costly pixel-level annotations. Recent studies have adopted unsupervised COD (UCOD) to eliminate labeling costs, but still suffer from two issues:1) insufficient supervision, leading to reliance on self-supervised backbone DINO and reduced model flexibility; and 2) ineffective use of pseudo-labels, which widens the performance gap with supervised methods and limits real-world applicability. In this paper, we propose a novel teacher-student framework for UCOD to address these two issues. To tackle the lack of supervision, we build a powerful teacher model by integrating Multimodal Large Language Models (MLLMs) and the Segment Anything Model (SAM) to generate high-quality pseudo-labels. However, the teacher model faces two challenges: 1) suboptimal performance of MLLMs in COD, and 2) cascading errors.To address these challenges, we first propose a Camouflaged-Aware Chain-of-Thought (CA-CoT) for MLLMs. CA-CoT guides MLLMs through step-by-step reasoning to simulate human perceptual processes, thereby enhancing their performance in COD.Subsequently, we design a Graded Mask Evaluator (GME) to mitigate cascading errors, which evaluates and grades the quality of masks generated by SAM, and then filters out the low-quality masks to provide more reliable supervision.To better leverage pseudo-labels, we propose Graded Knowledge Distillation (GKD), which adaptively enhances distillation at both image and pixel levels based on pseudo-label quality.Extensive experiments show that our method outperforms existing UCOD approaches by a large margin and achieves performance comparable to weakly supervised methods. Notably, our method also achieves good performance under zero-shot settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40137", "url": null, "sourceid": 33302, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38814, "uid": "5fdb4a28dbe649f89634d06546d454c4", "name": "FlowComposer: Composable Flows for Compositional Zero-Shot Learning", "authors": [{"id": 190754, "fullname": "Zhenqi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190754?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 154832, "fullname": "Lin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154832?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 90895, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90895?format=json", "institution": "HKUST"}], "abstract": "Compositional zero-shot learning (CZSL) aims to recognize unseen attribute\u2013object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT).They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions.However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models.In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals.We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38814", "url": null, "sourceid": 32500, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36318, "uid": "39183d4b26c4bb6f1a69b8040cfc8960", "name": "Photo3D: Advancing Photorealistic 3D Generation through Structure\u2011Aligned Detail Enhancement", "authors": [{"id": 156711, "fullname": "Xinyue Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156711?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 76238, "fullname": "Zhiyuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/76238?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88758, "fullname": "Lingchen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88758?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 184763, "fullname": "Yanjun Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184763?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Although recent 3D\u2011native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non\u2011rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT\u20114o\u2011Image model.Considering that the generated images can distort 3D structures due to their lack of multi\u2011view consistency, we design a structure\u2011aligned multi\u2011view synthesis pipeline and construct a detail\u2011enhanced multi\u2011view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with the realistic detail priors while preserving the structural consistency with the 3D-native geometry. While our scheme is general to different 3D-native generators, we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D\u2011native generation paradigms and achieves state\u2011of\u2011the\u2011art photorealistic 3D generation performance.  Codes, models and datasets will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36318", "url": null, "sourceid": 46307, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37698, "uid": "0b6412624cfd8ebee546fa2dc1db6945", "name": "Scaling Zero-Shot Reference-to-Video Generation", "authors": [{"id": 103410, "fullname": "Zijian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/103410?format=json", "institution": "King&#x27;s College London"}, {"id": 76995, "fullname": "Shikun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76995?format=json", "institution": "Meta AI"}, {"id": 86851, "fullname": "Haozhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86851?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 155930, "fullname": "Haonan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155930?format=json", "institution": "Nanyang Technological University"}, {"id": 113866, "fullname": "Zhaochong An", "url": "http://cvpr.thecvf.com/api/miniconf/users/113866?format=json", "institution": "University of Copenhagen"}, {"id": 130315, "fullname": "Weiming Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/130315?format=json", "institution": "University of Waterloo"}, {"id": 126968, "fullname": "Zhiheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126968?format=json", "institution": "University of Science and Technology of China"}, {"id": 154250, "fullname": "Xiaoke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154250?format=json", "institution": "UC Santa Cruz"}, {"id": 157111, "fullname": "Kam Woh Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157111?format=json", "institution": "Meta"}, {"id": 157112, "fullname": "Tian Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/157112?format=json", "institution": "Meta"}, {"id": 157110, "fullname": "Xiao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/157110?format=json", "institution": "Meta AI"}, {"id": 157113, "fullname": "Yuren Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/157113?format=json", "institution": "Meta"}, {"id": 154505, "fullname": "Hang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154505?format=json", "institution": "Facebook"}, {"id": 188036, "fullname": "Chuyan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188036?format=json", "institution": "Facebook"}, {"id": 139232, "fullname": "Aditya Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/139232?format=json", "institution": "Meta"}, {"id": 85886, "fullname": "Tao Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85886?format=json", "institution": "University of Surrey"}, {"id": 127697, "fullname": "Sen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/127697?format=json", "institution": "Meta AI"}], "abstract": "Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37698", "url": null, "sourceid": 41787, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38184, "uid": "12c9572aee915aa05233c43f75e915a6", "name": "Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images", "authors": [{"id": 156260, "fullname": "Matias Turkulainen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156260?format=json", "institution": "Aalto University"}, {"id": 189235, "fullname": "Akshay Krishnan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189235?format=json", "institution": "Georgia Institute of Technology"}, {"id": 127993, "fullname": "Filippo Aleotti", "url": "http://cvpr.thecvf.com/api/miniconf/users/127993?format=json", "institution": "Niantic, Inc."}, {"id": 86742, "fullname": "Mohamed Sayed", "url": "http://cvpr.thecvf.com/api/miniconf/users/86742?format=json", "institution": "University College London, University of London"}, {"id": 189236, "fullname": "Guillermo Garcia-Hernando", "url": "http://cvpr.thecvf.com/api/miniconf/users/189236?format=json", "institution": "Niantic Spatial"}, {"id": 130155, "fullname": "Juho Kannala", "url": "http://cvpr.thecvf.com/api/miniconf/users/130155?format=json", "institution": "University of Oulu"}, {"id": 150992, "fullname": "Arno Solin", "url": "http://cvpr.thecvf.com/api/miniconf/users/150992?format=json", "institution": "Aalto University"}, {"id": 158893, "fullname": "Gabriel Brostow", "url": "http://cvpr.thecvf.com/api/miniconf/users/158893?format=json", "institution": "Department of Computer Science, University College London"}, {"id": 87391, "fullname": "Daniyar Turmukhambetov", "url": "http://cvpr.thecvf.com/api/miniconf/users/87391?format=json", "institution": "Niantic"}], "abstract": "We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced data sets and paired satellite--terrain data, mined from open mapping services.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38184", "url": null, "sourceid": 30603, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39177, "uid": "bef08a31c0318484b7787659041858da", "name": "TempoControl: Temporal Attention Guidance for Text-to-Video Models", "authors": [{"id": 191510, "fullname": "Shira Schiber", "url": "http://cvpr.thecvf.com/api/miniconf/users/191510?format=json", "institution": "Bar-Ilan University"}, {"id": 191511, "fullname": "Ofir Lindenbaum", "url": "http://cvpr.thecvf.com/api/miniconf/users/191511?format=json", "institution": "Bar-Ilan University"}, {"id": 183447, "fullname": "Idan Schwartz", "url": "http://cvpr.thecvf.com/api/miniconf/users/183447?format=json", "institution": "Bar-Ilan University"}], "abstract": "Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal pattern with a control signal (correlation), adjusting its strength where visibility is required (magnitude), and preserving semantic consistency (entropy). TempoControl provides precise temporal control while maintaining high video quality and diversity. We demonstrate its effectiveness across various applications, including temporal reordering of single and multiple objects, action timing, and audio-aligned video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39177", "url": null, "sourceid": 43080, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39901, "uid": "834119d8cd999e0d70abd1bcabb627ad", "name": "SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia", "authors": [{"id": 180988, "fullname": "Pengfei Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/180988?format=json", "institution": "Xiamen University"}, {"id": 193076, "fullname": "Xingran Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193076?format=json", "institution": null}, {"id": 145074, "fullname": "Juntao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145074?format=json", "institution": "Tongji University"}, {"id": 191101, "fullname": "Peng Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191101?format=json", "institution": "Shopee"}, {"id": 193077, "fullname": "Wang Longchao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193077?format=json", "institution": "Shopee"}, {"id": 193078, "fullname": "Jianghang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193078?format=json", "institution": "Xiamen University"}, {"id": 70439, "fullname": "Shengchuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70439?format=json", "institution": "Xiamen University"}, {"id": 191102, "fullname": "Anxiang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191102?format=json", "institution": "Shopee"}, {"id": 88415, "fullname": "Liujuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88415?format=json", "institution": "Xiamen University"}], "abstract": "Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering~(TEC-VQA)  across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question\u2013answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive progress in document and scene text understanding from a global perspective.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39901", "url": null, "sourceid": 41336, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39554, "uid": "d3af6e88ef95e50bacebe1bd779ea52c", "name": "Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives", "authors": [{"id": 192333, "fullname": "Haoran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192333?format=json", "institution": "University of Bristol"}, {"id": 192334, "fullname": "Guoxi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192334?format=json", "institution": "University of Bristol"}, {"id": 155302, "fullname": "Fan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155302?format=json", "institution": "University of Bristol"}, {"id": 155305, "fullname": "David Bull", "url": "http://cvpr.thecvf.com/api/miniconf/users/155305?format=json", "institution": "University of Bristol"}, {"id": 187005, "fullname": "Nantheera Anantrasirichai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187005?format=json", "institution": "University of Bristol"}], "abstract": "Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39554", "url": null, "sourceid": 34700, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38244, "uid": "6c6094f256f51e83fe02bce6091163e7", "name": "URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images", "authors": [{"id": 180064, "fullname": "Ri Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/180064?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 189412, "fullname": "Zhao CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/189412?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189413, "fullname": "Caleb Chen Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189413?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 90495, "fullname": "Lei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90495?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Whole slide image (WSI) region retrieval remains an open challenge in computational pathology, as existing methods struggle to represent and preserve information of all possible regions. Current approaches that rely on fixed-size patches or slide-level retrieval are misaligned with real clinical workflows, where pathologists often examine WSI regions of arbitrary orientations and sizes rather than predefined patches or slides. In this work, we redefine WSI retrieval as a semantically optimal matching problem between arbitrary regions under spatial transformations, which necessitates a region-level representation that maintains semantic consistency. To fulfill this requirement, we introduce semantic tessellation, which organizes patch units into flexible, geometry-aware region descriptors. Building on this representation, we develop the affine identifier, a semantic signature that enables rotation- and scale-consistent region matching. We further derive theoretical bounds between the tessellation-derived descriptors and the ideal pixel-level semantic mask objective, showing that they reliably approximate mask-based region similarity. Together, these components form URICA, a theoretically grounded algorithm for robust WSI region retrieval. Experiments on large public datasets demonstrate that URICA achieves strong and consistent performance across diverse WSI retrieval tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38244", "url": null, "sourceid": 39523, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39135, "uid": "9a81e45c27473af4628ddce5c7e1d576", "name": "ProjFlow: Projection Sampling with Flow Matching for Zero\u2011Shot Exact Spatial Motion Control", "authors": [{"id": 186101, "fullname": "Akihisa Watanabe", "url": "http://cvpr.thecvf.com/api/miniconf/users/186101?format=json", "institution": "Waseda University"}, {"id": 113807, "fullname": "Qing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/113807?format=json", "institution": "LY Corporation"}, {"id": 89626, "fullname": "Edgar Simo-Serra", "url": "http://cvpr.thecvf.com/api/miniconf/users/89626?format=json", "institution": "Waseda University"}, {"id": 131224, "fullname": "Kent Fujiwara", "url": "http://cvpr.thecvf.com/api/miniconf/users/131224?format=json", "institution": "LY Corporation"}], "abstract": "Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce $\\textbf{ProjFlow}$, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39135", "url": null, "sourceid": 40682, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38708, "uid": "80fb8b5d8e12b6d0cf36310afeb3ebc5", "name": "KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation", "authors": [{"id": 172766, "fullname": "Feiyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172766?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 190498, "fullname": "Jia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190498?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 189412, "fullname": "Zhao CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/189412?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 190499, "fullname": "Yang WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/190499?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189413, "fullname": "Caleb Chen Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189413?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 90495, "fullname": "Lei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90495?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Cross-modal biomedical signals such as pathology and genomics can provide richer and more robust semantic guidance for medical image representation. However, semantic guidance remains limited, as privacy constraints and acquisition costs severely restrict the availability of medical images paired with other biomedical data. A further challenge is modality discrepancy, which propagates intra-modal statistical bias and cross-modal noise, degrading medical image representation quality. To this end, we propose KAMP, a large language model (LLM)--driven multimodally guided pretraining framework for medical image representation learning. KAMP leverages textual priors as semantic anchors to enhance medical image representations and align medical images with multimodal biomedical data, enabling the generation of rich and robust representations even under scarce paired data. KAMP operates in three stages. First, the LLM generates personalized diagnostic knowledge from patient clinical text and imaging metadata. We inject this knowledge as a prior to enrich the medical image representation and use it as a semantic anchor to reduce the distance between the medical image representations and other biomedical modalities. Second, the LLM is optimized via the Group Relative Policy Optimization (GRPO) strategy, with the cross-modal aligner pretrained in the first stage serving as the reward model. Third, the optimized knowledge is employed to retrain the cross-modal aligner, yielding more robust medical image representations while mitigating bias and noise introduced by other modalities.Comprehensive evaluations on brain, bladder, and liver cancer datasets demonstrate that KAMP consistently outperforms existing methods on downstream few-shot prediction tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38708", "url": null, "sourceid": 35817, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36884, "uid": "0d8a89515fe89cd53cdedd3c039a15b0", "name": "Causal Motion Diffusion Models for Autoregressive Motion Generation", "authors": [{"id": 113807, "fullname": "Qing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/113807?format=json", "institution": "LY Corporation"}, {"id": 186101, "fullname": "Akihisa Watanabe", "url": "http://cvpr.thecvf.com/api/miniconf/users/186101?format=json", "institution": "Waseda University"}, {"id": 131224, "fullname": "Kent Fujiwara", "url": "http://cvpr.thecvf.com/api/miniconf/users/131224?format=json", "institution": "LY Corporation"}], "abstract": "Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36884", "url": null, "sourceid": 33614, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38202, "uid": "cfefe028584e7a3f406e1096e7daaaff", "name": "Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing", "authors": [{"id": 182135, "fullname": "Honglu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182135?format=json", "institution": "Didi Chuxing"}, {"id": 189295, "fullname": "Zhiqin Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189295?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 189296, "fullname": "Ningning Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189296?format=json", "institution": null}, {"id": 185561, "fullname": "Saihui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185561?format=json", "institution": "Beijing Normal University"}, {"id": 189297, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189297?format=json", "institution": null}, {"id": 189298, "fullname": "Renwang Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/189298?format=json", "institution": null}, {"id": 189299, "fullname": "Zhaofeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/189299?format=json", "institution": "Beijing University of Post and Telecommunication"}], "abstract": "Face Anti-Spoofing (FAS) typically depends on a single visual modality when defending against presentation attacks such as print attacks, screen replays, and 3D masks, resulting in limited generalization across devices, environments, and attack types. Meanwhile, Multimodal Large Language Models (MLLMs) have recently achieved breakthroughs in image\u2013text understanding and semantic reasoning, suggesting that integrating visual and linguistic co-inference into FAS can substantially improve both robustness and interpretability. However, the lack of a high-quality vision\u2013language multimodal dataset has been a critical bottleneck. To address this, we introduce FaceCoT (Face Chain-of-Thought), the first large-scale Visual Question Answering (VQA) dataset tailored for FAS. FaceCoT covers 14 spoofing attack types and enriches model learning with high-quality CoT VQA annotations. Meanwhile, we develop a caption model refined via reinforcement learning to expand the dataset and enhance annotation quality. Furthermore, we introduce a CoT-Enhanced Progressive Learning (CEPL) strategy to better leverage the CoT data and boost model performance on FAS tasks. Extensive experiments demonstrate that models trained with FaceCoT and CEPL outperform state-of-the-art methods on multiple benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38202", "url": null, "sourceid": 33197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39045, "uid": "46f913fb3c2719fb01ecad725bc3952d", "name": "Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery", "authors": [{"id": 191245, "fullname": "Kaibing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191245?format=json", "institution": "Tianjin University"}, {"id": 191246, "fullname": "Yucheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191246?format=json", "institution": "University of Minnesota, Twin Cities"}, {"id": 191247, "fullname": "Tingzhang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191247?format=json", "institution": "City University of Hong Kong"}], "abstract": "On-the-fly Category Discovery (OCD) aims to dynamically identify both known and emerging unknown categories from streaming data, using supervision from only a limited set of labeled classes. Despite recent progress, our empirical analysis reveals fundamental limitations: existing methods suffer from cascading feature-to-hash degradation and severe space monopolization by known classes, fundamentally hindering novel category discovery. To address these coupled challenges, we introduce a principled two-stage framework.We first construct a Hyper-Semantic Space with dual geometric subspaces: a Derived Subspace employing parent\u2013derived prototype augmentation to capture intra-class diversity and enhance inter-class discrimination, and a Calibrated Subspace synthesized through cross-prototype interpolation to impose distributional constraints and prevent representational collapse.Within this geometrically-constrained space, we perform Assignment-Driven Hash Learning, where Flexible Prototype Assignment (FPA) models intra-class variations and enhances inter-class separation, alongside Binary Hash Regularization (BHR) to enforce compact and discriminative hash representations. Our framework serves as a plug-and-play module, consistently improving state-of-the-art OCD methods across fine-grained benchmarks. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39045", "url": null, "sourceid": 43144, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38056, "uid": "b1211abafb24dcd0eea6ef6e8f4790a6", "name": "RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization", "authors": [{"id": 92182, "fullname": "Junwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/92182?format=json", "institution": "Karlsruhe Institute of Technology"}, {"id": 188940, "fullname": "Ruize Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188940?format=json", "institution": "KIT"}, {"id": 106502, "fullname": "Ruiping Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106502?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 186156, "fullname": "Zichao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186156?format=json", "institution": "University College London, University of London; Karlsruher Institut f\u00fcr Technologie"}, {"id": 113798, "fullname": "Yufan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/113798?format=json", "institution": "Karlsruhe Institute of Technology (KIT)"}, {"id": 77183, "fullname": "Fangjinhua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77183?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 70993, "fullname": "Kunyu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/70993?format=json", "institution": "KIT"}, {"id": 89634, "fullname": "Kailun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89634?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 186157, "fullname": "Jiaming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186157?format=json", "institution": "Hunan University"}, {"id": 75808, "fullname": "Rainer Stiefelhagen", "url": "http://cvpr.thecvf.com/api/miniconf/users/75808?format=json", "institution": "Karlsruhe Institute of Technology"}], "abstract": "Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. The dataset, model, and code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38056", "url": null, "sourceid": 32234, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38171, "uid": "ec5ab2bab6fdb880337c5c92785f77dd", "name": "LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image  Generation by 256 Tokens", "authors": [{"id": 130250, "fullname": "Qingsong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/130250?format=json", "institution": "OPPO"}, {"id": 154482, "fullname": "Luyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154482?format=json", "institution": "Tsinghua university"}, {"id": 131107, "fullname": "Zhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131107?format=json", "institution": "Sensetime Research"}, {"id": 86615, "fullname": "Siyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86615?format=json", "institution": "Westlake University, Zhejiang University"}, {"id": 183334, "fullname": "Zhe Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183334?format=json", "institution": "Tsinghua University"}, {"id": 189203, "fullname": "Zhenyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189203?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 154487, "fullname": "Haonan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154487?format=json", "institution": "OPPO Guangdong Mobile Telecommunications Co., Ltd."}], "abstract": "Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and quality: high-resolution image generation either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (LacTok) that bridges discrete visual tokens with the compact latent space of pretrained latent diffusion models (LDMs), enabling efficient representation of 1024\u00d71024 images using only 256 tokens\u2014a 16$\\times$ compression over VQGAN. LacTok integrates a transformer encoder, a quantized codebook, and a latent consistency decoder.Direct application of LDM as the decoder results in color and brightness discrepancies;thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. To endow LacTok with text-to-image  generation capabilities, we seamlessly integrate it with an autoregressive transformer, forming LacTokGen. This transformer is trained by predicting compact token sequences conditioned on text instructions.Experiments demonstrate LacTok\u2019s superiority in high-fidelity reconstruction, with 10.8 reconstruction Fr\\'{e}chet Inception Distance on MSCOCO-2017 5K benchmark for 1024\u00d71024 image reconstruction.LacTokGen achieves 0.73 score on GenEval benchmark and 0.304 HPSv2 on MSCOCO-2017 dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38171", "url": null, "sourceid": 36410, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39860, "uid": "447f51ef2d069e87d21382789ccab1e6", "name": "Reflection Separation from a Single Image via Joint Latent Diffusion", "authors": [{"id": 104777, "fullname": "Zheng-Hui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104777?format=json", "institution": "National Taiwan University"}, {"id": 92507, "fullname": "Zhixiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/92507?format=json", "institution": "The University of Tokyo"}, {"id": 127300, "fullname": "Yu-Lun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127300?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 90154, "fullname": "Yung-Yu Chuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90154?format=json", "institution": "National Taiwan University"}], "abstract": "Single-image reflection separation remains challenging due to its ill-posed nature, especially under extreme conditions with strong or subtle reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios, because of insufficient information. This paper presents the first diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments show our approach achieves superior separation performance on multiple real-world benchmarks and surpasses state-of-the-art methods in both quantitative metrics and perceptual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39860", "url": null, "sourceid": 35416, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37455, "uid": "105080a6d9b902a81355dc79a51155c9", "name": "IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations", "authors": [{"id": 87159, "fullname": "Fadi Boutros", "url": "http://cvpr.thecvf.com/api/miniconf/users/87159?format=json", "institution": "Fraunhofer Institute for Computer Graphics Research"}, {"id": 187492, "fullname": "Eduarda Caldeira", "url": "http://cvpr.thecvf.com/api/miniconf/users/187492?format=json", "institution": "Fraunhofer IGD"}, {"id": 187493, "fullname": "Tahar Chettaoui", "url": "http://cvpr.thecvf.com/api/miniconf/users/187493?format=json", "institution": "Fraunhofer IGD"}, {"id": 87174, "fullname": "Naser Damer", "url": "http://cvpr.thecvf.com/api/miniconf/users/87174?format=json", "institution": "Fraunhofer Institute for Computer Graphics Research IGD"}], "abstract": "Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intra-class variation, an essential property for training robust and generalizable FR models. In this work, we propose IDperturb, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. IDperturb perturbs identity embeddings within a constrained angular region of the unit hyper-sphere, producing a diverse set of embeddings without modifying the underlying generative model. Each perturbed embedding serves as a conditioning vector for a pre-trained diffusion model, enabling the synthesis of visually varied yet identity-coherent face images suitable for training generalizable FR systems. Empirical results show that training FR on datasets generated using IDperturb leads to improved performance across multiple FR benchmarks, compared to existing synthetic data generation approaches. Code and generated datasets will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37455", "url": null, "sourceid": 35193, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38492, "uid": "3ff5f362073a3e37fcf98867f5b4f527", "name": "Few-shot Acoustic Synthesis with Multimodal Flow Matching", "authors": [{"id": 189982, "fullname": "Amandine Brunetto", "url": "http://cvpr.thecvf.com/api/miniconf/users/189982?format=json", "institution": "IMAGINE LIGM - ENPC IP Paris"}], "abstract": "Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context.We introduce FLow-matching ACoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues.FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint Acoustic\u2013GeometRy EmbEdding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics.This work is the first to apply generative flow matching to acoustics, establishing a new direction for robust and data-efficient acoustic synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38492", "url": "https://amandinebtto.github.io/FLAC/", "sourceid": 43349, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39630, "uid": "6d9292034eae369e7d6808ee56589403", "name": "UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions", "authors": [{"id": 71092, "fullname": "Guozhen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71092?format=json", "institution": "Nanjing University"}, {"id": 127553, "fullname": "Zixiang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127553?format=json", "institution": "xiaobing.ai"}, {"id": 71640, "fullname": "Teng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71640?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 101234, "fullname": "Ziqiao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/101234?format=json", "institution": "Renmin University of China"}, {"id": 187395, "fullname": "Youliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187395?format=json", "institution": "Tsinghua University"}, {"id": 155777, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155777?format=json", "institution": "Tencent"}, {"id": 155471, "fullname": "Yuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155471?format=json", "institution": "Microsoft Research"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for human-centric joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation (FAM) module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance (MA-CFG), a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables the seamless unification of pivotal audio-visual tasks within a single model. Furthermore, we demonstrate that joint multi-task training can further boost the performance of joint generation. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39630", "url": null, "sourceid": 37789, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37721, "uid": "a4fa73f413400524fef474c93faa5e02", "name": "When Pretty Isn\u2019t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators", "authors": [{"id": 181302, "fullname": "Krzysztof Adamkiewicz", "url": "http://cvpr.thecvf.com/api/miniconf/users/181302?format=json", "institution": "German Research Center for Artificial Intelligence DFKI"}, {"id": 153941, "fullname": "Brian Bernhard Moser", "url": "http://cvpr.thecvf.com/api/miniconf/users/153941?format=json", "institution": "German Research Center for Artificial Intelligence"}, {"id": 153940, "fullname": "Stanislav Frolov", "url": "http://cvpr.thecvf.com/api/miniconf/users/153940?format=json", "institution": "German Research Center for AI"}, {"id": 188084, "fullname": "Tobias Christian Nauen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188084?format=json", "institution": "German Research Center for AI; RPTU Kaiserslautern-Landau"}, {"id": 146467, "fullname": "Federico Raue", "url": "http://cvpr.thecvf.com/api/miniconf/users/146467?format=json", "institution": "German Research Center for AI"}, {"id": 86783, "fullname": "Andreas Dengel", "url": "http://cvpr.thecvf.com/api/miniconf/users/86783?format=json", "institution": "DFKI &amp; RPTU"}], "abstract": "Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression.We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data.Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators.Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37721", "url": null, "sourceid": 39258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36776, "uid": "352480df93fb022eb1deeb819f255f55", "name": "Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling", "authors": [{"id": 180616, "fullname": "Yuran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180616?format=json", "institution": "Peking University"}, {"id": 185850, "fullname": "Bohan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185850?format=json", "institution": "Peking University"}, {"id": 152416, "fullname": "Chengzhuo Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/152416?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 158283, "fullname": "Wenxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158283?format=json", "institution": "Peking University"}, {"id": 185851, "fullname": "Yang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185851?format=json", "institution": "Peking University"}, {"id": 150297, "fullname": "Xiaochen Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/150297?format=json", "institution": "Sichuan University"}, {"id": 182580, "fullname": "Hao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182580?format=json", "institution": "Peking University"}, {"id": 168185, "fullname": "Yuanxing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/168185?format=json", "institution": "Kuaishou Technology"}, {"id": 156267, "fullname": "Wentao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156267?format=json", "institution": "Peking University"}], "abstract": "Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in realistic and complex visual settings. We propose Scone, a unified understanding-generation framework that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge that conveys semantic information and guides the generation expert to preserve subject identity while reducing inference. A two-stage training scheme first learns composition and then strengthens distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark designed to evaluate composition, distinction, and their combination across diverse scenarios. Experiments show that Scone outperforms existing open-source models in both composition and distinction tasks. Our model, benchmark, and training data will be open-sourced.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36776", "url": null, "sourceid": 42365, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39684, "uid": "81ab41f724391ef12094724fc6d8234f", "name": "VABench: A Comprehensive Benchmark for Audio-Video Generation", "authors": [{"id": 192641, "fullname": "Daili Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/192641?format=json", "institution": "Peking University"}, {"id": 192642, "fullname": "Xizhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192642?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185850, "fullname": "Bohan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185850?format=json", "institution": "Peking University"}, {"id": 192643, "fullname": "Xinyi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192643?format=json", "institution": "Renmin University of China"}, {"id": 182580, "fullname": "Hao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182580?format=json", "institution": "Peking University"}, {"id": 146197, "fullname": "Junbo Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/146197?format=json", "institution": "Shanghai AI Lab"}, {"id": 145308, "fullname": "Xinlong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145308?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 192644, "fullname": "Quanqing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192644?format=json", "institution": null}, {"id": 156267, "fullname": "Wentao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156267?format=json", "institution": "Peking University"}], "abstract": "Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds.  We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39684", "url": null, "sourceid": 43489, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38629, "uid": "a7ba6dde28c61d9cec9909bde1d7dca2", "name": "VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models", "authors": [{"id": 167768, "fullname": "Shuhao Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/167768?format=json", "institution": "TUM"}, {"id": 190341, "fullname": "Youqi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190341?format=json", "institution": "Wuhan University"}, {"id": 146646, "fullname": "Peijie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146646?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 153208, "fullname": "Wenlong Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153208?format=json", "institution": "COWAROBOT"}, {"id": 190342, "fullname": "Qilin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190342?format=json", "institution": "Technical University of Munich, Karlsruhe Institute of Technology"}, {"id": 74044, "fullname": "Benjamin Busam", "url": "http://cvpr.thecvf.com/api/miniconf/users/74044?format=json", "institution": "Technical University of Munich"}, {"id": 132213, "fullname": "Xieyuanli Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132213?format=json", "institution": "National University of Defense Technology"}, {"id": 155616, "fullname": "Yun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155616?format=json", "institution": "Nankai University"}], "abstract": "Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird\u2019s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38629", "url": null, "sourceid": 36843, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36319, "uid": "41c6fd75530da76805a27dc0de3cd5b1", "name": "Mining Instance-Centric Vision\u2013Language Contexts for Human\u2013Object Interaction Detection", "authors": [{"id": 182730, "fullname": "Soo Won Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182730?format=json", "institution": "Seoul National University"}, {"id": 182728, "fullname": "Kyungchae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182728?format=json", "institution": "Seoul National University"}, {"id": 184764, "fullname": "Hyungchan Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/184764?format=json", "institution": "Seoul National University"}, {"id": 184765, "fullname": "Taein Son", "url": "http://cvpr.thecvf.com/api/miniconf/users/184765?format=json", "institution": "Hanyang University"}, {"id": 77105, "fullname": "Nam Ik Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/77105?format=json", "institution": "Seoul National University"}, {"id": 133243, "fullname": "Jun Won Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133243?format=json", "institution": "Seoul National University"}], "abstract": "Human\u2013Object Interaction (HOI) detection aims to localize human\u2013object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision\u2013Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)\u2014a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instance-centric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features and Progressive Context Aggregation (ProCA), which iteratively fuses these multi-context features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36319", "url": null, "sourceid": 42586, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37976, "uid": "10f95327c7464eda880047e7be289e41", "name": "TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer", "authors": [{"id": 180392, "fullname": "Cheng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180392?format=json", "institution": "City University of Hong Kong"}, {"id": 188724, "fullname": "Zimu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188724?format=json", "institution": "City University of Hong Kong"}, {"id": 151563, "fullname": "Ke Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151563?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 156586, "fullname": "Bin Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/156586?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "On-device AI systems increasingly adopt a single foundation model equipped with task-specific Low-Rank Adaptation (LoRA) modules, forming a multi-LoRA LLM that supports multiple tasks.We study how to adapt such a model to a new task on memory-constrainted devices.Although LoRA reduces trainable parameters, fine-tuning a full set of modules remains memory-intensive.To improve efficiency, we apply sparse updating, training a subset of LoRA modules within the memory budget.However, existing sparse updating methods assume all candidate parameters are instantiated and cannot estimate the importance of modules that do not yet exist, while prior memory models designed for sequential networks fail to capture the blockwise parallel structure of Transformers.We propose TaskIT, a framework for memory-efficient fine-tuning via cross-task importance transfer. TaskIT predicts pre-insertion module importance by transferring from previously tuned tasks and employs a block-based memory predictor that captures parallel and sequential dependencies of Transformer blocks. A dynamic programming scheduler then selects module locations, numbers, and ranks to maximize accuracy within the memory budget.Experiments on uni-modal and cross-modal benchmarks show that TaskIT achieves superior accuracy-memory tradeoffs compared with Zero-FT, non-LoRA, and LoRA-based fine-tuning methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37976", "url": null, "sourceid": 37901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36383, "uid": "12795bc7a24533b7946cf3584f0e8e1e", "name": "Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection", "authors": [{"id": 177178, "fullname": "Wenhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/177178?format=json", "institution": "Beihang University"}, {"id": 145145, "fullname": "Zimeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145145?format=json", "institution": "Beihang University"}, {"id": 184919, "fullname": "Yu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184919?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 184920, "fullname": "Zehua Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184920?format=json", "institution": "Hangzhou Innovation Institute, Beihang University"}, {"id": 87568, "fullname": "Jiaxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87568?format=json", "institution": "Beihang University"}], "abstract": "Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region-Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. We will release source code upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36383", "url": null, "sourceid": 40039, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39540, "uid": "9804911810c0dd6d53af460e8ef34811", "name": "WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces", "authors": [{"id": 192299, "fullname": "Sicheng Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192299?format=json", "institution": "Fudan University"}, {"id": 192300, "fullname": "Rui Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192300?format=json", "institution": "Shanghai Innovation Institute; Fudan University"}, {"id": 192301, "fullname": "Yifei Leng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192301?format=json", "institution": null}, {"id": 192302, "fullname": "Gaoning Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192302?format=json", "institution": "iMean.AI"}, {"id": 192303, "fullname": "LI LING", "url": "http://cvpr.thecvf.com/api/miniconf/users/192303?format=json", "institution": "Fudan University"}, {"id": 192304, "fullname": "Yanyi Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192304?format=json", "institution": "iMeanAI"}, {"id": 179908, "fullname": "Dehan Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/179908?format=json", "institution": "iMean.ai"}], "abstract": "We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39540", "url": null, "sourceid": 39005, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36559, "uid": "dec0f433860ff18f2df8be7cac5437ef", "name": "TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast", "authors": [{"id": 181794, "fullname": "Beilei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/181794?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185337, "fullname": "Yiming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185337?format=json", "institution": "The Chinese University of Hong Kong &amp; NVIDIA"}, {"id": 185338, "fullname": "Long Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185338?format=json", "institution": ""}, {"id": 94912, "fullname": "Hongliang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/94912?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop dual-level scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M\u2019s great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance instead of large-size metric depth models with large amounts of training data. Code will be public available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36559", "url": null, "sourceid": 34372, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39631, "uid": "667986c93812d47ecc3ad2ae4442b118", "name": "Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation", "authors": [{"id": 192525, "fullname": "Yiwen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192525?format=json", "institution": "Shanghai AI Lab; Northwest Polytechnical University"}, {"id": 168173, "fullname": "Ziyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/168173?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 192526, "fullname": "Kaixin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192526?format=json", "institution": "Peking University"}, {"id": 91572, "fullname": "Renrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91572?format=json", "institution": "MMLab of CUHK &amp;amp;amp; Shanghai AI Laboratory"}, {"id": 102054, "fullname": "Qizhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/102054?format=json", "institution": "Zhejiang University"}, {"id": 183446, "fullname": "Dongzhi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183446?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 192527, "fullname": "Junli Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192527?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185850, "fullname": "Bohan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185850?format=json", "institution": "Peking University"}, {"id": 154177, "fullname": "Haoming Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/154177?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86716, "fullname": "Delin Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86716?format=json", "institution": "Fudan University"}, {"id": 192528, "fullname": "Tianyi Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192528?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 88296, "fullname": "Dan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88296?format=json", "institution": "CSE, HKUST"}, {"id": 156267, "fullname": "Wentao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156267?format=json", "institution": "Peking University"}, {"id": 88582, "fullname": "Bin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88582?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39631", "url": null, "sourceid": 39773, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39035, "uid": "4798443501a54a147b4098b919f7b44c", "name": "Towards Fine-Grained Attribution: Instance-Aware Preference Optimization for Aligning Diffusion Models", "authors": [{"id": 146913, "fullname": "Jiayang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/146913?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 176008, "fullname": "Pin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176008?format=json", "institution": "Tianjin University"}, {"id": 191219, "fullname": "Hongbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191219?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 181509, "fullname": "Xinyue Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181509?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76200, "fullname": "Huaibo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76200?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76640, "fullname": "Ran He", "url": "http://cvpr.thecvf.com/api/miniconf/users/76640?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Direct Preference Optimization has achieved remarkable success in aligning diffusion models with human feedback. However, existing methods heavily rely on image-level preferences, which suffer from sparse rewards in the spatial dimension. This creates a fundamental misalignment: while an image may be globally preferred, it can contain locally inferior instances. Applying the same positive preference to these areas thus unfairly credits distracting regions while penalizing informative ones, leading to suboptimal performance and inefficient learning. To resolve this issue, we propose IAPO, an Instance-Aware Preference Optimization that introduces instance-level credit assignment to advance alignment from image-level to instance-level. We first construct a high-quality instance-level preference dataset by automatically identifying and relabeling corresponding instances in image pairs using vision-language models and object detection models. Leveraging this fine-grained dataset, we design a novel instance alignment loss with a dynamic reweighting mask that modulates instance-level loss within annotated bounding boxes, suppressing distractors to enforce fine-grained human preference alignment. Extensive experiments demonstrate that our method not only achieves state-of-the-art performance in multiple benchmarks but also attains higher training efficiency due to fine-grained instance-level preference labels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39035", "url": null, "sourceid": 36501, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39709, "uid": "ff2eddb9f95fc26f5a3d08f1a77e6f82", "name": "AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model", "authors": [{"id": 186204, "fullname": "Sofian Chaybouti", "url": "http://cvpr.thecvf.com/api/miniconf/users/186204?format=json", "institution": "Johann Wolfgang Goethe Universit\u00e4t Frankfurt am Main"}, {"id": 131496, "fullname": "Sanath Narayan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131496?format=json", "institution": "Technology Innovation Institute"}, {"id": 186200, "fullname": "Yasser Dahou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186200?format=json", "institution": "Technology Innovation Institute"}, {"id": 186203, "fullname": "Ph\u00fac H. L\u00ea Kh\u1eafc", "url": "http://cvpr.thecvf.com/api/miniconf/users/186203?format=json", "institution": "Technology Innovation Institute"}, {"id": 156049, "fullname": "Ankit Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/156049?format=json", "institution": "TII"}, {"id": 186201, "fullname": "Ngoc Huynh", "url": "http://cvpr.thecvf.com/api/miniconf/users/186201?format=json", "institution": "Deakin University"}, {"id": 186202, "fullname": "Wamiq Reyaz Para", "url": "http://cvpr.thecvf.com/api/miniconf/users/186202?format=json", "institution": "TII; KAUST"}, {"id": 150894, "fullname": "Hilde Kuehne", "url": "http://cvpr.thecvf.com/api/miniconf/users/150894?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 192702, "fullname": "Hakim Hacid", "url": "http://cvpr.thecvf.com/api/miniconf/users/192702?format=json", "institution": "TII"}], "abstract": "Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data\u2014typically reserved for self-supervised learning\u2014substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our AMoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT\u2192LLM stack, demonstrating improved performance compared to a model trained from scratch.  We release OpenLVD200M and distilled checkpoints.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39709", "url": null, "sourceid": 42715, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38562, "uid": "a826e7e3851f04b576471ad4450abc4d", "name": "UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization", "authors": [{"id": 180719, "fullname": "Qianfeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180719?format=json", "institution": "Dalian Polytechnic University"}, {"id": 190151, "fullname": "Qiyuan Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190151?format=json", "institution": "Dalian Polytechnic University"}, {"id": 90859, "fullname": "Xiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90859?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 190152, "fullname": "Jiyu Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190152?format=json", "institution": "Dalian Polytechnic University"}, {"id": 190153, "fullname": "Guiyue Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190153?format=json", "institution": "Dalian Polytechnic University"}, {"id": 89307, "fullname": "Jiangxin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89307?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Despite significant progress has been made in image deraining, we note that most existing methods are often developed for only specific types of rain degradation and fail to generalize across diverse real-world rainy scenes. How to effectively model different rain degradations within a universal framework is important for real-world image deraining. In this paper, we propose UniRain, an effective unified image deraining framework capable of restoring images degraded by rain streak and raindrop under both daytime and nighttime conditions. To better enhance unified model generalization, we construct an intelligent retrieval augmented generation (RAG)-based dataset distillation pipeline that selects high-quality training samples from all public deraining datasets for better mixed training. Furthermore, we incorporate a simple yet effective multi-objective reweighted optimization strategy into the asymmetric mixture-of-experts (MoE) architecture to facilitate consistent performance and improve robustness across diverse scenes. Extensive experiments show that our framework performs consistently favorably against the state-of-the-art models on both our proposed benchmarks and multiple public datasets. Code and dataset will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38562", "url": null, "sourceid": 42109, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38079, "uid": "37845557f7ff2fbf8eba7589c8529d27", "name": "RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark", "authors": [{"id": 185851, "fullname": "Yang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185851?format=json", "institution": "Peking University"}, {"id": 153708, "fullname": "Yuhao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153708?format=json", "institution": "Nanyang Technological University"}, {"id": 189008, "fullname": "Yue Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/189008?format=json", "institution": "Institute of Automation, University of the Chinese Academy of Sciences"}, {"id": 180616, "fullname": "Yuran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180616?format=json", "institution": "Peking University"}, {"id": 189009, "fullname": "Xuanyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189009?format=json", "institution": "Peking University"}, {"id": 170759, "fullname": "Sheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/170759?format=json", "institution": "Hefei University of Technology"}, {"id": 189010, "fullname": "Wenting Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189010?format=json", "institution": "Peking University"}, {"id": 142811, "fullname": "Haochen Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/142811?format=json", "institution": "CASIA, Xiaomi EV"}, {"id": 174459, "fullname": "rundong wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174459?format=json", "institution": "University of Science and Technology of China"}, {"id": 189011, "fullname": "Huanqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189011?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 89059, "fullname": "Zuyan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89059?format=json", "institution": "Department of Automation, Tsinghua University, Tsinghua University"}, {"id": 185850, "fullname": "Bohan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185850?format=json", "institution": "Peking University"}, {"id": 185835, "fullname": "Ruizhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185835?format=json", "institution": "Zhejiang University"}, {"id": 152519, "fullname": "Qixun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152519?format=json", "institution": "Peking University"}, {"id": 189012, "fullname": "Zhuoran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189012?format=json", "institution": "Peking University"}, {"id": 145308, "fullname": "Xinlong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145308?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 152416, "fullname": "Chengzhuo Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/152416?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 176715, "fullname": "bozhou li", "url": "http://cvpr.thecvf.com/api/miniconf/users/176715?format=json", "institution": "Peking  University"}, {"id": 189013, "fullname": "Qiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189013?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 189014, "fullname": "Haotian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189014?format=json", "institution": "National University of Defense Technology"}, {"id": 158616, "fullname": "Wenjing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158616?format=json", "institution": "National University of Defense Technology"}, {"id": 168185, "fullname": "Yuanxing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/168185?format=json", "institution": "Kuaishou Technology"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 188832, "fullname": "YiFan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188832?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 89788, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89788?format=json", "institution": "Nanyang Technological University"}], "abstract": "The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38079", "url": null, "sourceid": 38550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39468, "uid": "c8113305c0428b74ed3f1b6051520039", "name": "Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation", "authors": [{"id": 181509, "fullname": "Xinyue Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181509?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 181512, "fullname": "Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181512?format=json", "institution": "Shanghaitech University"}, {"id": 191219, "fullname": "Hongbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191219?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76640, "fullname": "Ran He", "url": "http://cvpr.thecvf.com/api/miniconf/users/76640?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 76200, "fullname": "Huaibo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76200?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Recently, generating 3D assets using visual priors from pretrained diffusion models has shown remarkable results. However, due to the inherent lack of 3D geometric priors in 2D diffusion, the synthesized results often suffer from spatial hallucination and multi-view inconsistency. To address this limitation, we propose Thoughtful3D, a novel framework that enhances 3D content generation quality by introducing structural chain-of-thought (CoT) reasoning to alleviate inconsistent issues and mitigate hallucinations. Specifically, we design a dual-phase structural CoT strategy: (1) 3DBlueprint-CoT explicitly plans the 3D generation process through textual semantic parsing and logical deduction during the initialization phase. (2) 3DRefine-CoT dynamically evaluates latent inconsistencies by analyzing multiple renderings, employing a multi-round iterative refinement mechanism to suppress hallucinations and enhance cross-view consistency. To further promote consistency across views, we propose a Cross-view Semantic Appearance Alignment strategy that enhances multi-view consistency by establishing dynamic geometric associations between the same features from different viewpoints. Extensive experiments demonstrate that Thoughtful3D significantly improves the quality and consistency of generated 3D assets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39468", "url": null, "sourceid": 37094, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38829, "uid": "b3dd9704c36f89bb79ee6408a4bea644", "name": "MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction", "authors": [{"id": 190781, "fullname": "Linjie Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190781?format=json", "institution": "Xiamen University"}, {"id": 190782, "fullname": "Jin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190782?format=json", "institution": null}, {"id": 190783, "fullname": "Xiangrong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190783?format=json", "institution": "Xiamen University"}, {"id": 85997, "fullname": "Changming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/85997?format=json", "institution": "CSIRO"}, {"id": 190784, "fullname": "Hui Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/190784?format=json", "institution": "La Trobe University"}, {"id": 190785, "fullname": "Yuqi Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190785?format=json", "institution": "nanjing university"}, {"id": 87585, "fullname": "Ran Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/87585?format=json", "institution": "Tianjin University"}, {"id": 182986, "fullname": "Qiangguo Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182986?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 185199, "fullname": "leyi wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185199?format=json", "institution": "Shandong University"}], "abstract": "Multi-modal learning approaches that integrate pathological images with genomic profiles have significantly enhanced the accuracy of survival prediction tasks. However, previous methods often struggle to effectively process long-range gigapixel whole slide images (WSIs) and sparse genomic profiles due to the limitations of conventional scanning strategies to serialize data and the complex and heterogeneous nature of the modalities. Inspired by recent advancements in Mamba and mixture of experts (MoE), we propose a novel multi-directional composite scanning strategy with mixture of attention and Mamba experts (MDCS-MoAME) for cancer survival prediction. Specifically, we introduce a multi-directional composite scanning (MDCS) strategy to both WSIs and genomic profiles, and use the Mamba encoder to process intra-modal representations at the region, patch, and gene level, ensuring sufficient utilization of the intrinsic information within each modality. To further capture heterogeneous inter-modal representations, we introduce mixture of attention and Mamba experts (MoAME), which dynamically selects tailored experts to model complex inter-modal correlations, flexibly focusing on the interactions between modalities. Finally, we introduce alignment constraints to recalibrate inter-modal interactions and reduce intra- and inter-modal representation redundancy, enhancing its discriminative power for comprehensive survival analysis. Experimental results on five publicly available datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance. Our code is included in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38829", "url": null, "sourceid": 34306, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40347?format=json"], "related_events_ids": [40347]}, {"id": 36912, "uid": "886551d6661c7e64f03ecdc16f7eae8b", "name": "VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs", "authors": [{"id": 186199, "fullname": "Brigitta Malagurski T\u00f6rtei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186199?format=json", "institution": "Technology Innovation Institute"}, {"id": 186200, "fullname": "Yasser Dahou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186200?format=json", "institution": "Technology Innovation Institute"}, {"id": 186201, "fullname": "Ngoc Huynh", "url": "http://cvpr.thecvf.com/api/miniconf/users/186201?format=json", "institution": "Deakin University"}, {"id": 186202, "fullname": "Wamiq Reyaz Para", "url": "http://cvpr.thecvf.com/api/miniconf/users/186202?format=json", "institution": "TII; KAUST"}, {"id": 186203, "fullname": "Ph\u00fac H. L\u00ea Kh\u1eafc", "url": "http://cvpr.thecvf.com/api/miniconf/users/186203?format=json", "institution": "Technology Innovation Institute"}, {"id": 156049, "fullname": "Ankit Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/156049?format=json", "institution": "TII"}, {"id": 186204, "fullname": "Sofian Chaybouti", "url": "http://cvpr.thecvf.com/api/miniconf/users/186204?format=json", "institution": "Johann Wolfgang Goethe Universit\u00e4t Frankfurt am Main"}, {"id": 131496, "fullname": "Sanath Narayan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131496?format=json", "institution": "Technology Innovation Institute"}], "abstract": "Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To adress this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near randomly under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36912", "url": null, "sourceid": 31344, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40347, "uid": "b3dd9704c36f89bb79ee6408a4bea644", "name": "MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction", "authors": [{"id": 190781, "fullname": "Linjie Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190781?format=json", "institution": "Xiamen University"}, {"id": 190782, "fullname": "Jin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190782?format=json", "institution": null}, {"id": 190783, "fullname": "Xiangrong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190783?format=json", "institution": "Xiamen University"}, {"id": 85997, "fullname": "Changming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/85997?format=json", "institution": "CSIRO"}, {"id": 190784, "fullname": "Hui Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/190784?format=json", "institution": "La Trobe University"}, {"id": 190785, "fullname": "Yuqi Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190785?format=json", "institution": "nanjing university"}, {"id": 87585, "fullname": "Ran Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/87585?format=json", "institution": "Tianjin University"}, {"id": 182986, "fullname": "Qiangguo Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182986?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 185199, "fullname": "leyi wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185199?format=json", "institution": "Shandong University"}], "abstract": "Multi-modal learning approaches that integrate pathological images with genomic profiles have significantly enhanced the accuracy of survival prediction tasks. However, previous methods often struggle to effectively process long-range gigapixel whole slide images (WSIs) and sparse genomic profiles due to the limitations of conventional scanning strategies to serialize data and the complex and heterogeneous nature of the modalities. Inspired by recent advancements in Mamba and mixture of experts (MoE), we propose a novel multi-directional composite scanning strategy with mixture of attention and Mamba experts (MDCS-MoAME) for cancer survival prediction. Specifically, we introduce a multi-directional composite scanning (MDCS) strategy to both WSIs and genomic profiles, and use the Mamba encoder to process intra-modal representations at the region, patch, and gene level, ensuring sufficient utilization of the intrinsic information within each modality. To further capture heterogeneous inter-modal representations, we introduce mixture of attention and Mamba experts (MoAME), which dynamically selects tailored experts to model complex inter-modal correlations, flexibly focusing on the interactions between modalities. Finally, we introduce alignment constraints to recalibrate inter-modal interactions and reduce intra- and inter-modal representation redundancy, enhancing its discriminative power for comprehensive survival analysis. Experimental results on five publicly available datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance. Our code is included in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40347", "url": null, "sourceid": -34306, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38829?format=json"], "related_events_ids": [38829]}, {"id": 39212, "uid": "2d533c9bac8862a184b2ad4374a9090f", "name": "Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop", "authors": [{"id": 191596, "fullname": "Xinwang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191596?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189987, "fullname": "Xiuxing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189987?format=json", "institution": "Beijing Institute of Technology"}, {"id": 183716, "fullname": "Qing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183716?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191597, "fullname": "Ziyue Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191597?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191598, "fullname": "Yutong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191598?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189986, "fullname": "Ziyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189986?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189988, "fullname": "Zhuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189988?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191599, "fullname": "Kai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191599?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 84767, "fullname": "Jianye Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84767?format=json", "institution": "Tianjin University"}, {"id": 189990, "fullname": "Xia Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189990?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Abstract visual reasoning benchmarks such as ARC-AGI evaluate the ability to infer generalizable transformation rules from few graphical demonstrations\u2014a capability where current deep learning models severely underperform. Mainstream large language models achieve only 15.8% (DeepSeek-R1) and 34.5% (o3-mini-high) test accuracy. The core reason lies in their static processing of task examples: unlike humans, who iteratively refine their understanding of examples while constructing solutions, these models lack mechanisms for dynamically aligning understanding and solving. We address this gap with the Understanding and Solving Reasoning Loop (USRL) framework. The architecture comprises two explicitly interacting modules: an Understanding Module (UM) that encodes and refines rule representations of examples, and a Solving Module (SM) that generates a draft solution informed by these evolving contexts. Through recurrent interaction, the model iteratively aligns its draft solution with its understanding about task examples continuously. Furthermore, we introduce an adaptive reasoning halting mechanism that autonomously terminates the reasoning loop based on the consistency between the generated draft solution and the rule representations. With only 7M parameters, our model achieves 47.2% accuracy on ARC-AGI-1, significantly outperforming both DeepSeek-R1 and o3-mini-high. This reveals that neurocognitive principles offer an effective pathway for abstract reasoning, with implications extending to compositional generalization and structured problem-solving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39212", "url": null, "sourceid": 31407, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37688, "uid": "9f11e692a2a53a8382be86ee9713763c", "name": "A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models", "authors": [{"id": 180003, "fullname": "Mujtaba Hussain Mirza", "url": "http://cvpr.thecvf.com/api/miniconf/users/180003?format=json", "institution": "Sapienza University of Rome"}, {"id": 188016, "fullname": "Antonio D Orazio", "url": "http://cvpr.thecvf.com/api/miniconf/users/188016?format=json", "institution": "University of Roma &quot;La Sapienza&quot;"}, {"id": 188017, "fullname": "Odelia Melamed", "url": "http://cvpr.thecvf.com/api/miniconf/users/188017?format=json", "institution": "Weizmann Institute of Science"}, {"id": 85566, "fullname": "Iacopo Masi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85566?format=json", "institution": "Sapienza, University of Rome"}], "abstract": "Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference.In light of this, we propose **Energy-Guided Test-Time Transformation (ET3)**, a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples.Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering. Code will be released upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37688", "url": null, "sourceid": 40689, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37967, "uid": "5bf73f0ab50f712f61880e1254f1b723", "name": "Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature", "authors": [{"id": 95487, "fullname": "Mohammad Esfeh Esfeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/95487?format=json", "institution": "University of British Columbia"}, {"id": 94196, "fullname": "Qi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/94196?format=json", "institution": "University of British Columbia"}, {"id": 188704, "fullname": "Yongxing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188704?format=json", "institution": "University of British Columbia"}, {"id": 188705, "fullname": "Zahra Gholami", "url": "http://cvpr.thecvf.com/api/miniconf/users/188705?format=json", "institution": null}, {"id": 153127, "fullname": "Renjie Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153127?format=json", "institution": "The University of British Columbia, Vector Institute"}, {"id": 188706, "fullname": "Purang Abolmaesumi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188706?format=json", "institution": "University of British Columbia"}], "abstract": "Modern vision systems are deployed in settings where occasional catastrophic failures matter more than average accuracy\u2014for example in medical imaging, autonomous driving, and safety monitoring. While conformal prediction gives distribution-free uncertainty guarantees, most existing methods only control mean error and are hard to tune toward rare but high-cost mistakes. We propose Bayesian-Quadrature Spectral Risk Control (BQ-SRC), a general framework for controlling tail-focused risks (such as conditional value at risk (CVaR)-style objectives) in a distribution-free way. BQ-SRC views conformal prediction through a Bayesian-quadrature lens and replaces mean-risk control with a flexible family of risk-averse criteria, while keeping the same black-box access to a trained model. A binomial testing scheme reduces the Monte Carlo conservatism of prior approaches, leading to tighter sets without sacrificing guarantees. We evaluate BQ-SRC across diverse vision tasks, including synthetic regression, closed-set and zero-shot image classification, multilabel classification, and semantic segmentation. Across these settings, BQ-SRC consistently maintains finite-sample risk guarantees and often yields smaller or otherwise more informative prediction sets than existing conformal and risk-controlling baselines, sometimes trading a modest amount of efficiency for stronger tail-risk control. We will make our implementation publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37967", "url": null, "sourceid": 32230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39454, "uid": "d4b42304f98e8c0760723c747006a5a4", "name": "Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence", "authors": [{"id": 180100, "fullname": "Kun Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180100?format=json", "institution": "Peking University"}, {"id": 192104, "fullname": "Yuanxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192104?format=json", "institution": "Peking University"}, {"id": 128083, "fullname": "Linli Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128083?format=json", "institution": "Peking University"}, {"id": 138833, "fullname": "Yishuo Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/138833?format=json", "institution": "Peking University"}, {"id": 192105, "fullname": "Hao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192105?format=json", "institution": "Tencent"}, {"id": 192106, "fullname": "Fandong Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192106?format=json", "institution": "WeChat AI, Tencent Inc."}, {"id": 149440, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/149440?format=json", "institution": "Tencent Inc"}, {"id": 128142, "fullname": "Xu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128142?format=json", "institution": "Peking University"}], "abstract": "Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we **1)** construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and **2)** design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10\\% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39454", "url": null, "sourceid": 33820, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36183, "uid": "efbd774fdce204077f64ad083665abb0", "name": "Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation", "authors": [{"id": 165037, "fullname": "Xuesong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165037?format=json", "institution": "Rochester Institute of Technology"}, {"id": 184356, "fullname": "Anke Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184356?format=json", "institution": "Facebook"}, {"id": 184357, "fullname": "Wenbo Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184357?format=json", "institution": "Monroe University"}, {"id": 184358, "fullname": "Emmett Ientilucci", "url": "http://cvpr.thecvf.com/api/miniconf/users/184358?format=json", "institution": "Rochester Institute of Technology"}], "abstract": "Dense scenes containing numerous tiny objects pose a fundamental challenge for segmentation models, where small localization errors can significantly degrade downstream measurements. We present Structure-Aware Representation Distillation (SARD), a teacher-compatible framework that transfers structural knowledge from a large teacher to a compact student via feature-space alignment rather than mask imitation. SARD constructs a structure-importance map that combines boundary salience, local density, and teacher confidence, and uses it to weight a unified representation loss integrating feature consistency, distribution alignment, and structural contrast. This encourages the student to allocate capacity to geometrically informative regions while preserving global context. Experiments on Cityscapes, ADE20K, and a challenging rock fragmentation benchmark (RockFrag) show that SARD consistently improves both mIoU and boundary IoU over strong distillation baselines; on RockFrag, SARD improves a Swin-T student over CWD by +4.3 mIoU and +6.7 bIoU. A ResNet-50 student distilled from a Swin-L teacher achieves up to 7.7$\\times$ parameter reduction and 9$\\times$ higher throughput than the teacher, with no additional inference overhead beyond the student network, demonstrating that structure-aware representation distillation is effective and efficient for tiny-dense segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36183", "url": null, "sourceid": 42150, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39507, "uid": "3fa8502ffeb11e407c06cee72b58398f", "name": "Unified Video Editing as Temporal Reasoner", "authors": [{"id": 101961, "fullname": "xiangpeng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101961?format=json", "institution": "University of Techolodgy Sydney"}, {"id": 192226, "fullname": "Ji Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192226?format=json", "institution": "University of California, Berkeley; Zhejiang University"}, {"id": 192227, "fullname": "Yiyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192227?format=json", "institution": "University of Technology Sydney"}, {"id": 127301, "fullname": "Yue Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/127301?format=json", "institution": "The Hong Kong University of Science and Technology, Hong Kong"}, {"id": 133029, "fullname": "Yan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133029?format=json", "institution": "University of Technology Sydney"}, {"id": 86542, "fullname": "Min Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86542?format=json", "institution": "University of Technology Sydney"}, {"id": 90502, "fullname": "Qiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90502?format=json", "institution": "University of Technology Sydney"}], "abstract": "Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a \"seeing, reasoning, then editing\" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39507", "url": null, "sourceid": 35506, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40158, "uid": "3dab936961a08882a5a63319a046c6fe", "name": "MOGeo: Beyond One-to-One Cross-View Object Geo-localization", "authors": [{"id": 193663, "fullname": "Lv Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193663?format=json", "institution": "Shenzhen University"}, {"id": 193664, "fullname": "Qingwang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193664?format=json", "institution": "Fudan University"}, {"id": 193665, "fullname": "Le Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193665?format=json", "institution": "Shenzhen University"}, {"id": 193666, "fullname": "Yuanyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193666?format=json", "institution": null}, {"id": 182970, "fullname": "YINGYING ZHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/182970?format=json", "institution": "Shenzhen University"}], "abstract": "Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistc setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). To advance CVMOGL task, we first construct a benchmark, CMLocation, which includes two datasets: CMLocation-V1 and CMLocation-V2. Furthermore, we propose a novel cross view multi-object geo-localization method, MOGeo, and benchmark it against existing state-of-the-art methods.Extensive experiments are conducted under various application scenarios to validate the effectiveness of our method. The results demonstrate that cross-view geo-localization in the more realistic setting remains a challenging problem, encouraging further research in this area.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40158", "url": null, "sourceid": 35325, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39387, "uid": "8e90864815242bf798fb87d2f66a7e94", "name": "SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving", "authors": [{"id": 135626, "fullname": "Peizheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/135626?format=json", "institution": "Mercedes-Benz AG"}, {"id": 175542, "fullname": "Zhenghao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175542?format=json", "institution": "Mercedes-Benz AG"}, {"id": 191972, "fullname": "David Holtz", "url": "http://cvpr.thecvf.com/api/miniconf/users/191972?format=json", "institution": "Mercedes-Benz"}, {"id": 191973, "fullname": "Hang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191973?format=json", "institution": "Karlsruhe Institute of Technology; Mercedes-Benz AG"}, {"id": 190058, "fullname": "Yutong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190058?format=json", "institution": "Mercedes-Benz AG; Universit\u00e4t Stuttgart"}, {"id": 191974, "fullname": "Yuzhi Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191974?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 179571, "fullname": "Rui Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/179571?format=json", "institution": "University of California, Los Angeles; University of Cambridge"}, {"id": 69174, "fullname": "Andreas Geiger", "url": "http://cvpr.thecvf.com/api/miniconf/users/69174?format=json", "institution": "University of T\u00fcbingen"}, {"id": 166446, "fullname": "Andreas Zell", "url": "http://cvpr.thecvf.com/api/miniconf/users/166446?format=json", "institution": "Cognitive systems - University Tuebingen"}], "abstract": "End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39387", "url": "https://zhenghao2519.github.io/SpaceDrive_Page/", "sourceid": 44908, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37415, "uid": "ee764b406ebc906642809b8429a1cf33", "name": "Unlearning without Forgetting: Securely Removing Targeted Concepts from Large-Scale Vision-Language Open-Vocabulary Detectors", "authors": [{"id": 187389, "fullname": "Zhongze Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187389?format=json", "institution": "Central South University"}, {"id": 156647, "fullname": "Xiu Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/156647?format=json", "institution": "Central South University"}, {"id": 187390, "fullname": "Feng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187390?format=json", "institution": "Southeast University"}, {"id": 149607, "fullname": "Dan Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149607?format=json", "institution": "School of Automation, Southeast University"}, {"id": 187391, "fullname": "Shan You", "url": "http://cvpr.thecvf.com/api/miniconf/users/187391?format=json", "institution": "SenseTime Research"}, {"id": 187392, "fullname": "Yueyi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187392?format=json", "institution": "Central South University"}, {"id": 187393, "fullname": "Jun Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/187393?format=json", "institution": "Central South University"}], "abstract": "Open-vocabulary detectors (OvOD) inherit tightly coupled cross-modal knowledge from web-scale pretraining, creating privacy, copyright, and compliance risks. Existing machine unlearning methods face \\emph{geometric entanglement interference} in OvOD: forgetting updates inevitably distort preserved knowledge due to shared semantic factors in decomposable embeddings. We introduce \\textbf{SafeDetect}, a geometrically constrained unlearning framework that constructs a null-space from preserved knowledge embeddings offline, then constrains parameter updates to this orthogonal complement, mathematically preventing interference with retained concepts. Forgetting is achieved through a one-step mean-flow objective that drives forgotten concepts toward non-detectable, while multimodal decoupling prevents cross-modal recovery.  We establish UOD-Bench, the first unified benchmark for OvOD unlearning, featuring 14.7K images with 67.3K region-phrase pairs across three tasks. Extensive experiments across UOD-Bench and standard benchmarks with diverse architectures (\\textit{e.g.}, GroundingDINO, LLM-Det) demonstrate that SafeDetect achieves superior forgetting efficacy (64.75\\% improvement over NPO) while maintaining stable retention performance and significantly better zero-shot generalization, with 1.5$\\times$ faster convergence than iterative methods. Code and benchmark will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37415", "url": null, "sourceid": 43570, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39971, "uid": "714bb9ae1e0f98eab9dff4c8edaeb6f8", "name": "BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment", "authors": [{"id": 189302, "fullname": "Kanglei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189302?format=json", "institution": "Tsinghua University"}, {"id": 189303, "fullname": "Chang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189303?format=json", "institution": "Tsinghua University; ByteDance Inc.; School of Gifted Young, USTC"}, {"id": 193219, "fullname": "Qingyi Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193219?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 101938, "fullname": "Liyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101938?format=json", "institution": "Tsinghua University"}], "abstract": "Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce **Bri**dged **M**odality **A**daptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift.  Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6--8\\% higher correlation and 12--15\\% lower error on average.  These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39971", "url": null, "sourceid": 36434, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39428, "uid": "b271042deee6c05270d5acb12775e928", "name": "Dynamic Logits Adjustment and Exploration for Test-Time Adaptation in Vision Language Models", "authors": [{"id": 192056, "fullname": "Haoyan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192056?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 88473, "fullname": "Yahao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88473?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 76644, "fullname": "Yinjie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76644?format=json", "institution": "Sichuan University"}, {"id": 88453, "fullname": "Lixin Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88453?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 88470, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88470?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Existing Test-Time Adaptation (TTA) methods for Vision-Language Models (VLMs), focusing on designing efficient adaptation parameters (eg. prompts or residual prototypes), predominantly rely on high-confidence samples obtained via entropy-based filtering. However, this prevailing paradigm implicitly inherits the VLM\u2019s class-wise prediction biases and leads to insufficient coverage of the test distribution, rendering the adaptation process biased and insufficiently exploratory.To overcome these limitations, we propose Dynamic Logits Adjustment and Exploration (DLAE), a novel framework that integrates Dynamic Logit Adjustment (DLA) with a Consistency-Guided Exploratory Cache (CGEC). DLA dynamically recalibrates model logits based on test prediction statistics, thereby mitigating class-wise prediction inconsistencies. Different from traditional cache mechanisms, our CGEC actively identifies additional samples near decision boundaries whose predicted labels are sensitive to the logit adjustment, thereby exploring beyond only high-confidence samples. By enforcing semantic and temporal consistency, the cache preserves the reliability of selected samples while enabling cautious yet effective exploration of low-confidence regions, ultimately yielding stable and reliable adaptation.Extensive experiments across multiple vision-language benchmarks demonstrate that our approach consistently surpasses state-of-the-art TTA methods, showing superior stability, adaptability, and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39428", "url": null, "sourceid": 36128, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37206, "uid": "ae128ac418fcc6ef161cb98a152030eb", "name": "LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis", "authors": [{"id": 186919, "fullname": "Ibne Farabi Shihab", "url": "http://cvpr.thecvf.com/api/miniconf/users/186919?format=json", "institution": "Amazon"}, {"id": 186920, "fullname": "Sanjeda Akter", "url": "http://cvpr.thecvf.com/api/miniconf/users/186920?format=json", "institution": "Iowa State University"}, {"id": 75497, "fullname": "Anuj Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/75497?format=json", "institution": "Iowa State University"}], "abstract": "Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2$\\pm$0.3 AP using only 5\\% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7$\\pm$0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1$\\pm$0.4 AP, p=0.02) and matching UDOP~\\cite{udop} (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7\\% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \\$12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37206", "url": null, "sourceid": 44161, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38324, "uid": "ac5b6ea5394bc6a1a9a71e0552236c72", "name": "DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation", "authors": [{"id": 88116, "fullname": "Tuan Duc Ngo", "url": "http://cvpr.thecvf.com/api/miniconf/users/88116?format=json", "institution": "UMass Amherst"}, {"id": 91712, "fullname": "Gabriel Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91712?format=json", "institution": "Adobe Research"}, {"id": 87532, "fullname": "Seoung Wug Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/87532?format=json", "institution": "Adobe Systems"}, {"id": 87942, "fullname": "Kevin Blackburn-Matzen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87942?format=json", "institution": "Adobe Systems"}, {"id": 189610, "fullname": "Evangelos Kalogerakis", "url": "http://cvpr.thecvf.com/api/miniconf/users/189610?format=json", "institution": "TU Crete, UMass Amherst"}, {"id": 75997, "fullname": "Chuang Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75997?format=json", "institution": "UMass Amherst/MIT-IBM Watson AI Lab"}, {"id": 180187, "fullname": "Joon-Young Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/180187?format=json", "institution": "Adobe Research"}], "abstract": "Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging\u2014especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38324", "url": null, "sourceid": 33737, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38204, "uid": "4b787bf893e1d9519259a265d9fc607c", "name": "Geometry-aware Cross-modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting", "authors": [{"id": 189301, "fullname": "Yuwen Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189301?format=json", "institution": "East China Normal University; Tsinghua University"}, {"id": 189302, "fullname": "Kanglei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189302?format=json", "institution": "Tsinghua University"}, {"id": 189303, "fullname": "Chang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189303?format=json", "institution": "Tsinghua University; ByteDance Inc.; School of Gifted Young, USTC"}, {"id": 101938, "fullname": "Liyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101938?format=json", "institution": "Tsinghua University"}], "abstract": "Referring 3D segmentation seeks to localize and segment target objects in a 3D scene given a natural-language query, requiring joint reasoning over geometric structures and linguistic cues. Although recent progress using 3D Gaussian Splatting (3DGS) has improved rendering quality, existing methods still struggle to spatially ground textual references due to two fundamental limitations: (1) language encoders provide no explicit positional priors, weakening geometric relation modeling; and (2) cross-modal attention is self-reinforcing, causing spatial errors to propagate through the Gaussian field once misalignment occurs. To address this, we propose GeoCGA, a geometry-aware cross-modal graph alignment framework that bridges linguistic semantics with the 3DGS representation. GeoCGA introduces position-aware prompt expansion to build a semantic-spatial graph capturing relational structure in text, and constructs a Gaussian-based geometric graph encoding 3D topology. A cross-modal alignment module enforces geometric consistency between the two graphs, enabling stable and spatially grounded correspondence across views. GeoCGA consistently outperforms prior state-of-the-art methods, yielding mIoU improvements of 28.8\\% on Ref-LERF, 2.6\\% on LERF-OVS, and 3.1\\% on 3D-OVS. These results point to an incremental advance toward more stable and spatially consistent 3D language grounding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38204", "url": null, "sourceid": 41361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38898, "uid": "9f9efae1ea901d57708f8066de0a8951", "name": "VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents", "authors": [{"id": 133195, "fullname": "George Eskandar", "url": "http://cvpr.thecvf.com/api/miniconf/users/133195?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 89671, "fullname": "Fengyi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89671?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 163834, "fullname": "Mohammad Altillawi", "url": "http://cvpr.thecvf.com/api/miniconf/users/163834?format=json", "institution": "Autonomous University of Barcelona"}, {"id": 190929, "fullname": "Dong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190929?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190930, "fullname": "Yang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190930?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen"}, {"id": 190931, "fullname": "Liudi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190931?format=json", "institution": "Albert-Ludwigs-Universit\u00e4t Freiburg"}, {"id": 91461, "fullname": "Ziyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91461?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}], "abstract": "Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a shared 4D latent space derived from a feed-forward spatial foundation model, namely, Pi3. This encourages view-consistent appearance even under wide baselines and dynamic camera motion. To scale beyond a fixed number of cameras, we train views at distinct diffusion timesteps, enabling the model to learn both joint and conditional view distributions. This in turn allows autoregressive synthesis of new viewpoints conditioned on existing ones. Experiments show superior or similar performance to the state-of-the-art on the single-view translation benchmarks and, for the first time, physically and stylistically consistent multi-view translations, including challenging egocentric and heterogeneous-camera setups central to world randomization for robot learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38898", "url": null, "sourceid": 39612, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38938, "uid": "0a0f319f4d1b7acc7a879809f0f1063f", "name": "Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image", "authors": [{"id": 191011, "fullname": "Zidian Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191011?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86514, "fullname": "Ancong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86514?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete $360^\\circ$ environments. To address these limitations, we design $\\textbf{Pano3DComposer}$, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to $\\textbf{Alignment-VGGT}$ by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. The code will be released if accepted.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38938", "url": null, "sourceid": 44599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36278, "uid": "274717a0a7b3e03bd51ea93f16ba9ff0", "name": "PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On", "authors": [{"id": 183573, "fullname": "Haohua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183573?format=json", "institution": "Xiaohongshu Inc."}, {"id": 180476, "fullname": "Tianze Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180476?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 184659, "fullname": "Wei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184659?format=json", "institution": null}, {"id": 184660, "fullname": "Runqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184660?format=json", "institution": "Xiaohongshu"}, {"id": 161356, "fullname": "Yandong Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/161356?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 184661, "fullname": "Dejia Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/184661?format=json", "institution": "Xiaohongshu"}, {"id": 184662, "fullname": "Yibo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184662?format=json", "institution": "Xiaohongshu"}, {"id": 86765, "fullname": "Xu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86765?format=json", "institution": "Shanghaitech University"}, {"id": 126819, "fullname": "Yao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126819?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 86998, "fullname": "Lu Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86998?format=json", "institution": "Beihang University"}, {"id": 87140, "fullname": "Zhiyong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87140?format=json", "institution": "Tsinghua University"}], "abstract": "Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge.We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors.We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36278", "url": null, "sourceid": 38971, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39750, "uid": "29215b10f38547ecaf0e25d3a2cc1c94", "name": "RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments", "authors": [{"id": 179922, "fullname": "Manish Bhurtel", "url": "http://cvpr.thecvf.com/api/miniconf/users/179922?format=json", "institution": "Howard University"}, {"id": 192782, "fullname": "Danda Rawat", "url": "http://cvpr.thecvf.com/api/miniconf/users/192782?format=json", "institution": "Howard University"}], "abstract": "Semantic segmentation in unstructured environments presents unique challenges due to irregular terrain, occlusions, and complex spatial layouts. While structured settings (e.g., urban scenes) have been widely studied, segmentation in unstructured settings remains relatively underexplored, both in terms of standardized benchmarking and architectural design. In this work, we propose a encoder-decoder based semantic segmentation architecture that integrates a Reduced Masked Autoencoder (RMAE) as the encoder, a Feature-to-Pyramid (F2P) neck, and a novel decoder called ProGRess. The ProGRess decoder introduces Progressive Leapwise Fusion (PLF) for top-down multi-scale fusion of non-contiguous feature maps, a Lightweight Channel Attention gate with Residuals (LCAR) module, and a Bottleneck Feature Fusion (BFF) block for compact refinement. We establish comprehensive baselines by benchmarking state-of-the-art CNN and transformer-based models on challenging unstructured environment datasets viz. RELLIS-3D, it's coarse-grained variant, and RUGD. Our architecture achieves the state-of-the-art performance with 57.41\\% mIoU on RELLIS-3D, 45.63\\% mIoU on RUGD, 78.95\\% mIoU on RELLIS-3DC datasets while maintaining competitive parameter-count and vRAM usage.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39750", "url": null, "sourceid": 44084, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37342, "uid": "e7ff9f6c0d2b46f4b4560c29ed23a582", "name": "ShapeR: Robust Conditional 3D Shape Generation from Casual Captures", "authors": [{"id": 73809, "fullname": "Mohd Yawar Nihal Siddiqui", "url": "http://cvpr.thecvf.com/api/miniconf/users/73809?format=json", "institution": "Technical University Munich"}, {"id": 146779, "fullname": "Duncan Frost", "url": "http://cvpr.thecvf.com/api/miniconf/users/146779?format=json", "institution": "Meta"}, {"id": 187204, "fullname": "Samir Aroudj", "url": "http://cvpr.thecvf.com/api/miniconf/users/187204?format=json", "institution": "AceNerds"}, {"id": 187205, "fullname": "Armen Avetisyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187205?format=json", "institution": "Meta Reality Labs"}, {"id": 140486, "fullname": "Henry Howard-Jenkins", "url": "http://cvpr.thecvf.com/api/miniconf/users/140486?format=json", "institution": "Facebook; Meta"}, {"id": 89106, "fullname": "Daniel DeTone", "url": "http://cvpr.thecvf.com/api/miniconf/users/89106?format=json", "institution": "Facebook"}, {"id": 75961, "fullname": "Pierre Moulon", "url": "http://cvpr.thecvf.com/api/miniconf/users/75961?format=json", "institution": "Facebook"}, {"id": 72738, "fullname": "Qirui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72738?format=json", "institution": "Simon Fraser University"}, {"id": 126494, "fullname": "Zhengqin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126494?format=json", "institution": "Facebook"}, {"id": 89078, "fullname": "Julian Straub", "url": "http://cvpr.thecvf.com/api/miniconf/users/89078?format=json", "institution": "Meta Reality Labs Research"}, {"id": 127741, "fullname": "Richard Newcombe", "url": "http://cvpr.thecvf.com/api/miniconf/users/127741?format=json", "institution": "Meta, Reality Labs Research"}, {"id": 127728, "fullname": "Jakob Engel", "url": "http://cvpr.thecvf.com/api/miniconf/users/127728?format=json", "institution": "Research, Meta Reality Labs"}], "abstract": "Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given a image sequence, we leverage off-the-shelf visual-inertial SLAM,3D detection algorithms and VLMs to extract for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in the wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to SoTA.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37342", "url": null, "sourceid": 31684, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39804, "uid": "edb24109f64fd8d18c12f6cfe0fe8868", "name": "JRM: Joint Reconstruction Model for Multiple Objects without Alignment", "authors": [{"id": 72738, "fullname": "Qirui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72738?format=json", "institution": "Simon Fraser University"}, {"id": 73809, "fullname": "Mohd Yawar Nihal Siddiqui", "url": "http://cvpr.thecvf.com/api/miniconf/users/73809?format=json", "institution": "Technical University Munich"}, {"id": 146779, "fullname": "Duncan Frost", "url": "http://cvpr.thecvf.com/api/miniconf/users/146779?format=json", "institution": "Meta"}, {"id": 187204, "fullname": "Samir Aroudj", "url": "http://cvpr.thecvf.com/api/miniconf/users/187204?format=json", "institution": "AceNerds"}, {"id": 187205, "fullname": "Armen Avetisyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187205?format=json", "institution": "Meta Reality Labs"}, {"id": 127741, "fullname": "Richard Newcombe", "url": "http://cvpr.thecvf.com/api/miniconf/users/127741?format=json", "institution": "Meta, Reality Labs Research"}, {"id": 88970, "fullname": "Angel Xuan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88970?format=json", "institution": "Simon Fraser University"}, {"id": 127728, "fullname": "Jakob Engel", "url": "http://cvpr.thecvf.com/api/miniconf/users/127728?format=json", "institution": "Research, Meta Reality Labs"}, {"id": 140486, "fullname": "Henry Howard-Jenkins", "url": "http://cvpr.thecvf.com/api/miniconf/users/140486?format=json", "institution": "Facebook; Meta"}], "abstract": "Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM\u2019s implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39804", "url": null, "sourceid": 37419, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38048, "uid": "db86b46760e47dba2bc98b9558aa20cb", "name": "Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities", "authors": [{"id": 188924, "fullname": "Sha Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188924?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 168992, "fullname": "Jiao PAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/168992?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 188925, "fullname": "Yu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188925?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 188926, "fullname": "Chao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188926?format=json", "institution": "University of Science and Technology Beijing"}], "abstract": "Multimodal magnetic resonance imaging (MRI) is crucial for brain tumor segmentation, with many methods leveraging its four key modalities to capture complementary information for effective sub-region analysis. However, the absence of several modalities is very common in practice, leading to severe performance degradation in existing full-modality segmentation methods. Limited by the structured data model, recent works often adopt a multi-stage training strategy for full-modality and missing-modality scenarios, which increases training costs and inadequately addresses the interference of miss. In this work, we propose a graph-based one-stage framework for robust brain tumor segmentation with missing modalities. Specifically, we introduce modality-specific virtual nodes that serve as supplementary information sources to compensate for missing modalities. To enhance model robustness against arbitrary modality combinations, we leverage the inherent flexibility of graph networks to devise a dynamic connection strategy. This mechanism dynamically adjusts the adjacency matrix based on modality availability, preserving beneficial information flow while mitigating interference effects caused by missing modalities. Furthermore, we enhance the graph network through heterogeneous weight matrices, enhancing its adaptability to multimodal scenarios. Extensive experiments on the BRATS-2018 and BRATS-2020 datasets demonstrate that our method outperforms the state-of-the-art methods on almost all subsets of incomplete modalities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38048", "url": null, "sourceid": 41545, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38354, "uid": "62d8ac431c3b201066064163fe6ab29e", "name": "CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis", "authors": [{"id": 175461, "fullname": "Di Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175461?format=json", "institution": "Xi\u2018an jiaotong university"}, {"id": 189689, "fullname": "Zhangpeng Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189689?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189690, "fullname": "Xiaobo Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189690?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189691, "fullname": "Jiashuai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189691?format=json", "institution": null}, {"id": 174500, "fullname": "Junbo Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174500?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189692, "fullname": "Hao Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/189692?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189693, "fullname": "Jiusong Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/189693?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189694, "fullname": "Zhi Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189694?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189695, "fullname": "Kai Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189695?format=json", "institution": "University of Cambridge"}, {"id": 189696, "fullname": "Yinghua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189696?format=json", "institution": "KingMed Diagnostics"}, {"id": 189697, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189697?format=json", "institution": "South China University of Technology; KingMed Diagnostics"}, {"id": 189698, "fullname": "Tingsong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189698?format=json", "institution": "KingMed"}, {"id": 189699, "fullname": "Haoran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189699?format=json", "institution": "Baidu"}, {"id": 189700, "fullname": "Mireia Crispin-Ortuzar", "url": "http://cvpr.thecvf.com/api/miniconf/users/189700?format=json", "institution": "University of Cambridge"}, {"id": 189701, "fullname": "Weimiao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189701?format=json", "institution": null}, {"id": 131886, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131886?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189702, "fullname": "Zeyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189702?format=json", "institution": "University of Cambridge"}], "abstract": "Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38354", "url": null, "sourceid": 37650, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36677, "uid": "b36b959e8edd9be3b4a2b01742259810", "name": "Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework", "authors": [{"id": 185624, "fullname": "Yu ZHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185624?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185625, "fullname": "Kang LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/185625?format=json", "institution": null}, {"id": 185626, "fullname": "LI ZHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185626?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87709, "fullname": "Pheng-Ann Heng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87709?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a self-reflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, $n$-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5\\% and 11\\% improvements over the competing methods on two public benchmarks respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36677", "url": null, "sourceid": 39160, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37737, "uid": "79da9938d61a8bb4ddeead82d229441a", "name": "MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts", "authors": [{"id": 188130, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188130?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 178363, "fullname": "Qinchuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178363?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188131, "fullname": "Yuteng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/188131?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188132, "fullname": "Zhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188132?format=json", "institution": null}, {"id": 188133, "fullname": "Penglei Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/188133?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 131520, "fullname": "Mengfei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131520?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 185362, "fullname": "Wenxiao ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185362?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 77389, "fullname": "Yuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77389?format=json", "institution": "The University of Hong Kong"}], "abstract": "Generating high-quality textures for 3D assets is a challenging task. Existing multiview texture generation methods suffer from the multiview inconsistency and missing textures on unseen parts, while UV inpainting texture methods do not generalize well due to insufficient UV data and cannot well utilize 2D image diffusion priors. In this paper, we propose a new method called MV2UV that combines 2D generative priors from multiview generation and the inpainting ability of UV refinement to get high-quality texture maps. Our key idea is to adopt a UV space generative model that simultaneously inpaints unseen parts of multiview images while resolving the inconsistency of multiview images. Experiments show that our method enables a better texture generation quality than existing methods, especially in unseen occluded and multiview-inconsistent parts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37737", "url": null, "sourceid": 32002, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39702, "uid": "d01d1ca662d5257010298baa3936153e", "name": "UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass", "authors": [{"id": 131520, "fullname": "Mengfei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131520?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 131523, "fullname": "Peng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131523?format=json", "institution": "Tsinghua University"}, {"id": 188130, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188130?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186540, "fullname": "Jiahao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186540?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192682, "fullname": "Chengfeng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192682?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 131529, "fullname": "Wei Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/131529?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 158377, "fullname": "Qifeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158377?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 76570, "fullname": "Sida Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/76570?format=json", "institution": "Zhejiang University"}, {"id": 185362, "fullname": "Wenxiao ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185362?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 86370, "fullname": "Wenhan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86370?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 77389, "fullname": "Yuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77389?format=json", "institution": "The University of Hong Kong"}, {"id": 131536, "fullname": "Yike Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131536?format=json", "institution": "Imperial College London"}], "abstract": "We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39702", "url": null, "sourceid": 38026, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37494, "uid": "d78952e2a37fcc10fd011148682958cb", "name": "D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration", "authors": [{"id": 183403, "fullname": "Enhuai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183403?format=json", "institution": "University of Sydney"}, {"id": 187579, "fullname": "Yunke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187579?format=json", "institution": "University of Sydney"}, {"id": 85997, "fullname": "Changming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/85997?format=json", "institution": "CSIRO"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}], "abstract": "Video diffusion models achieve impressive visual fidelity but remain computationally prohibitive for real-time or interactive generation due to their sequential denoising process. Recent caching methods accelerate inference by reusing outputs across timesteps, typically estimating each new output from the first-order residual, which is the difference between adjacent model predictions.To mitigate the accumulated error in caching methods, we propose D2Cache, a training-free method that leverages the smoothness of second-order residual delta, which is temporal differences between consecutive first-order residuals, to predict future timesteps more accurately. We theoretically show that this second-order correction improves prediction accuracy and effectively suppresses cumulative errors. Moreover, D2Cache adaptively scales second-order deltas using error estimates derived from timestep embeddings, maintaining accuracy across varying cache intervals.Empirically, D2Cache outperforms the state-of-the-art TeaCache across four video diffusion models (Latte, Open-Sora, LTX-video, and Wan2.1) at comparable acceleration rates, showing even larger gains under higher acceleration settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37494", "url": null, "sourceid": 46483, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40221, "uid": "08457b5ff06564481f2455c1adad6fd5", "name": "Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation", "authors": [{"id": 182323, "fullname": "Siyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182323?format=json", "institution": "The University of Sydney"}, {"id": 193817, "fullname": "Zijian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193817?format=json", "institution": "University of Sydney"}, {"id": 187579, "fullname": "Yunke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187579?format=json", "institution": "University of Sydney"}, {"id": 178539, "fullname": "Chenghao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/178539?format=json", "institution": "the University of Sydney"}, {"id": 186710, "fullname": "Tao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186710?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}], "abstract": "Vision-Language-Action (VLA) models have shown great performance in robotic manipulation by mapping visual observations and language instructions directly to actions. However, they remain brittle under distribution shifts: when test scenarios change, VLAs often reproduce memorized trajectories instead of adapting to the updated scene, which is a failure mode we refer to as the ``Memory Trap''. This limitation stems from the end-to-end design, which lacks explicit 3D spatial reasoning and prevents reliable identification of actionable regions in unfamiliar environments. To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. Our system detects memory traps through proprioception, repositions the robot to recent high-affordance regions, and proposes affordance-driven waypoints that anchor VLA-generated actions. A SAF-based scorer then selects trajectories with the highest cumulative affordance. Extensive experiments demonstrate that our method achieves an average improvement of 23.5\\% across different VLA backbones ($\\pi_{0}$ and $\\pi_{0.5}$) under out-of-distribution scenarios on real-world robotic platforms, and 20.2\\% on the LIBERO-Pro benchmark, validating its effectiveness in enhancing VLA robustness to distribution shifts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40221", "url": null, "sourceid": 36509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38547, "uid": "e25ca237a194723ab3c86e793660ef21", "name": "Confusion-Aware Spectral Regularizer for Long-Tailed Recognition", "authors": [{"id": 183412, "fullname": "Ziquan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183412?format=json", "institution": "University of Leicester, UK"}, {"id": 190099, "fullname": "Gaojie Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190099?format=json", "institution": "University of Exeter"}, {"id": 190100, "fullname": "Hanruo Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190100?format=json", "institution": "University of Leicester"}, {"id": 177704, "fullname": "Si-Yuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177704?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 190101, "fullname": "Yunxiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190101?format=json", "institution": "University of Exeter"}, {"id": 190102, "fullname": "ZEYU FU", "url": "http://cvpr.thecvf.com/api/miniconf/users/190102?format=json", "institution": "University of Exeter"}, {"id": 190103, "fullname": "Ronghui Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190103?format=json", "institution": "University of Exeter"}, {"id": 190104, "fullname": "Guoqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190104?format=json", "institution": "University of Exeter"}, {"id": 190105, "fullname": "Zhao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190105?format=json", "institution": "Zhengzhou University"}, {"id": 190106, "fullname": "Xia Yuhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190106?format=json", "institution": "Chengdu University of Technology"}, {"id": 190107, "fullname": "Jiaxing Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190107?format=json", "institution": "Chongqing University; University of Exeter"}, {"id": 74040, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74040?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 190108, "fullname": "Lu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190108?format=json", "institution": "University of Exeter"}, {"id": 190109, "fullname": "Tianjin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190109?format=json", "institution": "University of Exeter"}], "abstract": "Long-tailed image classification remains a long-standing challenge, as real-world data typically follow highly imbalanced distributions where a few head classes dominate and many tail classes contain only limited samples. This imbalance biases feature learning toward head categories and leads to significant degradation on rare classes. Although recent studies have proposed re-sampling, re-weighting, and decoupled learning strategies, the improvement on the most underrepresented classes still remains marginal compared with overall accuracy. In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization.  We first establish a new theoretical framework of class-specific error analysis, which shows that the worst-class error can be tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix and a model-dependent complexity term. Guided by this insight, we propose the Confusion-Aware Spectral Regularizer (CAR) that minimizes the spectral norm of the confusion matrix during training to reduce inter-class confusion and enhance tail-class generalization. To enable stable and efficient optimization, CAR integrates a Differentiable Confusion Matrix Surrogate and an EMA-based Confusion Estimator to maintain smooth and low-variance estimates across mini-batches. Extensive experiments across multiple long-tailed benchmarks demonstrates that CAR substantially improves both worst-class accuracy and overall performance. When combined with ConCutMix augmentation, CAR consistently surpasses exisiting state-of-the-art long-tailed learning methods under both the training-from-scratch setting (by $2.37\\% \\sim 4.83\\%$) and the fine-tuning-from-pretrained setting (by $2.42\\% \\sim 4.17\\%$) across ImageNet-LT, CIFAR100-LT, and iNaturalist datasets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38547", "url": null, "sourceid": 46719, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40340?format=json"], "related_events_ids": [40340]}, {"id": 36489, "uid": "af5604111becd28211b2bc271c8c9b57", "name": "FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes", "authors": [{"id": 135471, "fullname": "Donglai Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135471?format=json", "institution": "NVIDIA Research"}, {"id": 185177, "fullname": "Vismay Modi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185177?format=json", "institution": "NVIDIA"}, {"id": 185178, "fullname": "Rishit Dagli", "url": "http://cvpr.thecvf.com/api/miniconf/users/185178?format=json", "institution": "University of Toronto"}, {"id": 185179, "fullname": "Ty Trusty", "url": "http://cvpr.thecvf.com/api/miniconf/users/185179?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 154188, "fullname": "Gilles Daviet", "url": "http://cvpr.thecvf.com/api/miniconf/users/154188?format=json", "institution": "NVIDIA"}, {"id": 177745, "fullname": "Anka H. Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/177745?format=json", "institution": "NVIDIA"}, {"id": 184980, "fullname": "Nicholas Sharp", "url": "http://cvpr.thecvf.com/api/miniconf/users/184980?format=json", "institution": "NVIDIA"}, {"id": 150890, "fullname": "David I.W.", "url": "http://cvpr.thecvf.com/api/miniconf/users/150890?format=json", "institution": "Department of Computer Science, University of Toronto"}], "abstract": "We present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40$\\times$ training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of objects in different representations including meshes and Gaussian splats, as well as the application of our method in the downstream task of robot simulation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36489", "url": null, "sourceid": 46330, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38035, "uid": "c6dd57cb9806eadc9f7915a90d91aa92", "name": "Few-for-Many Personalized Federated Learning", "authors": [{"id": 151429, "fullname": "Ping Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/151429?format=json", "institution": "City University of Hong Kong"}, {"id": 153236, "fullname": "ZHANG Tiantian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153236?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 188885, "fullname": "Xi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188885?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188886, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188886?format=json", "institution": "Southeast University"}, {"id": 188887, "fullname": "Zhi-Ri Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188887?format=json", "institution": "Jinan University"}, {"id": 73898, "fullname": "Qingfu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73898?format=json", "institution": "City University of Hong Kong"}], "abstract": "Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data privacy. Existing approaches often rely on heuristics like clustering or model interpolation, which lack principled mechanisms for balancing heterogeneous client objectives. Serving $M$ clients with distinct data distributions is inherently a multi-objective optimization problem, where achieving optimal personalization ideally requires $M$ distinct models on the Pareto front. However, maintaining $M$ separate models poses significant scalability challenges in federated settings with hundreds or thousands of clients. To address this challenge, we reformulate PFL as a few-for-many optimization problem that maintains only $K$ shared server models ($K \\ll M$) to collectively serve all $M$ clients. We prove that this framework achieves near-optimal personalization: the approximation error diminishes as $K$ increases and converges to each client's optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the $K$ server models through efficient gradient-based updates. Unlike clustering-based approaches that require manual client partitioning or interpolation-based methods that demand careful hyperparameter tuning, FedFew automatically discovers the optimal model diversity through its optimization process. Experiments across vision, NLP, and real-world medical imaging datasets demonstrate that FedFew, with just 3 models, consistently outperforms other state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38035", "url": null, "sourceid": 33042, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36413, "uid": "a9bdc9589c2df82ca15eaa0205447770", "name": "Learning Convex Decomposition via Feature Fields", "authors": [{"id": 135048, "fullname": "Yuezhi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135048?format=json", "institution": "University of Texas at Austin"}, {"id": 91087, "fullname": "Qixing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91087?format=json", "institution": "University of Texas at Austin"}, {"id": 184979, "fullname": "Mikaela Angelina Uy", "url": "http://cvpr.thecvf.com/api/miniconf/users/184979?format=json", "institution": "NVIDIA"}, {"id": 184980, "fullname": "Nicholas Sharp", "url": "http://cvpr.thecvf.com/api/miniconf/users/184980?format=json", "institution": "NVIDIA"}], "abstract": "This work proposes  a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decomposition.Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36413", "url": null, "sourceid": 43285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40259?format=json"], "related_events_ids": [40259]}, {"id": 38445, "uid": "a16adb956f28c621d4e83cb0ec9616cf", "name": "Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport", "authors": [{"id": 189872, "fullname": "Khushboo Mishra", "url": "http://cvpr.thecvf.com/api/miniconf/users/189872?format=json", "institution": "Indian Institute of Technology (BHU) Varanasi"}, {"id": 189873, "fullname": "Varun Trivedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189873?format=json", "institution": "Indian Institute of Technology (BHU) Varanasi"}, {"id": 130097, "fullname": "Tanima Dutta", "url": "http://cvpr.thecvf.com/api/miniconf/users/130097?format=json", "institution": "IIT BHU"}], "abstract": "Unsupervised domain adaptation (UDA) for videos and 1D time-series data faces significant challenges due to domain shifts in terms of both temporal dynamics and feature distributions. Existing UDA approaches for time-series data often address temporal alignment and uncertainty mitigation as separate objectives, leading to unstable training, noisy pseudo-labels, and incomplete feature transfer. This disjoint treatment fails to capture inter-channel causal dependencies and also overlooks the impact of prediction uncertainty on adaptation quality. This limits the transferability of learned representations and results in suboptimal adaptation. To address the aforementioned limitations, we propose a novel UDA framework, named Causally-Regularized Optimal Transport (in short Causal-OT), that preserves domain-invariant causal mechanisms by embedding causal graph regularization into robust OT alignment process. First we estimate inter-channel causal graphs in both source and target domains and learn a transport plan that not only aligns feature distributions but also improves interpretability and minimizes the discrepancy between causal structures of the Granger graphs. However, pseudo-labeling may still prone to error propagation allowing incorrect target predictions during self-training, degrading the model stability and transfer quality across domains. To mitigate this, we further introduce a causality-aware pseudo-labeling strategy that selects high-confidence target samples based on both entropy and structural consistency with the causal graph of the source domain. This enhances robustness against pseudo-label noise.Extensive experiments on six time-series benchmarks achieving  4.5\\% gain in accuracy and a 3.8\\% improvement in F1-score. We conduct experiments on four benchmark video datasets that achieve a 2.5\\% gain in accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38445", "url": null, "sourceid": 34989, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36805, "uid": "346ff40778351836ea68a14e304aa0ae", "name": "Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models", "authors": [{"id": 152054, "fullname": "Yexing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152054?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 180040, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180040?format=json", "institution": "JD"}, {"id": 185906, "fullname": "Shen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185906?format=json", "institution": null}, {"id": 185907, "fullname": "Haohan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185907?format=json", "institution": null}, {"id": 185908, "fullname": "Yuxin Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185908?format=json", "institution": "JD.com"}, {"id": 185909, "fullname": "Yaoyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185909?format=json", "institution": null}, {"id": 184465, "fullname": "Ao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184465?format=json", "institution": "JD.com"}, {"id": 185910, "fullname": "Yuhao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185910?format=json", "institution": "JD.com"}, {"id": 185911, "fullname": "Lu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185911?format=json", "institution": "JD.com"}, {"id": 185912, "fullname": "Xudong Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/185912?format=json", "institution": null}, {"id": 185913, "fullname": "Haoran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185913?format=json", "institution": "JD.com"}, {"id": 173123, "fullname": "Run Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/173123?format=json", "institution": "Northeastern University"}, {"id": 185914, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185914?format=json", "institution": "JD"}, {"id": 185915, "fullname": "Jingjing Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185915?format=json", "institution": "JD"}, {"id": 185916, "fullname": "Junjie Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185916?format=json", "institution": "JD"}, {"id": 185917, "fullname": "Ching Law", "url": "http://cvpr.thecvf.com/api/miniconf/users/185917?format=json", "institution": "JD.com"}, {"id": 75725, "fullname": "Longguang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75725?format=json", "institution": "National University of Defense Technology"}, {"id": 87094, "fullname": "Yulan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87094?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. The dataset and codes will be released after acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36805", "url": null, "sourceid": 35801, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40220, "uid": "f060fb7eb8965de491b69639994123a5", "name": "Clothe and Pose", "authors": [{"id": 193816, "fullname": "Nakul Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193816?format=json", "institution": "Luma AI"}, {"id": 150939, "fullname": "Aayush Bansal", "url": "http://cvpr.thecvf.com/api/miniconf/users/150939?format=json", "institution": "Spree AI"}, {"id": 137035, "fullname": "Minh Vo", "url": "http://cvpr.thecvf.com/api/miniconf/users/137035?format=json", "institution": "SpreeAI"}], "abstract": "We present Clothe-and-Pose, an image generation and editing method that enables a user to try-on different clothes and allows them to pose as they desire. Our method inputs a single user image, garment image and example pose, and outputs the user wearing the target garment in desired pose. In this work, we also introduce an evaluation setup for clothing and posing tasks. Our study spans across a wide variety of garments, including athletic wear, bottoms, dresses, innerwear and swimwear, and a diverse set of poses. Finally, we demonstrate the capability of our model to do general human editing in real-world captures, as well as artificially generated images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40220", "url": null, "sourceid": 37946, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39360, "uid": "86d1c91847351b6999f019d54a4f3f8c", "name": "FPSBench: A Benchmark for Video Understanding at High Frame Rates", "authors": [{"id": 151032, "fullname": "Rohan Choudhury", "url": "http://cvpr.thecvf.com/api/miniconf/users/151032?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 191919, "fullname": "Jean Dandurand", "url": "http://cvpr.thecvf.com/api/miniconf/users/191919?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 191920, "fullname": "Kai Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191920?format=json", "institution": "CMU, Carnegie Mellon University; University of Hong Kong; Nanyang Technological University"}, {"id": 94848, "fullname": "Kshitij Madhav Bhat", "url": "http://cvpr.thecvf.com/api/miniconf/users/94848?format=json", "institution": "Indian Institute of Technology Indore"}, {"id": 191921, "fullname": "Kartik Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/191921?format=json", "institution": "Carnegie Mellon University"}, {"id": 191922, "fullname": "Liza Dahiya", "url": "http://cvpr.thecvf.com/api/miniconf/users/191922?format=json", "institution": "Carnegie Mellon University"}, {"id": 131817, "fullname": "Yizhou Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131817?format=json", "institution": "Carnegie Mellon University"}, {"id": 191923, "fullname": "Souraja Kundu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191923?format=json", "institution": "Carnegie Mellon University"}, {"id": 177608, "fullname": "Chun-Hsien Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/177608?format=json", "institution": "Carnegie Mellon University"}, {"id": 88213, "fullname": "Kris Kitani", "url": "http://cvpr.thecvf.com/api/miniconf/users/88213?format=json", "institution": "Carnegie Mellon University"}, {"id": 76711, "fullname": "Laszlo Jeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/76711?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Modern video-language models are typically trained on videos downsampled to low frames-per-second (FPS), and the most commonly used evaluation benchmarks are designed for low-FPS input as well. To address this shortcoming, we present FPS-Bench, a large video question-answering benchmark designed to evaluate VLMs\u2019 capabilities to understand video at high-frame rates. We introduce a new metric, the minimum frames-per-second (minFPS), which measures the minimum frame-rate required to solve a given question. While existing benchmarks require <1  minFPS, we rigorously curate more than 1000 questions from a diverse source of videos and manually verify minFPS for each example, leading to a benchmark that requires watching videos at on average 7 FPS to solve. Our evaluation of several state-of-the-art VLMs shows that they are severely lacking, achieving QA accuracy of 30\\% in the FPS-Bench multiple-choice task, while humans achieve  72\\% accuracy. We believe that FPS-Bench will serve as a valuable tool for improving frontier-level VLMs and will release all data and code.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39360", "url": null, "sourceid": 38284, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39989, "uid": "6c777229ea7df5098a0a57a29558ed31", "name": "OmniGen2: Towards Instruction-Aligned Multimodal Generation", "authors": [{"id": 193245, "fullname": "Chenyuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193245?format=json", "institution": "University of Science and Technology of China"}, {"id": 182423, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182423?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 193246, "fullname": "PengFei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193246?format=json", "institution": "University of Science and Technology of China"}, {"id": 158347, "fullname": "Ruiran Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/158347?format=json", "institution": "University of Science and Technology of China"}, {"id": 155474, "fullname": "Shitao Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155474?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 193247, "fullname": "Xin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193247?format=json", "institution": "University of Science and Technology of China"}, {"id": 126964, "fullname": "Yueze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126964?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 193248, "fullname": "Wanli Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193248?format=json", "institution": "Zhejiang University"}, {"id": 193249, "fullname": "Xiyan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193249?format=json", "institution": "Zhejiang University"}, {"id": 168362, "fullname": "Yexin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/168362?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 159072, "fullname": "Junjie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/159072?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 193250, "fullname": "Ziyi Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193250?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 193251, "fullname": "Ze Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193251?format=json", "institution": "University of Science and Technology of China"}, {"id": 158348, "fullname": "Chaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158348?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 189427, "fullname": "Haoge Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189427?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 193252, "fullname": "Kun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193252?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 89378, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89378?format=json", "institution": "Microsoft"}, {"id": 191032, "fullname": "Jiajun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191032?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 87804, "fullname": "Dong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87804?format=json", "institution": "University of Science and Technology of China"}, {"id": 193253, "fullname": "Defu Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193253?format=json", "institution": "University of Science and Technology of China"}, {"id": 89234, "fullname": "Xinlong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89234?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 128283, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128283?format=json", "institution": "Kuaishou Inc."}, {"id": 88774, "fullname": "Tiejun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88774?format=json", "institution": "Peking University"}, {"id": 153059, "fullname": "Zheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153059?format=json", "institution": "Research, Microsoft"}], "abstract": "Multimodal generative models can process instructions in various modalities and demonstrate outstanding performance across a wide range of image generation tasks. However, their robustness in complex real-world scenarios remains limited due to insufficient generalized instruction alignment. We introduces \\textbf{OmniGen2}, a unified multimodal generator designed to follow complex, fine-grained instructions. Our core contribution is a two-stage design that first builds a strong, world-knowledge-grounded foundation model and then aligns it using a progressive, multi-task instruction tuning strategy. The foundation model features a streamlined architecture with decoupled decoding for versatile multimodal generation and a novel positional encoding scheme to improve learning efficiency. We ground this model in real-world knowledge using large-scale data construction pipelines. Building on this foundation, we propose a progressive, reinforcement-based alignment process. This phase carefully schedules training tasks and reward signals to foster cross-task knowledge transfer, significantly improving the model's instruction-following capabilities. Our models demonstrate competitive performance on standard benchmarks and our dedicated in-context generation benchmark, \\textbf{OmniContext}. We will release our models, code, benchmark, and training datasets to catalyze future research in building more capable and instruction aligned generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39989", "url": null, "sourceid": 32436, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39560, "uid": "ad0ca0bad566bac4acce21e459b10e16", "name": "Modeling the Visual Ambiguity of Human Sketches", "authors": [{"id": 192347, "fullname": "Yang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192347?format=json", "institution": "Zhejiang University"}, {"id": 192348, "fullname": "Ping Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/192348?format=json", "institution": "Zhejiang University"}, {"id": 192349, "fullname": "Jin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192349?format=json", "institution": "Zhejiang University"}, {"id": 146335, "fullname": "Senyun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/146335?format=json", "institution": "Zhejiang University"}, {"id": 182911, "fullname": "Jingdan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182911?format=json", "institution": "Zhejiang University"}, {"id": 192350, "fullname": "Kaixiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192350?format=json", "institution": "Zhejiang University"}, {"id": 192351, "fullname": "Guodong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192351?format=json", "institution": "Zhejiang University"}, {"id": 144817, "fullname": "Jingru Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144817?format=json", "institution": "Carnegie Mellon University"}, {"id": 73862, "fullname": "Shengfeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73862?format=json", "institution": "Singapore Management University"}], "abstract": "Human sketches provide a compact and expressive form of visual communication, but their sparse structural cues, while capturing essential object structures, introduce ambiguity because a single sketch can correspond to multiple plausible images, making cross-domain alignment uncertain and unstable. Such ambiguity fundamentally limits sketch-based vision tasks that rely on precise sketch--image correspondence. To address this challenge, we introduce AmbiScore, a metric that quantifies the ambiguity of sketch-image pairs, and use Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) as a testbed to reveal how ambiguous supervision leads to performance collapse in existing methods. We further propose DisAmb (Disentangling Ambiguity), a framework that explicitly models and mitigates ambiguity through two components: (1) Elastic Matching, which adaptively adjusts supervision strength using AmbiScore, and (2) Purified Matching, which employs ambiguity-agnostic masks to disentangle structure and appearance via shape jigsaw and texture swapping. DisAmb establishes new benchmarks under high ambiguity and provides a robust, transferable supervisory signal for downstream sketch-guided tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39560", "url": null, "sourceid": 36536, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38619, "uid": "e680644ba734ef9acc82b74484c7b689", "name": "Linear Fundamental Matrix Estimation from 7 or 5 Points", "authors": [{"id": 190319, "fullname": "Taci Ata Kucukpinar", "url": "http://cvpr.thecvf.com/api/miniconf/users/190319?format=json", "institution": "University of Missouri"}, {"id": 190320, "fullname": "Juan Mogollon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190320?format=json", "institution": "University of Missouri"}, {"id": 190321, "fullname": "Joshua Fraser", "url": "http://cvpr.thecvf.com/api/miniconf/users/190321?format=json", "institution": "University of Missouri"}, {"id": 186594, "fullname": "Timothy Duff", "url": "http://cvpr.thecvf.com/api/miniconf/users/186594?format=json", "institution": "University of Missouri"}, {"id": 190322, "fullname": "Kannappan Palaniappan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190322?format=json", "institution": "University of Missouri"}], "abstract": "We revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision.As is well-known, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case.In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution.As a theoretical contribution, we offer an analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first practical linear solver for the minimal problem associated to this special configuration.Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints.When combined with early non-minimal fitting, the runtime and accuracy of our solver is competitive with the state-of-the-art (SoTA) on multiple benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38619", "url": null, "sourceid": 42343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40341?format=json"], "related_events_ids": [40341]}, {"id": 38290, "uid": "c31298d87a7446a1854f5a71f475b343", "name": "Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification", "authors": [{"id": 182204, "fullname": "Junjian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182204?format=json", "institution": "Central South Univertsity"}, {"id": 189509, "fullname": "Hulin Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189509?format=json", "institution": "Central South University"}, {"id": 189510, "fullname": "Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189510?format=json", "institution": "Central South University"}, {"id": 189511, "fullname": "Hailin Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/189511?format=json", "institution": "Central South University"}, {"id": 189512, "fullname": "Mengshen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/189512?format=json", "institution": "Central South University"}, {"id": 189513, "fullname": "Jianxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189513?format=json", "institution": "Central South University"}], "abstract": "Multiple Instance Learning (MIL) has emerged as the dominant paradigm for the analysis of gigapixel-scale Whole Slide Images (WSIs). However, recent methods leveraging guidance from Vision-Language Models often rely on static and universal pathological descriptions. This one-size-fits-all strategy fails to account for the vast morphological heterogeneity within individual WSIs, as its uniform guidance is not tailored to slide-specific visual evidence. To address this, we propose DyKo, a \\textbf{Dy}namic \\textbf{K}n\\textbf{o}wledge-guided MIL framework that adapts universal knowledge to slide-specific evidence for few-shot WSI classification. The core of DyKo is the WSI-Adaptive Knowledge Instantiation module (WAKI). WAKI begins by identifying key visual prototypes within a specific WSI's histology. These slide-specific prototypes then serve as queries to retrieve relevant concepts from a pathology knowledge base. This retrieved knowledge is then used to synthesize unique, knowledge-instantiated features for each instance, effectively instantiating tailored guidance at the patch level. To ensure fidelity and prevent semantic drift, we introduce a Structural Consistency loss that enforces alignment between knowledge-instantiated and visual features. Comprehensive experiments on four public real-world cancer datasets demonstrate that DyKo achieves superior performance over state-of-the-art methods in few-shot pathology diagnosis. Code will be made publicly available upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38290", "url": null, "sourceid": 36059, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38447, "uid": "f09845b1ef57647ae29b2833540f0028", "name": "OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models", "authors": [{"id": 189874, "fullname": "Ali Aliev", "url": "http://cvpr.thecvf.com/api/miniconf/users/189874?format=json", "institution": "HSE University"}, {"id": 189875, "fullname": "Kamil Garifullin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189875?format=json", "institution": "Higher School of Economics"}, {"id": 189876, "fullname": "Nikolay Yudin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189876?format=json", "institution": "Higher School of Economics"}, {"id": 189877, "fullname": "Vera Soboleva", "url": "http://cvpr.thecvf.com/api/miniconf/users/189877?format=json", "institution": "FusionBrain Lab; Higher School of Economics"}, {"id": 189878, "fullname": "Alexander Molozhavenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/189878?format=json", "institution": "Higher School of Economics"}, {"id": 189879, "fullname": "Ivan Oseledets", "url": "http://cvpr.thecvf.com/api/miniconf/users/189879?format=json", "institution": "Institute of Numerical Mathematics"}, {"id": 129266, "fullname": "Aibek Alanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/129266?format=json", "institution": "Artificial Intelligence Research Institute"}, {"id": 189880, "fullname": "Maxim Rakhuba", "url": "http://cvpr.thecvf.com/api/miniconf/users/189880?format=json", "institution": "Higher School of Economics"}], "abstract": "In a rapidly growing field of model training there is a constant practical interest in parameter-efficient model fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. Despite the efficiency of LoRA, one of the most popular fine-tuning methods nowadays, there is an open question: how to combine several adapters tuned for different tasks in one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and, utilizing manifold theory, get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by $\\mathcal{GS}$ orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. We identify that naive geodesic merging compresses spectral distributions, reducing expressiveness; our Cayley transform correction restores spectral properties for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two $\\mathcal{GS}$ orthogonal matrices is capable to unite concept and style features of different adapters. To our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38447", "url": null, "sourceid": 38906, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39042, "uid": "c7b2bf69f1796f4b7bd46bf55d8920f8", "name": "Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving", "authors": [{"id": 130079, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130079?format=json", "institution": "Johns Hopkins University"}, {"id": 129620, "fullname": "Bo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129620?format=json", "institution": "University of Texas, Austin"}, {"id": 157167, "fullname": "Yijing Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157167?format=json", "institution": "Waymo"}, {"id": 150906, "fullname": "Vincent Casser", "url": "http://cvpr.thecvf.com/api/miniconf/users/150906?format=json", "institution": "Waymo"}, {"id": 135529, "fullname": "Songyou Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/135529?format=json", "institution": "Google DeepMind"}, {"id": 130668, "fullname": "Zehao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130668?format=json", "institution": "University of Texas at Austin"}, {"id": 132207, "fullname": "Meng-Li Shih", "url": "http://cvpr.thecvf.com/api/miniconf/users/132207?format=json", "institution": "University of Washington"}, {"id": 191234, "fullname": "Xander Masotto", "url": "http://cvpr.thecvf.com/api/miniconf/users/191234?format=json", "institution": "Waymo"}, {"id": 191235, "fullname": "Shih-Yang Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/191235?format=json", "institution": "Waymo LLC"}, {"id": 191236, "fullname": "Kanaad Parvate", "url": "http://cvpr.thecvf.com/api/miniconf/users/191236?format=json", "institution": "Waymo"}, {"id": 191237, "fullname": "Tiancheng Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/191237?format=json", "institution": "Waymo"}, {"id": 191238, "fullname": "Linn Bieske", "url": "http://cvpr.thecvf.com/api/miniconf/users/191238?format=json", "institution": "Google"}, {"id": 85225, "fullname": "Dragomir Anguelov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85225?format=json", "institution": "Waymo"}, {"id": 138724, "fullname": "Mingxing Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/138724?format=json", "institution": "Waymo"}, {"id": 85215, "fullname": "Chiyu \u201cMax\u201d Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85215?format=json", "institution": "Waymo LLC"}], "abstract": "Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for system validation and training purposes. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite that we refer to as the AV log, which includes multi-view camera images and LiDAR point clouds. A core challenge that arises is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform a comprehensive set of quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39042", "url": null, "sourceid": 41224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37535, "uid": "b47b84be1e32ada197b34168a7c88052", "name": "DiffBMP: Differentiable Rendering with Bitmap Primitives", "authors": [{"id": 129819, "fullname": "Seongmin Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129819?format=json", "institution": "Seoul National University"}, {"id": 180148, "fullname": "Junghun James Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/180148?format=json", "institution": "Seoul National University"}, {"id": 187665, "fullname": "Daehyeop Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187665?format=json", "institution": "Seoul National University"}, {"id": 187666, "fullname": "Insoo Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/187666?format=json", "institution": "Seoul National University"}, {"id": 87674, "fullname": "Se Young Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87674?format=json", "institution": "Seoul National University"}], "abstract": "We introduce **DiffBMP**, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37535", "url": null, "sourceid": 37220, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39332, "uid": "a2f32c0381843529ca285ed87f5d1982", "name": "GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency", "authors": [{"id": 153983, "fullname": "Yin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153983?format=json", "institution": "Zhejiang University"}, {"id": 70254, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/70254?format=json", "institution": "HKUST (GZ)"}, {"id": 191864, "fullname": "Zixuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191864?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 191865, "fullname": "Zhen Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191865?format=json", "institution": "Zhejiang University"}, {"id": 191866, "fullname": "Li Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191866?format=json", "institution": "Central South University"}, {"id": 191867, "fullname": "Mengchu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191867?format=json", "institution": "New Jersey Institute of Technology"}, {"id": 155454, "fullname": "Shuiguang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155454?format=json", "institution": "Zhejiang University"}], "abstract": "Semi-Supervised Regression (SSR) is essential in domains like sentiment analysis, healthcare, etc., where labeled data is limited but unlabeled data is plentiful. Despite its practical importance, SSR remains underexplored due to the lack of effective pseudo-labeling strategies for continuous outputs. Unlike classification, regression lacks inherent confidence measures, making it harder to filter and trust pseudo-labels. This limitation permits low-quality pseudo-labels to propagate during training without proper validation, significantly  amplifying prediction errors in semi-supervised regression frameworks. In this work, we propose GaussianMatch, a novel SSR framework enabling high-quality pseudo-label filtering, which selects reliable pseudo-labels through multi-view prediction consistency under feature-space smoothness assumptions. Our framework introduces two key innovations: 1) Gaussian Consistency Filter (GCF) that quantifies prediction consistency across weakly augmented views through Gaussian similarity scoring, retaining pseudo-labels only when all predictions fall within a confidence interval; 2) Adaptive Gaussian Standard Deviation Smoothing (AGDS) that enhances GCF's robustness through a Bayesian-regularized curriculum that phases confidence intervals from warm-up conservative bounds to progressively tightened thresholds. The use of AGDS ensures stable and reliable pseudo-label filtering throughout training. Extensive experiments demonstrate that GaussianMatch performs strongly across varying data conditions, showing notable robustness under extreme label scarcity. For instance, it outperforms the state of the art on UTKFace with only 30 labels, reducing error by 15.36\\% and improving the Coefficient of Determination by 50.21\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39332", "url": null, "sourceid": 37486, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36525, "uid": "9e039cde871ee92385cac87ca0468af2", "name": "ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation", "authors": [{"id": 175009, "fullname": "Bo Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175009?format=json", "institution": "HKUST(GZ)"}, {"id": 185271, "fullname": "Haotian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185271?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 184642, "fullname": "Hehai Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184642?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 185272, "fullname": "Weiquan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185272?format=json", "institution": "The Hong Kong University of Science and Technology (GuangZhou)"}, {"id": 159049, "fullname": "Beier Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159049?format=json", "institution": "Nanyang Technological University"}, {"id": 185273, "fullname": "Yao Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185273?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 184645, "fullname": "Chengwei Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184645?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce ACE-Merging, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that ACE-Merging sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, ACE-Merging achieves an average absolute improvement of 4\\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, ACE-Merging delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36525", "url": null, "sourceid": 43067, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36636, "uid": "f7633dd32138e8ff73ba5b69c7e0c88d", "name": "Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion", "authors": [{"id": 102171, "fullname": "Yujie Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/102171?format=json", "institution": "HNU"}, {"id": 185522, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185522?format=json", "institution": "Hunan University"}, {"id": 127241, "fullname": "Ruihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127241?format=json", "institution": "Hunan University"}, {"id": 127223, "fullname": "F anWu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127223?format=json", "institution": "Wuhan University"}, {"id": 185523, "fullname": "Liu Zhi-Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185523?format=json", "institution": "Hunan University"}, {"id": 127251, "fullname": "Zhuo Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127251?format=json", "institution": "Hunan University"}, {"id": 127232, "fullname": "Kenli Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127232?format=json", "institution": "Hunan University"}], "abstract": "Camera-based Semantic Scene Completion (SSC) is able to comprehensively understand the entire scene, but it suffers from ambiguous predictions due to occlusions and incomplete information. Temporal SSC alleviates this issue, but existing models simply stack multi-frame temporal features, which can lead to inconsistencies between geometry and semantics over time. In this paper, we present ConSSC, a novel SSC method that learns Spatial-Temporal Consistency. It works by lifting historical frames into a 3D scene-level occupancy framework, aggregating 2D and 3D historical features from current voxels, and learning from 2D visibility and similarity cues in a temporal buffer. Specifically, our framework introduces two key components: the Hierarchical Voxel Refinement module, which extracts a coarse occupancy from depth and refines it through voxel-level representations, recovering missing information. The Temporal Semantic Aggregation module effectively integrates semantic features from different viewpoints and time points, enabling the reconstruction of occluded regions in the current frame using historical context, aggregating them into corresponding voxel features. Without additional sensors or data, ConSSC improves both geometric and semantic consistency. Extensive experiments on the SemanticKITTI and SSCBench-KITTI-360 datasets show that ConSSC outperforms state-of-the-art camera-based and temporal SSC baselines by a significant margin in terms of IoU and mIoU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36636", "url": null, "sourceid": 45345, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38902, "uid": "9d57a1e06a731e1b2377de6781e881b1", "name": "LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling", "authors": [{"id": 186684, "fullname": "Zuhao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186684?format=json", "institution": "Nanyang Technological University"}, {"id": 190940, "fullname": "Sudong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190940?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 190359, "fullname": "Kaichen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190359?format=json", "institution": "The University of Hong Kong"}, {"id": 190360, "fullname": "Keming Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190360?format=json", "institution": "School of Software, Tsinghua University"}, {"id": 126180, "fullname": "Sicong Leng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126180?format=json", "institution": "Nanyang Technological University"}, {"id": 73774, "fullname": "Yifan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73774?format=json", "institution": "National University of Singapore"}, {"id": 90785, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90785?format=json", "institution": "Nanyang Technological University"}, {"id": 184645, "fullname": "Chengwei Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184645?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 87301, "fullname": "Shijian Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87301?format=json", "institution": "Nanyang Technological University"}, {"id": 182987, "fullname": "Xingxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182987?format=json", "institution": "MiroMind"}, {"id": 126171, "fullname": "Lidong Bing", "url": "http://cvpr.thecvf.com/api/miniconf/users/126171?format=json", "institution": "Alibaba DAMO Academy"}], "abstract": "Large multimodal models (LMMs) have shown great potential in video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos\u2014by first skimming globally and then examining relevant clips for details\u2014we introduce LongVT, an end-to-end agentic framework that sparks \"Thinking with Long Videos\" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs\u2019 inherent temporal grounding ability as a native video cropping tool to zoom in on specific video clips and resample finer-grained frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering data for long-video reasoning, we curated and will release a data suite named VideoSIAH to facilitate both training and evaluation. Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.7K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning. Our evaluation benchmark contains 1,280 QA pairs carefully verified through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validations, LongVT consistently outperforms strong existing baselines across four challenging long-video understanding and reasoning benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38902", "url": null, "sourceid": 31878, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36418, "uid": "58caf27a48e7930aefd5437298d64a70", "name": "$\\phi$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models", "authors": [{"id": 77073, "fullname": "Thanh-Dat Truong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77073?format=json", "institution": "University of Arkansas"}, {"id": 161743, "fullname": "Huu-Thien Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/161743?format=json", "institution": "University of Arkansas"}, {"id": 89965, "fullname": "Jackson Cothren", "url": "http://cvpr.thecvf.com/api/miniconf/users/89965?format=json", "institution": "University of Arkansas - Fayetteville"}, {"id": 89954, "fullname": "Bhiksha Raj", "url": "http://cvpr.thecvf.com/api/miniconf/users/89954?format=json", "institution": "Carnegie Mellon University"}, {"id": 71354, "fullname": "Khoa Luu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71354?format=json", "institution": "University of Arkansas"}], "abstract": "Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $\\phi$-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new $\\phi$-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable $\\phi$-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed $\\phi$-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36418", "url": "https://uark-cviu.github.io/projects/Fai-DPO/", "sourceid": 44390, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36272, "uid": "d82493b6823132975adf6bd97c53014e", "name": "CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation", "authors": [{"id": 179993, "fullname": "Shilei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/179993?format=json", "institution": "Sun Yat-sen University"}, {"id": 184641, "fullname": "Ziyang Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184641?format=json", "institution": "Shanghai Artificial Intelligence Laboratory; SUN YAT-SEN UNIVERSITY"}, {"id": 184642, "fullname": "Hehai Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184642?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 184643, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184643?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 184644, "fullname": "Jiashun Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184644?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 146676, "fullname": "Xiaoxing Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/146676?format=json", "institution": "Beijing Institute of Technology"}, {"id": 156141, "fullname": "Haoyuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156141?format=json", "institution": "Sun Yat-Sen University"}, {"id": 143266, "fullname": "Guowen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/143266?format=json", "institution": "Sun Yat-Sen University"}, {"id": 184645, "fullname": "Chengwei Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184645?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 184646, "fullname": "Hong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184646?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 128276, "fullname": "Xue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128276?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 131057, "fullname": "Juepeng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131057?format=json", "institution": "Sun Yat-Sen University"}, {"id": 131048, "fullname": "Haohuan Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131048?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work is provided in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36272", "url": "https://github.com/ShileiCao/CrossEarth-Gate", "sourceid": 31273, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65700, "file": "/media/PosterPDFs/CVPR%202026/36272.png", "modified": "2026-04-16T22:25:02.914781-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65701, "file": "/media/PosterPDFs/CVPR%202026/36272-thumb.png", "modified": "2026-04-16T22:25:03.121989-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38706, "uid": "19b3fa53ed04b475f8eca3c2f862b60c", "name": "PhyCo: Learning Controllable Physical Priors for Generative Motion", "authors": [{"id": 131212, "fullname": "Sriram Narayanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131212?format=json", "institution": "Carnegie Mellon University"}, {"id": 127828, "fullname": "Ziyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127828?format=json", "institution": "Texas A&amp;M"}, {"id": 89406, "fullname": "Srinivasa G. Narasimhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89406?format=json", "institution": "Carnegie Mellon University"}, {"id": 75820, "fullname": "Manmohan Chandraker", "url": "http://cvpr.thecvf.com/api/miniconf/users/75820?format=json", "institution": "UC San Diego"}], "abstract": "Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision\u2013language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes\u2014without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38706", "url": "https://phyco-video.github.io/", "sourceid": 39514, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38209, "uid": "128481e48bed12a68431ca6c954cbebe", "name": "Foca-VLA: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation", "authors": [{"id": 189313, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189313?format=json", "institution": "Tongji University"}, {"id": 189314, "fullname": "Zhaxizhuoma Zhaxizhuoma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189314?format=json", "institution": "Shanghai Jiaotong University; Shanghai Artificial Intelligence Laboratory"}, {"id": 189315, "fullname": "Hongru Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189315?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189316, "fullname": "Junjie Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/189316?format=json", "institution": "Shanghai Artificial Intelligence Laboratory; Shanghai University of Science and Technology"}, {"id": 189317, "fullname": "Hongquan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189317?format=json", "institution": "East China Normal University"}, {"id": 189318, "fullname": "Jinda Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/189318?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189319, "fullname": "Yunsong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189319?format=json", "institution": ""}, {"id": 90129, "fullname": "Jia Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90129?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189320, "fullname": "Ce Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189320?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 71462, "fullname": "Jieji Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/71462?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189321, "fullname": "Qiaojun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189321?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 86440, "fullname": "Cewu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86440?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86632, "fullname": "Yu Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86632?format=json", "institution": "Shanghai Aritifcal Intelligence Laboratory"}, {"id": 88208, "fullname": "Jiangmiao Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88208?format=json", "institution": "Shanghai AI Laboratory "}], "abstract": "Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose Foca-VLA, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. Foca-VLA introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a cross-scale routing Mixture-of-Experts (MoE) with impedance control in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force--position regulation. To support learning and evaluation, we construct Foca-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that Foca-VLA substantially improves success rates and reliability in contact-rich manipulation, outperforming Pi0 and Pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby advancing force-aware physical intelligence in VLAs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38209", "url": null, "sourceid": 45012, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37108, "uid": "5e740f261196151a0b02c7b2fccf6ae6", "name": "Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos", "authors": [{"id": 186673, "fullname": "Jialun Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186673?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 186674, "fullname": "Zhangjun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186674?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 132802, "fullname": "Diandian Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/132802?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 186675, "fullname": "Zhixi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186675?format=json", "institution": "Southern Medical University"}, {"id": 127069, "fullname": "Jing Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/127069?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}, {"id": 87709, "fullname": "Pheng-Ann Heng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87709?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process and increases the risk of postoperative complications. Intelligent detection of bleeding areas can quantify the blood loss to assist decision-making, while locating bleeding points helps surgeons quickly identify the source of bleeding and achieve hemostasis in time to improve surgical success rates. To fill the benchmark gap, we first construct a real-world laparoscopic surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, enabling simultaneous detection of bleeding regions and points in laparoscopic surgery. The baseline embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2. The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures point motion direction via inter-frame optical flow. By coupled bidirectional guidance, our framework explores spatial-temporal correlations while exploiting memory modeling to infer current bleeding status. Extensive experiments indicate that our method outperforms 13 counterparts in bleeding detection. Code and data are available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37108", "url": null, "sourceid": 34366, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38645, "uid": "ee90b45cf1106fef95ee81de63d7a322", "name": "FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models", "authors": [{"id": 157444, "fullname": "Yucheng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/157444?format=json", "institution": "Southeast University"}, {"id": 151464, "fullname": "Fu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151464?format=json", "institution": "Southeast University"}, {"id": 188707, "fullname": "Ruixiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188707?format=json", "institution": "Southeast University"}, {"id": 149140, "fullname": "Jianlu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149140?format=json", "institution": "School of Computer Science and Engineering, Southeast University"}, {"id": 157445, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157445?format=json", "institution": "Southeast University"}, {"id": 190379, "fullname": "Yong Rui", "url": "http://cvpr.thecvf.com/api/miniconf/users/190379?format=json", "institution": "Berkeley Frontier Fund"}, {"id": 84884, "fullname": "Xin Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84884?format=json", "institution": "Southeast University"}], "abstract": "The training of diffusion models is computationally intensive, making effective pre-training essential. However, real-world deployments often demand models of variable sizes due to diverse memory and computational constraints, posing challenges when corresponding pre-trained versions are unavailable.To address this, we propose FINE, a novel pre-training method whose resulting model can flexibly factorize its knowledge into fundamental components, termed learngenes, enabling direct initialization of models of various sizes and eliminating the need for repeated pre-training.Rather than optimizing a conventional full-parameter model, FINE represents each layer\u2019s weights as the product of $U_{\\star}$, $\\Sigma_{\\star}^{(l)}$, and $V_{\\star}^\\top$, where $U_{\\star}$ and $V_{\\star}$ serve as size-agnostic learngenes shared across layers, while $\\Sigma_{\\star}^{(l)}$ remains layer-specific.By jointly training these components, FINE forms a decomposable and transferable knowledge structure that allows efficient initialization through flexible recombination of learngenes, requiring only light retraining of $\\Sigma_{\\star}^{(l)}$ on limited data.Extensive experiments demonstrate the efficiency of FINE, achieving state-of-the-art performance in initializing variable-sized models across diverse resource-constrained deployments. Furthermore, models initialized by FINE effectively adapt to diverse tasks, showcasing the task-agnostic versatility of learngenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38645", "url": null, "sourceid": 32292, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38067, "uid": "e25e74105b0ea8f9e8403033b7444f34", "name": "Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining", "authors": [{"id": 188972, "fullname": "Hao Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188972?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188973, "fullname": "Runqing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188973?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188974, "fullname": "Jin Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/188974?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 181590, "fullname": "xue zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181590?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188975, "fullname": "Jianxiao Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188975?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 188976, "fullname": "Mingzhu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188976?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Text-to-Image Person Retrieval (TIPR) aims to retrieve pedestrian images with a given natural-language description. It remains highly challenging due to the inherent ambiguity in cross-modal alignment: existing models often struggle to capture fine-grained correspondences, and their understanding of detailed pedestrian attributes is typically confined to partial or coarse cues, leading to mismatched or erroneous retrieval results.To overcome this challenge, we propose CECA, a Conversation-Enhanced Cross-modal Alignment framework. CECA strengthens the attribute correspondence between textual and visual modalities through multimodal large language models (MLLMs)-guided dialogue, enhances detailed cross-modal matching via a Bidirectional Correlation Matching (BCM) mechanism, and stabilizes optimization with a Confidence-Aware Weighting Loss (CAWL) that reduces the impact of low-quality conversational responses. Extensive experiments on three public benchmarks demonstrate the superior performance and strong generalization ability of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38067", "url": null, "sourceid": 32668, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39737, "uid": "e7004e2ce735b7e8af316bdc0cb84fad", "name": "Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis", "authors": [{"id": 175351, "fullname": "junwei zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175351?format=json", "institution": "Nanjing University Of Aeronautics And Astronautics"}, {"id": 192754, "fullname": "Dong Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192754?format=json", "institution": null}, {"id": 156156, "fullname": "Sheng-Jun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156156?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 190270, "fullname": "Kun Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190270?format=json", "institution": "Lanzhou University"}, {"id": 192755, "fullname": "Songcan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192755?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "Atmospheric turbulence significantly degrades long-range imaging by introducing geometric warping and exposure-time-dependent blur, which adversely affects both visual quality and the performance of high-level vision tasks. Existing methods for synthesizing turbulence effects often oversimplify the relationship between blur and exposure-time, typically assuming fixed or binary exposure settings. This leads to unrealistic synthetic data and limited generalization capability of trained models. To address this gap, we revisit the modulation transfer function (MTF) formulation and propose a novel Exposure-Time-dependent MTF (ET-MTF) that models blur as a continuous function of exposure-time. For blur synthesis, we derive a tilt-invariant point spread function (PSF) from the ET-MTF, which, when integrated with a spatially varying blur-width field, provides a comprehensive and physically accurate characterization of turbulence-induced blur. Building on this synthesis pipeline, we construct ET-Turb, a large-scale synthetic turbulence dataset that explicitly incorporates continuous exposure-time modeling across diverse optical and atmospheric conditions. The dataset comprises 5,083 videos (2,005,835 frames), partitioned into 3,988 training and 1,095 test videos. Extensive experiments demonstrate that models trained on ET-Turb produce more realistic restorations and achieve superior generalization on real-world turbulence data compared to those trained on other datasets. Our dataset will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39737", "url": null, "sourceid": 31184, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38603, "uid": "cc6177dae9053807361351c19dee7af7", "name": "Elucidating the SNR-t Bias of Diffusion Probabilistic Models", "authors": [{"id": 187048, "fullname": "Meng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187048?format=json", "institution": "Lanzhou University"}, {"id": 185770, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185770?format=json", "institution": "Alibaba Group"}, {"id": 103451, "fullname": "Jianhao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/103451?format=json", "institution": "Tianjin University"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}, {"id": 190270, "fullname": "Kun Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190270?format=json", "institution": "Lanzhou University"}], "abstract": "Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio\u2013timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep; however, this correspondence is disrupted during inference, leading to error accumulation and performance degradation in generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and SDXL) on datasets of various resolutions with negligible computational overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38603", "url": null, "sourceid": 33707, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38281, "uid": "bed2ebfb0c0857dd7a048030a237fe43", "name": "PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards", "authors": [{"id": 156627, "fullname": "Shulei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156627?format=json", "institution": "Zhejiang University"}, {"id": 126204, "fullname": "Longhui Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/126204?format=json", "institution": "Huawei Cloud Technologies Ltd."}, {"id": 147712, "fullname": "XIN HE", "url": "http://cvpr.thecvf.com/api/miniconf/users/147712?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 189491, "fullname": "Jianbo Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189491?format=json", "institution": null}, {"id": 189492, "fullname": "Hui Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189492?format=json", "institution": ""}, {"id": 84818, "fullname": "Zhou Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84818?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38281", "url": null, "sourceid": 37322, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37597, "uid": "e1c60d74120b7d2257c1f2a7561dfe41", "name": "RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning", "authors": [{"id": 99564, "fullname": "Ehsan Ahmadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/99564?format=json", "institution": "University of Alberta"}, {"id": 183596, "fullname": "Hunter Schofield", "url": "http://cvpr.thecvf.com/api/miniconf/users/183596?format=json", "institution": "York University"}, {"id": 187791, "fullname": "Behzad Khamidehi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187791?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187792, "fullname": "Fazel Arasteh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187792?format=json", "institution": null}, {"id": 187793, "fullname": "Jinjun Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187793?format=json", "institution": "York University"}, {"id": 187794, "fullname": "Lili Mou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187794?format=json", "institution": "University of Alberta"}, {"id": 128417, "fullname": "Dongfeng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128417?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187795, "fullname": "Kasra Rezaee", "url": "http://cvpr.thecvf.com/api/miniconf/users/187795?format=json", "institution": "Waabi"}], "abstract": "Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism enhancement, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37597", "url": null, "sourceid": 35096, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36977, "uid": "f0eefcbcfb4afc1b3fbef0018e0773a0", "name": "Anomaly-Related Residual Fields for Cross-domain Anomaly Detection", "authors": [{"id": 181371, "fullname": "Kewei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181371?format=json", "institution": "Zhejiang University"}, {"id": 186364, "fullname": "Jiayi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186364?format=json", "institution": "Zhejiang University of Technology"}, {"id": 176666, "fullname": "Zhengda Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/176666?format=json", "institution": "ChongQing University"}, {"id": 186365, "fullname": "Weijun Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186365?format=json", "institution": "EbTech Co. Ltd."}, {"id": 186366, "fullname": "Lingxiang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/186366?format=json", "institution": "Zhejiang University"}, {"id": 146606, "fullname": "Kejia Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/146606?format=json", "institution": "Zhejiang University"}, {"id": 86263, "fullname": "Zunlei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86263?format=json", "institution": "Zhejiang University"}, {"id": 186367, "fullname": "Yijun Bei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186367?format=json", "institution": "Zhejiang University"}], "abstract": "Label-free image anomaly detection is difficult because anomalies must be separated from intra-normal variability. Diffusion models learn a manifold for normal data, and, under the common assumption that off-manifold anomalies are harder to generate and yield larger prediction errors, many methods build detectors from prediction residuals; yet reverse-process stochasticity and complex but normal structure also produce large residuals, so magnitude alone is non-diagnostic. To clarify what is recoverable from such noisy residuals, the theory examines how residual signals propagate through later reverse steps, showing that variability consistent with normal statistics is gradually absorbed toward stationarity, whereas anomalous regions retain an additional non-stationary signal that persists. Building on this insight, the Residual\u2013Evolution Field (REF) isolates this persistent signal, with labeled source data calibrating the extractor and Cross-domain Field Alignment (CFA) transferring it to unlabeled targets. A theoretical framework with formal guarantees is established, and experiments across multiple benchmarks under substantial domain shifts demonstrate state-of-the-art performance, improving over strong baselines by 2.01\u201314 percentage points (pp).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36977", "url": null, "sourceid": 32591, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39574, "uid": "a85e8c31c51cc06e0b9abc2076abe38c", "name": "Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence", "authors": [{"id": 93916, "fullname": "Panagiotis Filntisis", "url": "http://cvpr.thecvf.com/api/miniconf/users/93916?format=json", "institution": "National Technical University of Athens; Athena Research and Innovation Centre"}, {"id": 192383, "fullname": "George Retsinas", "url": "http://cvpr.thecvf.com/api/miniconf/users/192383?format=json", "institution": "NVIDIA"}, {"id": 128471, "fullname": "Radek Danecek", "url": "http://cvpr.thecvf.com/api/miniconf/users/128471?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 133591, "fullname": "Vanessa Sklyarova", "url": "http://cvpr.thecvf.com/api/miniconf/users/133591?format=json", "institution": "Max Planck Institute for Intelligent Systems, Tubingen"}, {"id": 130563, "fullname": "Petros Maragos", "url": "http://cvpr.thecvf.com/api/miniconf/users/130563?format=json", "institution": "National Technical University of Athens"}, {"id": 88338, "fullname": "Timo Bolkart", "url": "http://cvpr.thecvf.com/api/miniconf/users/88338?format=json", "institution": "Google"}], "abstract": "Recent learning-based face reconstruction and registration frameworks such as ToFu and TEMPEH have shown that dense correspondence between facial scans and a common topology can be learned directly from images. However, these approaches still depend on precomputed registrations obtained through iterative optimization pipelines that often require manual verification and correction by human annotators. We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. Instead of relying on optimization-based registrations, we employ a pseudo-linear inverse kinematic solver in conjunction with dense 2D keypoints produced by a tracker trained only on synthetic data to directly enforce a common face topology at the vertex level. We further find that the commonly used point-to-surface distance can lead to unstable training and artifacts, and instead use pointmap- and normal-based losses that provide smoother gradients, more stable optimization, and improved reconstruction results.Additionally, we introduce at inference a brief test-time-optimization scheme which can further refine the results of the network, resulting in registrations that outperform traditional labor-intensive pipelines.Despite removing external registrations, our extensive experimental results show that MOCHI surpasses the previous state-of-the-art in reconstruction accuracy and visual fidelity. The code and the model will be made public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39574", "url": null, "sourceid": 42782, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37065, "uid": "7dcf06698a343e7d4985689898cb8491", "name": "Minimal Constraint Relaxation for Multiview Autocalibration", "authors": [{"id": 182834, "fullname": "Norio Kosaka", "url": "http://cvpr.thecvf.com/api/miniconf/users/182834?format=json", "institution": "National  Institute of Informatics"}, {"id": 186594, "fullname": "Timothy Duff", "url": "http://cvpr.thecvf.com/api/miniconf/users/186594?format=json", "institution": "University of Missouri"}, {"id": 84644, "fullname": "Tomas Pajdla", "url": "http://cvpr.thecvf.com/api/miniconf/users/84644?format=json", "institution": "CIIRC - Czech Technical University in Prague"}], "abstract": "Polynomial systems in multiview geometry are often highly over-constrained, and na\u00efve subsampling or elimination can lead to unstable or inconsistent estimation. We revisit this issue through the lens of \\emph{constraint relaxation}\u2014the selective removal of equations to recover a finite and well-conditioned solution space. Focusing on the Kruppa equations for camera autocalibration, we introduce the notion of \\emph{minimal relaxation}, a principled framework for identifying constraint subsets that preserve geometric validity while restoring solvability. Through symbolic analysis of the full three-view Kruppa system, we enumerate and classify all relaxation patterns, revealing algebraically minimal families that yield finite, well-conditioned problems.Comprehensive experiments validate this analysis across symbolic and numerical settings.Using homotopy continuation and synthetic perturbations, we show that specific relaxations remain stable under noise and permutation.Experiments with synthetic and real images further demonstrate that these relaxations consistently outperform the classical SVD-based Kruppa formulation in both robustness and calibration accuracy, establishing algebraic relaxation as a powerful paradigm for stable multiview autocalibration.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37065", "url": null, "sourceid": 45652, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37120, "uid": "ea991922e5d2d407b59be72152863092", "name": "Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control", "authors": [{"id": 103186, "fullname": "Zhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/103186?format=json", "institution": "\u534e\u4e2d\u79d1\u6280\u5927\u5b66"}, {"id": 186708, "fullname": "Cheng Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186708?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 186709, "fullname": "Yangyang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186709?format=json", "institution": "Harbin Institute of Technology"}, {"id": 162957, "fullname": "Boan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/162957?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 186710, "fullname": "Tao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186710?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 178399, "fullname": "Zhenguo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/178399?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 186711, "fullname": "Yibo Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186711?format=json", "institution": null}, {"id": 156220, "fullname": "Pengwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156220?format=json", "institution": "baai-\u5317\u4eac\u4eba\u5de5\u667a\u80fd\u7814\u7a76\u9662"}, {"id": 128283, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128283?format=json", "institution": "Kuaishou Inc."}, {"id": 186712, "fullname": "Fangzhou Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186712?format=json", "institution": "Harbin Institute of Technology"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}, {"id": 91956, "fullname": "Shanghang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91956?format=json", "institution": "Peking University"}], "abstract": "Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of \"motion = content + style\", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive freestyle performers capable of reacting to audio.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37120", "url": null, "sourceid": 32413, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40341, "uid": "e680644ba734ef9acc82b74484c7b689", "name": "Linear Fundamental Matrix Estimation from 7 or 5 Points", "authors": [{"id": 190319, "fullname": "Taci Ata Kucukpinar", "url": "http://cvpr.thecvf.com/api/miniconf/users/190319?format=json", "institution": "University of Missouri"}, {"id": 190320, "fullname": "Juan Mogollon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190320?format=json", "institution": "University of Missouri"}, {"id": 190321, "fullname": "Joshua Fraser", "url": "http://cvpr.thecvf.com/api/miniconf/users/190321?format=json", "institution": "University of Missouri"}, {"id": 186594, "fullname": "Timothy Duff", "url": "http://cvpr.thecvf.com/api/miniconf/users/186594?format=json", "institution": "University of Missouri"}, {"id": 190322, "fullname": "Kannappan Palaniappan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190322?format=json", "institution": "University of Missouri"}], "abstract": "We revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision.As is well-known, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case.In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution.As a theoretical contribution, we offer an analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first practical linear solver for the minimal problem associated to this special configuration.Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints.When combined with early non-minimal fitting, the runtime and accuracy of our solver is competitive with the state-of-the-art (SoTA) on multiple benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40341", "url": null, "sourceid": -42343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38619?format=json"], "related_events_ids": [38619]}, {"id": 37201, "uid": "3ffd71f587fdbbec02e6d4a51c962b10", "name": "SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robot", "authors": [{"id": 175270, "fullname": "MENGZHEN LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/175270?format=json", "institution": "PEKING UNIVERSITY"}, {"id": 127179, "fullname": "Enshen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127179?format=json", "institution": "Beihang university"}, {"id": 186708, "fullname": "Cheng Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186708?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 186909, "fullname": "Yi Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/186909?format=json", "institution": "Beihang University"}, {"id": 186910, "fullname": "Shanyu Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186910?format=json", "institution": "Peking University"}, {"id": 186911, "fullname": "Liming Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186911?format=json", "institution": "Harbin Institute of Technology"}, {"id": 156220, "fullname": "Pengwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156220?format=json", "institution": "baai-\u5317\u4eac\u4eba\u5de5\u667a\u80fd\u7814\u7a76\u9662"}, {"id": 128283, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128283?format=json", "institution": "Kuaishou Inc."}, {"id": 91956, "fullname": "Shanghang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91956?format=json", "institution": "Peking University"}], "abstract": "Active perception and manipulation are crucial for embodied robots to interact with complex scenes. Existing methods struggle to unify semantic-driven perception actively with robust, viewpoint-invariant execution accordingly. To this end, we propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Central to our approach is a decoupling of camera and manipulation actions, contrary to shared-action-space, and learning in a bottom-up strategy: we first train semantic camera control on our proposed large-scale dataset, then jointly optimize both action types via hybrid data. To support this learning, we introduce ActiveViewPose-200K, comprising 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We further present ActiveManip-Bench, the first benchmark filling the gap to evaluate active manipulation. Extensive experiments in both simulation and real-world settings show that SaPaVe outperforms recent VLA models such as GR00T and $\\pi_0$, achieving up to 31.25\\% higher success rates in real-world tasks.  Our results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37201", "url": null, "sourceid": 39256, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37106, "uid": "7edaadde50012f0860952123564eb1ba", "name": "SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts", "authors": [{"id": 186667, "fullname": "Khanh-Binh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186667?format=json", "institution": "Deakin University"}, {"id": 140401, "fullname": "Chae Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/140401?format=json", "institution": "National Cancer Center"}], "abstract": "Large-scale pre-trained image-text models exhibit robust multimodal representation, yet applying contrastive language-image pretraining (CLIP) to audio-visual localization remains challenging. Replacing the classification token ($[CLS]$) with an audio-embedded token ($[V_A]$)struggles to capture semantic cues, and the prompt \u201ca photo of a $[V_A]$\u201d fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose sound-aware prompt learning (\\textsc{SouPLe}), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench confirm that \\textsc{SouPLe} significantly improves localization and segmentation performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37106", "url": null, "sourceid": 38416, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38760, "uid": "21ee021ac65ce078bfd68b48368dc6a8", "name": "TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos", "authors": [{"id": 164882, "fullname": "Seungjae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/164882?format=json", "institution": "University of Maryland, College Park"}, {"id": 179965, "fullname": "Yoonkyo Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/179965?format=json", "institution": "University of Maryland, College Park"}, {"id": 176842, "fullname": "Inkook Chun", "url": "http://cvpr.thecvf.com/api/miniconf/users/176842?format=json", "institution": "New York University"}, {"id": 89238, "fullname": "Yao-Chih Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/89238?format=json", "institution": "University of Maryland College Park"}, {"id": 190604, "fullname": "Zikui Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190604?format=json", "institution": "University of Maryland, College Park"}, {"id": 190605, "fullname": "Hongjia Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190605?format=json", "institution": "New York University"}, {"id": 190606, "fullname": "Aayush Talreja", "url": "http://cvpr.thecvf.com/api/miniconf/users/190606?format=json", "institution": "University of Maryland, College Park"}, {"id": 190607, "fullname": "Tan Dao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190607?format=json", "institution": "University of Maryland, College Park"}, {"id": 151379, "fullname": "Yongyuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151379?format=json", "institution": "University of Maryland, College Park"}, {"id": 88945, "fullname": "Jia-Bin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88945?format=json", "institution": "University of Maryland, College Park"}, {"id": 126668, "fullname": "Furong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126668?format=json", "institution": "Department of Computer Science, University of Maryland"}], "abstract": "Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments---humans and different robots---are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation---a compact 3D \"trace-space\" of scene-level trajectories---that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation--trace--language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen\u2019s ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38760", "url": null, "sourceid": 39963, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40235, "uid": "e6d27bd10b9098033a43ca91066caa84", "name": "MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene", "authors": [{"id": 193837, "fullname": "wenjie mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193837?format=json", "institution": "Tongji University"}, {"id": 182044, "fullname": "Zhan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182044?format=json", "institution": "Tongji University"}, {"id": 193838, "fullname": "Chuanzhou su", "url": "http://cvpr.thecvf.com/api/miniconf/users/193838?format=json", "institution": "Tongji University"}, {"id": 193839, "fullname": "XUANYI SHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/193839?format=json", "institution": "Tongji University"}, {"id": 193840, "fullname": "Ziniu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193840?format=json", "institution": "Tongji University"}, {"id": 89023, "fullname": "Fan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89023?format=json", "institution": "Tongji University"}, {"id": 193841, "fullname": "Yujian Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193841?format=json", "institution": "Tongji University"}, {"id": 193842, "fullname": "Junqiao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193842?format=json", "institution": "Tongji Unviersity"}, {"id": 193843, "fullname": "Tiantian Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193843?format=json", "institution": "Tongji University"}, {"id": 177443, "fullname": "chen ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/177443?format=json", "institution": "Tongji University"}, {"id": 86210, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86210?format=json", "institution": "Tongji University"}], "abstract": "Generalizable Neural Radiance Fields (GeNeRF) enable high-quality scene reconstruction from a limited number of views and can generalize to unseen scenes. However, in real-world environments, transient distractors disrupt structural consistency across views, leading to deviated supervision signals and degraded reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and they estimate uncertainty from per-view reconstruction errors to remove distractors, but this is unreliable to GeNeRF, because it may misjudge inconsistent static structures from source views as distractors. To address this issue, we propose MUGeNeRF: a multi-view uncertainty-guided distractor-aware GeNeRF method, aim to effectively alleviate GeNeRF's robust modeling challenges in dynamic scenes with transient distractions. We explicitly decompose distractor awareness into two complementary uncertainty modeling tasks: Source-view uncertainty, serving as a transferable prior during the feed-forward process, captures structural inconsistencies across source views caused by viewpoint changes or dynamic factors; Target-view uncertainty focuses on observation anomalies caused by transient changes to infer distractor spatial distribution. These two uncertainties are integrated into a heteroscedastic reconstruction loss that guides adaptive supervision weighting, boosting the model's capability to detect and suppress distractors, and enabling more robust geometric modeling. To our knowledge, this is the first attempt to explore GeNeRF modeling with in scenes with transient distractors. Extensive experiments demonstrate that our method not only outperforms existing GeNeRF approaches but also rivals the performance of scene-specific distractor-free NeRFs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40235", "url": null, "sourceid": 42083, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36321, "uid": "e2a06284900590e8df25361db42caede", "name": "Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision", "authors": [{"id": 184767, "fullname": "Wang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184767?format=json", "institution": "Rensselaer Polytechnic Institute"}, {"id": 184768, "fullname": "Hanjing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184768?format=json", "institution": "Tiktok"}, {"id": 169435, "fullname": "Yufei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169435?format=json", "institution": "ByteDance Inc."}, {"id": 184769, "fullname": "Darsha Udayanga", "url": "http://cvpr.thecvf.com/api/miniconf/users/184769?format=json", "institution": "Rensselaer Polytechnic Institute"}, {"id": 184770, "fullname": "Qiang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/184770?format=json", "institution": "Rensselaer Polytechnic Institute"}], "abstract": "Bayesian deep learning (BDL) integrates Bayesian inference with deep learning, improving predictive performance while enabling principled uncertainty quantification. However, existing BDLs often rely on non-informative random priors, limiting the benefits of Bayesian inference. In contrast, knowledge-augmented deep learning explicitly injects domain knowledge during training, yet lacks a probabilistic foundation. In this paper, we propose a knowledge-augmented BDL framework that integrates domain knowledge both as an informative prior and as an adaptive likelihood under a unified two-stage hybrid formulation. In the first stage, we learn a knowledge-informed prior $p(\\theta \\mid \\mathcal{K})$ by pre-training a model to satisfy domain-specific constraints. In the second stage, we perform Bayesian inference on task data with an adaptive knowledge likelihood $p(\\mathcal{K} \\mid \\theta, \\mathcal{D})$, which dynamically enforces these constraints during optimization. This unified framework enables knowledge to guide both initialization and training, significantly improving prediction accuracy, robustness, adaptation and uncertainty estimation. Experiments on various computer vision tasks, including semi-synthetic and real-knowledge scenarios, demonstrate that our two-stage framework consistently outperforms state-of-the-art Bayesian and knowledge-augmented baselines.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36321", "url": null, "sourceid": 37461, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38339, "uid": "bc2ba20c4af583beaa5af3e1905773db", "name": "CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization", "authors": [{"id": 182292, "fullname": "Naiyu Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182292?format=json", "institution": "Lehigh University"}, {"id": 184768, "fullname": "Hanjing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184768?format=json", "institution": "Tiktok"}, {"id": 189637, "fullname": "Yue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189637?format=json", "institution": "Lehigh University"}, {"id": 86878, "fullname": "Tian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86878?format=json", "institution": "International Business Machines"}, {"id": 189638, "fullname": "Amit Dhurandhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/189638?format=json", "institution": "IBM TJ Watson Research"}, {"id": 189639, "fullname": "Chung-Hao Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189639?format=json", "institution": "University of California, Riverside"}, {"id": 184770, "fullname": "Qiang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/184770?format=json", "institution": "Rensselaer Polytechnic Institute"}], "abstract": "Causal graphs play a crucial role in AI research as they reveal the data generation processes underlying real-world machine learning and computer vision tasks. Recent studies have leveraged causal graphs to develop more robust and interpretable models. However, limited or biased data often lead to inaccurate causal graph estimation, reducing a model\u2019s transferability to unseen domains. To address this challenge, we propose a novel framework that performs Bayesian inference over causal graphs to capture potential underlying causal relations and identify invariant causal features for DG prediction. The key advantage of our framework lies in its ability to quantify causal graph uncertainty in the context of prediction tasks and incorporate it into the prediction process. Our proposed uncertainty provides valuable insights into (i) the reliability of our method on specific datasets, (ii) the alignment between learned causal graphs and unseen test domains, and (iii) the confidence of our predictions. In particular, we go beyond merely quantifying uncertainty and leverage it as weighting factors in a weighted Bayesian inference scheme. Empirical results on multiple benchmark distribution-shift datasets show that our algorithm, **C**ausal **G**raph **U**ncertainty-guided **Bayes**ian Inference (CGU-Bayes), outperforms existing DG methods on challenging datasets and achieves state-of-the-art performance overall.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38339", "url": null, "sourceid": 32696, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40102, "uid": "2a2b24197c68a7ce96e1fcd9e5cca0a8", "name": "Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes", "authors": [{"id": 193537, "fullname": "Ziheng Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193537?format=json", "institution": "Beijing Jiaotong University"}, {"id": 156445, "fullname": "Yuheng Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/156445?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 153565, "fullname": "Renshuai Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153565?format=json", "institution": "Beijing Jiaotong University"}, {"id": 193538, "fullname": "Yuxuan Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193538?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences; Tianjin University"}, {"id": 193539, "fullname": "Yuyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193539?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 193540, "fullname": "Yipu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193540?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 156454, "fullname": "Xiaolong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156454?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "The pursuit of a universal AI-generated image (AIGI) detector often relies on aggregating data from numerous generators to improve generalization. However, this paper identifies a paradoxical phenomenon we term the Benefit then Conflict dilemma, where detector performance stagnates and eventually degrades as source diversity expands. Our systematic analysis, leveraging Linear Discriminant Analysis (LDA), diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that replaces unconstrained aggregation with a structured learning paradigm. GAPL learns a compact set of canonical forgery prototypes to create a unified, low-variance feature space, effectively countering data heterogeneity. To resolve the model bottleneck, it employs a two-stage training scheme with Low-Rank Adaptation (LoRA) to fine-tune the feature extractor, enhancing its discriminative power while preserving valuable pretrained knowledge. This approach establishes a more robust and generalizable decision boundary. Through extensive experiments, we demonstrate that GAPL achieves state-of-the-art performance, showing superior detection accuracy across a wide variety of unseen GAN and diffusion-based generators.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40102", "url": null, "sourceid": 34139, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38027, "uid": "8686fa633cfb5f49a0609122b9e4140b", "name": "Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention", "authors": [{"id": 181592, "fullname": "Shezheng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/181592?format=json", "institution": "Hefei University of Technology"}, {"id": 188871, "fullname": "Shasha Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188871?format=json", "institution": "National University of Defense Technology"}, {"id": 188872, "fullname": "Shan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188872?format=json", "institution": "Hefei University of Technology"}, {"id": 188873, "fullname": "Xiaopeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188873?format=json", "institution": "National University of Defense Technology"}, {"id": 188874, "fullname": "Qian Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188874?format=json", "institution": "Central China Normal University"}, {"id": 188875, "fullname": "Chengyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188875?format=json", "institution": "Hunan University; National University of Defense Technology"}, {"id": 188876, "fullname": "Tianwei Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188876?format=json", "institution": "Chongqing Jiaotong University"}, {"id": 188877, "fullname": "Ma Jun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188877?format=json", "institution": "National University of Defense Technology"}, {"id": 188878, "fullname": "Jie Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188878?format=json", "institution": "National University of Defense Technology"}], "abstract": "Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision\u2013language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual\u2013text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage \u201creview\u201d phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance.Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38027", "url": null, "sourceid": 40076, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37919, "uid": "a992fdba69fe794820cb261f31df434d", "name": "PrivateEyes: Gaze-Preserving Anonymization for Data Sharing", "authors": [{"id": 183948, "fullname": "Surabhi Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/183948?format=json", "institution": "Samsung R&amp;D Institute India-Bangalore (SRI-B)"}, {"id": 183946, "fullname": "Dinesh Prabhu Muthumariappan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183946?format=json", "institution": "Samsung"}, {"id": 153192, "fullname": "Biplab Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/153192?format=json", "institution": "International Institute of Information Technology, Bangalore"}, {"id": 188591, "fullname": "Anoop Kolar Rajagopal", "url": "http://cvpr.thecvf.com/api/miniconf/users/188591?format=json", "institution": "Samsung"}, {"id": 188592, "fullname": "Kiran Iyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/188592?format=json", "institution": "Samsung"}, {"id": 167163, "fullname": "Donghwan Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/167163?format=json", "institution": "Samsung Electronics"}], "abstract": "Eye images captured from wearable devices such as head-mounted displays (HMDs) contain identifiable biometric cues, posing significant challenges for safe data sharing. Existing eye anonymization techniques often degrade downstream performance, particularly gaze estimation while still retaining iris-recognizable features. Although these methods aim to anonymize the iris, they introduce noticeable visual artifacts that reduce image fidelity. To address these limitations, we propose \\textbf{PrivateEyes}, a privacy-preserving framework that synthesizes anonymized yet gaze-consistent eye images. Our approach employs a three-stage pipeline: (1) a deep segmentation network that isolates semantic eye regions and provides structural control signals for synthesis, (2) a pose estimation network (PEN) trained on anatomically accurate synthetic eye renders to infer precise eye pose, and (3) a conditional diffusion model that reconstructs realistic, anonymized eye images conditioned on segmentation and pose. Extensive experiments across multiple benchmark datasets show that PrivateEyes achieves superior gaze-estimation accuracy compared to state-of-the-art anonymization baselines, improving performance by over 10\\% while reducing iris-recognition accuracy by $\\sim$50\\%. Our method also produces higher-fidelity images compared to other existing approaches. By enabling task-preserving and privacy-secure sharing of eye images, PrivateEyes supports responsible research and development in AR/VR and other gaze-driven applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37919", "url": null, "sourceid": 35453, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37264, "uid": "168f6cac16567296b81233afac6f127b", "name": "Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark", "authors": [{"id": 187042, "fullname": "Seng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187042?format=json", "institution": "The Chinese University of Hong Kong(Shen Zhen)"}, {"id": 181218, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181218?format=json", "institution": "Hao Chen"}, {"id": 187043, "fullname": "Chenglam Ho", "url": "http://cvpr.thecvf.com/api/miniconf/users/187043?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187044, "fullname": "Xinyu Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187044?format=json", "institution": null}, {"id": 187045, "fullname": "Jinping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187045?format=json", "institution": null}, {"id": 85821, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85821?format=json", "institution": "Nanjing University"}, {"id": 187046, "fullname": "Chao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187046?format=json", "institution": "University of Dundee; University of Cambridge"}], "abstract": "Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision\u2013language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: Can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. Scene-RAG improves VLM performance by +7.11%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37264", "url": null, "sourceid": 38456, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37785, "uid": "0028a24e18e166c292689023e6c22e09", "name": "Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent", "authors": [{"id": 185552, "fullname": "Wenliang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185552?format=json", "institution": "University of Texas at Arlington"}, {"id": 188261, "fullname": "Rob Barton", "url": "http://cvpr.thecvf.com/api/miniconf/users/188261?format=json", "institution": "Amazon"}, {"id": 188262, "fullname": "Lucas Goncalves", "url": "http://cvpr.thecvf.com/api/miniconf/users/188262?format=json", "institution": "Amazon"}, {"id": 188263, "fullname": "Kushal Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188263?format=json", "institution": "Amazon"}, {"id": 185550, "fullname": "Feng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185550?format=json", "institution": "University of Texas at Arlington"}, {"id": 185555, "fullname": "Hehuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185555?format=json", "institution": "University of Texas at Arlington"}, {"id": 185554, "fullname": "Yuzhi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185554?format=json", "institution": "University of Texas at Arlington"}, {"id": 188264, "fullname": "Vidit Bansal", "url": "http://cvpr.thecvf.com/api/miniconf/users/188264?format=json", "institution": "Amazon"}, {"id": 141480, "fullname": "Karim Bouyarmane", "url": "http://cvpr.thecvf.com/api/miniconf/users/141480?format=json", "institution": "Amazon"}, {"id": 156403, "fullname": "Junzhou Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156403?format=json", "institution": "University of Texas, Arlington"}], "abstract": "Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling, which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce MST-based LLM Traversal that selectively applies LLM reasoning for complex semantic judgments, reducing computational costs. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. We demonstrate superior performance across various clustering tasks, consistently outperforming specialized state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37785", "url": null, "sourceid": 32353, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39249, "uid": "49549984ae7c4b7d26235e812c3e4df1", "name": "VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale", "authors": [{"id": 136514, "fullname": "Parth Parag Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/136514?format=json", "institution": "University of Central Florida"}, {"id": 91932, "fullname": "Rohit Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/91932?format=json", "institution": "University of Central Florida"}, {"id": 191702, "fullname": "Prakash Chandra Chhipa", "url": "http://cvpr.thecvf.com/api/miniconf/users/191702?format=json", "institution": "Lule\u00e5 University of Technology"}, {"id": 73977, "fullname": "Mubarak Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/73977?format=json", "institution": "Amazon"}], "abstract": "The task of video geolocalization aims to determine the precise GPS coordinates of a video\u2019s origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using said aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model\u2019s ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. Code and models will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39249", "url": null, "sourceid": 42103, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39845, "uid": "6487ba8ae406887e3c94a658d21dfbdd", "name": "RelightAnyone: A Generalized Relightable 3D Gaussian Head Model", "authors": [{"id": 130632, "fullname": "Yingyan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130632?format=json", "institution": "ETH Zurich, DisneyResearch|Studios"}, {"id": 176579, "fullname": "Pramod Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/176579?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 130626, "fullname": "Sebastian Weiss", "url": "http://cvpr.thecvf.com/api/miniconf/users/130626?format=json", "institution": "DisneyResearch|Studios"}, {"id": 84994, "fullname": "Gaspard Zoss", "url": "http://cvpr.thecvf.com/api/miniconf/users/84994?format=json", "institution": "Disney Research Studios"}, {"id": 86229, "fullname": "Markus Gross", "url": "http://cvpr.thecvf.com/api/miniconf/users/86229?format=json", "institution": "Disney Research, Disney"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}, {"id": 127234, "fullname": "Marc Habermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/127234?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 84987, "fullname": "Derek Bradley", "url": "http://cvpr.thecvf.com/api/miniconf/users/84987?format=json", "institution": "DisneyResearch|Studios"}], "abstract": "3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39845", "url": "https://studios.disneyresearch.com/2026/05/31/relightanyone-a-generalized-relightable-3d-gaussian-head-model/", "sourceid": 38001, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65712, "file": "/media/PosterPDFs/CVPR%202026/39845.png", "modified": "2026-04-20T05:29:05.770310-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65714, "file": "/media/PosterPDFs/CVPR%202026/39845-thumb.png", "modified": "2026-04-20T12:28:03.959815-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37408, "uid": "0b2e4ebf84cf95514cf99299c650d871", "name": "QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models", "authors": [{"id": 172675, "fullname": "Puyin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/172675?format=json", "institution": "Stanford University"}, {"id": 89277, "fullname": "Tiange Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89277?format=json", "institution": "Stanford University"}, {"id": 187365, "fullname": "Ella Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187365?format=json", "institution": "Stanford University"}, {"id": 187366, "fullname": "Shirley Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187366?format=json", "institution": "Stanford University"}, {"id": 187367, "fullname": "Xinye Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187367?format=json", "institution": "Stanford University"}, {"id": 187368, "fullname": "Adnan Masood", "url": "http://cvpr.thecvf.com/api/miniconf/users/187368?format=json", "institution": "UST"}, {"id": 150947, "fullname": "Li Fei-Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/150947?format=json", "institution": "Stanford University"}, {"id": 75810, "fullname": "Ehsan Adeli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75810?format=json", "institution": "Stanford University"}], "abstract": "Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can perform quantitative physical reasoning tasks. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from videos. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video\u2013text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs when reasoning about objects\u2019 kinematic properties. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37408", "url": null, "sourceid": 38985, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37910, "uid": "53d201c136e76d7f5d18104f1c21f342", "name": "Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection", "authors": [{"id": 181662, "fullname": "Sairam Rebbapragada", "url": "http://cvpr.thecvf.com/api/miniconf/users/181662?format=json", "institution": null}, {"id": 179599, "fullname": "Rishabh Lalla", "url": "http://cvpr.thecvf.com/api/miniconf/users/179599?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 188569, "fullname": "Aveen Dayal", "url": "http://cvpr.thecvf.com/api/miniconf/users/188569?format=json", "institution": "Indian Institute of Technology, Hyderabad"}, {"id": 188570, "fullname": "Tejal Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/188570?format=json", "institution": "University of California, San Diego"}, {"id": 188571, "fullname": "Anuj Lalla", "url": "http://cvpr.thecvf.com/api/miniconf/users/188571?format=json", "institution": "Indian Institute of Technology, Jodhpur"}, {"id": 153478, "fullname": "Vineeth Balasubramanian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153478?format=json", "institution": "Microsoft Research and IIT-Hyderabad"}, {"id": 76698, "fullname": "Muhammad Haris Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76698?format=json", "institution": "Mohamed Bin Zayed University of Artificial Intelligence"}], "abstract": "Current state-of-the-art approaches in Source-Free Object Detection (SFOD) typically rely on Mean-Teacher self-labeling. However, domain shift often reduces the detector\u2019s ability to maintain strong object-focused representations, causing high-confidence activations over background clutter. This weak object focus results in unreliable pseudo-labels from the detection head. While prior works mainly refine these pseudo-labels, they overlook the underlying need to strengthen the feature space itself. We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter suppression and Noise robustness), a framework designed to enhance object-focused adaptation under domain shift. It consists of two complementary components. SPAR (Spatial Prior-Aware Regularization) leverages the generalization strength of vision foundation models to regularize the detector\u2019s feature space. Using class-agnostic binary masks derived from OV-SAM, SPAR promotes structured and foreground-focused activations by guiding the network toward object regions. IRPL (Imbalance-aware Noise Robust Pseudo-Labeling) complements SPAR by promoting balanced and noise-tolerant learning under severe foreground-background imbalance. Guided by a theoretical analysis that connects these designs to tighter localization and classification error bounds, FALCON-SFOD achieves competitive performance across SFOD benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37910", "url": null, "sourceid": 45373, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37965, "uid": "cb0b5f1330d14a551aac833db0baf14c", "name": "NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather", "authors": [{"id": 188700, "fullname": "Yanying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188700?format=json", "institution": "Ocean University of China"}, {"id": 188701, "fullname": "Jinyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188701?format=json", "institution": "Ocean University of China; Zhejiang University"}, {"id": 73862, "fullname": "Shengfeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73862?format=json", "institution": "Singapore Management University"}, {"id": 188702, "fullname": "Yangyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188702?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89803, "fullname": "Junyu Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89803?format=json", "institution": "Ocean University of China"}, {"id": 89817, "fullname": "Yong Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/89817?format=json", "institution": "Ocean University of China"}], "abstract": "We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient imbalance during the self-supervised optimization of 3D Gaussian representations. This physically grounded formulation allows NimbusGS to disentangle complex degradations while preserving scene structure, yielding superior geometry reconstruction and outperforming task-specific methods across diverse and challenging weather conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37965", "url": null, "sourceid": 33237, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40239, "uid": "3d1ed124d48ac4f12106b32decf840b0", "name": "ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes", "authors": [{"id": 180152, "fullname": "Zhongtao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180152?format=json", "institution": "Peking University"}, {"id": 193851, "fullname": "Jiaqi Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193851?format=json", "institution": "Beijing Forestry University"}, {"id": 188078, "fullname": "Qingtian Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188078?format=json", "institution": "The University of Tokyo"}, {"id": 193852, "fullname": "Yilong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193852?format=json", "institution": "Peking University"}, {"id": 147342, "fullname": "Mai Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/147342?format=json", "institution": "Peking University"}, {"id": 193853, "fullname": "Fei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193853?format=json", "institution": "Peking University"}, {"id": 193854, "fullname": "Meng GAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/193854?format=json", "institution": "Peking University"}, {"id": 193855, "fullname": "Shaorong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193855?format=json", "institution": "Beijing Forestry University"}, {"id": 188436, "fullname": "Chengwei Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188436?format=json", "institution": "Beihang Uinveristy"}, {"id": 188437, "fullname": "Yisong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188437?format=json", "institution": "Peking University"}, {"id": 193856, "fullname": "Guoping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193856?format=json", "institution": null}], "abstract": "Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It\u2018s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40239", "url": null, "sourceid": 31716, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37775, "uid": "8633ed0606d56f587bc45831882dd583", "name": "PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation", "authors": [{"id": 156079, "fullname": "Jiawei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156079?format=json", "institution": "Stony Brook University"}, {"id": 131410, "fullname": "Kunming Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131410?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 73844, "fullname": "Weiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73844?format=json", "institution": "Shandong University"}, {"id": 188241, "fullname": "Kaiyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188241?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 152113, "fullname": "Yixun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152113?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 127361, "fullname": "Jingwei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127361?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}, {"id": 126273, "fullname": "Ping Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126273?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Pose stylization is a fundamental task across the 2D, 3D, or video fields, which aim to output a stylized image or 3D mesh with the expected pose. In the 3D domains, existing pose stylization methods typically rely on 2D foundational models to modify the pose of an image before generating the corresponding 3D assets, which limits the ability of these methods to achieve rich and precise 3D pose stylization. To address this challenge, we propose a novel paradigm for 3D pose stylization that unifies pose stylization and 3D generation within a cohesive framework. This integration minimizes the risk of cumulative errors and enhances the model's efficiency and effectiveness. In addition, instead of a 2D skeleton used in previous works, we directly utilize the 3D skeleton because it can provide a more accurate representation of 3D spatial and topological relationships, which significantly enhances the model's capacity to achieve richer and more precise pose stylization. Additionally, we establish a comprehensive data engine to create a large-scale dataset that includes pairs of image-body misalignment and skeleton-body alignment. This dataset encourages 3D generative models to concurrently learn both the style of images and the pose-related 3D structures. Building on these innovations, we present PoseMaster, a unified 3D native method for stylized pose generation. Extensive experimental evaluations demonstrate that PoseMaster significantly outperforms current state-of-the-art techniques in both qualitative and quantitative assessments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37775", "url": null, "sourceid": 45117, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36371, "uid": "8a9707913ae744d57924ed5450567889", "name": "Dataset Distillation via Influence Matching", "authors": [{"id": 128162, "fullname": "Haoru Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128162?format=json", "institution": "HKU"}, {"id": 184901, "fullname": "Wang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184901?format=json", "institution": "University of Hong Kong"}, {"id": 128180, "fullname": "WU Sitong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128180?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 184902, "fullname": "Xiuzhe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184902?format=json", "institution": null}, {"id": 102107, "fullname": "Yangtian Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/102107?format=json", "institution": "University of Hong Kong"}, {"id": 76914, "fullname": "Chirui Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76914?format=json", "institution": "University of Hong Kong"}, {"id": 181212, "fullname": "Shaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181212?format=json", "institution": "University of Science and Technology of China"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}], "abstract": "We revisit dataset distillation from an outcome-centric perspective. Rather than aligning process surrogates (per-step gradients or training trajectories), Influence Matching (**Inf-Match**) aligns the final outcome of training: it learns a compact synthetic set whose effect on the converged parameters matches that of the full dataset. Concretely, we introduce a fully differentiable, sample-level influence estimator that quantifies parameter shifts from adding or removing data-- without time-consuming inverse-Hessian products or convexity assumptions. The estimator runs in linear time by unrolling the optimization dynamics and applying a first-order Taylor approximation. We then learn the synthetic set by minimizing the mismatch between its influence and that of the real dataset, yielding outcome alignment rather than heuristic process imitation. **Inf-Match** delivers the best accuracy across standard classification benchmarks. For instance, on Tiny-ImageNet (IPC=10), **Inf-Match** attains 31.5\\%, a +4.7\\% improvement over NCFM. Beyond classification, **Inf-Match** scales to vision-language distillation on Flickr30K, outperforming strong process-matching baselines. For instance, with 200 to 1000 synthetic samples, our method achieved a leading impressive average on image/text retrieval tasks, higher than NCFM by 2.5\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36371", "url": null, "sourceid": 31513, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37750, "uid": "f24d3891deca9c62c05caeb4918a772c", "name": "UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes", "authors": [{"id": 152113, "fullname": "Yixun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152113?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 131410, "fullname": "Kunming Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131410?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 188164, "fullname": "Xiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188164?format=json", "institution": "lightillusions"}, {"id": 152112, "fullname": "Rui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152112?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 156079, "fullname": "Jiawei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156079?format=json", "institution": "Stony Brook University"}, {"id": 73844, "fullname": "Weiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73844?format=json", "institution": "Shandong University"}, {"id": 188165, "fullname": "Jiarui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188165?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 104911, "fullname": "Fei-Peng Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/104911?format=json", "institution": "Light Illusions"}, {"id": 126273, "fullname": "Ping Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126273?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based models in the second stage to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we bypass the limitations of UV mapping by introducing a Large Texturing Model (LTM) that directly regresses textures in a unified 3D functional space. Moreover, to enable more effective and complete supervision of LTM, we propose to extend surface-defined textures into a continuous volumetric field to serve as an advanced training objective, which we refer to as Texture Functions (TF). Finally, we develop an advanced LoRA-based strategy for efficiently adapting large-scale 2D Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37750", "url": null, "sourceid": 40923, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37666, "uid": "11b465ec6e2e00fbe0397fc8f78230d0", "name": "Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly", "authors": [{"id": 155569, "fullname": "Aditya Chetan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155569?format=json", "institution": "Cornell University"}, {"id": 187975, "fullname": "Eric Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187975?format=json", "institution": "Cornell University"}, {"id": 187976, "fullname": "Peeyush Kushwaha", "url": "http://cvpr.thecvf.com/api/miniconf/users/187976?format=json", "institution": "Independent Researcher"}, {"id": 187977, "fullname": "Bharath Raj Nagoor Kani", "url": "http://cvpr.thecvf.com/api/miniconf/users/187977?format=json", "institution": "Cornell University"}, {"id": 179562, "fullname": "Utkarsh Mall", "url": "http://cvpr.thecvf.com/api/miniconf/users/179562?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 150923, "fullname": "Qianqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150923?format=json", "institution": "University of California, Berkeley"}, {"id": 85450, "fullname": "Noah Snavely", "url": "http://cvpr.thecvf.com/api/miniconf/users/85450?format=json", "institution": "Google / Cornell"}, {"id": 89201, "fullname": "Bharath Hariharan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89201?format=json", "institution": "Cornell University"}], "abstract": "The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks.To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37666", "url": null, "sourceid": 39702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37529, "uid": "7602980244b8793fb90d3bfb3bef1639", "name": "DVAR: Dynamic Visual Autoregressive Modeling for Image Super-Resolution", "authors": [{"id": 187653, "fullname": "Yu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187653?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 158831, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158831?format=json", "institution": "Nanjing University"}, {"id": 145729, "fullname": "Wei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145729?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129362, "fullname": "Qingguo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129362?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 187654, "fullname": "Xiantao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187654?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 127403, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127403?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 85000, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85000?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Next-scale prediction paradigm visual autoregressive (VAR) models have demonstrated significant potential for image super-resolution. However, their practical application is constrained by a rigid, size-specific design. This limitation stems from their reliance on memorizing fixed, absolute scaling schedules, which necessitates a distinct model for each target resolution. We introduce DVAR, a Dynamic Visual AutoRegressive framework that overcomes this fundamental bottleneck. Instead of memorizing these rigid schedules, DVAR learns a canonical scaling dynamic. This dynamic effectively decouples the logic of relative scaling from the absolute target size, thereby preserving a single set of proportions between generative steps that can be applied uniformly to any size. Furthermore, we introduce a dynamic sampling scheduler to mitigate the teacher-forcing problem with negligible computational overhead. By leveraging the geometric proximity of visual tokens in the codebook, it efficiently simulates the model's predictive error distribution to bridge the training-inference gap. To our knowledge, DVAR is the first framework to grant VAR models size-flexibility, breaking their one-to-one dependency on a fixed resolution. Extensive evaluations demonstrate that DVAR achieves superior visual quality over existing Real-ISR methods, proving that a flexible, purely autoregressive approach is a viable path to state-of-the-art image super-resolution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37529", "url": null, "sourceid": 41003, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36167, "uid": "8512eaf345bf11c7a75a1306c0366da7", "name": "SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization", "authors": [{"id": 184314, "fullname": "Xuankun Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184314?format=json", "institution": "Wuhan University"}, {"id": 184315, "fullname": "Wenke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184315?format=json", "institution": "Nanyang Technological University"}, {"id": 184316, "fullname": "Tingfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184316?format=json", "institution": "Wuhan University"}, {"id": 184317, "fullname": "Daiguo Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184317?format=json", "institution": "Xiaomi"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities, yet their expanded modality space introduces new compositional safety risks that emerge from complex text\u2013image interactions.Such cross-modal couplings can produce unsafe semantics even when individual inputs are benign, exposing the fragile safety awareness of current MLLMs.While recent works enhance safety by guiding models to reason about potential risks, unregulated reasoning traces may compromise alignment; although Group Relative Policy Optimization (GRPO) offers self-rewarded refinement without human supervision, it lacks verifiable signals for reasoning safety.To address this, we propose **SafeGRPO** a self-rewarded multimodal safety alignment framework that integrates rule-governed reward construction into GRPO, enabling interpretable and verifiable optimization of reasoning safety. Built upon the constructed **SafeTag-VL-3K** dataset with explicit visual, textual, and combined safety tags, SafeGRPO performs **step-guided safety thinking** to enforce structured reasoning and behavior alignment, substantially improving multimodal safety awareness, compositional robustness, and reasoning stability across diverse benchmarks without sacrificing general capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36167", "url": null, "sourceid": 32992, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38148, "uid": "bc6ed32af5e74a9eae9accaab95902fd", "name": "CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection", "authors": [{"id": 180081, "fullname": "Zhaonian Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180081?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 148401, "fullname": "Rui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/148401?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189162, "fullname": "Haotian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189162?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 189163, "fullname": "Xinhu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189163?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 189164, "fullname": "Meng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189164?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 85363, "fullname": "Gang Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/85363?format=json", "institution": "Wormpex AI Research"}], "abstract": "Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Pl\u00fccker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38148", "url": null, "sourceid": 43460, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38530, "uid": "2d676aeab38fa3680f4369dbbd4f3e52", "name": "TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction", "authors": [{"id": 179968, "fullname": "Qin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179968?format=json", "institution": "RWTH Aachen &amp; FZJ"}, {"id": 149372, "fullname": "Abigail Morrison", "url": "http://cvpr.thecvf.com/api/miniconf/users/149372?format=json", "institution": "RWTH Aachen University; Forschungszentrum J\u00fclich"}, {"id": 190068, "fullname": "Hanno Scharr", "url": "http://cvpr.thecvf.com/api/miniconf/users/190068?format=json", "institution": "Forschungszentrum Juelich GmbH"}, {"id": 190069, "fullname": "Kai Krajsek", "url": "http://cvpr.thecvf.com/api/miniconf/users/190069?format=json", "institution": "Forschungszentrum J\u00fclich GmbH"}], "abstract": "Learning temporal transformations, that is, how visual objects evolve across frames, is a fundamental challenge in video representation learning. Frame-to-frame dynamics involve complex, non-linear, and non-local changes that go far beyond conventional spatial augmentations. We propose TimeBridge, a self-supervised method that combines the joint embedding for video representation with learning temporal transformations by reconstructing in-between frames from only the start and end frames. This formulation encourages the model to infer the temporal evolution bridging the two endpoints, rather than merely encoding static frame representations. Unlike joint-embedding methods that lack explicit transformation modelling or future-prediction objectives that rely on unconstrained extrapolation, TimeBridge learns concrete frame-to-frame dynamics by promoting temporal consistency. We realise this through cross-concatenated class tokens and lightweight decoders, which recombine features from the start and end frames to reconstruct intermediates. TimeBridge achieves new state-of-the-art performance on multiple dense video prediction benchmarks, including 73.5 J\\&F on DAVIS 2017 video object segmentation, 47.5 mIoU on VIP part propagation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38530", "url": null, "sourceid": 33585, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39898, "uid": "70b98536f7f2cff5c36df2424787d87b", "name": "ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM", "authors": [{"id": 183702, "fullname": "Stefan Spiss", "url": "http://cvpr.thecvf.com/api/miniconf/users/183702?format=json", "institution": "University of Innsbruck"}, {"id": 193071, "fullname": "Joey Hieronimy", "url": "http://cvpr.thecvf.com/api/miniconf/users/193071?format=json", "institution": "Universit\u00e4t Innsbruck"}, {"id": 176798, "fullname": "Marcel Ritter", "url": "http://cvpr.thecvf.com/api/miniconf/users/176798?format=json", "institution": "University of Innsbruck"}, {"id": 193072, "fullname": "Matthias Harders", "url": "http://cvpr.thecvf.com/api/miniconf/users/193072?format=json", "institution": "Universit\u00e4t Innsbruck"}], "abstract": "This work presents ODGS-SLAM, an omnidirectional simultaneous localization and mapping (SLAM) system utilizing 3D Gaussian Splatting (3DGS) as the unified representation for tracking and mapping.Thus, it reconstructs scene geometry from panoramic image sequences (RGB or RGBD) via splats while also detecting the camera poses.Such a framework is important to understand the full surrounding, *e.g.*, for augmented reality applications or autonomous systems.We extended existing 3DGS-SLAM methods to handle omnidirectional input by including closed-form gradients for mapping and camera pose estimation, utilizing an equirectangular projection model.To lower memory footprint, a key frame removal procedure based on graph analysis is proposed, enabling the application to handle larger input sizes.For evaluation, we provide a data set of controlled real-world and synthetic test scenes (indoor and outdoor), employing a custom developed virtual camera lens.An extensive evaluation shows that, for camera tracking, the proposed method achieves statistically significant lower ATE RMSE scores compared to a recent omnidirectional SLAM system, as well as other 3DGS-SLAM frameworks, while reaching a similar mapping performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39898", "url": "https://odgs-slam.github.io/", "sourceid": 35158, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39354, "uid": "9447b7e2c724174c9af773d5f1b64263", "name": "VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models", "authors": [{"id": 156381, "fullname": "Soumya Suvra Ghosal", "url": "http://cvpr.thecvf.com/api/miniconf/users/156381?format=json", "institution": "University of Maryland, College Park"}, {"id": 181887, "fullname": "Youngeun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/181887?format=json", "institution": "Amazon AWS AI Labs"}, {"id": 191908, "fullname": "Zhuowei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191908?format=json", "institution": "Amazon"}, {"id": 134314, "fullname": "Ritwick Chaudhry", "url": "http://cvpr.thecvf.com/api/miniconf/users/134314?format=json", "institution": "Amazon"}, {"id": 158872, "fullname": "Linghan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158872?format=json", "institution": "Amazon"}, {"id": 191909, "fullname": "Hongjing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191909?format=json", "institution": "Amazon"}, {"id": 191910, "fullname": "Jakub Zablocki", "url": "http://cvpr.thecvf.com/api/miniconf/users/191910?format=json", "institution": "Amazon"}, {"id": 98249, "fullname": "Yifan Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/98249?format=json", "institution": "Amazon"}, {"id": 191911, "fullname": "Qin ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/191911?format=json", "institution": null}], "abstract": "Advances in large reasoning models have shown strong performance on complex reasoning tasks by scaling test-time compute through extended inference-time thinking. However, recent studies observe that in vision-dependent tasks, extended textual reasoning at inference-time can often degrade performance as models progressively lose attention to visual tokens, increasingly relying on textual priors alone. To address this, prior works used reinforcement learning (RL)-based fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of inference-time compute without additional RL fine-tuning, we propose VisRef, a visually grounded test-time scaling framework. Our key idea is to actively guide the reasoning process through re-injecting a coreset of visual tokens that are semantically relevant to the reasoning context yet diverse and globally representative of the image for more grounded multi-modal reasoning. Experiments on three visual reasoning benchmarks with state-of-the-art multi-modal large reasoning models demonstrate that under fixed inference-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39354", "url": null, "sourceid": 42300, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38287, "uid": "91cd64268d833ea419db6bae48551d22", "name": "NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers", "authors": [{"id": 184353, "fullname": "Yuhang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184353?format=json", "institution": "Qihoo 360"}, {"id": 184352, "fullname": "Bo Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184352?format=json", "institution": "Qihoo 360"}, {"id": 102638, "fullname": "Shanyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102638?format=json", "institution": "360 AI Institute"}, {"id": 189507, "fullname": "Hongyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189507?format=json", "institution": "Tsinghua University"}, {"id": 184351, "fullname": "Liebucha Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184351?format=json", "institution": "Qihoo 360"}, {"id": 180658, "fullname": "Dawei Leng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180658?format=json", "institution": "360 AI Research"}, {"id": 184354, "fullname": "Yuhui Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184354?format=json", "institution": "360 AI Research"}], "abstract": "Flow-based Transformer models have achieved state-of-the-art image generation performance, but often suffer from high inference latency and computational cost due to their large parameter sizes. To improve inference efficiency without compromising quality, we propose Bridged Progressive Rectified Flow Transformers (NAMI), which decompose the generation process across temporal, spatial, and architectural demensions. We divide the rectified flow into different stages according to resolution, and use a BridgeFlow module to connect them. Fewer Transformer layers are used at low-resolution stages to generate image layouts and concept contours, and more layers are progressively added as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce Bridged Progressive Rectified Flow Transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 64% for generating 1024\u00d71024 resolution images; (3) We propose a BridgeFlow module to align flows between different stages; (4) We propose the NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and comprehensively assess model effectiveness. The results show that our model is competitive with state-of-the-art models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38287", "url": null, "sourceid": 37850, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36809, "uid": "7859df91d0151aa6b404de12b8362300", "name": "Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement", "authors": [{"id": 180536, "fullname": "Xiwen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180536?format=json", "institution": "Sichuan University"}, {"id": 175139, "fullname": "Shichao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175139?format=json", "institution": "Sichuan University"}, {"id": 185930, "fullname": "Ruowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185930?format=json", "institution": "Sichuan University"}, {"id": 185931, "fullname": "mao li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185931?format=json", "institution": "Sichuan University"}, {"id": 185932, "fullname": "Chenyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185932?format=json", "institution": "Sichuan University"}, {"id": 153024, "fullname": "Ji-Zhe Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153024?format=json", "institution": "Sichuan University"}, {"id": 153199, "fullname": "Qijun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153199?format=json", "institution": "Sichuan University"}, {"id": 185933, "fullname": "Hailun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185933?format=json", "institution": "Sichuan University"}], "abstract": "Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint to enforce smooth transitions across dense intermediate viewpoints, while avoiding over-smoothing that could erase sharp geometric features. Therefore, Dehallu3D employs a plug-and-play optimization module with two key constraints: (i) adjacent consistency to ensure geometric continuity across views, and (ii) adaptive smoothness to retain fine details. We further propose the Outlier Risk Measure (ORM) metric to quantify geometric fidelity in 3D generation from the perspective of outliers. Extensive experiments show that Dehallu3D achieves high-fidelity 3D generation by effectively preserving structural details while removing hallucinated outliers. The code will be fully available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36809", "url": null, "sourceid": 36476, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38604, "uid": "b5385961c08290e3097e9e4784c65807", "name": "Visual Grounding for Object Questions", "authors": [{"id": 190271, "fullname": "Martin Nicolas Everaert", "url": "http://cvpr.thecvf.com/api/miniconf/users/190271?format=json", "institution": "EPFL"}, {"id": 137332, "fullname": "Xiruo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/137332?format=json", "institution": "Amazon"}, {"id": 190272, "fullname": "Hiroyuki Takeda", "url": "http://cvpr.thecvf.com/api/miniconf/users/190272?format=json", "institution": "Amazon"}, {"id": 130422, "fullname": "Raja Bala", "url": "http://cvpr.thecvf.com/api/miniconf/users/130422?format=json", "institution": "Amazon"}, {"id": 141119, "fullname": "Vivek Yadav", "url": "http://cvpr.thecvf.com/api/miniconf/users/141119?format=json", "institution": "Amazon"}, {"id": 99803, "fullname": "Vidya Narayanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/99803?format=json", "institution": "Amazon"}], "abstract": "Current visual grounding research remains limited for practical applications, because existing techniques primarily focus on direct visual queries (e.g., \"find the red car\") or reading visible text (e.g., \"what is the title of this book?\"), rather than supporting general questions about objects (e.g., \"how comfortable are these earbuds?\"). We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). Unlike previous work that grounds only what is directly visible in images, VGOQ handles open-ended general questions about objects, including concepts such as ease and comfort of use, and aims to identify visual evidence or context that would support an answer. This unexplored problem has immediate practical value, particularly in designing and optimizing product imagery in e-commerce stores. As initial steps toward this challenging task, we develop two automated data generation techniques combining existing models and data, and create two new datasets: ABO-VGOQ and VizWiz-VGOQ.We show that the data can be used to train a lightweight visual grounding model, and evaluate it against state-of-the-art approaches. Our results provide initial evidence that VGOQ represents a meaningful research direction: current SOTA visual grounding performance decreases from 29.2\\%-52.2\\% gIoU to 22.6\\%-37.2\\% gIoU when questions are rephrased from visual questions (segmentation of the answer) to general object questions (VizWiz-VGOQ, segmentation of visual evidence). On our new ABO-VGOQ dataset, our lightweight model achieves 39.5\\% gIoU, while current SOTA visual grounding approaches achieve only 12.4\\%-19.3\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38604", "url": null, "sourceid": 34555, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39412, "uid": "0a69e2db28ae2f16a203997c548b79be", "name": "Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper", "authors": [{"id": 192020, "fullname": "Siwei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192020?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 177488, "fullname": "Haonian Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/177488?format=json", "institution": "Southwest Jiaotong University"}, {"id": 192021, "fullname": "Siyang Xin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192021?format=json", "institution": "Fudan University"}, {"id": 192022, "fullname": "Juanquan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192022?format=json", "institution": "Fudan University"}, {"id": 192023, "fullname": "Shi Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192023?format=json", "institution": "Peking University"}, {"id": 130606, "fullname": "Xinyu Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/130606?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192024, "fullname": "Peng Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/192024?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 192025, "fullname": "Jiaqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192025?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 192026, "fullname": "Zhaorun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192026?format=json", "institution": "University of Chicago"}, {"id": 190368, "fullname": "Yiyang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190368?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 91788, "fullname": "Linjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91788?format=json", "institution": "Microsoft"}, {"id": 87290, "fullname": "Lijuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87290?format=json", "institution": "Microsoft"}, {"id": 131401, "fullname": "Huaxiu Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131401?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}], "abstract": "Automatically generating clear and accurate figures for research papers remains challenging, as it requires semantic understanding, precise structure, and visual aesthetics. Existing approaches struggle to balance fidelity and quality: large language model (LLM) code-based methods (e.g., SVG, Mermaid) are structured but inflexible, while image-generation models (e.g., GPT-Image-1, Nano Banana) produce hard-to-edit and often inaccurate figures. We present Paper2Figure, a dual multi-agent system with an interactive web platform for paper-to-figure generation. Generation Agents convert text into our designed FigScript language, encoding figure semantics, styles and layout. The web system renders the FigScript into an initial image, which Refinement Agents iteratively analyze to locate issues and revise the FigScript for improved logic, alignment, aesthetics and text accuracy. Crucially, users can further refine results through an intuitive web interface, ensuring full control over the final output. To evaluate Paper2Figure, we introduce Paper2Figure Bench, a benchmark comprising 100 academic figures with paired descriptions. Experiments demonstrate that Paper2Figure markedly improves accuracy by 12%, beauty by 13.5%, and completeness by 17.0% over state-of-the-art baselines in fully automatic generation without human adjustment. By combining automated generation with interactive edit, Paper2Figure bridges the gap between AI assistance and researcher control, offering a practical solution for high-quality academic figure creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39412", "url": null, "sourceid": 45953, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37225, "uid": "0d4c8e8f080ebba0f8ca13a2faaa390c", "name": "Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding", "authors": [{"id": 180236, "fullname": "Thinesh Thiyakesan Ponbagavathi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180236?format=json", "institution": "University of Hildesheim"}, {"id": 186954, "fullname": "Constantin Seibold", "url": "http://cvpr.thecvf.com/api/miniconf/users/186954?format=json", "institution": "Ruprecht-Karls-Universit\u00e4t Heidelberg"}, {"id": 150991, "fullname": "Alina Roitberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/150991?format=json", "institution": "Universit\u00e4t Stuttgart"}], "abstract": "Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle).To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq  outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence, that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37225", "url": null, "sourceid": 44483, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40078, "uid": "4467c5ebcc8206601625e713d31605f9", "name": "K$\\alpha$LOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks", "authors": [{"id": 182430, "fullname": "David Tschirschwitz", "url": "http://cvpr.thecvf.com/api/miniconf/users/182430?format=json", "institution": "Bauhaus-University Weimar"}, {"id": 193451, "fullname": "Volker Rodehorst", "url": "http://cvpr.thecvf.com/api/miniconf/users/193451?format=json", "institution": "Bauhaus-Universit\u00e4t Weimar"}], "abstract": "Progress in computer vision relies on the interplay of data, algorithms, and computation. For foundational tasks such as object detection, supervised learning with human-annotated data remains the state-of-the-art approach. However, this \"gold-standard\" data is notoriously error-prone, which is a fundamental bottleneck that hinders both model training and evaluation. As a result, benchmarking improvements have become negligible or non-existent in the last year. This issue does not stem from algorithms or computation, but from problem specifications and the dataset creation process. This ultimately leads to ill-defined tasks with noisy labels. Although statistical methods for Inter-Annotator Agreement (IAA) exist, they are often applied inconsistently and lack standardization, which makes dataset quality comparisons unreliable.We propose a unified meta-algorithm for dataset quality evaluation called K$\\alpha$LOS (Krippendorff's $\\alpha$ Localization Object Sensing) that serves as a tool for dataset creation and final assessment. Our framework conceptually incorporates existing methods and extends upon them. This provides a broader scope, as our method applies to any combined localization and classification task. It provides greater analytical depth than competing methods, enabling downstream tasks such as evaluating intra-annotator-consistency, rater vitality, and localization sensitivity. Crucially, it is modular, flexible, and extensible, allowing components to be interchanged for specific use-cases and enabling comparability across datasets and tasks.Validating such a metric is challenging, as no \"real\" ground truth exists. Typically, what we evaluate is considered the ground truth and starting point in the modeling process. Prior validation often relies on heuristics or machine-generated labels that fail to capture the complexity of real annotation noise. Therefore, we introduce an experimental validation approach using an empirical noise generator from real, multi-annotated datasets, which also scrutinizes heuristic assumptions about the noise distribution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40078", "url": "https://github.com/Madave94/kalos", "sourceid": 31479, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65733, "file": "/media/PosterPDFs/CVPR%202026/40078.png", "modified": "2026-04-23T14:24:38.114701-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65734, "file": "/media/PosterPDFs/CVPR%202026/40078-thumb.png", "modified": "2026-04-23T14:24:38.331937-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65732, "modified": "2026-04-23T11:13:45.226413-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/40078.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39445, "uid": "a844418805e810cc482327eace05e206", "name": "Medic-AD: : Towards Medical Vision-Language Model's Clinical Intelligence", "authors": [{"id": 181545, "fullname": "Woohyeon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/181545?format=json", "institution": "Seoul National University"}, {"id": 192091, "fullname": "Jaeik Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192091?format=json", "institution": "Seoul National University"}, {"id": 192092, "fullname": "Sunghwan Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/192092?format=json", "institution": "Seoul National University"}, {"id": 192093, "fullname": "Pa Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192093?format=json", "institution": "samsung changwon hospital"}, {"id": 178711, "fullname": "Woo Kyoung Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/178711?format=json", "institution": "Samsung Medical Center"}, {"id": 178428, "fullname": "Yoojin Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/178428?format=json", "institution": "Samsung Changwon Hospital"}, {"id": 192094, "fullname": "Nam-Joon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192094?format=json", "institution": null}, {"id": 192095, "fullname": "Ginny Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192095?format=json", "institution": "NVIDIA"}, {"id": 88097, "fullname": "Ka Chun Cheung", "url": "http://cvpr.thecvf.com/api/miniconf/users/88097?format=json", "institution": "NVIDIA"}, {"id": 87756, "fullname": "Jaeyoung Do", "url": "http://cvpr.thecvf.com/api/miniconf/users/87756?format=json", "institution": "Department of Computer Science, University of Wisconsin - Madison"}], "abstract": "Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present Medic-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (Ano) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter-image difference tokens (Diff) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, Medic-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that Medic-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39445", "url": null, "sourceid": 44381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40369?format=json"], "related_events_ids": [40369]}, {"id": 37397, "uid": "1c699143cd368d893bb7b5fa1fdcabcc", "name": "Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation", "authors": [{"id": 168173, "fullname": "Ziyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/168173?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 91572, "fullname": "Renrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91572?format=json", "institution": "MMLab of CUHK &amp;amp;amp; Shanghai AI Laboratory"}, {"id": 155705, "fullname": "Hongyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155705?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 185043, "fullname": "Manyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185043?format=json", "institution": null}, {"id": 187346, "fullname": "Xinyan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187346?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 187347, "fullname": "Sifan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187347?format=json", "institution": "Microsoft"}, {"id": 185047, "fullname": "Yan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185047?format=json", "institution": "Meituan"}, {"id": 185048, "fullname": "Peng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185048?format=json", "institution": "Meituan"}, {"id": 87709, "fullname": "Pheng-Ann Heng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87709?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., *think*, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself.In this preliminary study, we introduce **Thinking-while-Generating** (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs.To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning.We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37397", "url": null, "sourceid": 32738, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40369, "uid": "a844418805e810cc482327eace05e206", "name": "Medic-AD: : Towards Medical Vision-Language Model's Clinical Intelligence", "authors": [{"id": 181545, "fullname": "Woohyeon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/181545?format=json", "institution": "Seoul National University"}, {"id": 192091, "fullname": "Jaeik Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192091?format=json", "institution": "Seoul National University"}, {"id": 192092, "fullname": "Sunghwan Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/192092?format=json", "institution": "Seoul National University"}, {"id": 192093, "fullname": "Pa Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192093?format=json", "institution": "samsung changwon hospital"}, {"id": 178711, "fullname": "Woo Kyoung Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/178711?format=json", "institution": "Samsung Medical Center"}, {"id": 178428, "fullname": "Yoojin Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/178428?format=json", "institution": "Samsung Changwon Hospital"}, {"id": 192094, "fullname": "Nam-Joon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192094?format=json", "institution": null}, {"id": 192095, "fullname": "Ginny Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192095?format=json", "institution": "NVIDIA"}, {"id": 88097, "fullname": "Ka Chun Cheung", "url": "http://cvpr.thecvf.com/api/miniconf/users/88097?format=json", "institution": "NVIDIA"}, {"id": 87756, "fullname": "Jaeyoung Do", "url": "http://cvpr.thecvf.com/api/miniconf/users/87756?format=json", "institution": "Department of Computer Science, University of Wisconsin - Madison"}], "abstract": "Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present Medic-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (Ano) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter-image difference tokens (Diff) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, Medic-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that Medic-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40369", "url": null, "sourceid": -44381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39445?format=json"], "related_events_ids": [39445]}, {"id": 39196, "uid": "9c87b4fa747d4b5675c82f561eb9cd4c", "name": "Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment", "authors": [{"id": 191558, "fullname": "Jerry Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191558?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191559, "fullname": "Haowen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191559?format=json", "institution": "Tsinghua University, Tsinghua University; Tsinghua University, Tsinghua University"}, {"id": 181261, "fullname": "Denis Gudovskiy", "url": "http://cvpr.thecvf.com/api/miniconf/users/181261?format=json", "institution": "Panasonic AI Lab"}, {"id": 191560, "fullname": "Yohei Nakata", "url": "http://cvpr.thecvf.com/api/miniconf/users/191560?format=json", "institution": "Panasonic Holdings Corp."}, {"id": 191561, "fullname": "Tomoyuki Okuno", "url": "http://cvpr.thecvf.com/api/miniconf/users/191561?format=json", "institution": "Panasonic Holdings Corporation"}, {"id": 75618, "fullname": "Kurt Keutzer", "url": "http://cvpr.thecvf.com/api/miniconf/users/75618?format=json", "institution": "EECS, UC Berkeley"}, {"id": 130710, "fullname": "Wenzhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130710?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world. Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39196", "url": null, "sourceid": 31281, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36697, "uid": "a0d5b283cb5fc43b41097fc63ce92b17", "name": "ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars", "authors": [{"id": 177834, "fullname": "Kaiwen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/177834?format=json", "institution": "University of Science and Techonology of China"}, {"id": 185669, "fullname": "Jinkai Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/185669?format=json", "institution": "University of Science and Technology of China"}, {"id": 127405, "fullname": "Juyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127405?format=json", "institution": "University of Science and Technology of China"}], "abstract": "In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive, streamable 3D representation method is needed that can be immediately deployed and continuously optimized as resources increase. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown byadaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face\u2011local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. ProgressiveAvatars supports incremental loading rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. Thanks to our progressive representation method with an inherited tree structure, ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36697", "url": null, "sourceid": 41410, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38071, "uid": "6269b862f80762b76e50df2dd5c4b99c", "name": "Stepwise Credit Assignment for GRPO on Flow-Matching Models", "authors": [{"id": 188986, "fullname": "Yash Savani", "url": "http://cvpr.thecvf.com/api/miniconf/users/188986?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 188069, "fullname": "Branislav Kveton", "url": "http://cvpr.thecvf.com/api/miniconf/users/188069?format=json", "institution": "Adobe Research"}, {"id": 188987, "fullname": "Yuchen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188987?format=json", "institution": "Adobe Systems"}, {"id": 95788, "fullname": "Yilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/95788?format=json", "institution": "Adobe Systems"}, {"id": 126843, "fullname": "Jing Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126843?format=json", "institution": "Adobe Systems"}, {"id": 188068, "fullname": "Subhojyoti Mukherjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188068?format=json", "institution": "Adobe Research"}, {"id": 188988, "fullname": "Nikos Vlassis", "url": "http://cvpr.thecvf.com/api/miniconf/users/188988?format=json", "institution": "Adobe Systems"}, {"id": 88736, "fullname": "Krishna Kumar Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/88736?format=json", "institution": "Adobe Systems"}], "abstract": "Flow-GRPO successfully applies reinforcement learning to flow models, but uses _uniform credit assignment_ across all timesteps. This ignores the temporal structure of diffusion generation: early timesteps determine composition and content (low-frequency structure), while late timesteps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently _reward suboptimal intermediate steps_, especially when errors are corrected later in the diffusion trajectory. We propose __Stepwise-Flow-GRPO__, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves visual quality while preserving stochasticity for policy gradients.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38071", "url": null, "sourceid": 42818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36397, "uid": "d4f3031272693602ccb1df4024655175", "name": "Learning complete and explainable visual representations from itemized text supervision", "authors": [{"id": 137385, "fullname": "Yiwei Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/137385?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 184945, "fullname": "Chenhui Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184945?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 184946, "fullname": "Soumyanil Banerjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184946?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 176274, "fullname": "Shixuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176274?format=json", "institution": null}, {"id": 184947, "fullname": "Akshay Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184947?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 84591, "fullname": "Akhil Kondepudi", "url": "http://cvpr.thecvf.com/api/miniconf/users/84591?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 135184, "fullname": "Honglak Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/135184?format=json", "institution": "LG AI Research"}, {"id": 84663, "fullname": "Todd C. Hollon", "url": "http://cvpr.thecvf.com/api/miniconf/users/84663?format=json", "institution": "University of Michigan"}], "abstract": "Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36397", "url": null, "sourceid": 37110, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38221, "uid": "1aace02b1dc7a9ee987286a90bbef89c", "name": "OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control", "authors": [{"id": 179611, "fullname": "Xilong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/179611?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 136775, "fullname": "Jianchun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/136775?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 176579, "fullname": "Pramod Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/176579?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 189361, "fullname": "Timo Teufel", "url": "http://cvpr.thecvf.com/api/miniconf/users/189361?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 189362, "fullname": "Linjie Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189362?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 189363, "fullname": "Tigran Minasian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189363?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute; Universit\u00e4t des Saarlandes"}, {"id": 189364, "fullname": "Oleksandr Sotnychenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/189364?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 186119, "fullname": "Xiao-Xiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186119?format=json", "institution": "Nanjing University"}, {"id": 127234, "fullname": "Marc Habermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/127234?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}], "abstract": "We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38221", "url": null, "sourceid": 34909, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40328?format=json"], "related_events_ids": [40328]}, {"id": 37824, "uid": "21b1a8129e987682b9ee28f6eaf36a0f", "name": "Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer", "authors": [{"id": 130408, "fullname": "Dong In Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/130408?format=json", "institution": "Korea University"}, {"id": 188347, "fullname": "Hyungjun Doh", "url": "http://cvpr.thecvf.com/api/miniconf/users/188347?format=json", "institution": "Purdue University"}, {"id": 70133, "fullname": "Seunggeun Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/70133?format=json", "institution": "Purdue"}, {"id": 188348, "fullname": "Runlin Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188348?format=json", "institution": "Purdue University"}, {"id": 130426, "fullname": "Sangpil Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130426?format=json", "institution": "Korea University"}, {"id": 90006, "fullname": "Karthik Ramani", "url": "http://cvpr.thecvf.com/api/miniconf/users/90006?format=json", "institution": "Purdue University"}], "abstract": "Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing.Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS.Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37824", "url": null, "sourceid": 37699, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37807, "uid": "62dee803f4071bd95a11e66e9b8324a7", "name": "Visual Personalization Turing Test", "authors": [{"id": 188320, "fullname": "Rameen Abdal", "url": "http://cvpr.thecvf.com/api/miniconf/users/188320?format=json", "institution": "Snap Inc."}, {"id": 153609, "fullname": "James Burgess", "url": "http://cvpr.thecvf.com/api/miniconf/users/153609?format=json", "institution": "Stanford University"}, {"id": 85389, "fullname": "Sergey Tulyakov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85389?format=json", "institution": "Snap Inc."}, {"id": 158585, "fullname": "Kuan-Chieh Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158585?format=json", "institution": "Snap Inc."}], "abstract": "We introduce the Visual Personalization Turing Test (VPTT), a new paradigm for evaluating contextual visual personalization based on perceptual indistinguishability, rather than identity replication. A model passes the VPTT if its output (image, video, 3D asset, etc.) is indistinguishable to a human or calibrated VLM judge from content a given person might plausibly create or share. To operationalize VPTT, we present the VPTT Framework, integrating a 10k-persona benchmark (VPTT-Bench), a visual retrieval-augmented generator (VPRAG), and the VPTT Score, a text-only metric calibrated against human and VLM judgments. We show high correlation across human, VLM, and VPTT evaluations, validating the VPTT Score as a reliable perceptual proxy. Experiments demonstrate that VPRAG achieves the best alignment\u2013originality balance, offering a scalable and privacy-safe foundation for personalized generative AI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37807", "url": null, "sourceid": 32841, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40328, "uid": "1aace02b1dc7a9ee987286a90bbef89c", "name": "OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control", "authors": [{"id": 179611, "fullname": "Xilong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/179611?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 136775, "fullname": "Jianchun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/136775?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 176579, "fullname": "Pramod Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/176579?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 189361, "fullname": "Timo Teufel", "url": "http://cvpr.thecvf.com/api/miniconf/users/189361?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 189362, "fullname": "Linjie Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189362?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 189363, "fullname": "Tigran Minasian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189363?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute; Universit\u00e4t des Saarlandes"}, {"id": 189364, "fullname": "Oleksandr Sotnychenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/189364?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 186119, "fullname": "Xiao-Xiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186119?format=json", "institution": "Nanjing University"}, {"id": 127234, "fullname": "Marc Habermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/127234?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}], "abstract": "We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40328", "url": null, "sourceid": -34909, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38221?format=json"], "related_events_ids": [38221]}, {"id": 37425, "uid": "0ca410c8d727f9fe3ad4f29b4cacf1fa", "name": "Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation", "authors": [{"id": 181143, "fullname": "Yi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181143?format=json", "institution": "\u4e2d\u56fd\u79d1\u5b66\u6280\u672f\u5927\u5b66"}, {"id": 187423, "fullname": "Shengju Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187423?format=json", "institution": "Tencent"}, {"id": 76217, "fullname": "Lingting Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76217?format=json", "institution": "The University of Hong Kong"}, {"id": 187424, "fullname": "Lei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187424?format=json", "institution": "University of Science and Technology of China"}, {"id": 187425, "fullname": "Wandi Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187425?format=json", "institution": "University of Science and Technology of China"}, {"id": 187426, "fullname": "Ziqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187426?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 87579, "fullname": "Lequan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87579?format=json", "institution": "The University of Hong Kong"}, {"id": 88935, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88935?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37425", "url": null, "sourceid": 33828, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39352, "uid": "c46269770def083aca28a5ad93d79e18", "name": "SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning", "authors": [{"id": 180357, "fullname": "Fei Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/180357?format=json", "institution": "Imperial College London"}, {"id": 144929, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144929?format=json", "institution": "King\u2019s College London"}, {"id": 191905, "fullname": "Yifu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191905?format=json", "institution": "Tianjin University"}, {"id": 191906, "fullname": "Zibin Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191906?format=json", "institution": "Tsinghua University"}, {"id": 191907, "fullname": "Xianze Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191907?format=json", "institution": "Tianjin University"}, {"id": 156018, "fullname": "Shan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/156018?format=json", "institution": "King\u2019s College London"}, {"id": 84767, "fullname": "Jianye Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84767?format=json", "institution": "Tianjin University"}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}, {"id": 86877, "fullname": "Stefanos Zafeiriou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86877?format=json", "institution": "Imperial College London"}], "abstract": "Vision-Language-Action (VLA) models have emerged as a promising paradigm where pretrained Vision-Language Models (VLMs) serve as System 2 for high-level reasoning, connected to action experts as System 1 for low-level motor control.However, current works fail to genuinely leverage VLM capabilities: VLMs produce latent embeddings that lack semantic interpretability, providing ambiguous and unstable guidance to downstream policies, while solely action supervision further causes VLMs to degenerate into parameter-heavy fusion encoders that memorize action patterns rather than perform generalized reasoning.To bridge this gap, we introduce **SemanticVLA**, which leverages VLM reasoning through synergistic dual-path design. *Explicit trace reasoning* generates interpretable spatial waypoints as textual coordinate sequences through the VLM's native language interface, directly reusing its pretrained spatial grounding to provide a \"thinking process\" for task planning. *Latent action tokens* complement trace reasoning by learning compact visuomotor primitives grounded in visual observations, providing more fine-grained action representations beyond pure coordinate prediction. This synergy enables trace reasoning to leverage VLM's multimodal understanding for refining latent token prediction, while latent tokens provide stable and grounded guidance that compensates for trace's numerical sensitivity.SemanticVLA achieves 97.0\\% average success rate on LIBERO and 65.1\\% on SimplerEnv WidowX, substantially outperforming strong baselines. More importantly, SemanticVLA maintains significantly more stable performance under instruction rephrasing in both simulation suites, and demonstrates strong advantages on real-world long-horizon and reasoning-intensive tasks.By bridging VLM reasoning and action expert through semantically explicit trace and visually grounded latent action tokens, our approach enables *genuine reasoning* rather than *action memorization*.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39352", "url": null, "sourceid": 46382, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39290, "uid": "5b8fb4dc83626faa47bdf214c6119098", "name": "CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling", "authors": [{"id": 104825, "fullname": "Li Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/104825?format=json", "institution": "Shandong University"}, {"id": 89410, "fullname": "Weikai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89410?format=json", "institution": "Tencent America"}, {"id": 154392, "fullname": "Yujie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154392?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 76921, "fullname": "Yingda Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/76921?format=json", "institution": "Peking University"}, {"id": 153580, "fullname": "Zeyu HU", "url": "http://cvpr.thecvf.com/api/miniconf/users/153580?format=json", "institution": "Tencent Lightspeed Studios, Singapore"}, {"id": 185827, "fullname": "Runze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185827?format=json", "institution": "Tencent"}, {"id": 185828, "fullname": "Keyang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185828?format=json", "institution": "Tencent LightSpeed Studio"}, {"id": 187423, "fullname": "Shengju Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187423?format=json", "institution": "Tencent"}, {"id": 153582, "fullname": "Xin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153582?format=json", "institution": "LightSpeed Studios"}, {"id": 155002, "fullname": "Xueying Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/155002?format=json", "institution": "Shandong University"}], "abstract": "Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this  gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame  learned directly from data.  By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical representation yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39290", "url": null, "sourceid": 46491, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40362?format=json"], "related_events_ids": [40362]}, {"id": 39728, "uid": "12686196691e427422aba9f58eb9188a", "name": "RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection", "authors": [{"id": 130913, "fullname": "Jihwan Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/130913?format=json", "institution": "Korea University"}, {"id": 192736, "fullname": "Chanhyeong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192736?format=json", "institution": "Korea University"}, {"id": 182644, "fullname": "Jinyoung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/182644?format=json", "institution": "KAIST"}, {"id": 175505, "fullname": "Taehoon Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/175505?format=json", "institution": " Korea Advanced Institute of Science and Technology"}, {"id": 156026, "fullname": "Hyunwoo J. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156026?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Weakly supervised Human\u2013Object Interaction (HOI) detection is vital for scalable scene understanding by learning interactions from only image-level annotations, i.e., no labels specifying which human\u2013object instances are engaged in the interaction.Due to the lack of localization signals, prior works typically propose candidate pairs using an external object detector and then infer their interactions through pairwise reasoning.However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it exhibits suboptimal performance due to false positives arising from non-interactive combinations, hindering its capability of instance-level HOI reasoning.To this end, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module that enables efficient and accurate HOI reasoning.Under image-level supervision, RegFormer leverages spatially grounded implicit signals as guidance for the reasoning process, facilitating effective locality elicitation.Benefiting from the implicitly learned local interactions, our module can accurately distinguish humans, objects, and their interactions within their corresponding regions, enabling precise and efficient instance-level HOI reasoning without any additional training.Our extensive experiments and analysis demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even shows comparable performance compared to fully supervised models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39728", "url": null, "sourceid": 42047, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38740, "uid": "4a481c12f9ce3441585bc800ae000fe8", "name": "Towards Robust Sequential Decomposition for Complex Image Editing", "authors": [{"id": 182379, "fullname": "Zilai Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182379?format=json", "institution": "Brown University"}, {"id": 87233, "fullname": "Mingdeng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87233?format=json", "institution": "The University of Tokyo"}, {"id": 190556, "fullname": "Zijie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190556?format=json", "institution": "ByteDance Inc."}, {"id": 190557, "fullname": "Xiaochen Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190557?format=json", "institution": "ByteDance"}, {"id": 135318, "fullname": "Yichun Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/135318?format=json", "institution": "ByteDance"}, {"id": 168338, "fullname": "Peihao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/168338?format=json", "institution": "ByteDance"}, {"id": 89379, "fullname": "Chen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/89379?format=json", "institution": "Brown University"}, {"id": 126478, "fullname": "Peng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126478?format=json", "institution": "Bytedance US AILab"}], "abstract": "Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38740", "url": null, "sourceid": 35384, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36903, "uid": "bc4d9b0e9bdbd3186592452785c479cc", "name": "Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation", "authors": [{"id": 179959, "fullname": "Xinshun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179959?format=json", "institution": "Peking University"}, {"id": 144398, "fullname": "Peiming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/144398?format=json", "institution": "Peking University"}, {"id": 186160, "fullname": "Ziyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186160?format=json", "institution": "Peking University"}, {"id": 186161, "fullname": "Zhongbin Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186161?format=json", "institution": "Tencent"}, {"id": 186162, "fullname": "Zhichao Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186162?format=json", "institution": "Alibaba Group"}, {"id": 186163, "fullname": "Songtao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186163?format=json", "institution": "Sony (China) Limited"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 86481, "fullname": "Mengyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86481?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between \"perception\" models that understand motion from video but only output text, and \"generation\" models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons. Code is in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36903", "url": null, "sourceid": 39955, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36269, "uid": "a528bca9b3e06f60ba610a7249ab99e5", "name": "See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions", "authors": [{"id": 176626, "fullname": "Jiayi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176626?format=json", "institution": "Wuhan University"}, {"id": 184631, "fullname": "Zhihong Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184631?format=json", "institution": "Wuhan University"}, {"id": 184632, "fullname": "Hongchen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/184632?format=json", "institution": "Wuhan University"}, {"id": 184633, "fullname": "Daiqin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184633?format=json", "institution": "Wuhan University"}, {"id": 129613, "fullname": "Zhenzhong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129613?format=json", "institution": "Wuhan University"}], "abstract": "Object counting in remote sensing imagery becomes challenging when visual cues are obscured by clouds, fog, shadows, or low-light conditions. Yet earth observation inherently provides complementary geo-modalities, including land use and map, which offer stable structural and contextual priors that remain available when appearance cues fail. In this paper, we introduce  \\textbf{GROC},  the first large-scale dataset \\textbf{G}eo-guided \\textbf{R}easoning in \\textbf{O}bject \\textbf{C}ounting under adverse earth observation conditions. GROC contains 1.2 million point annotations over 14K images, each aligned with 3 modalities that preserve original geospatial information. We also provide a data engine to collect a large-scale object counting dataset with multiple geo-modalities, realistic degradations, and reliable annotations. We further present an counting agent that adaptively leverages geo-modalities to produce reliable estimates. Extensive experiments show that existing models struggle to \u201csee\u201d through adverse conditions, whereas geo-modalities improve robustness. GROC establishes the first benchmark that explicitly challenges models to \\textbf{see what they cannot see}, charting a new direction for geo-guided amodal reasoning in earth observation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36269", "url": null, "sourceid": 39719, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38316, "uid": "57a9d589fa03ef4795f38f84306486c4", "name": "Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation", "authors": [{"id": 146919, "fullname": "Binyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146919?format=json", "institution": "Wuhan University"}, {"id": 85619, "fullname": "Yuning Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85619?format=json", "institution": "University of Science and Technology of China"}, {"id": 151427, "fullname": "Weinan Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/151427?format=json", "institution": "University of Science and Technology of China"}, {"id": 71048, "fullname": "hualiang wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71048?format=json", "institution": "HKUST"}, {"id": 189583, "fullname": "Mu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189583?format=json", "institution": "ByteDance Inc."}, {"id": 184633, "fullname": "Daiqin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184633?format=json", "institution": "Wuhan University"}], "abstract": "Recent proprietary models such as Sora\u20112 demonstrate promising progress in generating multi\u2011shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model\u2019s ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token\u2011level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi\u2011reference and multi\u2011shot video generation model capable of accurately controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross\u2011shot consistency and reference fidelity compared with various baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38316", "url": "https://poco-multiref-multishot.github.io/", "sourceid": 33663, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36879, "uid": "b90b1b71d03d17c8af3e78947fa87a0a", "name": "The Invisible Gorilla Effect in Out-of-distribution Detection", "authors": [{"id": 186088, "fullname": "Harry Anthony", "url": "http://cvpr.thecvf.com/api/miniconf/users/186088?format=json", "institution": "University of Oxford"}, {"id": 186089, "fullname": "Ziyun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186089?format=json", "institution": "University of Oxford"}, {"id": 186090, "fullname": "Hermione Warr", "url": "http://cvpr.thecvf.com/api/miniconf/users/186090?format=json", "institution": "University of Oxford"}, {"id": 157202, "fullname": "Konstantinos Kamnitsas", "url": "http://cvpr.thecvf.com/api/miniconf/users/157202?format=json", "institution": "University of Oxford"}], "abstract": "Deep Neural Networks achieve high performance in vision tasks by learning features from regions of interest (ROI) within images, but their performance degrades when deployed on out-of-distribution (OOD) data that differs from training data. This challenge has led to OOD detection methods that aim to identify and reject unreliable predictions. Although prior work shows that OOD detection performance varies by artefact type, the underlying causes remain underexplored. To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model\u2019s ROI and drops when it does not - a phenomenon we term the Invisible Gorilla Effect. For example, in a skin lesion classifier with red lesion ROI, we show the method Mahalanobis Score achieves a 31.5% higher AUROC when detecting OOD red ink (similar to ROI) compared to black ink (dissimilar) annotations. We annotated artefacts by colour in 11,355 images from three public datasets (e.g. ISIC) and generated colour-swapped counterfactuals to rule out dataset bias. We then evaluated 40 OOD methods across 7 benchmarks and found significant performance drops for most methods when artefacts differed from the ROI. Our findings highlight an overlooked failure mode in OOD detection and provide guidance for more robust detectors. Code and annotations will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36879", "url": null, "sourceid": 33906, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39108, "uid": "b2b9e8eb2b9bc46c4f57785472b96f4d", "name": "RegionRoute: Regional Style Transfer with Diffusion Model", "authors": [{"id": 181925, "fullname": "Bowen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181925?format=json", "institution": "The University of Texas at Austin"}, {"id": 191376, "fullname": "Jake Zuena", "url": "http://cvpr.thecvf.com/api/miniconf/users/191376?format=json", "institution": "Dolby Laboratories"}, {"id": 183579, "fullname": "Alan Bovik", "url": "http://cvpr.thecvf.com/api/miniconf/users/183579?format=json", "institution": "University if Colorado Boulder"}, {"id": 191377, "fullname": "Divya Kothandaraman", "url": "http://cvpr.thecvf.com/api/miniconf/users/191377?format=json", "institution": "Dolby Laboratories"}], "abstract": "Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39108", "url": null, "sourceid": 37017, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38103, "uid": "4328ad1978e9bcb7b297600e3019a5a6", "name": "MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents", "authors": [{"id": 180925, "fullname": "Ruoxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180925?format=json", "institution": "Jilin University"}, {"id": 189063, "fullname": "Qiyun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189063?format=json", "institution": "Jilin University"}, {"id": 189064, "fullname": "Zhiyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189064?format=json", "institution": "Jilin University"}, {"id": 189065, "fullname": "Ziqi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189065?format=json", "institution": "Jilin University"}, {"id": 189066, "fullname": "Siyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189066?format=json", "institution": "Jilin University"}, {"id": 182698, "fullname": "Jian-Yu Jiang-Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182698?format=json", "institution": "National Taiwan University"}, {"id": 189067, "fullname": "Bin Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189067?format=json", "institution": "Jilin University"}, {"id": 133400, "fullname": "Hongxia Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/133400?format=json", "institution": "Jilin University"}, {"id": 88152, "fullname": "Jianlong Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88152?format=json", "institution": "Microsoft"}, {"id": 127961, "fullname": "Wen-Huang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127961?format=json", "institution": "National Taiwan University"}], "abstract": "Theory of Mind (ToM) refers to the ability to infer others\u2019 mental states, such as beliefs, desires, and intentions. Current vision\u2013language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent\u2019s own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. Given multimodal inputs, MindPower first perceives the environment and human states, then performs ToM Reasoning to model both self and others, and finally generates decisions and actions guided by inferred mental states. Furthermore, we introduce Mind-Reward, a novel optimization objective that encourages VLMs to produce consistent ToM Reasoning and behavior. Our model outperforms GPT-4o by 12.77% in decision making and 12.49% in action generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38103", "url": null, "sourceid": 36566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38833, "uid": "343e8f41b18325a6058adc3773ed4d53", "name": "Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models", "authors": [{"id": 190791, "fullname": "Hongji Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190791?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 190792, "fullname": "Manjiang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190792?format=json", "institution": "The University of Queensland"}, {"id": 190793, "fullname": "Junchi Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190793?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence; University of Electronic Science and Technology of China"}, {"id": 179482, "fullname": "PRIYANKA SINGH", "url": "http://cvpr.thecvf.com/api/miniconf/users/179482?format=json", "institution": "The University of Queensland"}, {"id": 179644, "fullname": "Xue Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/179644?format=json", "institution": "The University of Queensland"}, {"id": 190794, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190794?format=json", "institution": "KAUST"}, {"id": 181486, "fullname": "Lijie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181486?format=json", "institution": "MBZUAI"}], "abstract": "Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is especially challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and aggressive interventions easily damage general reasoning ability. However, existing benchmarks do not jointly evaluate how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with explicit reasoning traces and dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench shows that current unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To overcome these limitations, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free, inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning ability. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention than existing approaches. Our code and data will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38833", "url": null, "sourceid": 36890, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39651, "uid": "bd71244c9e7f8c7835251fa4c63a4996", "name": "BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer", "authors": [{"id": 192567, "fullname": "Changzhou Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192567?format=json", "institution": "Swinburne University of Technology"}, {"id": 192568, "fullname": "Wanlun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192568?format=json", "institution": "Swinburne University of Technology"}, {"id": 146283, "fullname": "XI TANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/146283?format=json", "institution": "Swinburne University of Technology"}, {"id": 187802, "fullname": "Kun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187802?format=json", "institution": "Edith Cowan University"}, {"id": 192569, "fullname": "Sheng Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192569?format=json", "institution": "Swinburne University of Technology"}, {"id": 149034, "fullname": "Yang Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149034?format=json", "institution": "Swinburne University of Technology"}], "abstract": "Sign Language Translation (SLT) converts continuous sign videos into spoken language text, yet current models, whether gloss-based or gloss-free, struggle with long or discourse-level inputs. Recent architectures such as TwoStreamNetwork and CV-SLT have nearly saturated short-sentence accuracy, but their performance degrades on long sentences and multi-sentence paragraphs. In real scenarios such as news, interviews or daily conversations, signers naturally produce extended signing sequences with complex contextual dependencies. Moreover, identifying precise gloss boundaries remains a key obstacle, while gloss-based methods, though often superior, incur heavy annotation costs. The community therefore needs a solution that mitigates gloss dependency while preserving translation quality.We present **BoostSLT**, a context-aware framework enhancing semantic consistency over long sign sequences without gloss supervision. Instead of requiring explicit gloss segmentation, BoostSLT introduces an *Energy-Aware Temporal Segmentation (EAT-Seg)* module that dynamically partitions videos into semantically coherent fragments, followed by a *Diffusion-based Semantic Reconstruction (DSR)* module that stitches and refines fragment-level translations into globally fluent paragraphs. The framework is plug-and-play and model-agnostic, seamlessly integrating with existing gloss-based or gloss-free pipelines across languages. Experiments on PHOENIX-2014T, CSL-Daily, and Auslan-Daily show consistent BLEU and Rouge-L gains, confirming that diffusion-driven semantic reconstruction effectively bridges local accuracy and global coherence in long-form SLT.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39651", "url": null, "sourceid": 41130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40375?format=json"], "related_events_ids": [40375]}, {"id": 36616, "uid": "239ce949b3bcb25a0c47779fe37c80e0", "name": "Best Segmentation Buddies for Image-Shape Correspondence", "authors": [{"id": 72635, "fullname": "Itai Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/72635?format=json", "institution": "University of Chicago &amp; Tel Aviv University"}, {"id": 185474, "fullname": "Dongwei Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185474?format=json", "institution": "University of California, Berkeley"}, {"id": 87982, "fullname": "Dale Decatur", "url": "http://cvpr.thecvf.com/api/miniconf/users/87982?format=json", "institution": "University of Chicago"}, {"id": 86488, "fullname": "Rana Hanocka", "url": "http://cvpr.thecvf.com/api/miniconf/users/86488?format=json", "institution": "University of Chicago"}], "abstract": "Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored problem of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape.To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. We then identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D segmentation model of the image to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36616", "url": null, "sourceid": 44615, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37663, "uid": "5f74f8625f41c155c6437067d78ca25e", "name": "PCA-Seg: Revisiting  Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation", "authors": [{"id": 181787, "fullname": "Jianjian Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/181787?format=json", "institution": "Nanjing University of Science  and Technology"}, {"id": 107127, "fullname": "Tao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107127?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 187972, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187972?format=json", "institution": "Nanjing Normal University"}, {"id": 181698, "fullname": "Gensheng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181698?format=json", "institution": "Sungkyunkwan University"}, {"id": 157797, "fullname": "Xiangbo Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157797?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 72996, "fullname": "Fumin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72996?format=json", "institution": "UESTC"}], "abstract": "Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS).  However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context.  Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge,  enabling the model to capture richer vision-language alignment information from cost volumes.   Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.  Our source code is available in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37663", "url": null, "sourceid": 41243, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40375, "uid": "bd71244c9e7f8c7835251fa4c63a4996", "name": "BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer", "authors": [{"id": 192567, "fullname": "Changzhou Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192567?format=json", "institution": "Swinburne University of Technology"}, {"id": 192568, "fullname": "Wanlun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192568?format=json", "institution": "Swinburne University of Technology"}, {"id": 146283, "fullname": "XI TANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/146283?format=json", "institution": "Swinburne University of Technology"}, {"id": 187802, "fullname": "Kun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187802?format=json", "institution": "Edith Cowan University"}, {"id": 192569, "fullname": "Sheng Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192569?format=json", "institution": "Swinburne University of Technology"}, {"id": 149034, "fullname": "Yang Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149034?format=json", "institution": "Swinburne University of Technology"}], "abstract": "Sign Language Translation (SLT) converts continuous sign videos into spoken language text, yet current models, whether gloss-based or gloss-free, struggle with long or discourse-level inputs. Recent architectures such as TwoStreamNetwork and CV-SLT have nearly saturated short-sentence accuracy, but their performance degrades on long sentences and multi-sentence paragraphs. In real scenarios such as news, interviews or daily conversations, signers naturally produce extended signing sequences with complex contextual dependencies. Moreover, identifying precise gloss boundaries remains a key obstacle, while gloss-based methods, though often superior, incur heavy annotation costs. The community therefore needs a solution that mitigates gloss dependency while preserving translation quality.We present **BoostSLT**, a context-aware framework enhancing semantic consistency over long sign sequences without gloss supervision. Instead of requiring explicit gloss segmentation, BoostSLT introduces an *Energy-Aware Temporal Segmentation (EAT-Seg)* module that dynamically partitions videos into semantically coherent fragments, followed by a *Diffusion-based Semantic Reconstruction (DSR)* module that stitches and refines fragment-level translations into globally fluent paragraphs. The framework is plug-and-play and model-agnostic, seamlessly integrating with existing gloss-based or gloss-free pipelines across languages. Experiments on PHOENIX-2014T, CSL-Daily, and Auslan-Daily show consistent BLEU and Rouge-L gains, confirming that diffusion-driven semantic reconstruction effectively bridges local accuracy and global coherence in long-form SLT.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40375", "url": null, "sourceid": -41130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39651?format=json"], "related_events_ids": [39651]}, {"id": 38725, "uid": "1f4e960dc47c49ab5c2a495e00055f64", "name": "OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging", "authors": [{"id": 180411, "fullname": "meilin liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180411?format=json", "institution": "Shenyang University of Technology"}, {"id": 190529, "fullname": "Jiaying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190529?format=json", "institution": "Shenyang University of Technology"}, {"id": 190530, "fullname": "Jing Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190530?format=json", "institution": "Shenyang University of Technology, Shenyang University of Technology"}], "abstract": "Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM,a modality- and task-agnostic FL framework that unifies training across classification,segmentation, super-resolution, visual question answering,and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates(i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix\u2013Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38725", "url": null, "sourceid": 38419, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37600, "uid": "ff63d5c2a8762344e119750b0cb6ce1b", "name": "F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation", "authors": [{"id": 187798, "fullname": "Hengzhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187798?format=json", "institution": null}, {"id": 187799, "fullname": "Liqian Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187799?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 185893, "fullname": "Wenhua Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185893?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 187800, "fullname": "Xiaogang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187800?format=json", "institution": "Adelaide University"}, {"id": 187801, "fullname": "Qiuxia Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187801?format=json", "institution": "South China University of Technology"}, {"id": 144965, "fullname": "Lianlei Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144965?format=json", "institution": "Tsinghua University"}, {"id": 187802, "fullname": "Kun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187802?format=json", "institution": "Edith Cowan University"}], "abstract": "Semantic segmentation of ultra-high-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces com- putational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks ad- dress this trade-off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a frequency-aware framework that decomposes UHR images into high- and low-frequency components for specialized processing. The high-frequency branch preserves full-resolution structural details, while the low-frequency branch processes downsampled inputs through dual sub-branches capturing short- and long-range dependencies. A Hybrid-Frequency Fusion mod- ule integrates these observations, guided by two novel objectives: Cross-Frequency Alignment Loss ensures semantic consistency between frequency components, and Cross-Frequency Balance Loss regulates gradient magnitudes across branches to stabilize training. Evaluated on DeepGlobe and Inria Aerial benchmarks, F2Net achieves state-of-the-art performance with mIoU of 80.22 and 83.39, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37600", "url": null, "sourceid": 33048, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39679, "uid": "a0e862b5c2de0352aeb851d943e6046c", "name": "PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction", "authors": [{"id": 192632, "fullname": "Yu Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192632?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 187800, "fullname": "Xiaogang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187800?format=json", "institution": "Adelaide University"}, {"id": 192633, "fullname": "Shan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192633?format=json", "institution": "Wuhan polytechnic university"}, {"id": 91374, "fullname": "Wei Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91374?format=json", "institution": "La Trobe University"}, {"id": 192634, "fullname": "Thomas Bishop", "url": "http://cvpr.thecvf.com/api/miniconf/users/192634?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 87486, "fullname": "Zhiyong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87486?format=json", "institution": "The University of Sydney"}, {"id": 187802, "fullname": "Kun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187802?format=json", "institution": "Edith Cowan University"}], "abstract": "Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39679", "url": null, "sourceid": 34253, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36471, "uid": "01bfa6f22b2ebecc7539536db6ea78f3", "name": "FLOW: Optimal Transport-Driven Feature Warping for Generalized Remote Physiological Measurement", "authors": [{"id": 180291, "fullname": "bo zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180291?format=json", "institution": "GBU/CUHKSZ"}, {"id": 185124, "fullname": "Junzhe Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185124?format=json", "institution": "Harbin Institute of Technology (Shenzhen); Great Bay University"}, {"id": 128503, "fullname": "Dan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128503?format=json", "institution": "Hefei University of Technology"}, {"id": 185125, "fullname": "Dongmin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185125?format=json", "institution": "Southern University of Science and Technology"}, {"id": 169175, "fullname": "Wenjin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169175?format=json", "institution": "Southern University of Science and Technology"}, {"id": 156482, "fullname": "Tao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156482?format=json", "institution": "Macao Polytechnic University"}, {"id": 156480, "fullname": "Yue Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/156480?format=json", "institution": "Macao Polytechnic University"}, {"id": 70489, "fullname": "Zitong YU", "url": "http://cvpr.thecvf.com/api/miniconf/users/70489?format=json", "institution": "Nanyang Technological University"}], "abstract": "Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos but often suffers from severe performance degradation under domain shifts. Traditional STMap-based methods~\\cite{niu2019rhythmnet} rely on predefined spatio-temporal representations that offer engineered robustness but discard fine-grained temporal cues. In contrast, end-to-end rPPG models directly learn hierarchical features from raw videos, capturing richer physiological patterns yet remaining highly sensitive to variations in illumination, motion, and camera sensors. To address these challenges, we propose \\textbf{FLOW (Feature-Level Optimal Warping)}, an Optimal Transport (OT)\u2013driven and plug-and-play framework for multi-source domain generalization in rPPG measurement.FLOW formulates domain shifts as structured Optimal Transport problems and performs feature-level warping to align multiple source domains in a shared latent space. Specifically, a dual-consistency regularization is proposed to enforce both frequency fidelity and mapping invariance, while a shape-adaptive alignment module is designed to enable architecture-agnostic integration without re-training. We further derive a generalization bound based on conditional OT discrepancy, providing theoretical insight into FLOW\u2019s robustness under distributional shifts. Extensive experiments across diverse rPPG benchmarks demonstrate that FLOW consistently improves cross-domain generalization while maintaining lightweight and modular deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36471", "url": null, "sourceid": 32194, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37072, "uid": "9692a39f82adafb404aa2620230fd829", "name": "PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement", "authors": [{"id": 180291, "fullname": "bo zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180291?format=json", "institution": "GBU/CUHKSZ"}, {"id": 128503, "fullname": "Dan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128503?format=json", "institution": "Hefei University of Technology"}, {"id": 185124, "fullname": "Junzhe Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185124?format=json", "institution": "Harbin Institute of Technology (Shenzhen); Great Bay University"}, {"id": 87841, "fullname": "Yong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87841?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184923, "fullname": "Bochao Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184923?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 156482, "fullname": "Tao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156482?format=json", "institution": "Macao Polytechnic University"}, {"id": 156480, "fullname": "Yue Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/156480?format=json", "institution": "Macao Polytechnic University"}, {"id": 70489, "fullname": "Zitong YU", "url": "http://cvpr.thecvf.com/api/miniconf/users/70489?format=json", "institution": "Nanyang Technological University"}], "abstract": "Remote photoplethysmography (rPPG) measurement enables non-contact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, limiting robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier\u2013Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution, justifying the use of a Temporal Convolutional Network (TCN). Based on this principle, we design the PHASE-Net, a lightweight model with three key components: 1) Zero-FLOPs  Axial Swapper module to swap or transpose a few spatial channels to mix distant facial regions, boosting cross-region feature interaction without changing temporal order; 2) Adaptive Spatial Filter to learn a soft spatial mask per frame to highlight signal-rich areas and suppress noise for cleaner feature maps; and 3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance and strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37072", "url": null, "sourceid": 33675, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39363, "uid": "3953e20c5314a7773feba5489c97d84b", "name": "Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion", "authors": [{"id": 191928, "fullname": "Shuhan Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191928?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 191929, "fullname": "Biru Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191929?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 191930, "fullname": "Junling Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191930?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "This paper introduces Nestwork, a unified latent-diffusion framework for conditional 3D furnished house layout generation using a heterogeneous graph of rooms and furniture. Designing reasonable and controllable 3D layouts that reflect the underlying semantic structure of a house is a key challenge in AI-assisted architectural design. Existing graph-based methods either produce unfurnished multi-room layouts or generate furnished scenes one room at a time, preventing joint reasoning over room structure and furniture placement. Nestwork represents an entire house as a heterogeneous graph with typed room and furniture nodes and multiple spatial relations. A single unconditional autoencoder based on a heterogeneous graph attention network embeds this graph into a compact latent space, and a low-rank relational field compensates for missing geometric edge information at test time. A diffusion denoiser is trained once using random masking, enabling the same model to operate under different conditioning strengths, from topology-only to fully annotated graphs. Multi-level conditioning combines masked node-level attention with graph-level embeddings to support flexible user control, including layouts specified through natural-language descriptions. Experiments on the 3D-FRONT dataset show that Nestwork achieves high fidelity, structural consistency, and diversity. Controlled ablations further validate the contributions of each component.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39363", "url": null, "sourceid": 45199, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36553, "uid": "7dd701e074e9850f9d6d1b52332b0dee", "name": "NeuROK: Generative 4D Neural Object Kinematics", "authors": [{"id": 86204, "fullname": "Chen Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86204?format=json", "institution": "Stanford University"}, {"id": 185326, "fullname": "Guangzhao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185326?format=json", "institution": "Cornell University"}, {"id": 151344, "fullname": "Yue Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151344?format=json", "institution": "Stanford University"}, {"id": 159444, "fullname": "Yunzhi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159444?format=json", "institution": "Stanford University"}, {"id": 152816, "fullname": "Shangzhe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152816?format=json", "institution": "University of Cambridge"}, {"id": 84531, "fullname": "Jiajun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84531?format=json", "institution": "Stanford University"}], "abstract": "Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics---realistic temporal deformations of static objects under various physical conditions---remains challenging and often ad hoc despite being critical for building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space of all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this framework across diverse dynamic object types, showing clear advantages over prior works.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36553", "url": null, "sourceid": 35948, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39711, "uid": "690baec7c6e9b45cfeaf09ccec212bb4", "name": "Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes", "authors": [{"id": 192704, "fullname": "Qi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192704?format=json", "institution": "Shenzhen University"}, {"id": 192705, "fullname": "Jixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192705?format=json", "institution": "shenzhen university"}, {"id": 181373, "fullname": "Zhang Kaiyi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181373?format=json", "institution": "Shenzhen University"}, {"id": 192706, "fullname": "Xinquan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192706?format=json", "institution": "Shenzhen University"}, {"id": 88738, "fullname": "Antoni B. Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88738?format=json", "institution": "City University of Hong Kong"}, {"id": 85724, "fullname": "Hui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85724?format=json", "institution": "Shenzhen University"}], "abstract": "Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, MVTrackTrans, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code will be made public upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39711", "url": null, "sourceid": 34203, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37344, "uid": "5fe6b849fbf264bf68e4602074f10809", "name": "D2T2 - Multimodal Automated Planning for Brachytherapy", "authors": [{"id": 182083, "fullname": "Lance Moore", "url": "http://cvpr.thecvf.com/api/miniconf/users/182083?format=json", "institution": "University of California, San Diego; University of California, San Diego"}, {"id": 187207, "fullname": "Aranyo Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/187207?format=json", "institution": "University of California, San Diego"}, {"id": 177718, "fullname": "Ryan Truong", "url": "http://cvpr.thecvf.com/api/miniconf/users/177718?format=json", "institution": "University of California San Diego"}, {"id": 187208, "fullname": "Karoline Kallis", "url": "http://cvpr.thecvf.com/api/miniconf/users/187208?format=json", "institution": null}, {"id": 187209, "fullname": "Kelly Kisling", "url": "http://cvpr.thecvf.com/api/miniconf/users/187209?format=json", "institution": "UC San Diego Health"}, {"id": 187210, "fullname": "Sandra Meyers", "url": "http://cvpr.thecvf.com/api/miniconf/users/187210?format=json", "institution": "University of California, San Diego"}, {"id": 84939, "fullname": "Nuno Vasconcelos", "url": "http://cvpr.thecvf.com/api/miniconf/users/84939?format=json", "institution": "University of California San Diego"}], "abstract": "Brachytherapy is a complex radiation oncology problem that requires the simultaneous prediction of radiation dose, which is used for treatment planning, and a set of machine parameters, known as dwell times, used for treatment delivery. We propose Direct Dwell Time Transformer ($\\textbf{D2T2}$), the first deep learning architecture that directly predicts dwell times during dose prediction. $\\textbf{D2T2}$ is a two stage model, where the first stage predicts a vector of dwell times and the second implements the physical model of radiation delivery, namely a linear combination of radiation dose kernels. Besides the the automatic prediction of dwell times, this has the benefit of constraining the model to make physically plausible dose predictions, when trained end-to-end.  To enhance this training, we also propose a new loss function, denoted as the gamma loss, based on the prediction of the gamma index, which is the gold standard of dose comparisons. This is implemented by training a model to predict the latter using a synthetic dataset of  groundtruth and predicted dose pairs. We train $\\textbf{D2T2}$ on a large dataset of $\\sim$5,000 clinical brachytherapy plans---the largest such dataset to date---spanning gynecological, breast, and other treatment sites. Results demonstrate that $\\textbf{D2T2}$ outperforms existing methods in both accuracy and speed. Notably, $\\textbf{D2T2}$ produces deliverable plans and physically valid dose distributions in a single forward pass, for any application of brachytherapy, hours faster than manual planning and minutes faster than more recent automated methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37344", "url": null, "sourceid": 32824, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38343, "uid": "3ce529861d4b288181b5454f0acceabf", "name": "MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding", "authors": [{"id": 174415, "fullname": "Yang Guangjing", "url": "http://cvpr.thecvf.com/api/miniconf/users/174415?format=json", "institution": "bupt"}, {"id": 189648, "fullname": "Ziyuan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189648?format=json", "institution": "Emory University"}, {"id": 189649, "fullname": "Chaoran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189649?format=json", "institution": null}, {"id": 189650, "fullname": "Chenlin Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/189650?format=json", "institution": "Peking University"}, {"id": 189651, "fullname": "Jinglin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189651?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 189652, "fullname": "Wanran Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189652?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 189653, "fullname": "Zhenyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189653?format=json", "institution": "Tsinghua University"}, {"id": 189654, "fullname": "Bing Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/189654?format=json", "institution": "Shandong University"}, {"id": 189655, "fullname": "Qicheng Lao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189655?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code and checkpoints will be released after acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38343", "url": null, "sourceid": 41960, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36314, "uid": "cfbc4c656854352ff0ed6f6975d35c4c", "name": "HoneyBee: Data Recipes for Vision-Language Reasoners", "authors": [{"id": 131051, "fullname": "Hritik Bansal", "url": "http://cvpr.thecvf.com/api/miniconf/users/131051?format=json", "institution": "University of California, Los Angeles"}, {"id": 184753, "fullname": "Devendra Singh Sachan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184753?format=json", "institution": "Meta"}, {"id": 89393, "fullname": "Kai-Wei Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89393?format=json", "institution": "University of California, Los Angeles"}, {"id": 131036, "fullname": "Aditya Grover", "url": "http://cvpr.thecvf.com/api/miniconf/users/131036?format=json", "institution": "University of California, Los Angeles"}, {"id": 184754, "fullname": "Gargi Ghosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/184754?format=json", "institution": null}, {"id": 130928, "fullname": "Wen-tau Yih", "url": "http://cvpr.thecvf.com/api/miniconf/users/130928?format=json", "institution": "Meta Platforms, Inc."}, {"id": 184755, "fullname": "Ramakanth Pasunuru", "url": "http://cvpr.thecvf.com/api/miniconf/users/184755?format=json", "institution": "FAIR @ Meta"}], "abstract": "Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36314", "url": null, "sourceid": 31566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39610, "uid": "626dc1c3716c817b064632fded168e35", "name": "CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation", "authors": [{"id": 179150, "fullname": "Ke Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179150?format=json", "institution": "Fudan University"}, {"id": 192470, "fullname": "Haiyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192470?format=json", "institution": null}, {"id": 192471, "fullname": "Zhuofan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192471?format=json", "institution": "Fudan University"}, {"id": 192472, "fullname": "Zhengtao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192472?format=json", "institution": "University of Southern California"}, {"id": 192473, "fullname": "WeitaoJia WeitaoJia", "url": "http://cvpr.thecvf.com/api/miniconf/users/192473?format=json", "institution": null}, {"id": 192474, "fullname": "Xiaodong Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/192474?format=json", "institution": "Fudan University"}, {"id": 126844, "fullname": "Jingqun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126844?format=json", "institution": "Bytedance"}, {"id": 156578, "fullname": "Benlei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/156578?format=json", "institution": "Alibaba Group"}, {"id": 189927, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189927?format=json", "institution": "Fudan University"}, {"id": 89233, "fullname": "Xiangyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/89233?format=json", "institution": "Fudan University"}], "abstract": "Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods, such as 3D reconstruction from sketches, often produce non-editable, approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model\u2019s ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39610", "url": null, "sourceid": 42129, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39392, "uid": "d1b1dc60ff9ec109ebb3fdd6dd06fdcc", "name": "Omni-MMSI: Toward Identity-attributed Social Interaction Understanding", "authors": [{"id": 183817, "fullname": "Xinpeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183817?format=json", "institution": "The University of Texas at Dallas"}, {"id": 99367, "fullname": "Bolin Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/99367?format=json", "institution": "Amazon AGI  / Georgia Tech"}, {"id": 191984, "fullname": "Hardy Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191984?format=json", "institution": "University of California, Santa Cruz"}, {"id": 136608, "fullname": "Shijian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/136608?format=json", "institution": "The University of Texas at Dallas"}, {"id": 75526, "fullname": "Cihang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/75526?format=json", "institution": "University of California, Santa Cruz"}, {"id": 75508, "fullname": "Yuyin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75508?format=json", "institution": "UC Santa Cruz"}, {"id": 95904, "fullname": "James Rehg", "url": "http://cvpr.thecvf.com/api/miniconf/users/95904?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 87904, "fullname": "Yapeng Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/87904?format=json", "institution": "University of Texas at Dallas"}], "abstract": "We introduce \\textbf{Omni-MMSI}, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech. The task involves two tightly coupled goals: extracting identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to).This task is essential for developing AI assistants that can perceive and respond to human interactions.Unlike prior studies that assume identity-attributed social cues perfectly provided, Omni-MMSI reflects realistic scenarios where AI assistants must perceive from raw multi-modal streams and reason over extracted social cues.However, existing pipelines and multi-modal LLMs perform poorly in this setting because they lack reliable identity attribution ability, which leads to inaccurate social cues and weak interaction reasoning.To address this challenge, we propose \\textbf{Omni-MMSI-R}, a reference-based pipeline that uses reference audio-vision pairs to produce identity-attributed social cues and leverages curated chain-of-thought supervision for reasoning on reference-based inputs. To enable reference-based research, we construct participant-level reference pairs and curated reasoning annotations on top of the existing datasets.Extensive experiments demonstrate that Omni-MMSI-R consistently outperforms advanced multi-modal LLMs and counterparts in Omni-MMSI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39392", "url": null, "sourceid": 38380, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38684, "uid": "0c8637e962dc6beed9beecbb0069d296", "name": "Fine-Grained Multi Image Object Hallucination Benchmark", "authors": [{"id": 179863, "fullname": "Joonki Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/179863?format=json", "institution": "Seoul National University"}, {"id": 182427, "fullname": "Chaeyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182427?format=json", "institution": "AIM Intelligence"}, {"id": 188897, "fullname": "Hyungwook Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188897?format=json", "institution": "Seoul National University"}, {"id": 188896, "fullname": "Yejin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188896?format=json", "institution": "Seoul National University"}, {"id": 153545, "fullname": "Kihyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153545?format=json", "institution": "Seoul National University"}, {"id": 190459, "fullname": "Yohan Jo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190459?format=json", "institution": "Seoul National University"}, {"id": 153548, "fullname": "Joonseok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/153548?format=json", "institution": "Seoul National University"}], "abstract": "Multimodal Large Language Models (MLLMs) are increasingly deployed in multi-image scenarios requiring complex reasoning across visual contexts. However, current MLLMs remain fundamentally limited by object hallucination\u2014generating plausible yet factually inconsistent descriptions about objects. Existing benchmarks, designed primarily for single-image settings or providing only high-level multi-image assessments, cannot systematically diagnose how visual complexity and reasoning demands trigger hallucination. To address this gap, we introduce MIOH, a fine-grained multi-image object hallucination benchmark that systematically evaluates object hallucination across four foundational tasks (existence, counting, attribute, position) through three multi-image reasoning patterns (comprehensive, comparative, selective) under three controlled adversarial pressures (visual context scale, perceptual difficulty, contextual bias). Through evaluation of 30 models, we reveal that even state-of-the-art systems like GPT-5 and Gemini-2.5-Pro exhibit distinct failure patterns across different reasoning patterns and tasks. Our evaluation reveals that hallucination stems not merely from perceptual failures but from integration-stage limitations when maintaining object representations across multiple images. MIOH provides a controlled framework for analyzing multi-image object hallucination and serves as a critical evaluation tool for developing more reliable multimodal AI systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38684", "url": null, "sourceid": 44201, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38187, "uid": "b71ae9a72156d7961f68be39331f4f28", "name": "Vista4D: Video Reshooting with 4D Point Clouds", "authors": [{"id": 176470, "fullname": "Kuan Heng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/176470?format=json", "institution": "Columbia University"}, {"id": 189242, "fullname": "Zhizheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189242?format=json", "institution": "UCLA Computer Science Department, University of California, Los Angeles"}, {"id": 189243, "fullname": "Pablo Salamanca", "url": "http://cvpr.thecvf.com/api/miniconf/users/189243?format=json", "institution": null}, {"id": 70723, "fullname": "Yash Kant", "url": "http://cvpr.thecvf.com/api/miniconf/users/70723?format=json", "institution": "University of Toronto / Snap Research"}, {"id": 97282, "fullname": "Ryan Burgert", "url": "http://cvpr.thecvf.com/api/miniconf/users/97282?format=json", "institution": "Stony Brook University"}, {"id": 189244, "fullname": "Yuancheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189244?format=json", "institution": "Netflix"}, {"id": 189245, "fullname": "Koichi Namekata", "url": "http://cvpr.thecvf.com/api/miniconf/users/189245?format=json", "institution": "University of Oxford"}, {"id": 158367, "fullname": "Yiwei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158367?format=json", "institution": "NetFlix"}, {"id": 89955, "fullname": "Bolei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89955?format=json", "institution": "University of California, Los Angeles"}, {"id": 85090, "fullname": "Micah Goldblum", "url": "http://cvpr.thecvf.com/api/miniconf/users/85090?format=json", "institution": "Columbia University"}, {"id": 153383, "fullname": "Paul Debevec", "url": "http://cvpr.thecvf.com/api/miniconf/users/153383?format=json", "institution": "NetFlix"}, {"id": 126808, "fullname": "Ning Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126808?format=json", "institution": "Netflix Eyeline Studios"}], "abstract": "We present **Vista4D**, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. Results are best viewed as videos in the Supplement.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38187", "url": "https://eyeline-labs.github.io/Vista4D/", "sourceid": 46121, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37568, "uid": "2972f61f995d94a50224a2613be30d4e", "name": "Aligning Text, Images and 3D Structure Token-by-Token", "authors": [{"id": 159429, "fullname": "Aadarsh Sahoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/159429?format=json", "institution": "California Institute of Technology"}, {"id": 187747, "fullname": "Vansh Tibrewal", "url": "http://cvpr.thecvf.com/api/miniconf/users/187747?format=json", "institution": "California Institute of Technology"}, {"id": 86982, "fullname": "Georgia Gkioxari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86982?format=json", "institution": "California Institute of Technology"}], "abstract": "Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed \"cookbook\" outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks \u2013 rendering, recognition, instruction-following, and question-answering \u2013 and four 3D datasets, synthetic and real-world. We show our model\u2019s effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37568", "url": null, "sourceid": 34592, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37915, "uid": "bbd230052c26e38339105b7d1af05990", "name": "Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective", "authors": [{"id": 99367, "fullname": "Bolin Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/99367?format=json", "institution": "Amazon AGI  / Georgia Tech"}, {"id": 90357, "fullname": "XuDong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90357?format=json", "institution": "UC Berkeley"}, {"id": 188582, "fullname": "Sai Saketh Rambhatla", "url": "http://cvpr.thecvf.com/api/miniconf/users/188582?format=json", "institution": "Skild AI"}, {"id": 95904, "fullname": "James Rehg", "url": "http://cvpr.thecvf.com/api/miniconf/users/95904?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 88875, "fullname": "Zsolt Kira", "url": "http://cvpr.thecvf.com/api/miniconf/users/88875?format=json", "institution": "Georgia Institute of Technology"}, {"id": 86361, "fullname": "Rohit Girdhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/86361?format=json", "institution": "Meta"}, {"id": 84585, "fullname": "Ishan Misra", "url": "http://cvpr.thecvf.com/api/miniconf/users/84585?format=json", "institution": "Facebook"}], "abstract": "Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction\u2013generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency tokenization and detokenization. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional tokenizers, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.14 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37915", "url": "https://bolinlai.github.io/projects/FreqWarm/", "sourceid": 31484, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36498, "uid": "7126c3a6111f0dc9f0bc7ecd4325b63d", "name": "GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents", "authors": [{"id": 182318, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182318?format=json", "institution": "Xiaomi Corporation"}, {"id": 185205, "fullname": "Yuchen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185205?format=json", "institution": "Xiaomi Corporation"}, {"id": 185206, "fullname": "Haoyu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185206?format=json", "institution": "Xiaomi Corporation"}, {"id": 185207, "fullname": "ZhiqiangXia ZhiqiangXia", "url": "http://cvpr.thecvf.com/api/miniconf/users/185207?format=json", "institution": "Xiaomi Corporation"}, {"id": 185208, "fullname": "Hongzhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185208?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185209, "fullname": "Kaiyang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/185209?format=json", "institution": "Xiaomi Corporation"}, {"id": 185210, "fullname": "Changpeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185210?format=json", "institution": "Xiaomi Corporation"}, {"id": 185211, "fullname": "Jinyang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185211?format=json", "institution": "Tsinghua University"}, {"id": 185212, "fullname": "Jiaming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185212?format=json", "institution": "Xiaomi Corporation"}, {"id": 185213, "fullname": "Runyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185213?format=json", "institution": "Xiaomi Corporation"}, {"id": 185214, "fullname": "Ying Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185214?format=json", "institution": "Xiaomi Corporation"}], "abstract": "Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce \\textbf{GUI-CEval}, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on \\textbf{real-device} environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and end-to-end performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a reproducible and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36498", "url": null, "sourceid": 43663, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37090, "uid": "e2d9bc580ff07f2c76c307150110f38f", "name": "Archon: A Unified Multimodal Model for Holistic Digital Human Generation", "authors": [{"id": 88702, "fullname": "Chong Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88702?format=json", "institution": "Zhejiang University"}, {"id": 128093, "fullname": "Shichen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128093?format=json", "institution": "Google"}, {"id": 158398, "fullname": "Lijun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158398?format=json", "institution": "Google DeepMind"}, {"id": 85705, "fullname": "David Futschik", "url": "http://cvpr.thecvf.com/api/miniconf/users/85705?format=json", "institution": "Czech Technical Univeresity in Prague, FEE"}, {"id": 186632, "fullname": "Stylianos Moschoglou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186632?format=json", "institution": "Google"}, {"id": 186633, "fullname": "Shefali Srivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/186633?format=json", "institution": "Google"}, {"id": 158442, "fullname": "Ziqian Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158442?format=json", "institution": "Google"}, {"id": 158443, "fullname": "Feitong Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/158443?format=json", "institution": "Google"}, {"id": 84995, "fullname": "Guofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84995?format=json", "institution": "Zhejiang University"}, {"id": 76752, "fullname": "Zhaopeng Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/76752?format=json", "institution": "Zhejiang University"}, {"id": 85744, "fullname": "Sean Fanello", "url": "http://cvpr.thecvf.com/api/miniconf/users/85744?format=json", "institution": "Google"}, {"id": 88352, "fullname": "Yinda Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88352?format=json", "institution": "Google"}], "abstract": "We introduce Archon, a unified multimodal framework that extends multimodal language models to address the fundamental challenge of holistic digital human generation. Archon unifies diverse human-centric modalities, including description, script, speech, animation, semantic segmentation, image and video, within a single controllable generative system, enabled by modality-specific tokenization and auto-regressive cross-modal reasoning. For high-quality video outputs, we incorporate a semantic-driven video diffusion decoder that reconstructs photorealistic video from compact representations. We further analyze cross-modality ambiguity and explore alternative modality generation chain that improves controllability and coherence. Experiments demonstrate strong performance across diverse multimodal generation tasks without task-specific fine-tuning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37090", "url": null, "sourceid": 37012, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39668, "uid": "c1873a205a7b7b021a082c65c7548d5d", "name": "VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues", "authors": [{"id": 151034, "fullname": "Swetha Sirnam", "url": "http://cvpr.thecvf.com/api/miniconf/users/151034?format=json", "institution": "University of Central Florida"}, {"id": 91932, "fullname": "Rohit Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/91932?format=json", "institution": "University of Central Florida"}, {"id": 136514, "fullname": "Parth Parag Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/136514?format=json", "institution": "University of Central Florida"}, {"id": 136031, "fullname": "David Shatwell", "url": "http://cvpr.thecvf.com/api/miniconf/users/136031?format=json", "institution": "University of Central Florida"}, {"id": 191269, "fullname": "Jeffrey A. Chan-Santiago", "url": "http://cvpr.thecvf.com/api/miniconf/users/191269?format=json", "institution": "University of Central Florida"}, {"id": 192604, "fullname": "Nyle Siddiqui", "url": "http://cvpr.thecvf.com/api/miniconf/users/192604?format=json", "institution": "University of Central Florida"}, {"id": 134410, "fullname": "Joseph Fioresi", "url": "http://cvpr.thecvf.com/api/miniconf/users/134410?format=json", "institution": "University of Central Florida"}, {"id": 73977, "fullname": "Mubarak Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/73977?format=json", "institution": "Amazon"}], "abstract": "Video Question Answering (VideoQA) has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects, and events directly observable within individual frames or short clips. To truly understand videos as humans do, models must go beyond what is directly shown, inferring hidden relationships and contextual cues that are only implied across frames. Humans naturally excel at such implicit reasoning, seamlessly integrating partial visual cues over time to infer motin dynamics, spatial layout and context, constructing a coherent mental model of the scene even when such relationships are never explicitly depicted. Current benchmarks fail to capture this essential aspect of video understanding. To address this gap, we introduce VRR-QA, a benchmark for Visual Relational Reasoning Beyond Explicit Cues. We curate our benchmark from creative and cinematic videos such as movies, that deliberately employ storytelling techniques which omit direct depictions of certain events or relations, requiring viewers to infer them. VRR-QA comprises $1K$ meticulously expert-annotated QA pairs drawn from $1K$ creative video clips covering $15$ genres across $7$ decades of content, from both live-action and animated titles. These annotations are deliberately challenging, crafted by authors, validated through multiple annotators, and benchmarked against human performance to ensure high quality. Our extensive evaluations on $11$ leading VideoQA models reveals consistent and significant performance degradation, underscoring their reliance on surface-level visual cues and highlighting the difficulty of implicit reasoning. Even the best model substantially underperforms human baselines with only 64% accuracy. Performance variations across models further illustrate the complexity and diversity of the challenges presented by VRR-QA. By releasing both dataset and data collection framework, VRR-QA establishes a rigorous, diverse, and reproducible testbed for advancing VideoQA.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39668", "url": null, "sourceid": 35212, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39946, "uid": "f7a23934663d0479450d75481425c26b", "name": "HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views", "authors": [{"id": 180718, "fullname": "Jiashu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180718?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193173, "fullname": "Xumeng Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/193173?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 126699, "fullname": "Zhaoyang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/126699?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193174, "fullname": "Zipeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193174?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193175, "fullname": "Kuiran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193175?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 84996, "fullname": "Guorong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84996?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 126243, "fullname": "Zhenjun Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/126243?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 72567, "fullname": "Jianbin Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/72567?format=json", "institution": "University of the Chinese Academy of Sciences"}], "abstract": "3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions\u2014characterized by globally sparse coverage, blurred background, and distorted  high-frequency areas.To address this, we propose HeroGS\u2014Hierarchical Guidance for Robust 3D Gaussian Splatting\u2014a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature level leverages low-level features to refine high-frequency details and adaptively densifies Gaussians in background regions.The optimized distributions then support Co-Pruned Geometry Consistency (CPG) at parameter level, which guides geometric consistency through parameter freezing and  co-pruning, effectively removing inconsistent splats. The hierarchical guidance strategy effectively constrains and optimizes the overall Gaussian distributions, thereby enhancing both structural fidelity and rendering quality.Extensive experiments demonstrate that HeroGS achieves high-fidelity reconstructions and consistently surpasses state-of-the-art baselines under sparse-view conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39946", "url": null, "sourceid": 45533, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39765, "uid": "e74186a9024394af6d13cb98b343f11a", "name": "Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation", "authors": [{"id": 155644, "fullname": "Changsheng Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/155644?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 192810, "fullname": "Zijian Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192810?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 154782, "fullname": "Mengshi Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154782?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "In this paper, we propose Robo-SGG, a plug-and-play module for robust scene graph generation (SGG). Unlike standard SGG, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to shifted visual features (e.g., corruption interference or occlusions). To obtain robust visual features, we leverage layout information, representing the global structure of an image, which is robust to domain shift, to enhance the robustness of SGG methods under corruption. Specifically, we employ Instance Normalization (IN) to alleviate the domain-specific variations and recover the robust structural features (i.e., the positional and semantic relationships among objects) by the proposed Layout-Oriented Restitution. Furthermore, under corrupted images, we introduce a Layout-Embedded Encoder (LEE) that adaptively fuses layout and visual features via a gating mechanism, enhancing the robustness of positional and semantic representations for objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 6.3%, 11.1%, and 8.0% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C benchmark, respectively, and achieve new state-of-the-art performance in the corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39765", "url": null, "sourceid": 35857, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38931, "uid": "2ac74624f549e19f8b2797ebbd4269ea", "name": "LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving", "authors": [{"id": 190985, "fullname": "Qihao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190985?format=json", "institution": "Harbin Institute of Technology"}, {"id": 190986, "fullname": "Jiarun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190986?format=json", "institution": "NIO; Alibaba Group"}, {"id": 190987, "fullname": "Ziqian Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/190987?format=json", "institution": "Alibaba Group"}, {"id": 157645, "fullname": "Jianyun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157645?format=json", "institution": "Alibaba Group"}, {"id": 190988, "fullname": "Sheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190988?format=json", "institution": "NIO"}, {"id": 76581, "fullname": "Tao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/76581?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187102, "fullname": "lijun zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187102?format=json", "institution": "Harbin Institute of Technology"}, {"id": 71512, "fullname": "Ruifeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/71512?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization.To address these challenges, we present MVS-Pro, a novel multi-view stereo framework that reconciles these competing objectives through two key insights:  (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames.Built upon these principles, MVS-Pro embeds the LiDAR prompt in two ways: as a hard geometric prior anchoring the cost volume, and as soft feature-wise guidance fused by a triple cues combiner.As for temporal consistency, MVS-Pro leverages a spatio-temporal decoder that jointly exploits geometric cues from the MVS cost volume and temporal context from neighboring frames. Experiments show that MVS-Pro achieves state-of-the-art performance on multiple benchmarks, excelling in metric accuracy, temporal stability, and zero-shot cross-domain transfer, demonstrating its practical value for scalable, reliable autonomous driving systems.Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38931", "url": null, "sourceid": 37096, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37188, "uid": "82367463300c6d423d4bf86699b0d20b", "name": "StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning", "authors": [{"id": 129556, "fullname": "Giuseppe Vecchio", "url": "http://cvpr.thecvf.com/api/miniconf/users/129556?format=json", "institution": "Adobe Research"}], "abstract": "We introduce **StableMaterials**, a novel approach for generating photorealistic physical-based rendering (PBR) materials that integrate semi-supervised learning with Latent Diffusion Models (LDMs). Our method employs adversarial training to distill knowledge from existing large-scale image generation models, minimizing the reliance on annotated data and enhancing the diversity in generation. This distillation approach aligns the distribution of the generated materials with that of image textures from an SDXL model, enabling the generation of novel materials that are not present in the initial training dataset.Furthermore, we employ a diffusion-based refiner model to improve the visual quality of the samples and achieve high-resolution generation. Finally, we distill a latent consistency model for fast generation in just four steps and propose a new tileability technique that removes visual artifacts typically associated with fewer diffusion steps. We detail the architecture and training process of StableMaterials, the integration of semi-supervised training within existing LDM frameworks.Comparative evaluations with state-of-the-art methods show the effectiveness of StableMaterials, highlighting its potential applications in computer graphics and beyond.StableMaterials will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37188", "url": "https://gvecchio.com/stablematerials", "sourceid": 37418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36843, "uid": "943344b5f592e157b66b4b2c6843b301", "name": "Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models", "authors": [{"id": 186005, "fullname": "Nimrod Berman", "url": "http://cvpr.thecvf.com/api/miniconf/users/186005?format=json", "institution": null}, {"id": 186006, "fullname": "Adam Botach", "url": "http://cvpr.thecvf.com/api/miniconf/users/186006?format=json", "institution": "Amazon"}, {"id": 182071, "fullname": "Emanuel Ben Baruch", "url": "http://cvpr.thecvf.com/api/miniconf/users/182071?format=json", "institution": "Amazon"}, {"id": 186007, "fullname": "Shunit Haviv Hakimi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186007?format=json", "institution": ""}, {"id": 169746, "fullname": "Asaf Gendler", "url": "http://cvpr.thecvf.com/api/miniconf/users/169746?format=json", "institution": "Amazon Prime Video"}, {"id": 186008, "fullname": "Ilan Naiman", "url": "http://cvpr.thecvf.com/api/miniconf/users/186008?format=json", "institution": "Google; Ben Gurion University of the Negev, Technion"}, {"id": 126986, "fullname": "Erez Yosef", "url": "http://cvpr.thecvf.com/api/miniconf/users/126986?format=json", "institution": "Tel Aviv University"}, {"id": 186009, "fullname": "Igor Kviatkovsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/186009?format=json", "institution": "Amazon"}], "abstract": "Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context\u2013focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision\u2013recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36843", "url": null, "sourceid": 41404, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36242, "uid": "274ba853b1325395ce18cdd0b43c5d9e", "name": "ProSoftArena: Evaluating Hierarchical Capabilities of Multimodal Agents in Professional Software Environments", "authors": [{"id": 88843, "fullname": "Jiaxin Ai", "url": "http://cvpr.thecvf.com/api/miniconf/users/88843?format=json", "institution": "Wuhan University"}, {"id": 184553, "fullname": "Yukang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184553?format=json", "institution": "Nankai University"}, {"id": 153648, "fullname": "Fanrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153648?format=json", "institution": "University of Science and Technology of China"}, {"id": 184554, "fullname": "Jianwen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184554?format=json", "institution": "Nankai University"}, {"id": 177438, "fullname": "Zizhen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/177438?format=json", "institution": "Nankai University"}, {"id": 184555, "fullname": "Chuanhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184555?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 184556, "fullname": "Yifan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184556?format=json", "institution": "University of Science and Technology of China &amp; Shanghai Innovation Institute"}, {"id": 184557, "fullname": "Wenxiao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184557?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184558, "fullname": "Ruoxi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184558?format=json", "institution": "University of Pittsburgh"}, {"id": 184559, "fullname": "Mingliang Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184559?format=json", "institution": "Beijing Institute of Technology"}, {"id": 184560, "fullname": "Kaipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184560?format=json", "institution": "Shanda AI Research"}], "abstract": "Multimodal agents are making rapid progress on general computer-use tasks, yet existing benchmarks remain largely confined to browsersand basic desktop applications, falling short in professional software workflows that dominate real-world scientific and industrial practice. To close this gap, we introduce ProSoftArena, a benchmark and platform specifically for evaluating multimodal agents in professional software environments. We establish the first capability hierarchy tailored to agent use of professional software and construct a benchmark of 437 realistic work and research tasks spanning 6 disciplines and 13 core professional applications. To ensure reliable and reproducible assessment, we build an executable real-computer environment with an execution-based evaluation framework and uniquely incorporate a human-in-the-loop evaluation paradigm. Extensive experiments show that even the best-performing agent attains only a 24.4\\% success rate on L2 tasks and completely fails on L3 multi-software workflow. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents in professional software settings.We will release all data and codes to forster research in this critical area.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36242", "url": null, "sourceid": 44870, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37630, "uid": "5824d6556d667e44db4870fcc6cbafa0", "name": "RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks", "authors": [{"id": 184280, "fullname": "Yanping LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/184280?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 183140, "fullname": "Zhening Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183140?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187910, "fullname": "Zijian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187910?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187911, "fullname": "Zehong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187911?format=json", "institution": "Lingnan University"}, {"id": 90255, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90255?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier to enforce distributional alignment between the recovered and original natural images, thereby enhancing the defense efficacy. Experimental results demonstrate that our framework effectively defends against white-box, black-box, and adaptive attacks in 3DGS systems, achieving state-of-the-art performance in both safety and utility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37630", "url": null, "sourceid": 39123, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39772, "uid": "3ae358cf08aab5c083ec6b4dc03fa2b5", "name": "Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors", "authors": [{"id": 180685, "fullname": "Zhangxing Bian", "url": "http://cvpr.thecvf.com/api/miniconf/users/180685?format=json", "institution": "Johns Hopkins University"}, {"id": 192824, "fullname": "Shuwen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192824?format=json", "institution": "Johns Hopkins University"}, {"id": 192825, "fullname": "Samuel Remedios", "url": "http://cvpr.thecvf.com/api/miniconf/users/192825?format=json", "institution": "Johns Hopkins University"}, {"id": 192826, "fullname": "Junyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192826?format=json", "institution": "Johns Hopkins Medicine"}, {"id": 168313, "fullname": "Aaron Carass", "url": "http://cvpr.thecvf.com/api/miniconf/users/168313?format=json", "institution": "Johns Hopkins University"}, {"id": 192827, "fullname": "Blake Dewey", "url": "http://cvpr.thecvf.com/api/miniconf/users/192827?format=json", "institution": "Johns Hopkins University"}, {"id": 192828, "fullname": "Jerry L Prince", "url": "http://cvpr.thecvf.com/api/miniconf/users/192828?format=json", "institution": "Johns Hopkins University"}], "abstract": "Tagged MRI enables tracking internal tissue motion non-invasively. It encodes motion by modulating anatomy with periodic tags, which deforms along with tissue. However, the entanglement between anatomy, tags and motion poses significant challenges on post processing. Existence of tags and imaging blur hinders downstream tasks such as segmenting anatomy. Tag fading, due to T1-relaxation, disrupts brightness constancy assumption for motion tracking. For decades, these challenges are handled in isolation and sub-optimally. In contrast, we introduce a blind and nonlinear inverse framework for tagged MRI that, for the first time, unifies these tasks: anatomical image recovery, high-resolution cine image synthesis, and motion estimation. At its core, the synergy of MR physics and generative priors enables us to blindly estimate the unknown forward imaging models, high-resolution underlying anatomy, while simultaneously tracking 3D diffeomorphic Lagrangian motion over time. Experiments on tagged brain MRI demonstrate that our approach yields high-resolution anatomy images, cine images, and more accurate motion than specialized methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39772", "url": null, "sourceid": 46308, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38875, "uid": "bffa0846b1e371410d49aa3c87d57c29", "name": "Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks", "authors": [{"id": 143140, "fullname": "Zhichao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143140?format=json", "institution": "Xidian University"}, {"id": 190897, "fullname": "Jianjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190897?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 190898, "fullname": "Zhixianhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190898?format=json", "institution": "Xidian University"}, {"id": 190899, "fullname": "Pangu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190899?format=json", "institution": "Xidian University"}, {"id": 190900, "fullname": "Xiangfei Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190900?format=json", "institution": "Xidian University"}, {"id": 190901, "fullname": "Pengfei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190901?format=json", "institution": "Xidian University"}, {"id": 126996, "fullname": "Leida Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126996?format=json", "institution": "Xidian University"}], "abstract": "Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences. To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels. Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation. Extensive experiments and comparisons demonstrate the superiority of the proposed method. Data and model will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38875", "url": null, "sourceid": 37904, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40349?format=json"], "related_events_ids": [40349]}, {"id": 38896, "uid": "2d76ad7ae5fb13d95eb34c39c6fb59b0", "name": "Hyperbolic Prototype Learning with Uncertainty-Aware Consistency for Continual Test-Time Segmentation", "authors": [{"id": 166554, "fullname": "Siddhant Gole", "url": "http://cvpr.thecvf.com/api/miniconf/users/166554?format=json", "institution": "IIT Bombay"}, {"id": 182736, "fullname": "Akash Pal", "url": "http://cvpr.thecvf.com/api/miniconf/users/182736?format=json", "institution": "Indian Institute of Technology, Bombay"}, {"id": 185086, "fullname": "Amit More", "url": "http://cvpr.thecvf.com/api/miniconf/users/185086?format=json", "institution": "Honda R&amp;D Co., Ltd."}, {"id": 185085, "fullname": "S Divakar Bhat", "url": "http://cvpr.thecvf.com/api/miniconf/users/185085?format=json", "institution": "Honda R&amp;D Japan"}, {"id": 158110, "fullname": "Subhasis Chaudhuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/158110?format=json", "institution": "Indian Institute of Technology Bombay"}, {"id": 71694, "fullname": "Biplab Banerjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/71694?format=json", "institution": "Associate Professor, IIT Bombay, India"}], "abstract": "Continual Test-Time Adaptation (CTTA) for semantic segmentation is vital for deploying vision models in dynamic environments with persistent domain shifts. Existing methods often degrade over time as self-supervised updates amplify early prediction errors. We attribute this fragility to a geometric limitation: Euclidean feature spaces, with polynomial volume growth, lead to distorted semantic representations and crowded, unstable decision boundaries. We propose **HyperProtoSeg**, a hyperbolic prototypical segmentation network that learns geometrically optimal class prototypes in the Poincar\u00e9 ball. Leveraging the exponential expansion of hyperbolic space, it enforces large and uniform inter-class margins with low distortion, yielding well-separated and curvature-stable embeddings. For robust online adaptation, we introduce **Hyperbolic Boundary Consistency Adaptation (HBCA)**, which partitions pixels by cross-view consistency into confident \u201ccore\u2019\u2019 and uncertain \u201cboundary\u2019\u2019 sets. HBCA applies geodesic distance minimization for confident regions and a novel Hyperbolic Directional Consistency Loss for uncertain ones, preventing error amplification. Experiments on challenging synthetic-to-real benchmarks (Cityscapes to ACDC, IDD to IDD-AW, SHIFT) show that HyperProtoSeg + HBCA achieves     an average improvement of (**1.94%**,**4.02%**,**1.24%**) over state-of-the-art CTTA methods under severe structural shifts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38896", "url": null, "sourceid": 32948, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38758, "uid": "2cadebae208572a73bcc7385d5de0710", "name": "Radar-Guided Polynomial Fitting for Metric Depth Estimation", "authors": [{"id": 156899, "fullname": "Patrick Rim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156899?format=json", "institution": "Yale University / Google"}, {"id": 130760, "fullname": "Hyoungseob Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/130760?format=json", "institution": "Yale University"}, {"id": 190600, "fullname": "Vadim Ezhov", "url": "http://cvpr.thecvf.com/api/miniconf/users/190600?format=json", "institution": "Avride Inc."}, {"id": 190601, "fullname": "Changil Jeffrey Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190601?format=json", "institution": "University of Pennsylvania"}, {"id": 72573, "fullname": "Alex Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/72573?format=json", "institution": "Yale University"}], "abstract": "We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38758", "url": null, "sourceid": 44101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39037, "uid": "eec278b875a784c7646e2a27648819d5", "name": "Neural Distribution Prior for LiDAR Out-of-Distribution Detection", "authors": [{"id": 183287, "fullname": "Zizhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183287?format=json", "institution": "The University of Melbourne"}, {"id": 191222, "fullname": "Zhengkang Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191222?format=json", "institution": "University of Melbourne"}, {"id": 158801, "fullname": "Jiayang Ao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158801?format=json", "institution": "The University of Melbourne"}, {"id": 153389, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153389?format=json", "institution": "University of Melbourne"}, {"id": 191223, "fullname": "Joseph West", "url": "http://cvpr.thecvf.com/api/miniconf/users/191223?format=json", "institution": "University of Melbourne"}, {"id": 191224, "fullname": "Kourosh Khoshelham", "url": "http://cvpr.thecvf.com/api/miniconf/users/191224?format=json", "institution": "University of Melbourne"}], "abstract": "LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise\u2013based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31\\% on the STU test set, which is more than 10$\\times$ higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39037", "url": null, "sourceid": 35415, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37458, "uid": "1ab9f53c53dc087056a99065861a6f65", "name": "VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition", "authors": [{"id": 180609, "fullname": "Tanush Yadav", "url": "http://cvpr.thecvf.com/api/miniconf/users/180609?format=json", "institution": "University of Washington"}, {"id": 136938, "fullname": "Reza Salehi", "url": "http://cvpr.thecvf.com/api/miniconf/users/136938?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 85306, "fullname": "Jae Sung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/85306?format=json", "institution": "University of Washington"}, {"id": 187496, "fullname": "Vivek Ramanujan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187496?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 153897, "fullname": "Hannaneh Hajishirzi", "url": "http://cvpr.thecvf.com/api/miniconf/users/153897?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 187497, "fullname": "Yejin Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187497?format=json", "institution": "Stanford University"}, {"id": 88931, "fullname": "Ali Farhadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88931?format=json", "institution": "University of Washington"}, {"id": 133433, "fullname": "Rohun Tripathi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133433?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}], "abstract": "Videos capture a rich array of subtleties in actions. While large video language models have advanced in understanding long videos, their ability to discern nuanced motions in domain-specific, fine-grained actions remains unclear. Current benchmarks evaluate for fine-grained actions in a domain agnostic manner, making to hard to evaluate models on this task. To address this gap, we introduce \\dataset, a comprehensive benchmark aimed at evaluating the domain-specific, fine-grained action understanding of video models.This benchmark covers $1,087$ distinct actions spanning $38$ domains, from bouldering to suturing.Our evaluations demonstrate that current video models encounter significant difficulties in recognizing these actions in a zero-shot scenario. We then examine how to improve model performance on this task. To this end, we collect a training dataset of 160K clips of fine-grained, domain-specific actions. Post-training a 4B model on this data, we surpass all Gemini models and GPT-4o on our benchmark. Next, we evaluate few-shot evaluation and demonstrate that even the best-performing model, GPT-5, struggles in a few-shot evaluation setting. When given three in-context examples, the gap between model and human performance widens, with human accuracy improving by 13% while models only improve by 3%. This suggests that video language models are currently not effective few-shot learners--unlike their text-only counterparts and further gains may be elicited from improving these models' few-short learning capabilities.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37458", "url": null, "sourceid": 33619, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37932, "uid": "192beb199bc41714bc563f5a0cc7e9a5", "name": "SounDiT: Geo-Contextual Soundscape-to-Landscape Generation", "authors": [{"id": 188615, "fullname": "Junbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188615?format=json", "institution": "University of Tennessee, Knoxville"}, {"id": 188616, "fullname": "Haofeng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188616?format=json", "institution": "University of South Carolina"}, {"id": 176287, "fullname": "Bowen Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/176287?format=json", "institution": "Arizona State University"}, {"id": 188617, "fullname": "Albert Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188617?format=json", "institution": "University of Texas at Austin"}, {"id": 188618, "fullname": "Teng Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188618?format=json", "institution": "University of Canterbury; Wuhan University"}, {"id": 91087, "fullname": "Qixing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91087?format=json", "institution": "University of Texas at Austin"}, {"id": 188619, "fullname": "Bing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188619?format=json", "institution": "University of Tennessee, Knoxville"}, {"id": 155027, "fullname": "Zhengzhong Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155027?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 188620, "fullname": "Shan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/188620?format=json", "institution": "National Institute of Clean-and-Low-Carbon Energy"}, {"id": 188621, "fullname": "Yuhao Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188621?format=json", "institution": "University of Texas at Austin"}], "abstract": "Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on acoustic environments. To address this challenge, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We further propose SounDiT, a diffusion transformer (DiT)-based model that incorporates acoustic environments and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37932", "url": null, "sourceid": 34541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36391, "uid": "6628a98abb4cf561365336de2e4ff220", "name": "HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image", "authors": [{"id": 165078, "fullname": "Hezhen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165078?format=json", "institution": "UT Austin"}, {"id": 129149, "fullname": "Wangbo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129149?format=json", "institution": "National University of Singapore"}, {"id": 76222, "fullname": "Lanqing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76222?format=json", "institution": "Nanyang Technological University"}, {"id": 135363, "fullname": "Hanwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135363?format=json", "institution": "Adobe Systems"}, {"id": 184940, "fullname": "Jonathan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184940?format=json", "institution": "University of Texas at Austin"}, {"id": 90152, "fullname": "Zhiwen Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90152?format=json", "institution": "University of Texas, Austin"}, {"id": 73838, "fullname": "Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73838?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 75636, "fullname": "Zhangyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75636?format=json", "institution": "University of Texas at Austin"}, {"id": 128201, "fullname": "Georgios Pavlakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/128201?format=json", "institution": "University of Texas at Austin"}], "abstract": "In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture,  HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36391", "url": null, "sourceid": 42860, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39836, "uid": "1a12488a1e9d664c3dc0832ee94b67c5", "name": "Data-Centric Meta-Learning for Robust Few-Shot Generalization", "authors": [{"id": 192947, "fullname": "Jongmin Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192947?format=json", "institution": "Sungkyunkwan University"}, {"id": 160392, "fullname": "Soobin CHA", "url": "http://cvpr.thecvf.com/api/miniconf/users/160392?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 133165, "fullname": "Jaehun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/133165?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 161364, "fullname": "Inho Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/161364?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 192948, "fullname": "Minho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/192948?format=json", "institution": null}, {"id": 130891, "fullname": "Kwangsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130891?format=json", "institution": "Department of Computer Science &amp; Engineering, College of Computing, Sungkyunkwan University"}], "abstract": "Few-shot learning aims to enable rapid adaptation to unseen tasks using limited data. Optimization-based meta-learning addresses this challenge by acquiring shared prior knowledge across diverse tasks. However, its effectiveness degrades in cross-domain scenarios where unseen tasks differ significantly from training tasks. We identify this degradation as a failure to acquire generalizable prior knowledge, which is fundamentally caused by gradient discrepancies\u2014conflicting update directions arising in the meta-training environment with diverse task distributions. To achieve robust few-shot generalization, we propose Data-Centric Meta-Learning (DCML), a novel framework that mitigates gradient discrepancies by aligning task-specific input distributions with shared prior knowledge.  DCML accomplishes this alignment through a meta-learnable visual prompt that is integrated into the entire meta-learning process\u2014unlike previous prompt-based methods restricted solely to test-time adaptation. During meta-training, the prompt transforms each task\u2019s inputs to induce more consistent gradients, thereby facilitating the learning of generalizable prior knowledge. Leveraging this robust knowledge, DCML enables rapid and parameter-efficient test-time adaptation by updating only the lightweight prompt and classifier while keeping the backbone frozen.Extensive experiments demonstrate that DCML consistently outperforms baselines, particularly in challenging few-shot cross-domain scenarios, establishing a data-centric perspective for robust meta-learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39836", "url": null, "sourceid": 34391, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36366, "uid": "1f1933824a8410104b4d51949b95900c", "name": "IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence", "authors": [{"id": 183026, "fullname": "Jieren Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183026?format=json", "institution": "Capital One"}, {"id": 184884, "fullname": "Zhizhang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184884?format=json", "institution": "Amazon"}, {"id": 184885, "fullname": "Ziyan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184885?format=json", "institution": "Microsoft"}, {"id": 184886, "fullname": "Aleksandar Cvetkovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/184886?format=json", "institution": "Microsoft"}, {"id": 139483, "fullname": "Pak Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/139483?format=json", "institution": null}, {"id": 184887, "fullname": "Dragomir Yankov", "url": "http://cvpr.thecvf.com/api/miniconf/users/184887?format=json", "institution": null}, {"id": 182370, "fullname": "Chiqun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182370?format=json", "institution": "Microsoft"}], "abstract": "Most mapping tools remain point-and-click, making it hard to ask spatial questions or relate what a camera sees to its surrounding geography in a view-aware way. We present **IMAIA** \u2014 the *Interactive Maps AI Assistant* \u2014 which enables natural-language interaction with both vector (street) maps and satellite imagery, while enriching camera inputs with geospatial intelligence to help users interpret the world around them.IMAIA consists of two complementary modules:* **Maps Plus**, which treats the map as primary context by converting tiled vector or satellite views into a grid-aligned format that language models can query to resolve deictic references (e.g., \u201cthe flower-shaped building next to the park in the top-right\u201d).* **Places AI Smart Assistant (PAISA)**, which performs camera-aware place reasoning by fusing image\u2013place embeddings with geospatial signals such as location, heading, and distance to ground the scene, highlight key attributes, and produce concise explanations.A lightweight multi-agent design ensures low latency and transparent intermediate reasoning. Across map-centric question answering and camera-to-place grounding tasks, IMAIA consistently improves accuracy and responsiveness over strong baselines while remaining efficient for real-world use. By uniting language, maps, and geospatial cues, IMAIA advances from scripted interactions to *conversational mapping* that is both spatially grounded and widely accessible.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36366", "url": null, "sourceid": 39568, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39267, "uid": "814780edbfedcd6356e9be7786960e64", "name": "Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration", "authors": [{"id": 107446, "fullname": "Xun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107446?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191735, "fullname": "Yufan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191735?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191736, "fullname": "Disen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191736?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191737, "fullname": "Yuqing Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191737?format=json", "institution": "Independent Researcher"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 72996, "fullname": "Fumin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72996?format=json", "institution": "UESTC"}, {"id": 84847, "fullname": "Heng Tao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84847?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 130166, "fullname": "Xing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130166?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Multimodal learning often grapples with the challenge of low-quality data, which predominantly manifests as two facets: modality imbalance and noisy corruption. While these issues are often studied in isolation, we argue that they share a common root in the predictive uncertainty towards the reliability of individual modalities and instances during learning. In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. The core of our proposed CPSC lies in a novel self-calibrating training loop that seamlessly integrates two key modules: (1) Representation Self-Calibration, which decomposes unimodal features into components, selectively fuses the most robust ones identified by a conformal predictor to enhance feature resilience. (2) Gradient Self-Calibration, which recalibrates the gradient flow during backpropagation based on instance-wise reliability scores, steering the optimization towards more trustworthy directions. Furthermore, we also devise a self-update strategy for the conformal predictor to ensure the entire system co-evolves consistently throughout the training process. Extensive experiments on six benchmark datasets under both imbalanced and noisy settings demonstrate that our CPSC framework consistently outperforms existing state-of-the-art methods. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39267", "url": null, "sourceid": 38636, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40365, "uid": "fd1e1971f70705af958fce3debc479dd", "name": "NitroGen: An Open Foundation Model for Generalist Gaming Agents", "authors": [{"id": 191868, "fullname": "Lo\u00efc Magne", "url": "http://cvpr.thecvf.com/api/miniconf/users/191868?format=json", "institution": "NVIDIA"}, {"id": 191869, "fullname": "Anas Awadalla", "url": "http://cvpr.thecvf.com/api/miniconf/users/191869?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 191870, "fullname": "Guanzhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191870?format=json", "institution": "California Institute of Technology"}, {"id": 86771, "fullname": "Yinzhen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86771?format=json", "institution": "University of California, San Diego"}, {"id": 191871, "fullname": "Joshua Belofsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/191871?format=json", "institution": "General Trajectory"}, {"id": 191872, "fullname": "Fengyuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191872?format=json", "institution": "NVIDIA"}, {"id": 191873, "fullname": "Joohwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191873?format=json", "institution": "NVIDIA"}, {"id": 85320, "fullname": "Ludwig Schmidt", "url": "http://cvpr.thecvf.com/api/miniconf/users/85320?format=json", "institution": "University of Washington"}, {"id": 86982, "fullname": "Georgia Gkioxari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86982?format=json", "institution": "California Institute of Technology"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 76018, "fullname": "Yisong Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/76018?format=json", "institution": "Caltech"}, {"id": 187497, "fullname": "Yejin Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187497?format=json", "institution": "Stanford University"}, {"id": 75460, "fullname": "Yuke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75460?format=json", "institution": "University of Texas - Austin"}, {"id": 169493, "fullname": "Linxi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/169493?format=json", "institution": "NVIDIA"}], "abstract": "We introduce NitroGen, a video-action foundation model for generalist gaming agents, trained on 40,000 hours of gameplay videos across more than 1000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in success rates over models trained from scratch. We release the dataset, benchmark, and model weights to advance research on generalist embodied agents.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40365", "url": null, "sourceid": -43643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39333?format=json"], "related_events_ids": [39333]}, {"id": 40128, "uid": "04781d233bb1aa1e65ead1422900fdbc", "name": "Translating Signals to Languages for sEMG-Based Activity Recognition", "authors": [{"id": 176868, "fullname": "Ming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176868?format=json", "institution": "School of Computing and Communications, Lancaster University"}, {"id": 153314, "fullname": "Haoxuan Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153314?format=json", "institution": "Lancaster University"}, {"id": 75819, "fullname": "Qiuhong Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/75819?format=json", "institution": "Monash University"}, {"id": 193596, "fullname": "Wei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193596?format=json", "institution": "Cardiff University"}, {"id": 73672, "fullname": "Hossein Rahmani", "url": "http://cvpr.thecvf.com/api/miniconf/users/73672?format=json", "institution": "Lancaster University"}, {"id": 153315, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153315?format=json", "institution": "Lancaster University"}], "abstract": "Surface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Within this framework, we design a language-oriented mapping mechanism that converts continuous sEMG sequences into \u201csEMG language,\u201d integrating several strategies to further facilitate the signal-to-language mapping process. Extensive experiments demonstrate that the proposed framework achieves highly accurate sEMG signal-based activity recognition using large language models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40128", "url": null, "sourceid": 37181, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39242, "uid": "9fca30adda6cfcf3fa2d1e0b15e69691", "name": "GeoWorld: Geometric World Models", "authors": [{"id": 158542, "fullname": "Zeyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158542?format=json", "institution": "The Australian National University"}, {"id": 191689, "fullname": "Danning Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191689?format=json", "institution": "Sichuan University"}, {"id": 153152, "fullname": "Ian Reid", "url": "http://cvpr.thecvf.com/api/miniconf/users/153152?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 159454, "fullname": "Richard Hartley", "url": "http://cvpr.thecvf.com/api/miniconf/users/159454?format=json", "institution": "Australian National University"}], "abstract": "Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA-2.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39242", "url": null, "sourceid": 35289, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36451, "uid": "89188e01d32520f2c127da5a731796da", "name": "AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning", "authors": [{"id": 185085, "fullname": "S Divakar Bhat", "url": "http://cvpr.thecvf.com/api/miniconf/users/185085?format=json", "institution": "Honda R&amp;D Japan"}, {"id": 185086, "fullname": "Amit More", "url": "http://cvpr.thecvf.com/api/miniconf/users/185086?format=json", "institution": "Honda R&amp;D Co., Ltd."}, {"id": 185087, "fullname": "Mudit Soni", "url": "http://cvpr.thecvf.com/api/miniconf/users/185087?format=json", "institution": "Honda R&amp;D"}, {"id": 185088, "fullname": "Bhuvan Aggarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/185088?format=json", "institution": "Honda"}], "abstract": "Long-Tail Class Incremental Learning (LTCIL) combines two fundamental challenges: \\textit{catastrophic forgetting} of past tasks and \\textit{severe class imbalance}. Existing approaches mitigate one challenge at a time, through rehearsal, reweighting, or classifier alignment, but they typically assume \\emph{static priors} and rely on multi-stage training. In contrast, we propose \\textbf{AdaPrior}, a simple Bayesian framework that treats LTCIL as a problem of \\emph{dynamic prior misalignment}. Our key idea is to estimate model-induced priors online via an exponential moving average and use them for (i) debiasing during training (\\textbf{AdaPrior Loss}), and (ii) lightweight post-hoc correction at inference. The combined approach unifies loss-level and inference-level debiasing without additional stages or heavy computation. We provide theoretical analysis showing that AdaPrior\u2019s prior estimator converges to the true model prior and that its logit adjustment yields well calibrated posteriors under mild assumptions. Extensive experiments on CIFAR100-LT, Food-101-LT, ImageNet-LT-subset, and iNaturalist18-subset demonstrate consistent gains over recent LTCIL baselines. Beyond accuracy, AdaPrior improves calibration, and forgetting curves, making it a practical and scalable solution for long-tail continual learning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36451", "url": null, "sourceid": 40459, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37323, "uid": "de596bc2f2a21ed618ef73e8f5e9e58e", "name": "Conflict-Aware Adaptive Cross-Reconstruction for Multimodal Sentiment Analysis", "authors": [{"id": 187162, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187162?format=json", "institution": "Shanxi University"}, {"id": 181532, "fullname": "Fuyuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181532?format=json", "institution": "Shanxi University"}, {"id": 187163, "fullname": "Xingwang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187163?format=json", "institution": "Shanxi University"}], "abstract": "Disentanglement-based methods for learning shared representations are widely used in multimodal sentiment analysis. However, most of them adopt intra-modal reconstruction strategy and rely on similarity losses to align shared representations, while often ignoring potential emotional conflicts across modalities within the same sample, thereby distorting the shared semantics. To address these issues, we propose a Conflict-aware Adaptive Cross-Reconstruction approach (CACR). First, we formally define emotional conflict and design a conflict-aware weighting strategy. This strategy calculates sample-level conflict scores based on modality consistency metrics and maps them into dynamic weights for the cross-reconstruction loss of each modality. Second, based on this, we construct a cross-reconstruction module: for each modality, reconstructs its representation by leveraging its own specific features and the shared features of the other modalities, adaptively weighting each cross-reconstruction term with the aforementioned weights, thereby achieving implicit alignment of shared representations while mitigating semantic ambiguity. Extensive experiments on three widely used benchmarks show that CACR outperforms existing state-of-the-art methods on six evaluation metrics, demonstrating its effectiveness in handling modality-level emotional conflict.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37323", "url": null, "sourceid": 35391, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36428, "uid": "a8fc21015db4f75ac1bc2269f1e2a58e", "name": "Decoupling Defense Strategies for Robust Image Watermarking", "authors": [{"id": 180019, "fullname": "Jiahui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180019?format=json", "institution": "Tsinghua University"}, {"id": 185020, "fullname": "Zehang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185020?format=json", "institution": "Swinburne University of Technology"}, {"id": 158542, "fullname": "Zeyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158542?format=json", "institution": "The Australian National University"}, {"id": 185021, "fullname": "Chaoyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185021?format=json", "institution": "Tsinghua University"}, {"id": 176856, "fullname": "Lianchen Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/176856?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 184785, "fullname": "Lifeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184785?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Deep learning-based image watermarking, while robust against conventional distortions, remains vulnerable to advanced adversarial and regeneration attacks. Conventional countermeasures, which jointly optimize the encoder and decoder via a noise layer, face 2 inevitable challenges:(1) decrease of clean accuracy due to decoder adversarial training and (2) limited robustness due to simultaneous training of all three advanced attacks.To overcome these issues, we propose AdvMark, a novel two-stage fine-tuning framework that decouples the defense strategies. In stage 1, we address adversarial vulnerability via a tailored adversarial training paradigm that primarily fine-tunes the encoder while only conditionally updating the decoder. This approach learns to move the image into a non-attackable region, rather than modifying the decision boundary, thus preserving clean accuracy.In stage 2, we tackle distortion and regeneration attacks via direct image optimization. To preserve the adversarial robustness gained in stage 1, we formulate a principled, constrained image loss with theoretical guarantees, which balances the deviation from cover and previous encoded images. We also propose a quality-aware early-stop to further guarantee the lower bound of visual quality.Extensive experiments demonstrate AdvMark outperforms with the highest image quality and comprehensive robustness, i.e. up to 29\\%, 33\\% and 46\\% accuracy improvement for distortion, regeneration and adversarial attacks, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36428", "url": null, "sourceid": 32408, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37616, "uid": "0b5130310a93e632cdfcf01eefd6b01e", "name": "Latent Visual Reasoning", "authors": [{"id": 187869, "fullname": "Kelvin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187869?format=json", "institution": "University of California, Berkeley"}, {"id": 187870, "fullname": "Chuyi Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187870?format=json", "institution": "University of California, Berkeley"}, {"id": 89723, "fullname": "Leonid Karlinsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/89723?format=json", "institution": "IBM Research AI"}, {"id": 89688, "fullname": "Rogerio Feris", "url": "http://cvpr.thecvf.com/api/miniconf/users/89688?format=json", "institution": "International Business Machines"}, {"id": 86710, "fullname": "Trevor Darrell", "url": "http://cvpr.thecvf.com/api/miniconf/users/86710?format=json", "institution": "Electrical Engineering &amp; Computer Science Department"}, {"id": 89687, "fullname": "Roei Herzig", "url": "http://cvpr.thecvf.com/api/miniconf/users/89687?format=json", "institution": "Tel Aviv University"}], "abstract": "While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, this strategy imposes restrictive priors on ``useful'' visual abstractions, creates heavy annotation costs, and struggles to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- in addition to demonstrating strong cross-task generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37616", "url": null, "sourceid": 31210, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38775, "uid": "0115afca115b436ff35c864efd26d9fe", "name": "Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure", "authors": [{"id": 187229, "fullname": "Ziling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187229?format=json", "institution": "University of Hong Kong"}, {"id": 151465, "fullname": "Shuya Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151465?format=json", "institution": "University of Hong Kong"}, {"id": 187227, "fullname": "Jialin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187227?format=json", "institution": "University of Hong Kong"}, {"id": 187230, "fullname": "Ka-Ho Chow", "url": "http://cvpr.thecvf.com/api/miniconf/users/187230?format=json", "institution": "The University of Hong Kong"}], "abstract": "Face recognition (FR) technologies are increasingly used to power large-scale image retrieval systems, raising serious privacy concerns. Services like Clearview AI and PimEyes allow anyone to upload a facial photo and retrieve a large amount of online content associated with that person. This not only enables identity inference but also exposes their digital footprint, such as social media activity, private photos, and news reports, often without their consent. In response to this emerging threat, we propose Protego, a user-centric privacy protection method that safeguards facial images from such retrieval-based privacy intrusions. Protego encapsulates a user\u2019s 3D facial signatures into a pose-invariant 2D representation, which is dynamically deformed into a natural-looking 3D mask tailored to the pose and expression of any facial image of the user, and applied prior to online sharing. Motivated by a critical limitation of existing methods, Protego amplifies the sensitivity of FR models so that protected images cannot be matched even among themselves. Experiments show that Protego significantly reduces retrieval accuracy across a wide range of black-box FR models and performs at least $2\\times$ better than existing methods. It also offers unprecedented visual coherence, particularly in video settings where consistency and natural appearance are essential. Overall, Protego contributes to the fight against the misuse of FR for mass surveillance and identity tracing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38775", "url": null, "sourceid": 45946, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38356, "uid": "e6a01bad243e951417c655ab34e09f76", "name": "ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS", "authors": [{"id": 189704, "fullname": "Yuhuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189704?format=json", "institution": "The University of Hong Kong"}, {"id": 176651, "fullname": "Aoxuan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/176651?format=json", "institution": "Independent"}, {"id": 70960, "fullname": "Yihua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70960?format=json", "institution": "University of Hong Kong"}, {"id": 76914, "fullname": "Chirui Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76914?format=json", "institution": "University of Hong Kong"}, {"id": 77054, "fullname": "Peng Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/77054?format=json", "institution": "University of Hong Kong"}, {"id": 90670, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90670?format=json", "institution": "The University of Hong Kong"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}], "abstract": "Achieving precise, object-level control in image editing remains challenging: 2D methods lack 3D awareness and often yield ambiguous or implausible results, while existing 3D-aware approaches rely on heavy optimization or incomplete monocular reconstructions. We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. ObjectMorpher lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting (3DGS), enabling fast, identity-preserving manipulation. Users drag control points; a graph-based non-rigid deformation with as-rigid-as-possible (ARAP) constraints ensures physically sensible shape and pose changes. A composite diffusion module harmonizes lighting, color, and boundaries for seamless reintegration. Across diverse categories, ObjectMorpher delivers fine-grained, photorealistic edits with superior controllability and efficiency, outperforming 2D drag and 3D-aware baselines on KID, LPIPS, SIFID, and user preference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38356", "url": null, "sourceid": 40827, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39235, "uid": "6bdffb19c15963b8e630b6a1861b477f", "name": "Test-Time Attention Purification for Backdoored Large Vision Language Models", "authors": [{"id": 144784, "fullname": "Zhifang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144784?format=json", "institution": "The University of Queensland"}, {"id": 177067, "fullname": "Yang Bojun", "url": "http://cvpr.thecvf.com/api/miniconf/users/177067?format=json", "institution": "Xidain University"}, {"id": 191672, "fullname": "Shuo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191672?format=json", "institution": "Nanyang Technological University"}, {"id": 191673, "fullname": "Weitong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191673?format=json", "institution": "University of Adelaide"}, {"id": 191674, "fullname": "Wei Emma Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191674?format=json", "institution": "Adelaide University"}, {"id": 191675, "fullname": "Olaf Maennel", "url": "http://cvpr.thecvf.com/api/miniconf/users/191675?format=json", "institution": "Adelaide University"}, {"id": 127450, "fullname": "Lei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127450?format=json", "institution": "Nanyang Technological University"}, {"id": 191676, "fullname": "Miao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191676?format=json", "institution": "The University of Queensland; RIKEN"}], "abstract": "Despite the strong multimodal performance, large vision\u2013language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context \u2014 a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual\u2013text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model\u2019s utility on both clean and poisoned samples.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39235", "url": null, "sourceid": 45108, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38344, "uid": "90d68610088debb296464490f04c110f", "name": "Bridging the Perception Gap in Image Super-Resolution Evaluation", "authors": [{"id": 189656, "fullname": "Shaolin Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/189656?format=json", "institution": "Computer Vision Center"}, {"id": 189657, "fullname": "Josep Rocafort", "url": "http://cvpr.thecvf.com/api/miniconf/users/189657?format=json", "institution": null}, {"id": 96860, "fullname": "Danna Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/96860?format=json", "institution": "Universitat Aut\u00f2noma Barcelona, Computer Vision Center"}, {"id": 162198, "fullname": "David Serrano-Lozano", "url": "http://cvpr.thecvf.com/api/miniconf/users/162198?format=json", "institution": "Computer Vision Center"}, {"id": 76465, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76465?format=json", "institution": "Zhejiang University"}, {"id": 92830, "fullname": "Javier Vazquez-Corral", "url": "http://cvpr.thecvf.com/api/miniconf/users/92830?format=json", "institution": "Computer Vision Center / Autonomous University of Barcelona"}], "abstract": "As super-resolution (SR) techniques advance, we observe a growing distrust of evaluation metrics in recent SR research. An inconsistency often emerges between certain evaluation criteria and human perceptual preference. Although current SR research employs varying metrics to evaluate SR performance, it remains underexplored how robust and reliable these metrics actually are. To bridge this gap, we conduct a comprehensive analysis of widely used image quality metrics, examining their consistency with human perception when evaluating state-of-the-art SR models. We show that some metrics exhibit only limited\u2014or even negative\u2014correlation with human preferences. We further identify several intrinsic challenges in SR evaluation that compromise the effectiveness of both full-reference (FR) and no-reference (NR) image quality assessment (IQA) frameworks. To address these issues, we propose a simple yet effective Relative Quality Index (RQI) framework, which assesses the relative quality discrepancy between image pairs. Our framework enables easy integration and notable improvements for existing IQA metrics in SR evaluation. Moreover, it can be utilized as a valuable training guide for SR models, enabling the generation of images with more realistic details while maintaining structural fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38344", "url": null, "sourceid": 37917, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38851, "uid": "39f08ca9bf44368a6a029b0ff7c43ab3", "name": "Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment", "authors": [{"id": 185718, "fullname": "Haiwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185718?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 190831, "fullname": "Fengpeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190831?format=json", "institution": ""}, {"id": 190832, "fullname": "Zhilin Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190832?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185719, "fullname": "Yuanman Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185719?format=json", "institution": "Shenzhen University"}, {"id": 190833, "fullname": "Xiong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190833?format=json", "institution": null}, {"id": 91005, "fullname": "Jiantao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/91005?format=json", "institution": "University of Macau"}], "abstract": "Advances in generative AI (GenAI) have increasingly complicated the identification of synthetic images, prompting the proposal of numerous zero-/few-shot detection methods to counter unknown GenAI better. However, we observe that existing detectors often misclassify synthetic images with physical transformations (e.g., print+scan) as real. The essence of this observation lies in: should images remapped from the physical world to digital space still be categorized as ``Synthetic''? Furthermore, the definition of what constitutes real and synthetic images urgently needs to be clarified. We first boldly propose that the authenticity of an image depends on whether it originates from the physical world, i.e., it is necessary to verify the original correlation between the digital image and the physical world. To this end, we first analyze the physical-to-digital mapping process: illumination signals are captured by camera sensors as RAW data, which is then converted into RGB data via camera internal parameters. This process embodies unique physical cues inherent to real scenes. Based on this, we propose a novel forensic feature termed alignment trace, which is constructed by modeling a shared RAW-RGB feature space. This trace captures the inherent parameter correlations of real images in the physical-to-digital conversion process, thereby indirectly verifying the physical origin of the image. Experiments demonstrate that our method achieves state-of-the-art zero-shot detection using only real RAW-RGB data pairs. When additional prior knowledge is provided, the method can be easily fine-tuned to achieve better cross-domain detection performance. We hope this work provides a new baseline for zero-shot synthetic detection and, more significantly, inspires the forensics community to explore the essential distinctions between real and synthetic images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38851", "url": null, "sourceid": 43077, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38551, "uid": "cc37617187aba8b27cf5dc43cf6231d1", "name": "DuetGen: Towards General Purpose Interleaved Multimodal Generation", "authors": [{"id": 157793, "fullname": "Min Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/157793?format=json", "institution": "Georgia Institute of Technology"}, {"id": 90964, "fullname": "Xiaohui Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90964?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 189854, "fullname": "Jiannan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189854?format=json", "institution": "Georgia Institute of Technology"}, {"id": 85190, "fullname": "Yin Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/85190?format=json", "institution": "NVIDIA"}, {"id": 72179, "fullname": "Francesco Ferroni", "url": "http://cvpr.thecvf.com/api/miniconf/users/72179?format=json", "institution": "NVIDIA"}, {"id": 184159, "fullname": "Jialuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184159?format=json", "institution": "Georgia Institute of Technology"}, {"id": 135762, "fullname": "Max Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/135762?format=json", "institution": "NVIDIA Research"}, {"id": 130060, "fullname": "Yogesh Balaji", "url": "http://cvpr.thecvf.com/api/miniconf/users/130060?format=json", "institution": "NVIDIA"}, {"id": 190116, "fullname": "Haoxiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190116?format=json", "institution": "NVIDIA"}, {"id": 90963, "fullname": "Tsung-Yi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/90963?format=json", "institution": "NVIDIA"}, {"id": 76033, "fullname": "Xiao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76033?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 189255, "fullname": "Yue Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189255?format=json", "institution": "Georgia Institute of Technology"}, {"id": 71186, "fullname": "Chieh-Yun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71186?format=json", "institution": "GaTech"}, {"id": 90941, "fullname": "Ming-Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90941?format=json", "institution": "NVIDIA"}, {"id": 75967, "fullname": "Humphrey Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75967?format=json", "institution": "Georgia Tech | UIUC / Oregon | PAIR"}], "abstract": "Unified multimodal generation aims to jointly model image-to-text and text-to-image tasks within a single architecture. However, current approaches struggle to produce coherent, interleaved sequences of text and images. This limitation hinders applications that rely on tightly integrated multimodal outputs\u2014such as step-by-step instructional guides, visual planning tools, and interactive content editing\u2014where textual explanations and visual elements must be generated in a coordinated manner. We introduce DuetGen, a general-purpose interleaved multimodal generation model and investigate data curation, architecture design, and evaluation. In terms of data, we construct a large-scale high-quality instruction-tuning corpus combining curated web content, rewritten multimodal conversations, and diverse synthetic examples covering everyday scenarios. Architecturally, DuetGen builds upon a pretrained MLLM and diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining while remaining scalable. A two-stage decoupled training strategy first instruct-tunes the MLLM and then aligns it with the DiT using large-scale curated interleaved image\u2013text sequences. Experiments on public and newly constructed benchmarks show that DuetGen substantially outperforms prior open-source systems across text quality, image fidelity, and image\u2013context alignment, achieving substantial gains on text-to-image and image-editing benchmarks. Code and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38551", "url": null, "sourceid": 38233, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38437, "uid": "4607d59dfdb94c404b1bc22055f11b4c", "name": "Plenoptic Video Generation", "authors": [{"id": 76033, "fullname": "Xiao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76033?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 189864, "fullname": "Shitao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189864?format=json", "institution": "NVIDIA"}, {"id": 157793, "fullname": "Min Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/157793?format=json", "institution": "Georgia Institute of Technology"}, {"id": 156491, "fullname": "Xian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156491?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 75943, "fullname": "Jinwei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75943?format=json", "institution": "NVIDIA"}, {"id": 90941, "fullname": "Ming-Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90941?format=json", "institution": "NVIDIA"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 69633, "fullname": "Chen-Hsuan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/69633?format=json", "institution": "NVIDIA"}], "abstract": "Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera estimation, and diverse view transformations (e.g., third-person \u2192 third-person, and head-view \u2192 gripper-view in robotic manipulation).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38437", "url": null, "sourceid": 45891, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37639, "uid": "ac5a08fd81df9def6dd34354e08bc3bd", "name": "Semantic-Guided Global-Local Collaborative Prompt Learning for Few-Shot Class Incremental Learning", "authors": [{"id": 180872, "fullname": "yongxin yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180872?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187927, "fullname": "Weisen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187927?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187928, "fullname": "Xingye Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187928?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 126524, "fullname": "Yuanjie Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126524?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187929, "fullname": "Zhengrong Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187929?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 84798, "fullname": "Wenming Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84798?format=json", "institution": "Hikvision Research Institute"}, {"id": 187930, "fullname": "Wenqi Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/187930?format=json", "institution": "Hikvision Research Institute"}, {"id": 91032, "fullname": "Changxin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/91032?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 91049, "fullname": "Nong Sang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91049?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Few-Shot Class-Incremental Learning (FSCIL) poses a critical challenge in machine learning, requiring models to continuously integrate novel classes with limited samples while preserving knowledge of previously seen classes. While existing FSCIL approaches have demonstrated promising results, they still suffer from catastrophic forgetting and few-shot overfitting due to the challenge of balancing old knowledge retention with new knowledge acquisition. To address these challenges, we propose an innovative Semantic-Guided Global-Local Collaborative Prompt Learning (SGLC) framework. Built upon powerful pre-trained Vision-Language Models (VLMs), the framework first introduces a dual-alignment mechanism: globally aligning visual features with visual-textual prototypes and locally aligning multi-view visual features with local textual attribute features, which facilitates effective knowledge learning while preserving existing knowledge via frozen prototypes of previous classes. Furthermore, to alleviate overfitting, we incorporate Large Language Models (LLMs) to generate semantically rich textual descriptions, which simultaneously guide both global and local prompt learning through knowledge distillation. Extensive experiments on the miniImageNet, CIFAR-100, and CUB200 datasets demonstrate that SGLC performs favorably against the state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37639", "url": null, "sourceid": 34976, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39459, "uid": "3286e51f959c4a846d19f1edfaf416f1", "name": "UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation", "authors": [{"id": 179890, "fullname": "Haopeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/179890?format=json", "institution": "University of Mississippi"}, {"id": 180926, "fullname": "Yihao Ai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180926?format=json", "institution": "NUS (Ai Yihao)"}, {"id": 192113, "fullname": "Kabeen Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192113?format=json", "institution": "Duksung Women&#x27;s University"}, {"id": 85616, "fullname": "Robby T. Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85616?format=json", "institution": "National University of Singapore"}, {"id": 192114, "fullname": "Yixin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192114?format=json", "institution": "University of Mississippi"}, {"id": 69287, "fullname": "Bo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69287?format=json", "institution": "University of Mississippi"}], "abstract": "Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions.But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes.Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions.To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes realistic low-light images and dynamically fuses visual cues with pose priors for improved pose estimation.Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches.Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture.Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the challenging ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39459", "url": null, "sourceid": 42096, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38639, "uid": "7c8616cf86f6ca71e8189ff8d544d5ac", "name": "When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought", "authors": [{"id": 190368, "fullname": "Yiyang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190368?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 103236, "fullname": "Haoqin Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103236?format=json", "institution": "University of Chinese Academy of Sciences, UCAS"}, {"id": 190369, "fullname": "Zijun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190369?format=json", "institution": "University of California, Santa Cruz"}, {"id": 84813, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84813?format=json", "institution": "University of California, Santa Cruz"}, {"id": 153870, "fullname": "Niklas Muennighoff", "url": "http://cvpr.thecvf.com/api/miniconf/users/153870?format=json", "institution": "Stanford University"}, {"id": 190370, "fullname": "Fan Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190370?format=json", "institution": "Stanford University"}, {"id": 168674, "fullname": "Chaorui Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/168674?format=json", "institution": "ByteDance Inc."}, {"id": 140057, "fullname": "Shen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/140057?format=json", "institution": "ByteDance Inc."}, {"id": 153751, "fullname": "Haoqi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153751?format=json", "institution": null}, {"id": 187497, "fullname": "Yejin Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187497?format=json", "institution": "Stanford University"}, {"id": 98797, "fullname": "James Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/98797?format=json", "institution": null}, {"id": 75526, "fullname": "Cihang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/75526?format=json", "institution": "University of California, Santa Cruz"}, {"id": 131401, "fullname": "Huaxiu Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131401?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 126230, "fullname": "Qinghao Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/126230?format=json", "institution": "Alibaba Group"}], "abstract": "We propose MIRA (Multimodal Imagination for Reasoning Assessment), a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional Chain-of-thought (CoT) methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images --- such as sketches, structural diagrams, or path drawings --- to guide their reasoning process. This setup closely mirrors how humans solve complex problems through \"drawing to think\". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone (e.g., tracking a die\u2019s movement on a board and summing the face-down values after each roll). To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT (Text-CoT) input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models (MLLMs), including strongest private models (e.g., GPT-5, o3, Gemini 2.5 Pro) as well as strong open-weight models (e.g., Qwen2.5-VL, GLM 4.5V), perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38639", "url": null, "sourceid": 40506, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38191, "uid": "2517e55b4bfe200e1e2c7dfd27e8fd5f", "name": "MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models", "authors": [{"id": 71186, "fullname": "Chieh-Yun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71186?format=json", "institution": "GaTech"}, {"id": 139798, "fullname": "Zhonghao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/139798?format=json", "institution": "Adobe Systems"}, {"id": 189253, "fullname": "Qi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189253?format=json", "institution": "Adobe Systems"}, {"id": 189254, "fullname": "Zhifan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/189254?format=json", "institution": "Georgia Institute of Technology"}, {"id": 157793, "fullname": "Min Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/157793?format=json", "institution": "Georgia Institute of Technology"}, {"id": 189255, "fullname": "Yue Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189255?format=json", "institution": "Georgia Institute of Technology"}, {"id": 189256, "fullname": "Yinan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189256?format=json", "institution": "Adobe Systems"}, {"id": 127520, "fullname": "Hui Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127520?format=json", "institution": "Adobe Inc."}, {"id": 189257, "fullname": "Wei-An Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189257?format=json", "institution": "Adobe Systems"}, {"id": 189258, "fullname": "Yiru Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189258?format=json", "institution": "Adobe Systems"}, {"id": 85138, "fullname": "Ajinkya Kale", "url": "http://cvpr.thecvf.com/api/miniconf/users/85138?format=json", "institution": "Adobe Systems"}, {"id": 87987, "fullname": "Irfan Essa", "url": "http://cvpr.thecvf.com/api/miniconf/users/87987?format=json", "institution": "Georgia Institute of Technology"}, {"id": 75967, "fullname": "Humphrey Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75967?format=json", "institution": "Georgia Tech | UIUC / Oregon | PAIR"}], "abstract": "Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax\u2014improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38191", "url": null, "sourceid": 39389, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39983, "uid": "5033fd9ca925f86dbd50b35dcdd85206", "name": "Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing", "authors": [{"id": 180036, "fullname": "SeongRae Noh", "url": "http://cvpr.thecvf.com/api/miniconf/users/180036?format=json", "institution": "Korea University"}, {"id": 193238, "fullname": "SeungWon Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193238?format=json", "institution": "Korea University"}, {"id": 153134, "fullname": "Gyeong-Moon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/153134?format=json", "institution": "Korea University"}, {"id": 180047, "fullname": "HyeongYeop Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180047?format=json", "institution": "Korea University"}], "abstract": "Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task.We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space.Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations.By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility\u2014three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39983", "url": null, "sourceid": 45594, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38825, "uid": "fd7d20b9c2863af38093925e27205843", "name": "Finding Distributed Object-Centric Properties in Self-Supervised Transformers", "authors": [{"id": 183839, "fullname": "Samyak Rawlekar", "url": "http://cvpr.thecvf.com/api/miniconf/users/183839?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 190769, "fullname": "Amitabh Swain", "url": "http://cvpr.thecvf.com/api/miniconf/users/190769?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 153126, "fullname": "Yujun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/153126?format=json", "institution": "The University of Queensland"}, {"id": 182615, "fullname": "Yiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182615?format=json", "institution": "University of California, Merced"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 136101, "fullname": "Narendra Ahuja", "url": "http://cvpr.thecvf.com/api/miniconf/users/136101?format=json", "institution": "University of Illinois at Urbana-Champaign"}], "abstract": "Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in \\texttt{[CLS]} token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the \\texttt{[CLS]} token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the \\texttt{[CLS]} token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38825", "url": null, "sourceid": 33827, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39795, "uid": "035b8cb517231275e8af15d62d99aa40", "name": "Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere", "authors": [{"id": 192874, "fullname": "Francesco Di Sario", "url": "http://cvpr.thecvf.com/api/miniconf/users/192874?format=json", "institution": "University of Turin"}, {"id": 132581, "fullname": "Daniel Rebain", "url": "http://cvpr.thecvf.com/api/miniconf/users/132581?format=json", "institution": "Wayve; University of British Columbia"}, {"id": 94620, "fullname": "Dor Verbin", "url": "http://cvpr.thecvf.com/api/miniconf/users/94620?format=json", "institution": "Google"}, {"id": 192875, "fullname": "Marco Grangetto", "url": "http://cvpr.thecvf.com/api/miniconf/users/192875?format=json", "institution": "University of Turin"}, {"id": 126249, "fullname": "Andrea Tagliasacchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126249?format=json", "institution": "Simon Fraser University, Google Brain"}], "abstract": "Radiance field methods (e.g.~3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations.SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and critically fail to capture specular reflections -- a key component of realistic rendering. While alternatives like Spherical Gaussians offer improvements, they introduce significant optimization complexity.We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting.SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while maintaining simpler optimization compared to existing alternatives. For reflections -- where SH fundamentally fail -- we leverage SV as learnable reflection probes, taking reflected directions as input following principles from traditional graphics. This formulation achieves state-of-the-art results across both synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39795", "url": null, "sourceid": 35509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38687, "uid": "85bb35d4a343eb4beeb2b03c450e244c", "name": "PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction", "authors": [{"id": 180434, "fullname": "Ziqiao Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180434?format=json", "institution": "National University of Singapore"}, {"id": 190465, "fullname": "Qichao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190465?format=json", "institution": "Nanyang Technological University"}, {"id": 190466, "fullname": "Zhiyang Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190466?format=json", "institution": "MIT"}, {"id": 190467, "fullname": "Zixing Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/190467?format=json", "institution": "University of Bristol"}, {"id": 76652, "fullname": "Zhipeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76652?format=json", "institution": "Nanyang Technological University"}, {"id": 188679, "fullname": "Irwin King", "url": "http://cvpr.thecvf.com/api/miniconf/users/188679?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 190468, "fullname": "Peilin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190468?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias reinforces short-range continuity but limits the model\u2019s ability to capture long-range dependencies, thereby weakening its capacity to enforce global structural properties such as symmetry, geometric consistency, and large-scale spatial regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Strictly following the baseline experimental setups, empirical results on ShapeNet benchmark demonstrate that PointNSP achieves state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. Moreover, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, under dense generation with 8,192 points, PointNSP's advantages become even more pronounced, highlighting its strong scalability potential.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38687", "url": null, "sourceid": 32951, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40086, "uid": "3e43837dc774ebfbd1ccc4801237041d", "name": "VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network", "authors": [{"id": 181412, "fullname": "Yang Zepeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181412?format=json", "institution": "Beihang University"}, {"id": 193481, "fullname": "Junxuan Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193481?format=json", "institution": "Capital University of Physical Education and Sports"}, {"id": 158990, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158990?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 158511, "fullname": "Ju Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158511?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 158782, "fullname": "Junjun Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/158782?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 180052, "fullname": "Yongfeng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180052?format=json", "institution": null}, {"id": 193482, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193482?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}], "abstract": "The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual\u2013inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba\u2019s dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code will be available on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40086", "url": null, "sourceid": 32357, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37077, "uid": "a7c3525b2c5e24f3f212246ee3a64b6d", "name": "TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts", "authors": [{"id": 186611, "fullname": "Yu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186611?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 186612, "fullname": "Hongbin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186612?format=json", "institution": "Hangzhou Institute for Advanced Study, University of the Chinese Academy of Sciences"}, {"id": 90790, "fullname": "Juan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90790?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 180724, "fullname": "YIJI CHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180724?format=json", "institution": "Tencent"}, {"id": 186613, "fullname": "Tiankai Hang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186613?format=json", "institution": "Tencent"}, {"id": 102099, "fullname": "Runze He", "url": "http://cvpr.thecvf.com/api/miniconf/users/102099?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 107041, "fullname": "Zijin Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/107041?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 181780, "fullname": "Shiyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181780?format=json", "institution": "Tsinghua University"}, {"id": 186614, "fullname": "Yuxin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186614?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 126129, "fullname": "Jintao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126129?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 186615, "fullname": "Chunyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186615?format=json", "institution": "Tencent Hunyuan"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 131773, "fullname": "Tong-yee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/131773?format=json", "institution": "National Cheng Kung University"}, {"id": 87043, "fullname": "Fan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87043?format=json", "institution": "Institute of Computing Technology, CAS"}], "abstract": "Unified image generation and editing models suffer from severe task interference in dense diffusion transformers architectures, where a shared parameter space must compromise between conflicting objectives (e.g., local editing v.s. subject-driven generation). While the sparse Mixture-of-Experts (MoE) paradigm is a promising solution, its gating networks remain task-agnostic, operating based on local features, unaware of global task intent. This task-agnostic nature prevents meaningful specialization and fails to resolve the underlying task interference.In this paper, we propose a novel framework to inject semantic intent into MoE routing. We introduce a Hierarchical Task Semantic Annotation scheme to create structured task descriptors (e.g., scope, type, preservation). We then design Predictive Alignment Regularization to align internal routing decisions with the task's high-level semantics. This regularization evolves the gating network from a task-agnostic executor to a dispatch center. Our model effectively mitigates task interference, outperforming dense baselines in fidelity and quality, and our analysis shows that experts naturally develop clear and semantically correlated specializations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37077", "url": null, "sourceid": 43271, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39933, "uid": "bbf4a983b7d312cf210692dc24c9a4be", "name": "General Process Reward Modeling for Robotic Reinforcement Learning", "authors": [{"id": 156446, "fullname": "Huajie Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156446?format=json", "institution": "Peking University"}, {"id": 193146, "fullname": "Sixiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193146?format=json", "institution": "Peking University"}, {"id": 192491, "fullname": "Yijie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192491?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 193147, "fullname": "Zixiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193147?format=json", "institution": "Peking University"}, {"id": 186708, "fullname": "Cheng Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186708?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 156445, "fullname": "Yuheng Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/156445?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 193148, "fullname": "Yaoxu Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193148?format=json", "institution": "Peking University"}, {"id": 192493, "fullname": "Zhongxia Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192493?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 192492, "fullname": "Xiansheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192492?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 192490, "fullname": "Peterson Co", "url": "http://cvpr.thecvf.com/api/miniconf/users/192490?format=json", "institution": "Peking University"}, {"id": 193149, "fullname": "Shaoxuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/193149?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 193150, "fullname": "Guocai Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193150?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 156220, "fullname": "Pengwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156220?format=json", "institution": "baai-\u5317\u4eac\u4eba\u5de5\u667a\u80fd\u7814\u7a76\u9662"}, {"id": 128283, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128283?format=json", "institution": "Kuaishou Inc."}, {"id": 91956, "fullname": "Shanghang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91956?format=json", "institution": "Peking University"}], "abstract": "The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization.To address these, we introduce Robo-Dopamine, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Robo-Dopamine, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap.Extensive experiments across 10 simulated and 8 real-world tasks validate our approach. {GRM achieves state-of-the-art accuracy in reward assessment}, and {Dopamine-RL built on GRM significantly improves policy learning efficiency}.For instance, after {GRM} is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95\\% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39933", "url": null, "sourceid": 34821, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40085, "uid": "85ecc824097a247c49dfdf7cffe98500", "name": "Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective", "authors": [{"id": 193474, "fullname": "Zhenbin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193474?format=json", "institution": "Sichuan University"}, {"id": 193475, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193475?format=json", "institution": "Sichuan University"}, {"id": 193476, "fullname": "Lituan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193476?format=json", "institution": "Sichuan University"}, {"id": 193477, "fullname": "Zhenwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193477?format=json", "institution": "Sichuan University"}, {"id": 193478, "fullname": "Guangwu Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193478?format=json", "institution": "Sichuan University"}, {"id": 193479, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193479?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}, {"id": 193480, "fullname": "Wei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193480?format=json", "institution": null}], "abstract": "This work tackles a key challenge in test-time energy adaptation: prohibitive time overhead arising from recent state-of-the-art test-time adaptation (TTA) methods, which are built on energy models relying on iterative Monte Carlo or Langevin dynamics sampling with multiple stochastic updates per test instance to approximate energy gradients. We tackle the problem from an innovative control system perspective by i) describing the energy as a complex-valued wave, where the amplitude encodes energy uncertainty and the phase characterizes its evolution, and ii) maintaining a time-dependent wave equation that interprets TTA as a control system evolution process. By enforcing the control system law of probability current conservation, our method directs probability current away from high-energy (error-prone) regions toward low-energy (accurate) ones, achieving adaptive energy redistribution without additional stochastic sampling while preserving the overall normalization of the energy landscape. Experimentally, the proposed method significantly outperforms baseline methods across several public benchmark datasets, with adaptive time being only 1/3 ~ 1/7 of that required by the compared Top-1 to Top-3 baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40085", "url": null, "sourceid": 39814, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37548, "uid": "a4d8904831cfd921f81dc279df02f6c1", "name": "PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward", "authors": [{"id": 180828, "fullname": "Linqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180828?format=json", "institution": "\u817e\u8baf\u79d1\u6280\uff08\u6df1\u5733\uff09\u6709\u9650\u516c\u53f8"}, {"id": 187694, "fullname": "zhiyong xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187694?format=json", "institution": "Tencent HunYuan"}, {"id": 126922, "fullname": "XiMing Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/126922?format=json", "institution": "Beihang University"}, {"id": 180724, "fullname": "YIJI CHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180724?format=json", "institution": "Tencent"}, {"id": 135754, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135754?format=json", "institution": "Tencent"}, {"id": 187695, "fullname": "Donghao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187695?format=json", "institution": "Tencent"}, {"id": 186613, "fullname": "Tiankai Hang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186613?format=json", "institution": "Tencent"}, {"id": 187696, "fullname": "Zhenxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187696?format=json", "institution": "Tencent Hunyuan"}, {"id": 187697, "fullname": "Jiale Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187697?format=json", "institution": "Tencent"}, {"id": 187698, "fullname": "wangqixun wangqixun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187698?format=json", "institution": null}, {"id": 76781, "fullname": "Ruihuang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/76781?format=json", "institution": null}, {"id": 187699, "fullname": "Comi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187699?format=json", "institution": "tencent"}, {"id": 107592, "fullname": "Xin LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/107592?format=json", "institution": null}, {"id": 180146, "fullname": "Mingrui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180146?format=json", "institution": "Xiamen University"}, {"id": 187700, "fullname": "Xinchi Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187700?format=json", "institution": null}, {"id": 154539, "fullname": "Shuyang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154539?format=json", "institution": "Tencent Hunyuan Research"}, {"id": 186615, "fullname": "Chunyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186615?format=json", "institution": "Tencent Hunyuan"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}], "abstract": "Recent advances in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects such as attribute binding, negation, and compositional relationships. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pre-trained T2I model.Specifically, we adopt a multi-stage training pipeline to systematically boost the rewriter's understanding and rewriting performance. In the first stage, we conduct supervised fine-tuning (SFT) using CoT-enabled data to enable the rewriter to generate structured, chain-of-thought-style responses. In the second stage, we design a task-specific reward model\u2014AlignEvaluator\u2014to further align user prompts with fine-grained preferences through GRPO.The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy derived from common T2I failure cases. By optimizing the rewriter to maximize the reward from AlignEvaluator, our framework learns to generate prompts that T2I models can interpret more precisely. Furthermore, we introduce a comprehensive human-aligned benchmark to facilitate future research in this direction. Extensive experiments demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37548", "url": null, "sourceid": 33920, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36508, "uid": "4a00ec743cd160ce59b375e9d7e4696a", "name": "Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras", "authors": [{"id": 185235, "fullname": "Takumi Kawano", "url": "http://cvpr.thecvf.com/api/miniconf/users/185235?format=json", "institution": "Osaka University"}, {"id": 172921, "fullname": "Kohei Miura", "url": "http://cvpr.thecvf.com/api/miniconf/users/172921?format=json", "institution": "The University of Osaka"}, {"id": 185236, "fullname": "Daisuke Iwai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185236?format=json", "institution": "The University of Osaka"}], "abstract": "Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for large multi-projector environments.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36508", "url": null, "sourceid": 31550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40265?format=json"], "related_events_ids": [40265]}, {"id": 37150, "uid": "9abbbda872050c52f250190578b0c178", "name": "Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification", "authors": [{"id": 151439, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151439?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 186779, "fullname": "Xu Xinan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186779?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 186780, "fullname": "Shuo Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186780?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 186781, "fullname": "Yu Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186781?format=json", "institution": "Muyuan Laboratory"}, {"id": 151441, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151441?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 186782, "fullname": "Wufan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186782?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 186783, "fullname": "Hui Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186783?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 154267, "fullname": "Wendong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154267?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Recent pseudo-bag augmentation methods for Multiple Instance Learning (MIL)-based Whole Slide Image (WSI) classification sample instances from a limited number of bags, resulting in constrained diversity. To address this issue, we propose Contrastive Cross-Bag Augmentation ($\\text{C}^{2}$Aug) to sample instances from all bags with the same class to increase the diversity of pseudo-bags. However, introducing new instances into the pseudo-bag increases the number of critical instances (e.g., tumor instances). This increase results in a reduced occurrence of pseudo-bags containing few critical instances, thereby limiting model performance, particularly on test slides with small tumor areas. To address this, we introduce a bag-level and group-level contrastive learning framework to enhance the discrimination of features with distinct semantic meanings, thereby improving model performance. Experimental results demonstrate that $\\text{C}^{2}$Aug consistently outperforms state-of-the-art approaches across multiple evaluation metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37150", "url": null, "sourceid": 34331, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40265, "uid": "4a00ec743cd160ce59b375e9d7e4696a", "name": "Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras", "authors": [{"id": 185235, "fullname": "Takumi Kawano", "url": "http://cvpr.thecvf.com/api/miniconf/users/185235?format=json", "institution": "Osaka University"}, {"id": 172921, "fullname": "Kohei Miura", "url": "http://cvpr.thecvf.com/api/miniconf/users/172921?format=json", "institution": "The University of Osaka"}, {"id": 185236, "fullname": "Daisuke Iwai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185236?format=json", "institution": "The University of Osaka"}], "abstract": "Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for large multi-projector environments.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40265", "url": null, "sourceid": -31550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36508?format=json"], "related_events_ids": [36508]}, {"id": 38919, "uid": "91ee256d2ebc089ddeacfa66b84ffb66", "name": "HybridDriveVLA: Vision-Language-Action model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving", "authors": [{"id": 190963, "fullname": "Yipene Bassole", "url": "http://cvpr.thecvf.com/api/miniconf/users/190963?format=json", "institution": "Dongguk University"}, {"id": 190964, "fullname": "Sungwoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190964?format=json", "institution": "Dongguk University"}, {"id": 190965, "fullname": "Jiwoo Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/190965?format=json", "institution": "Dongguk University"}, {"id": 190966, "fullname": "Yunsick Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/190966?format=json", "institution": "Dongguk University"}], "abstract": "Vision-Language-Action (VLA) models are emerging as an important technology in autonomous driving, recognized for their sophisticated reasoning and interpretability. However, traditional VLA models often rely on image-to-text with Chain-of-Thought (CoT) reasoning, which converts sequential visual scenes into textual symbols, thereby under-utilizing spatial context in visual information. Existing autonomous driving systems using VLA models predict only a single sequence of waypoints as a trajectory considering a given command and multiple aspects. However, we suggest evaluating each sequence of waypoints to reveal the importance of the corresponding aspect. We introduce HybridDriveVLA, a VLA model that integrates visual Chain-of-Thought (V-Cot) reasoning and a proposed Tree-of-Thought (ToT)-inspired waypoint evaluation (ToT-evaluation). V-Cot reasoning anticipates future scenes, which serve as goals for ToT-evaluation. The ToT-evaluation generates and scores waypoints based each on safety, progress, and comfort aspects. The highest cumulated score of the waypoints based on the three aspects is optimal. To the best of our knowledge, we are the first to propose a unified method integrating both a CoT and a ToT approach in a VLA model. Experimental results demonstrated that HybridDriveVLA achieved strong performance on comfort, progress, and safety metrics with and average collision rate of 0.17\\% on the nuScenes benchmark, outperforming traditional VLA models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38919", "url": null, "sourceid": 41505, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37587, "uid": "589e19fafa037ef3e798363d7f9bd6d3", "name": "Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection", "authors": [{"id": 157178, "fullname": "Yingxin Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157178?format=json", "institution": "Xiamen University"}, {"id": 70489, "fullname": "Zitong YU", "url": "http://cvpr.thecvf.com/api/miniconf/users/70489?format=json", "institution": "Nanyang Technological University"}, {"id": 187777, "fullname": "Jun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187777?format=json", "institution": "Great Bay University"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}, {"id": 87841, "fullname": "Yong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87841?format=json", "institution": "Harbin Institute of Technology"}, {"id": 126239, "fullname": "Xiaochun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126239?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Face forgery detection faces a critical challenge: a persistent gap between offline benchmarks and real-world efficacy, which we attribute to the ecological invalidity of training data. This work introduces Agent4FaceForgery to address two fundamental problems: (1) how to capture the diverse intents and iterative processes of human forgery creation, and (2) how to model the complex, often adversarial, text-image interactions that accompany forgeries in social media. To solve this, we propose a multi-agent framework where LLM-powered agents, equipped with profile and memory modules, simulate the forgery creation process. Crucially, these agents interact in a simulated social environment to generate samples labeled for nuanced text-image consistency, moving beyond simple binary classification. An Adaptive Rejection Sampling (ARS) mechanism ensures data quality and diversity. Extensive experiments validate that the data generated by our simulation-driven approach brings significant performance gains to detectors of multiple architectures, fully demonstrating the effectiveness and value of our framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37587", "url": null, "sourceid": 39976, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38487, "uid": "572fdfd496ec32968f94ab3cb3ca9991", "name": "Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning", "authors": [{"id": 189958, "fullname": "Apoorv Vyas", "url": "http://cvpr.thecvf.com/api/miniconf/users/189958?format=json", "institution": "Meta"}, {"id": 189959, "fullname": "Heng-Jui Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189959?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 189960, "fullname": "Cheng-Fu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189960?format=json", "institution": "University of California, Los Angeles"}, {"id": 89206, "fullname": "Po-Yao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89206?format=json", "institution": "Facebook"}, {"id": 189961, "fullname": "Luya Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189961?format=json", "institution": "Meta"}, {"id": 148381, "fullname": "Julius Richter", "url": "http://cvpr.thecvf.com/api/miniconf/users/148381?format=json", "institution": "University of Hamburg"}, {"id": 189962, "fullname": "Sanyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189962?format=json", "institution": "Facebook; Harbin Institute of Technology"}, {"id": 189963, "fullname": "Matthew Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/189963?format=json", "institution": "Meta"}, {"id": 188077, "fullname": "Piotr Doll\u00e1r", "url": "http://cvpr.thecvf.com/api/miniconf/users/188077?format=json", "institution": "Thinking Machines"}, {"id": 189964, "fullname": "Christoph Feichtenhofer", "url": "http://cvpr.thecvf.com/api/miniconf/users/189964?format=json", "institution": "Meta FAIR"}, {"id": 189965, "fullname": "Ann Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189965?format=json", "institution": "Meta"}, {"id": 189966, "fullname": "Wei-Ning Hsu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189966?format=json", "institution": "Facebook"}], "abstract": "We introduce Perception Encoder-Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE~\\citep{pe}, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio\u2013video, audio\u2013text, and video\u2013text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio\u2013video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects\u2014avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. Our models and code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38487", "url": null, "sourceid": 39037, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36966, "uid": "c5cc5960420a4189fb225dbaec8f9e62", "name": "WPT: World-to-Policy Transfer via Online World Model Distillation", "authors": [{"id": 186331, "fullname": "Guangfeng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186331?format=json", "institution": "University of Science and Technology of China"}, {"id": 145274, "fullname": "Yueru Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/145274?format=json", "institution": "Beijing University of Post and Telecommunications; The Chinese University of Hong Kong"}, {"id": 186332, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186332?format=json", "institution": "University of Science and Technology of China"}, {"id": 186333, "fullname": "Yi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186333?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 180283, "fullname": "Yiyao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180283?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 186334, "fullname": "zhan qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186334?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 130142, "fullname": "Dave Zhenyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130142?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 90680, "fullname": "Bingbing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90680?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 77320, "fullname": "Xu Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/77320?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatiotemporal correlations between an agent\u2019s actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher\u2019s reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop), surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9$\\times$ faster inference, while retaining most of the gains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36966", "url": null, "sourceid": 31229, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37368, "uid": "a405656bc824415c53be4e7bc6272620", "name": "DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving", "authors": [{"id": 180283, "fullname": "Yiyao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180283?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187264, "fullname": "Ying Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/187264?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 158469, "fullname": "Haiming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158469?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 186331, "fullname": "Guangfeng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186331?format=json", "institution": "University of Science and Technology of China"}, {"id": 76438, "fullname": "Wending Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76438?format=json", "institution": null}, {"id": 77320, "fullname": "Xu Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/77320?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187265, "fullname": "Jiantao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187265?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 130714, "fullname": "Yingjie CAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/130714?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 90680, "fullname": "Bingbing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90680?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 155659, "fullname": "Zhen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155659?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 187266, "fullname": "Shaojie Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187266?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Vision-based autonomous driving has gained much attention due to its low costs and excellent performance.Compared with dense BEV (Bird\u2019s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37368", "url": null, "sourceid": 37122, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38168, "uid": "a790b04a0380999fb3ea958c0e6eb5dc", "name": "HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm", "authors": [{"id": 90164, "fullname": "Jie-En Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90164?format=json", "institution": "National Tsing Hua University"}, {"id": 189196, "fullname": "Hong-En Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189196?format=json", "institution": "University of Southern California"}, {"id": 189197, "fullname": "C.-C. Jay Kuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189197?format=json", "institution": "University of Southern California"}], "abstract": "Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of $+5.46\\%$, $+17.00\\%$, and $+9.53\\%$, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38168", "url": null, "sourceid": 31866, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38294, "uid": "3aab3d6ff7ff48ac335703d9125ad2a7", "name": "Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning", "authors": [{"id": 189529, "fullname": "Xuchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189529?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences; Zhongguancun Academy"}, {"id": 189530, "fullname": "Xuzhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189530?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189531, "fullname": "Shiyu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189531?format=json", "institution": "Nanyang Technological University"}, {"id": 131797, "fullname": "Kaiqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131797?format=json", "institution": ", Institute of automation, Chinese academy of science"}], "abstract": "Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: \u201cSelect Less, Reason More.\u201d Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8\\% on LongVideoBench, 69.0\\% on MVBench and 64.9\\% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38294", "url": null, "sourceid": 37230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36174, "uid": "c0dba5809c620f70942856ad09b144d0", "name": "RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation", "authors": [{"id": 184332, "fullname": "Linfei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184332?format=json", "institution": "Tongji University"}, {"id": 184333, "fullname": "Lin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184333?format=json", "institution": "Tongji University; IEIT SYSTEMS (Beijing) Co., Ltd."}, {"id": 184334, "fullname": "Ying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184334?format=json", "institution": "Tongji University"}], "abstract": "Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million segmentation, detection, and language annotations, and roughly 11 billion grasping examples. Building on this dataset, RealVLG-R1 employs Reinforcement Fine-tuning on pretrained large-scale vision-language models to predict bounding boxes, segmentation masks, grasp poses, and contact points in an end-to-end manner from natural language instructions. Experimental results demonstrate that RealVLG supports zero-shot perception and manipulation in real-world unseen environments, establishing a unified semantic-visual multimodal benchmark that provides a comprehensive data and evaluation platform for language-driven robotic perception and grasping policy learning. All data and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36174", "url": null, "sourceid": 33859, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39834, "uid": "ab62bfb723e1c716bbb98005b47df372", "name": "DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving", "authors": [{"id": 192944, "fullname": "Dongqian Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192944?format=json", "institution": "University of Macau"}, {"id": 192945, "fullname": "Haoran Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192945?format=json", "institution": "University of Macau"}, {"id": 89411, "fullname": "Wencheng Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/89411?format=json", "institution": "University of Macau"}, {"id": 89897, "fullname": "Runzhou Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89897?format=json", "institution": "Beijing Institute of Technology"}, {"id": 185570, "fullname": "Zhongying Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185570?format=json", "institution": "ZEEKR"}, {"id": 185571, "fullname": "Jianfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185571?format=json", "institution": "Shanghai ZEEKR Blue New Energy Technology Co., Ltd."}, {"id": 126757, "fullname": "Jianbing Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126757?format=json", "institution": "University of Macau"}], "abstract": "Autonomous driving has made substantial progress recently, achieving reliable performance in most real-world environments. However, existing algorithms still depend heavily on high-definition maps, making them ineffective in mapless scenarios such as indoor parking lots. These limitations hinder seamless point-to-point navigation and restrict the broader deployment of the autonomous driving system.To address this challenge, we propose DriveVLN, a new task that extends Vision-and-Language Navigation (VLN) to autonomous driving. DriveVLN employs visual and linguistic priors to guide vehicles toward destinations based solely on concise natural-language descriptions, without access to predefined maps or routes. Unlike conventional VLN, which relies on detailed step-wise instructions in indoor environments, DriveVLN requires models to produce navigation information based on diverse visual cues and history, including signs, landmarks, and textual indicators.We further develop a CARLA-based simulation engine comprising over 200 realistic scenes reconstructed from real road scans, enabling large-scale training and closed-loop evaluation. A baseline model is established through supervised fine-tuning on real data, followed by reinforcement learning in simulation.Comprehensive experiments show that DriveVLN effectively bridges map-based and mapless driving, providing a new foundation for unified, language-driven autonomous navigation in complex real-world environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39834", "url": null, "sourceid": 42864, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39179, "uid": "463305739b212edcd50d4df28ea21663", "name": "P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction", "authors": [{"id": 92500, "fullname": "Kota Shimomura", "url": "http://cvpr.thecvf.com/api/miniconf/users/92500?format=json", "institution": "Chubu university"}, {"id": 191514, "fullname": "Hidehisa Arai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191514?format=json", "institution": "Turing Inc."}, {"id": 189451, "fullname": "Tsubasa Takahashi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189451?format=json", "institution": "Acompany"}, {"id": 129528, "fullname": "Takayoshi Yamashita", "url": "http://cvpr.thecvf.com/api/miniconf/users/129528?format=json", "institution": "Chubu University"}, {"id": 133682, "fullname": "Hironobu Fujiyioshi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133682?format=json", "institution": "DENSO CORPORATION"}], "abstract": "3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency.To address this limitation, we introduce $\\textbf{P2GS}$, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39179", "url": null, "sourceid": 43758, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37180, "uid": "728c80d8a69ff8c92e3d9a046bece8f8", "name": "DyaDiT: A Multi-Modal Diffusion Transformer for Socially-Aware Dyadic Gesture Generation", "authors": [{"id": 159466, "fullname": "YICHEN PENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/159466?format=json", "institution": "Japan Advanced Institute of Science and Technology, Tokyo Institute of Technology"}, {"id": 159763, "fullname": "Jyun-Ting Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/159763?format=json", "institution": "Carnegie Mellon University"}, {"id": 185885, "fullname": "Siyeol Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/185885?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 175386, "fullname": "RUOFAN LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/175386?format=json", "institution": "Institute of Science Tokyo"}, {"id": 155112, "fullname": "Haiyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155112?format=json", "institution": "the university of tokyo"}, {"id": 156179, "fullname": "Xuangeng Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156179?format=json", "institution": "The University of Tokyo"}, {"id": 107613, "fullname": "Ruicong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107613?format=json", "institution": "The University of Tokyo"}, {"id": 186863, "fullname": "Erwin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186863?format=json", "institution": "Institute of Science Tokyo"}, {"id": 186864, "fullname": "Hideki Koike", "url": "http://cvpr.thecvf.com/api/miniconf/users/186864?format=json", "institution": "Institute of Science Tokyo"}, {"id": 88213, "fullname": "Kris Kitani", "url": "http://cvpr.thecvf.com/api/miniconf/users/88213?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker\u2019s motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37180", "url": null, "sourceid": 31614, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37030, "uid": "0a2c8d94e057ab72ea4f88b67e0bb40e", "name": "Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors", "authors": [{"id": 186527, "fullname": "Chuanqing Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186527?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 181744, "fullname": "Xin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181744?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 186528, "fullname": "Zehui Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186528?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 152747, "fullname": "Zhengda Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152747?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 87313, "fullname": "Yiqun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87313?format=json", "institution": "Chongqing University"}, {"id": 186529, "fullname": "Junqi Diao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186529?format=json", "institution": "Air Force Engineering University"}, {"id": 85727, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85727?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D\u20133D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photorealistic novel view synthesis. The experiments show significant outperformance over existing pose-free and pose-aware 3DGS methods on both real-world and synthetic 360-degree videos.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37030", "url": null, "sourceid": 39477, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39402, "uid": "499a3f88156bb6321e00d0ff0ab55041", "name": "RL\u2011ScanIQA: Reinforcement-Learned Scanpaths  for Blind 360\u00b0 Image Quality Assessment", "authors": [{"id": 143695, "fullname": "yujia wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143695?format=json", "institution": "Victoria University of Wellington"}, {"id": 191999, "fullname": "Yuyan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191999?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 192000, "fullname": "Jiuming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192000?format=json", "institution": "University of Cambridge"}, {"id": 185061, "fullname": "Fang-Lue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185061?format=json", "institution": "Victoria University of Wellington"}, {"id": 189163, "fullname": "Xinhu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189163?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 192001, "fullname": "Neil Dodgson", "url": "http://cvpr.thecvf.com/api/miniconf/users/192001?format=json", "institution": "Victoria University of Wellington"}], "abstract": "Blind 360\u00b0 image quality assessment (IQA) aims to predict perceptual quality for panoramic images without a pristine reference. Unlike conventional planar images, 360\u00b0 content in immersive environments restricts viewers to a limited viewport at any moment, making viewing behaviors critical to quality perception. Although existing scanpath-based approaches have attempted to model viewing behaviors by approximating the human view\u2011then\u2011rate paradigm, they treat scanpath generation and quality assessment as separate steps, preventing end-to-end optimization and task-aligned exploration. To address this limitation, we propose RL\u2011ScanIQA, a reinforcement\u2011learned framework for blind 360\u00b0 IQA. RL-ScanIQA optimize a PPO-trained scanpath policy and a quality assessor, where the policy receives quality-driven feedback to learn task-relevant viewing strategies. To improve training stability and prevent mode collapse, we design multi-level rewards, including scanpath diversity and equator-biased priors. We further boost cross\u2011dataset robustness using distortion\u2011space augmentation together with rank\u2011consistent losses that preserve intra\u2011image and inter\u2011image quality orderings. Extensive experiments on three benchmarks show that RL\u2011ScanIQA achieves superior in\u2011dataset performance and cross\u2011dataset generalization. Code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39402", "url": null, "sourceid": 39616, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36963, "uid": "df0e19d29493ef2136fc3e2fc029c054", "name": "FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation", "authors": [{"id": 186321, "fullname": "YIYI CAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/186321?format=json", "institution": "Shanda AI Research Tokyo"}, {"id": 186322, "fullname": "Yuhan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186322?format=json", "institution": "The University of Tokyo"}, {"id": 186323, "fullname": "Kunhang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186323?format=json", "institution": "The University of Tokyo"}, {"id": 186324, "fullname": "YOU ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/186324?format=json", "institution": "Shanda Group Co., Ltd."}, {"id": 186325, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186325?format=json", "institution": "Shanda Group Corp."}, {"id": 155112, "fullname": "Haiyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155112?format=json", "institution": "the university of tokyo"}], "abstract": "We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events.We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning.With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36963", "url": null, "sourceid": 35145, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38041, "uid": "6f70611ba289fba67a59ee773cd3df3c", "name": "LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents", "authors": [{"id": 181764, "fullname": "Zihe Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181764?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 106927, "fullname": "Zhuosheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106927?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 188904, "fullname": "Jiaping Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/188904?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 178500, "fullname": "Gongshen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178500?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose LaSM, a Layer-wise Scaling Mechanism that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38041", "url": null, "sourceid": 39718, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36427, "uid": "0e201cedf37601b3d133df35bd9db100", "name": "Unifying Language-Action  Understanding and Generation for Autonomous Driving", "authors": [{"id": 182492, "fullname": "Xinyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182492?format=json", "institution": "Zhejiang University"}, {"id": 185014, "fullname": "Qian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185014?format=json", "institution": "Li Auto Inc."}, {"id": 185015, "fullname": "WENJIE DING", "url": "http://cvpr.thecvf.com/api/miniconf/users/185015?format=json", "institution": null}, {"id": 185016, "fullname": "Zhao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185016?format=json", "institution": "Li Auto Inc."}, {"id": 185017, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185017?format=json", "institution": "Li Auto"}, {"id": 185018, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185018?format=json", "institution": null}, {"id": 185019, "fullname": "Bailin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185019?format=json", "institution": "Li Auto"}, {"id": 153218, "fullname": "Kun Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153218?format=json", "institution": "LiAuto"}, {"id": 153220, "fullname": "XianPeng Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153220?format=json", "institution": "LiAuto"}, {"id": 131775, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131775?format=json", "institution": "State key laboratory of CAD&amp;CG"}], "abstract": "Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language\u2013action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method (C2F) that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36427", "url": null, "sourceid": 32831, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38402, "uid": "e8c3ffce73ea28b5d4874789c5828145", "name": "Focus on Background: Exploring SAM's Potential in Few-Shot Medical Image Segmentation with Background-Centric Prompting", "authors": [{"id": 189799, "fullname": "Yuntian Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189799?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 71240, "fullname": "Yazhou Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71240?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86028, "fullname": "Piotr Koniusz", "url": "http://cvpr.thecvf.com/api/miniconf/users/86028?format=json", "institution": "Data61/CSIRO + Australian National University"}, {"id": 181383, "fullname": "Haofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181383?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Conventional few-shot medical image segmentation (FSMIS) approaches face performance bottlenecks that hinder broader clinical applicability. Although the Segment Anything Model (SAM) exhibits strong category-agnostic segmentation capabilities, its direct application to medical images often leads to over-segmentation due to ambiguous anatomical boundaries. In this paper, we reformulate SAM-based FSMIS as a prompt localization task and propose FoB (Focus on Background), a background-centric prompt generator that provides accurate background prompts to constrain SAM\u2019s over-segmentation. Specifically, FoB bridges the gap between segmentation and prompt localization by category-agnostically generating support background prompts and localizing them directly in the query image. To address the challenge of prompt localization for novel categories, FoB models rich contextual information to capture foreground-background spatial dependencies. Moreover, inspired by the inherent structural patterns of background prompts in medical images, FoB models this structure as a constraint to progressively refine background prompt predictions. Experiments on three diverse medical image datasets demonstrate that FoB outperforms other baselines by large margins, achieves state-of-the-art performance on FSMIS, and further exhibits strong cross-domain generalization. All code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38402", "url": null, "sourceid": 34642, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38181, "uid": "6a0f5ef283b2008eeff6756343f8810c", "name": "Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction", "authors": [{"id": 189226, "fullname": "Zhengbo Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189226?format=json", "institution": "Shanghai University of Finance and Economics"}, {"id": 189227, "fullname": "Zifan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189227?format=json", "institution": "Alibaba Group"}, {"id": 151933, "fullname": "Shaobo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151933?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189228, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189228?format=json", "institution": "Alibaba Group"}, {"id": 189229, "fullname": "Bing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189229?format=json", "institution": "Alibaba Group"}, {"id": 189230, "fullname": "hu wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/189230?format=json", "institution": "Alibaba Group"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "Multimodal Large Language Models (MLLMs) have significantly advanced vision\u2013language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image\u2013text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure both fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with post-hoc filtering, decoupling generation from learning needs.We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image\u2013text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding the Teacher\u2019s targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated \u201cimage\u2013code\u2013instruction\u201d triplets, distilling programmatic drawing intelligence into visual generation.Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six geometric benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38181", "url": null, "sourceid": 38351, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38802, "uid": "c920b75709d49179c0ba5a02c9839a46", "name": "OneThinker: All-in-one Reasoning Model for Image and Video", "authors": [{"id": 126995, "fullname": "Kaituo Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126995?format=json", "institution": "Beijing Institute of Technology"}, {"id": 185043, "fullname": "Manyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185043?format=json", "institution": null}, {"id": 155705, "fullname": "Hongyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155705?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 190711, "fullname": "Kaixuan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190711?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 173056, "fullname": "shuang chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/173056?format=json", "institution": "Zhejiang University"}, {"id": 159912, "fullname": "Yilei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159912?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 190712, "fullname": "Dian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190712?format=json", "institution": "The Chinese University of  Hong Kong"}, {"id": 190713, "fullname": "Peiwen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190713?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 107255, "fullname": "Yiyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107255?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 190714, "fullname": "Haoze Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190714?format=json", "institution": null}, {"id": 185047, "fullname": "Yan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185047?format=json", "institution": "Meituan"}, {"id": 185048, "fullname": "Peng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185048?format=json", "institution": "Meituan"}, {"id": 185049, "fullname": "Xunliang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185049?format=json", "institution": "Meituan"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities.To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start.Moreover, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks.Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38802", "url": null, "sourceid": 37571, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37424, "uid": "2d2633f2f6a184c9733f83e50dedbc7d", "name": "MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations", "authors": [{"id": 187422, "fullname": "Raghav Magazine", "url": "http://cvpr.thecvf.com/api/miniconf/users/187422?format=json", "institution": "Microsoft"}, {"id": 152409, "fullname": "Xingjian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152409?format=json", "institution": "Carnegie Mellon University"}, {"id": 92284, "fullname": "Min Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92284?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Saliency-based explainability methods are widely used to interpret deep learning models in medical imaging, yet many existing approaches rely on white box access of models, which is not always possible due to privacy concerns. In this work, we introduce **MedLIME**, a novel, model-agnostic explanation framework designed to enhance the robustness and fidelity of saliency maps for medical imaging abnormality localization. Building upon the Local Interpretable Model-agnostic Explanations (LIME) paradigm, MedLIME integrates three key components: (1) **Generative Masking** (GM), (2) **Supervised Test-Time Adaptation** (STT) and (3) a **Evidence-based Regularization** (EBR) to improve the saliency map generation accuracy of LIME. Extensive experiments on multiple medical datasets, across three model architectures demonstrate that MedLIME consistently outperforms gradient-based and perturbation-based baselines in abnormality localization as measured by AUPRC. Our results highlight that incorporating generative reconstruction, adaptive perturbation and data-driven regularization improves the reliability and interpretability of medical imaging models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37424", "url": null, "sourceid": 41946, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39069, "uid": "ecf634864fe4ff241756b47582c09582", "name": "Chain of World: World Model Thinking in Latent Motion", "authors": [{"id": 191295, "fullname": "Fuxiang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191295?format=json", "institution": "Harbin Institute of Technology"}, {"id": 153284, "fullname": "Donglin Di", "url": "http://cvpr.thecvf.com/api/miniconf/users/153284?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158474, "fullname": "Lulu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158474?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 191296, "fullname": "Xuancheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191296?format=json", "institution": "Li Auto Inc."}, {"id": 152449, "fullname": "Lei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152449?format=json", "institution": "University of New South Wales"}, {"id": 187196, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187196?format=json", "institution": "Li Auto Inc."}, {"id": 187197, "fullname": "Chen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187197?format=json", "institution": "Li Auto Inc."}, {"id": 189477, "fullname": "Tonghua Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/189477?format=json", "institution": "Harbin Institute of Technology"}, {"id": 191297, "fullname": "Baorui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/191297?format=json", "institution": "Beijing Academy of Artificial Intelligence"}], "abstract": "Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics.World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds.Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge.To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new \"Chain of World\" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder.This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning.Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39069", "url": null, "sourceid": 33430, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39781, "uid": "89b0bf9a915d78c4d84bb40a137ac250", "name": "BEV-CAR: Enhancing Monocular Bird\u2019s Eye View Segmentation with Context-Aware Rasterization", "authors": [{"id": 192835, "fullname": "Yixin Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192835?format=json", "institution": "Chongqing University"}, {"id": 192836, "fullname": "Ke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192836?format=json", "institution": "Chongqing University"}, {"id": 192837, "fullname": "Tongtong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192837?format=json", "institution": "Chongqing University"}, {"id": 192838, "fullname": "Chunhui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192838?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 192839, "fullname": "Kai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192839?format=json", "institution": "Chongqing University"}], "abstract": "Bird\u2019s Eye View (BEV) semantic segmentation is essential for autonomous driving and mobile robotics, yet it still faces significant challenges on accurate segmentation of foreground object and efficient estimating of layout categories obscured by objects. To address these issues, we propose BEV-CAR, a Context-Aware Rasterization method that rasterizes the BEV representation without any coordinate transformations. By optimising each ray and incorporating depth features, BEV-CAR effectively addresses the challenges posed by object occlusions and varying environmental conditions. It ensures robust performance across diverse scenarios, particularly improving the accuracy of foreground object segmentation and layout estimation in occluded areas. And extensive experiments on the nuScenes and Argoverse datasets demonstrate that BEV-CAR achieves state-of-the-art (SOTA) performance. More importantly, the rasterization technique in this paper does not introduce additional computational overhead during the inference process, making it suitable for practical deployment in real-world scenarios. Code and technical appendix are available in supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39781", "url": null, "sourceid": 35773, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37330, "uid": "9c9b74ef3f74c104388a1b4f1e09183a", "name": "NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity", "authors": [{"id": 180531, "fullname": "Weijian Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180531?format=json", "institution": "The University of Hong Kong"}, {"id": 181350, "fullname": "Mu Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181350?format=json", "institution": "University of Hong Kong"}, {"id": 187175, "fullname": "Yu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187175?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 187176, "fullname": "Jiahang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187176?format=json", "institution": "University of Hong Kong"}, {"id": 187177, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187177?format=json", "institution": "University of Hong Kong"}, {"id": 187178, "fullname": "Yuqin Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187178?format=json", "institution": "Tsinghua University"}, {"id": 152969, "fullname": "Chunfeng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152969?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 158393, "fullname": "Andrew Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158393?format=json", "institution": "University of Hong Kong"}, {"id": 152964, "fullname": "Jiamin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152964?format=json", "institution": "The Chinese University of Hong Kong, Shanghai AI Laboratory"}], "abstract": "Visual encoding and decoding models act as gateways to understanding the neural mechanisms underlying human visual perception. Typically, visual encoding models that predict brain activity from stimuli and decoding models that reproduce stimuli from brain activity are treated as distinct tasks, requiring separate models and training procedures. This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. NeuroFlow introduces two key components: (i) NeuroVAE is designed as a variational backbone to model neural variability and establish a compact, semantically structured latent space for bidirectional modeling across visual and neural modalities. (ii) Cross-modal Flow Matching (XFM) bypasses the typical paradigm of noise-to-data diffusion guided by a specific modality condition, instead learning a reversibly consistent flow model between visual and neural latent distributions. For the first time, visual encoding and decoding are reformulated as a time-dependent, reversible process within a shared latent space for unified modeling. Empirical results demonstrate that NeuroFlow achieves superior overall performance in visual encoding and decoding tasks with higher computational efficiency compared to any isolated methods. We further analyze principal factors that steer the model toward encoding\u2013decoding consistency and, through brain functional analyses, demonstrate that NeuroFlow captures consistent activation patterns underlying neural variability. NeuroFlow marks a major step toward unified visual encoding and decoding from neural activity, providing mechanistic insights that inform future bidirectional visual brain\u2013computer interfaces. Code will be released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37330", "url": null, "sourceid": 39675, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37122, "uid": "c64c545aff0d17ad713c907fdada37d1", "name": "R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space", "authors": [{"id": 164361, "fullname": "Tin Stribor Sohn", "url": "http://cvpr.thecvf.com/api/miniconf/users/164361?format=json", "institution": "Dr. Ing. h.c. Porsche AG"}, {"id": 164362, "fullname": "Maximilian Dillitzer", "url": "http://cvpr.thecvf.com/api/miniconf/users/164362?format=json", "institution": "Hochschule Esslingen"}, {"id": 95802, "fullname": "Jason Corso", "url": "http://cvpr.thecvf.com/api/miniconf/users/95802?format=json", "institution": "Voxel51; University of Michigan"}, {"id": 186718, "fullname": "Eric Sax", "url": "http://cvpr.thecvf.com/api/miniconf/users/186718?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}], "abstract": "Humans perceive and reason about their surroundings in four dimensions-three spatial and one temporal axis-by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning through an iterative retrieval-reasoning loop. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in structured 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37122", "url": null, "sourceid": 32991, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39890, "uid": "0e7e8bf3c8755a3fd3922bba39aa51d8", "name": "SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training", "authors": [{"id": 193063, "fullname": "Le Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193063?format=json", "institution": "University of Geneva"}, {"id": 193064, "fullname": "Thiago Santos", "url": "http://cvpr.thecvf.com/api/miniconf/users/193064?format=json", "institution": "MIRALab, University of Geneva"}, {"id": 193065, "fullname": "Nadia Thalmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/193065?format=json", "institution": "University of Geneva"}, {"id": 193066, "fullname": "Katarzyna Wac", "url": "http://cvpr.thecvf.com/api/miniconf/users/193066?format=json", "institution": "Quality of Life Lab, University of Geneva, Switzerland"}], "abstract": "In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. SHands captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees) each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands will be publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39890", "url": null, "sourceid": 35617, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37585, "uid": "fa80153c38c0265b4dbaaaabc54f7485", "name": "Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling", "authors": [{"id": 180468, "fullname": "linkun fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180468?format=json", "institution": "Henan University"}, {"id": 174339, "fullname": "Jiahao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174339?format=json", "institution": "\u6cb3\u5357\u5927\u5b66"}, {"id": 187773, "fullname": "JTzhang JTzhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187773?format=json", "institution": "Henan Univeristy"}, {"id": 187774, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187774?format=json", "institution": "Henan Univeristy"}, {"id": 145848, "fullname": "Fazhi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/145848?format=json", "institution": "Wuhan University"}, {"id": 187775, "fullname": "Daojun Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187775?format=json", "institution": null}], "abstract": "To ensure the robustness of 3D point cloud Deep Neural Network(3D DNN), 3D adversarial attack targeting the inference stage and backdoor attack targeting the training stage are well studied. The success of both attacks usually requires a specified permissions that attacker must have. However, the obtainable permissions are uncertain due to the deployment environment changes in practical scenarios. This renders existing separately designed adversarial attack or backdoor attack ineffective. To solve this issue, this paper proposes a unified attack that can adapt to both 3D point cloud backdoor attack and adversarial attack, named UAtt3D. Furthermore, by observing existing attacks, their way to promise attack stealthiness is to limit the undesirable perturbation. This strategy requires moving the point position as little as possible, which restricts the attack intensity and is not suitable for our unified attack. Meanwhile, this strategy will inevitably cause a quality decrease on 3D point cloud due to the remaining malicious perturbation. Therefore, our UAtt3D explores a new avenue to guarantee attack stealthiness which improves the quality of attacked 3D point cloud rather than decreasing it. In detail, to simultaneously consider feature movement of adversarial attack and backdoor feature learning of backdoor attack, a flexible isotropic resampling is designed. It realigns the position of most points based on surface approximation and rays sampling. By fine tuning the resampled point cloud, adversarial point cloud and backdoored point cloud are obtained. Several experiments suggest that the proposed UAtt3D achieves outstanding stealthiness comparing with existing adversarial attacks and backdoor attacks from the subjective and objective perspective. Meanwhile, its attack efficiency is competitive.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37585", "url": null, "sourceid": 43435, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38915, "uid": "7b27ab2fbcbe3b67935da0694742ed0e", "name": "DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision\u2013Language Transformers to Missing Modalities", "authors": [{"id": 180403, "fullname": "Jueqing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180403?format=json", "institution": "Monash University"}, {"id": 190953, "fullname": "Yuanyuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190953?format=json", "institution": "Monash University"}, {"id": 190954, "fullname": "Xiaohao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190954?format=json", "institution": "Monash University"}, {"id": 156883, "fullname": "Shuaicheng Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156883?format=json", "institution": "Nanyang Technological University"}, {"id": 170533, "fullname": "Fucai Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/170533?format=json", "institution": "Monash University"}, {"id": 190955, "fullname": "Shujie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190955?format=json", "institution": "Monash University"}, {"id": 178847, "fullname": "Wei Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/178847?format=json", "institution": "Oracle"}, {"id": 190956, "fullname": "Jionghao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190956?format=json", "institution": "University of Hong Kong"}, {"id": 190957, "fullname": "Wray Buntine", "url": "http://cvpr.thecvf.com/api/miniconf/users/190957?format=json", "institution": "VinUniversity &amp; Monash University"}, {"id": 75933, "fullname": "Hamid Rezatofighi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75933?format=json", "institution": "Monash University"}, {"id": 158410, "fullname": "Lan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/158410?format=json", "institution": "Monash University"}], "abstract": "The performance of Vision\u2013Language Transformers drops sharply when an input modality (e.g., image) is missing, because the model is forced to make predictions using incomplete information. Existing missing-aware prompt methods help reduce this degradation, but they still rely on conventional prediction heads (e.g., a Fully-Connected layer) that compute class scores in the same way regardless of which modality is present or absent. We introduce Decoupled Prototype Learning (DPL), a new prediction head architecture that explicitly adjusts its decision process to the observed input modalities. For each class, DPL selects a set of prototypes specific to the current missing-modality cases (image-missing, text-missing, or mixed-missing). Each prototype is then decomposed into image-specific and text-specific components, enabling the head to make decisions that depend on the information actually present. This adaptive design allows DPL to handle inputs with missing modalities more effectively while remaining fully compatible with existing prompt-based frameworks. Extensive experiments on MM-IMDb, UPMC Food-101, and Hateful Memes demonstrate that DPL outperforms state-of-the-art approaches across all widely used multimodal image\u2013text datasets and various missing cases.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38915", "url": null, "sourceid": 40839, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36829, "uid": "1ba2e3e63336e31e2474cac0fd74bb40", "name": "PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-based Structure Matching", "authors": [{"id": 153358, "fullname": "Hanqiao Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/153358?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 98504, "fullname": "Yuzhou Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/98504?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 185975, "fullname": "Yangdong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185975?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 129136, "fullname": "Shuhan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129136?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "While structure-based relocalizers have long strived for *point* correspondences when establish or regress query-map associations, in this paper, we pioneer the use of **planar primitives** and planar 3D maps for lightweight 6-DoF camera relocalization in structured environments.Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness.This motivates us to introduce *PlanaReLoc*, a streamlined \"plane-centric\" paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework.Through extensive experiments on the *ScanNet* and *12Scenes* datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modality structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training.The code and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36829", "url": null, "sourceid": 32470, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37741, "uid": "d4d9dd228996e12e46d286639eccd3e1", "name": "Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep", "authors": [{"id": 188146, "fullname": "Tianyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188146?format=json", "institution": "Nanyang Technological University"}, {"id": 188147, "fullname": "Ye Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188147?format=json", "institution": "Nanyang Technological University"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}, {"id": 188148, "fullname": "Chen Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188148?format=json", "institution": "Nanyang Technological University"}, {"id": 188149, "fullname": "Jianjun Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188149?format=json", "institution": "Nanyang Technological University"}, {"id": 88428, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88428?format=json", "institution": "Nanyang Technological University"}, {"id": 73234, "fullname": "Kim-Hui Yap", "url": "http://cvpr.thecvf.com/api/miniconf/users/73234?format=json", "institution": "Nanyang Technological University"}, {"id": 88437, "fullname": "Lap-Pui Chau", "url": "http://cvpr.thecvf.com/api/miniconf/users/88437?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, diffusion transformers remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the Diffusion Transformer (DiT) itself that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model\u2019s output.This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion transformers and video editing tasks. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy effectively reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves noticeable acceleration includes 2.67$\\times$ latency speedup and noticeable FLOPs reduction over commonly used foundation models with negligible degradation in editing quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37741", "url": null, "sourceid": 31824, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36564, "uid": "ee36a2060ec0721650bf82c39619ab88", "name": "Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition", "authors": [{"id": 185363, "fullname": "Jakob Zimmermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/185363?format=json", "institution": "Freie Universit\u00e4t Berlin; Technische Universit\u00e4t Berlin; Fraunhofer HHI"}, {"id": 185364, "fullname": "Georg Loho", "url": "http://cvpr.thecvf.com/api/miniconf/users/185364?format=json", "institution": "Freie Universit\u00e4t Berlin; University of Twente"}], "abstract": "It has been demonstrated in various contexts that monotonicity leads to better explainability in neural networks. However, not every function can be well approximated by a monotone neural network.We demonstrate that monotonicity can still be used in two ways to boost explainability. First, we use an adaptation of the decomposition of a trained ReLU network into two monotone and convex parts, thereby overcoming numerical obstacles from an inherent blowup of the weights in this procedure. Our proposed saliency methods -- SplitCAM and SplitLRP --improve onstate of the art results on both VGG16 and Resnet18 networks on ImageNet-S across all Quantus saliency metric categories.Second, we exhibit that training a model as the difference between two monotone neural networks results in a system with strong self-explainability properties.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36564", "url": null, "sourceid": 31575, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38565, "uid": "a0c6cd885e0a0bd2ebfb8710f10c6bc8", "name": "The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers", "authors": [{"id": 190158, "fullname": "Wei Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190158?format=json", "institution": "National University of Defense Technology"}, {"id": 183813, "fullname": "Yang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/183813?format=json", "institution": "National University of Defense Technology"}, {"id": 190159, "fullname": "Jincai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190159?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 190160, "fullname": "Qing Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190160?format=json", "institution": "Hefei Institution of Technology"}], "abstract": "Crafting adversarial examples can be formulated as an optimization problem. While sign-based optimizers such as I-FGSM and MI-FGSM have become the de facto standard for the induced optimization problems, there still exist several unsolved problems in theoretical grounding and practical reliability especially in non-convergence and instability, which inevitably influences their transferability. Contrary to the expectation, we observe that the attack success rate may degrade sharply when more number of iterations are conducted. In this paper, we address these issues from an optimization perspective. By reformulating the sign-based optimizer as a specific coordinate-wise gradient descent, we argue that one cause for non-convergence and instability is their non-decaying step-size scheduling. Based upon this viewpoint, we propose a series of new attack algorithms that enforce Monotonically Decreasing Coordinate-wise Step-sizes (MDCS) within sign-based optimizers. Typically, we further provide theoretical guarantees proving that MDCS-MI attains an optimal convergence rate of $O(1/\\sqrt{T})$, where $T$ is the number of iterations. Extensive experiments on image classification and cross-modal retrieval tasks demonstrate that our approach not only significantly improves transferability but also enhances attack stability compared to state-of-the-art sign-based methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38565", "url": null, "sourceid": 43496, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40295, "uid": "6e465bf376c95a9f4a5f56bd005455f6", "name": "Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion", "authors": [{"id": 180286, "fullname": "Zengyi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180286?format=json", "institution": "Hefei University of Technology"}, {"id": 187013, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187013?format=json", "institution": "Hefei University of Technology"}, {"id": 187262, "fullname": "Juan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187262?format=json", "institution": "Hefei University of Technology"}, {"id": 187263, "fullname": "Zhiqin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187263?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 127590, "fullname": "Yafei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127590?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 127563, "fullname": "Huafeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127563?format=json", "institution": "Kunmimg University of Science and Technology"}], "abstract": "Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote accurate semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40295", "url": null, "sourceid": -44482, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37367?format=json"], "related_events_ids": [37367]}, {"id": 39367, "uid": "150784e5fbeb562400a0cd1111471d6a", "name": "Changes in Real Time: Online Scene Change Detection with Multi-View Fusion", "authors": [{"id": 126456, "fullname": "Chamuditha Jayanga Galappaththige", "url": "http://cvpr.thecvf.com/api/miniconf/users/126456?format=json", "institution": "QUT Centre for Robotics, Australia."}, {"id": 152497, "fullname": "Jason Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/152497?format=json", "institution": "University of Sydney"}, {"id": 152498, "fullname": "Lloyd Windrim", "url": "http://cvpr.thecvf.com/api/miniconf/users/152498?format=json", "institution": "Abyss Solutions"}, {"id": 152499, "fullname": "Donald G. Dansereau", "url": "http://cvpr.thecvf.com/api/miniconf/users/152499?format=json", "institution": "University of Sydney"}, {"id": 152500, "fullname": "Niko Suenderhauf", "url": "http://cvpr.thecvf.com/api/miniconf/users/152500?format=json", "institution": "Queensland University of Technology"}, {"id": 152501, "fullname": "Dimity Miller", "url": "http://cvpr.thecvf.com/api/miniconf/users/152501?format=json", "institution": "Queensland University of Technology"}], "abstract": "Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches.    Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offline baselines. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39367", "url": "https://chumsy0725.github.io/O-SCD/", "sourceid": 32827, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37367, "uid": "6e465bf376c95a9f4a5f56bd005455f6", "name": "Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion", "authors": [{"id": 180286, "fullname": "Zengyi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180286?format=json", "institution": "Hefei University of Technology"}, {"id": 187013, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187013?format=json", "institution": "Hefei University of Technology"}, {"id": 187262, "fullname": "Juan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187262?format=json", "institution": "Hefei University of Technology"}, {"id": 187263, "fullname": "Zhiqin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187263?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 127590, "fullname": "Yafei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127590?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 127563, "fullname": "Huafeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127563?format=json", "institution": "Kunmimg University of Science and Technology"}], "abstract": "Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote accurate semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37367", "url": null, "sourceid": 44482, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40295?format=json"], "related_events_ids": [40295]}, {"id": 40321, "uid": "8ed67538022400bdba14f3e8b4bf6ab3", "name": "PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation", "authors": [{"id": 182723, "fullname": "Minjae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182723?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 188839, "fullname": "Sungwoo Hur", "url": "http://cvpr.thecvf.com/api/miniconf/users/188839?format=json", "institution": "POSTECH"}, {"id": 188840, "fullname": "Soojin Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188840?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 89062, "fullname": "Won Hwa Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/89062?format=json", "institution": "POSTECH"}], "abstract": "Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM\u2019s mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples.Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40321", "url": null, "sourceid": -45817, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38019?format=json"], "related_events_ids": [38019]}, {"id": 38368, "uid": "68287713f50585614720e60776c95ad8", "name": "Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting", "authors": [{"id": 181720, "fullname": "Jaejin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/181720?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 181724, "fullname": "Minjae JEONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/181724?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 189733, "fullname": "Joonhyuk Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/189733?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 189734, "fullname": "Yechan Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189734?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 189735, "fullname": "Seunghun Baek", "url": "http://cvpr.thecvf.com/api/miniconf/users/189735?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 89062, "fullname": "Won Hwa Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/89062?format=json", "institution": "POSTECH"}], "abstract": "Embedding invisible and temporally consistent watermarks into dynamic 4D Gaussian Splatting (4DGS) models poses unique challenges due to continuous spatio-temporal deformation of Gaussians and diverse motion dynamics.Existing 3DGS watermarking methods, which directly fine-tune parameters within Gaussian splats, fail to preserve geometric fidelity and temporal coherence when extended to dynamic 4D settings. In this regime, we propose Mark4D, a temporally consistent watermarking technique that achieves robust, imperceptible, and motion-aware watermark embedding for 4DGS. Mark4D comprises 1) a decoder for watermark recovery in the latent video\u2013text space for robustness against pixel-level distortions,  2) trajectory-aligned offsets that embed watermark signals along Gaussian motion paths to preserve geometry, and 3) a motion-adaptive loss weighting strategy that balances supervision across frames with varying motion intensities. Extensive experiments on synthetic and real-world dynamic scene datasets demonstrate that Mark4D achieves superior bit accuracy, visual fidelity, and robustness under diverse distortions, establishing a foundation for secure and reliable protection of dynamic 4D scene assets. The implementation of Mark4D will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38368", "url": null, "sourceid": 45792, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39766, "uid": "e96e43446dc54b1e2a36367f390bdb0c", "name": "OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation", "authors": [{"id": 183757, "fullname": "Yating Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183757?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 192811, "fullname": "Zhaoshuai Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192811?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 192812, "fullname": "Yang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192812?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 192813, "fullname": "Yongnan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192813?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 156650, "fullname": "Shizhou Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156650?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 88282, "fullname": "Yanning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88282?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Estimating the 3D pose of unseen objects from a single image remains a fundamental yet challenging problem in computer vision, especially under a CAD model-free setting.Pioneering attempts address this issue by matching templates generated through Novel View Synthesis (NVS), which essentially aims to learn the geometric transformation from a reference to a target view.  While promising, these methods can only approximate this transformation under pixel-level supervision, as the starting orientation remains undefined. In the absence of explicit geometric constraints to verify the correctness of the predicted transformation, existing methods often synthesize novel views with geometry-distorted structures or severely blurred local textures, leading to unreliable template matching and suboptimal pose estimation results. To this end, we propose OrienPose, a novel object pose estimation framework via orientation-aware NVS from a single image. Specifically, we introduce the Orientation-Aware Guidance, which explicitly injects object orientation cues into the reference latent embedding to enhance orientation awareness during viewpoint transformation. We also introduce an orientation consistency loss that supervises viewpoint transformation at the geometric level, establishing sufficient supervision for explicit and geometry-consistent transformation guidance beyond pixel-level similarity. This loss justifies estimating the reference orientation rather than using its ground-truth pose, thereby ensuring the alignment of coordinate domains between the injected and supervised priors. Extensive experiments demonstrate that OrienPose achieves state-of-the-art performance in single-view unseen object pose estimation and impressive robustness to image degradations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39766", "url": null, "sourceid": 45872, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40134, "uid": "746e60af033af91289867fa75d7d0d28", "name": "Towards High-resolution and Disentangled Reference-based Sketch Colorization", "authors": [{"id": 160788, "fullname": "Dingkun Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/160788?format=json", "institution": "Shanda AI Research Tokyo"}, {"id": 152467, "fullname": "Xinrui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152467?format=json", "institution": "The University of Tokyo"}, {"id": 193603, "fullname": "Ru Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193603?format=json", "institution": "The University of Tokyo, The University of Tokyo"}, {"id": 152468, "fullname": "Zhuoru Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152468?format=json", "institution": "ProjectHAT"}, {"id": 193604, "fullname": "Jinze Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193604?format=json", "institution": "Waseda University"}, {"id": 152470, "fullname": "Yusuke Iwasawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/152470?format=json", "institution": "The University of Tokyo, The University of Tokyo"}, {"id": 152471, "fullname": "Yutaka Matsuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152471?format=json", "institution": "The University of Tokyo"}, {"id": 152472, "fullname": "Jiaxian Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152472?format=json", "institution": "Google Research"}], "abstract": "Sketch colorization models have been widely studied to automate and assist in the creation of animation frames and digital illustrations. However, current methods are still not satisfactory for industrial standard applications in high-resolution synthesis and precise controllability of details. To further enhance the synthesis quality and controllability, we propose an image-referenced sketch colorization method based on the powerful SDXL backbone and leverage sketches as spatial guidance and RGB images as color references. A split cross-attention mechanism is coupled with spatial masks to separately colorize the foreground and background regions to avoid spatial entanglement. A tagger network trained on a massive anime-style image dataset is employed to extract attribution-level information from reference images and integrated into the pipeline to provide precise control signals for synthesis. However, the increased resolution and number of attention layers in the SDXL backbone and precise reference information from the tagger network cause severe entanglement during colorization. We consequently combine a foreground encoder and a background encoder for disentanglement and better synthesis quality. Furthermore, a high-quality annotated and paired sketch colorization dataset is collected for fine-tuning. The proposed method is the first to achieve high resolution high quality sketch colorization with precise control, and obvious outperforms existing methods in quantitative and qualitative validations, as well as user studies in both quality and controllability. Ablation study reveals the influence of each component. Code and dataset will be made publicly available upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40134", "url": null, "sourceid": 46321, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38612, "uid": "553273a1c6a3ea4b3df58f9d0de5aaaf", "name": "Test-Time 3D Occupancy Prediction", "authors": [{"id": 180899, "fullname": "Fengyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180899?format=json", "institution": "The University of Queensland"}, {"id": 190299, "fullname": "Xiangyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190299?format=json", "institution": "The University of Queensland"}, {"id": 177293, "fullname": "Huitong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177293?format=json", "institution": "The University of Queensland"}, {"id": 128732, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128732?format=json", "institution": "Harbin Institute of Technology"}, {"id": 90777, "fullname": "Zi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90777?format=json", "institution": "University of Queensland"}, {"id": 158034, "fullname": "Yadan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158034?format=json", "institution": "The University of Queensland"}], "abstract": "Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ.   Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition, without any network training or fine-tuning. Specifically, TT-Occ operates in a lift-track-voxelize symphony: We first lift the geometry and semantics of surrounding-view extracted from VFMs to instantiate Gaussians at 3D space; Next, we track dynamic Gaussians while accumulating static ones to complete the scene and enforce temporal consistency; Finally, we voxelize the optimized Gaussians to generate occupancy prediction. Optionally, inherent noise in VFM predictions and tracking is mitigated by periodically smoothing neighboring Gaussians during optimization. To validate the generality and effectiveness of our framework, we offer two variants: one LiDAR-based and one vision-centric, and conduct extensive experiments on Occ3D and nuCraft benchmarks with varying voxel resolutions. Source code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38612", "url": null, "sourceid": 32115, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39308, "uid": "c92c2cc982b4aff542f0cfa4cab0895b", "name": "ViStoryBench: Comprehensive Benchmark Suite for Story Visualization", "authors": [{"id": 191814, "fullname": "Cailin Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191814?format=json", "institution": "ShanghaiTech University"}, {"id": 180444, "fullname": "Ailin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180444?format=json", "institution": "StepFun AI"}, {"id": 181057, "fullname": "Hu Yaoqi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181057?format=json", "institution": "AIGC Research"}, {"id": 191815, "fullname": "Jingwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191815?format=json", "institution": null}, {"id": 88826, "fullname": "Wei Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88826?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 101226, "fullname": "Jiaqi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/101226?format=json", "institution": "Research, Microsoft"}, {"id": 178412, "fullname": "Hongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178412?format=json", "institution": "JYXC"}, {"id": 189458, "fullname": "Xinyao Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189458?format=json", "institution": "Nanyang Technological University"}, {"id": 191816, "fullname": "Weiwei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191816?format=json", "institution": "Fudan University"}, {"id": 107192, "fullname": "Hengyuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107192?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86059, "fullname": "Xuanyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86059?format=json", "institution": "Megvii Technology Inc."}, {"id": 126775, "fullname": "Xianfang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126775?format=json", "institution": "Tencent PCG"}, {"id": 90514, "fullname": "Zhewei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90514?format=json", "institution": "Megvii Technology Inc."}, {"id": 87502, "fullname": "Gang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87502?format=json", "institution": "Tencent"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, lacking character reference, or single-image cases, and fail to capture real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present \\textbf{ViStoryBench}, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt alignment, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and are used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39308", "url": null, "sourceid": 40858, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38525, "uid": "2e8f1f7c2e446cabefd299853e768bf7", "name": "PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks", "authors": [{"id": 184496, "fullname": "Cheng Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184496?format=json", "institution": "Baidu"}, {"id": 175650, "fullname": "yubo zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175650?format=json", "institution": "Baidu"}, {"id": 175159, "fullname": "Ting Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/175159?format=json", "institution": "Baidu"}, {"id": 184500, "fullname": "Xueqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184500?format=json", "institution": "Baidu"}, {"id": 184502, "fullname": "Hongen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184502?format=json", "institution": null}, {"id": 184503, "fullname": "Lin Manhui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184503?format=json", "institution": "Baidu"}, {"id": 184504, "fullname": "Yue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184504?format=json", "institution": "Baidu"}, {"id": 184498, "fullname": "Tingquan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184498?format=json", "institution": "Baidu"}, {"id": 184501, "fullname": "Changda Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184501?format=json", "institution": "Baidu"}, {"id": 90373, "fullname": "Jiaxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90373?format=json", "institution": "Nankai University"}, {"id": 184499, "fullname": "Zelun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184499?format=json", "institution": "Baidu"}, {"id": 181263, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181263?format=json", "institution": "Baidu"}, {"id": 184505, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184505?format=json", "institution": "Baidu"}, {"id": 105233, "fullname": "Yi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105233?format=json", "institution": "Baidu"}], "abstract": "The advent of \u201cOCR 2.0\u201d and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance comparable to billion-parameter VLMs, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38525", "url": null, "sourceid": 41918, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36523, "uid": "4b650319cca0b2d0480d58b5c6451a28", "name": "When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters", "authors": [{"id": 182185, "fullname": "Liangwei Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182185?format=json", "institution": "People&#x27;s Public Security University of China"}, {"id": 185266, "fullname": "Jiaqi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185266?format=json", "institution": "People&#x27;s Public Security University of China"}, {"id": 185267, "fullname": "Jianwei Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185267?format=json", "institution": "People&#x27;s Public Security University of China"}, {"id": 185268, "fullname": "Qiyao Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185268?format=json", "institution": "People&#x27;s Public Security University of China"}], "abstract": "Low-Rank Adaptation (LoRA) has emerged as a leading technique for efficiently fine-tuning text-to-image diffusion models, and its widespread adoption on open-source platforms has fostered a vibrant culture of model sharing and customization. However, the same modular and plug-and-play flexibility that makes LoRA appealing also introduces a broader attack sur-face. To highlight this risk, we propose Masquerade-LoRA (MasqLoRA), the first backdoor attack that leverages the LoRA mechanism to stealthily inject malicious behavior into text-to-image diffusion models. MasqLoRA operates by freezing the base model parameters and updating only the low-rank adapter weights using a small number of \u201ctrigger word-target image\u201d pairs. This enables the attacker to train a standalone backdoor LoRA module that embeds a hidden cross-modal mapping: when the module is loaded and a specific textual trigger is provided, the model produces a predefined visual output; other-wise, it behaves indistinguishably from the clean model, ensuring the stealthiness of the attack. Experimental results demonstrate that MasqLoRA can be trained with minimal resource overhead and achieves a high attack success rate of 99.8%. MasqLoRA reveals a severe and unique threat in the AI supply chain, underscoring the urgent need for dedicated defense mechanisms for the LoRA-centric sharing ecosystem.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36523", "url": null, "sourceid": 45300, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65737, "file": "/media/PosterPDFs/CVPR%202026/36523.png", "modified": "2026-04-24T03:13:20.748717-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37792, "uid": "b496dd03413b9ca410a34d5a25f2ef62", "name": "DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles", "authors": [{"id": 188278, "fullname": "Yiming Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188278?format=json", "institution": "Harbin Institute of Technology"}, {"id": 188279, "fullname": "Hongkun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188279?format=json", "institution": "Ocean University of China"}, {"id": 188280, "fullname": "Lionel WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/188280?format=json", "institution": null}, {"id": 157930, "fullname": "BIN CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/157930?format=json", "institution": null}, {"id": 188281, "fullname": "Weizhi Xian", "url": "http://cvpr.thecvf.com/api/miniconf/users/188281?format=json", "institution": "Chongqing Research Institute, Harbin Institute of Technology"}, {"id": 188282, "fullname": "Jianzhi Teng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188282?format=json", "institution": ""}], "abstract": "Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \\textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \\textbf{De}composing \\textbf{A}ttention head \\textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \\textit{Attribute}, \\textit{Generalization}, and \\textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37792", "url": null, "sourceid": 39766, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37977, "uid": "f7ce934a42554260ad8225c18d2b61c1", "name": "SARMAE: Masked Autoencoder for SAR Representation Learning", "authors": [{"id": 188725, "fullname": "Danxu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188725?format=json", "institution": "Beijing Institute of Technology"}, {"id": 158612, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158612?format=json", "institution": "Wuhan University"}, {"id": 180748, "fullname": "Hebaixu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180748?format=json", "institution": "Zhongguancun Academy, Beijing, China"}, {"id": 184689, "fullname": "Haoyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184689?format=json", "institution": "Wuhan University"}, {"id": 188726, "fullname": "Wentao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188726?format=json", "institution": "Wuhan University"}, {"id": 188727, "fullname": "Yilin Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188727?format=json", "institution": "Fudan University"}, {"id": 186959, "fullname": "Haonan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186959?format=json", "institution": "Wuhan University"}, {"id": 188728, "fullname": "Wei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/188728?format=json", "institution": "Beijing Institute of Technology"}, {"id": 91732, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91732?format=json", "institution": "The University of Sydney"}], "abstract": "Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training.Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37977", "url": null, "sourceid": 44774, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38238, "uid": "dec9c56e749d1e55a8efbd3fceafdaf3", "name": "InternData-A1:  Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy", "authors": [{"id": 90271, "fullname": "Yang Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/90271?format=json", "institution": "Peking University"}, {"id": 180901, "fullname": "Yuyin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180901?format=json", "institution": "Fudan University"}, {"id": 189394, "fullname": "Yiman Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189394?format=json", "institution": "Zhejiang University"}, {"id": 189395, "fullname": "Zetao Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189395?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189396, "fullname": "Xu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189396?format=json", "institution": "Shanghai Artificial Intelligence Laboratory; Shanghai Jiaotong University"}, {"id": 155397, "fullname": "Ning Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155397?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 189397, "fullname": "Hangxu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189397?format=json", "institution": "Fudan University"}, {"id": 130405, "fullname": "Xuekun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130405?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 189398, "fullname": "Zherui Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189398?format=json", "institution": "University of Science and Technology of China"}, {"id": 179494, "fullname": "Feng Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179494?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 189399, "fullname": "Yaping Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189399?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 88374, "fullname": "Ping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88374?format=json", "institution": "Peking University"}, {"id": 129281, "fullname": "Junhao Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/129281?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 90129, "fullname": "Jia Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90129?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}, {"id": 88208, "fullname": "Jiangmiao Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88208?format=json", "institution": "Shanghai AI Laboratory "}], "abstract": "Recent work explores how real and synthetic data contribute to VLA model generalization. While the $\\pi$-series model has shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale.This paper provides the first evidence that synthetic data alone can match the performance of the strongest $\\pi$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation.The resulting model also exhibits surprisingly strong zero-shot sim-to-real transfer on several challenging tasks.Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables flexible task assembly, long-horizon skill composition, and heterogeneous embodiments with minimal manual tuning.Using the same architecture as $\\pi_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $\\pi_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks.We will open-source both the dataset and the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38238", "url": null, "sourceid": 34601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39580, "uid": "a51cc66bf971e1aaae696dfeb130f38b", "name": "Taming Generative Diffusion Model for Task-Oriented Infrared Imaging", "authors": [{"id": 151421, "fullname": "Tengyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151421?format=json", "institution": "Dalian University of Technology"}, {"id": 192401, "fullname": "Zhilong Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192401?format=json", "institution": "Dalian University of Technology"}, {"id": 192402, "fullname": "Yubo Diao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192402?format=json", "institution": "Dalian University of Technology"}, {"id": 192403, "fullname": "Guanming An", "url": "http://cvpr.thecvf.com/api/miniconf/users/192403?format=json", "institution": "Dalian University of Technology"}, {"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}, {"id": 131737, "fullname": "Risheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131737?format=json", "institution": "Dalian University of Technology"}], "abstract": "Infrared (IR) imaging is indispensable for perception in adverse environments, yet real-world data is often corrupted by dynamically coupled degradations that impair both visual quality and downstream semantic understanding. Although diffusion models offer powerful generative priors, existing approaches remain ill-suited to this setting. Their slow multi-step sampling, reliance on RGB-driven statistics misaligned with IR physics, and the necessity for costly fine-tuning of all model parameters render them impractical for dynamic IR perception. We present a unified diffusion framework that re-formulates IR restoration as a single-step generative process. The core idea is to associate each degraded input with a specific intermediate latent state in the diffusion trajectory, enabling the model to reconstruct the clean image via a single, direct reverse step. Physical realism is further reinforced through an IR-specific spectral regularization that preserves the characteristic energy distribution of thermal emissions. Addressing the diverse and rapidly shifting demands of dynamic IR perception, we further develop a task-aware low-rank adaptation mechanism. This mechanism employs a lightweight prompting hypernetwork to generate compact modulation parameters, facilitating rapid and scalable adaptation ability without retraining the entire network. Comprehensive evaluations demonstrate that our framework attains state-of-the-art restoration performance, preserves reliable semantic structures, and supports rapid adaptation that generalizes effectively across diverse tasks and conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39580", "url": null, "sourceid": 34852, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37075, "uid": "2620e1332513cc8456f58c01089c2508", "name": "Prompt-Free Universal Region Proposal Network", "authors": [{"id": 186605, "fullname": "Qihong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186605?format=json", "institution": "Nanjing University"}, {"id": 146746, "fullname": "Changhan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/146746?format=json", "institution": "nankai university"}, {"id": 181212, "fullname": "Shaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181212?format=json", "institution": "University of Science and Technology of China"}, {"id": 184978, "fullname": "Wenbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184978?format=json", "institution": "Nanjing University"}, {"id": 130772, "fullname": "Qi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86625, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86625?format=json", "institution": "Nanjing University"}], "abstract": "Identifying potential objects is critical for object recognition and analysis across various computer vision applications.Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions.However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios.In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (\\ourmodel), which identifies potential objects without relying on external prompts.First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features.Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner.Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network.Our method can be optimized with limited data (e.g., 5\\% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection.Experimental results across 19 datasets validate the effectiveness of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37075", "url": null, "sourceid": 43707, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37607, "uid": "42c12185a3238db19e1ff3993c120902", "name": "MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe", "authors": [{"id": 131385, "fullname": "Tianyu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131385?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187820, "fullname": "Zefan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187820?format=json", "institution": "Tsinghua University; Tsinghua University"}, {"id": 187821, "fullname": "Chongyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187821?format=json", "institution": "ModelBest"}, {"id": 187822, "fullname": "Fuwei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187822?format=json", "institution": "ModelBest"}, {"id": 187823, "fullname": "Wenshuo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187823?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 149407, "fullname": "Zhihui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/149407?format=json", "institution": "Tsinghua University"}, {"id": 187824, "fullname": "Tianchi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187824?format=json", "institution": "ModelBest"}, {"id": 187825, "fullname": "Weize Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187825?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187826, "fullname": "Yuxiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187826?format=json", "institution": "Tsinghua University"}, {"id": 187827, "fullname": "Ranchi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187827?format=json", "institution": "ModelBest"}, {"id": 187828, "fullname": "Bokai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187828?format=json", "institution": "ModelBest"}, {"id": 187829, "fullname": "Junbo Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/187829?format=json", "institution": null}, {"id": 151733, "fullname": "Yingjing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151733?format=json", "institution": "Zhejiang University"}, {"id": 187830, "fullname": "Liqing Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187830?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187831, "fullname": "Luoyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187831?format=json", "institution": "Modelbest"}, {"id": 187832, "fullname": "Hanyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187832?format=json", "institution": "Peking University"}, {"id": 187833, "fullname": "Jingkun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187833?format=json", "institution": "Sichuan University"}, {"id": 187834, "fullname": "Hongyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187834?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187835, "fullname": "Qining Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187835?format=json", "institution": ""}, {"id": 187836, "fullname": "Wenhao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187836?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 187837, "fullname": "Bingxiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187837?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187838, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187838?format=json", "institution": null}, {"id": 187839, "fullname": "Jie Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187839?format=json", "institution": "ModelBest"}, {"id": 187840, "fullname": "Ji Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187840?format=json", "institution": "Tsinghua University"}, {"id": 76730, "fullname": "Zonghao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76730?format=json", "institution": "Tsinghua University"}, {"id": 156515, "fullname": "Chi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156515?format=json", "institution": "Tsinghua University"}, {"id": 187841, "fullname": "Guoyang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187841?format=json", "institution": "ModelBest"}, {"id": 177968, "fullname": "Yuxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/177968?format=json", "institution": "Tsinghua University"}, {"id": 187842, "fullname": "Ganqu Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/187842?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 187843, "fullname": "Ning Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/187843?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187844, "fullname": "Xu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187844?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 106924, "fullname": "Yuan Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/106924?format=json", "institution": "Tsinghua University"}, {"id": 129010, "fullname": "Zhiyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129010?format=json", "institution": "Tsinghua University"}, {"id": 131381, "fullname": "Maosong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131381?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\\% GPU memory cost and 8.7\\% inference time of Qwen2.5-VL 7B.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37607", "url": null, "sourceid": 41756, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37100, "uid": "77dd1e39a455f1c57e984f7fe30b59b9", "name": "Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation", "authors": [{"id": 186653, "fullname": "Yiming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186653?format=json", "institution": "Hefei University of Technology"}, {"id": 76798, "fullname": "Sisi You", "url": "http://cvpr.thecvf.com/api/miniconf/users/76798?format=json", "institution": "Hefei University of Technology"}, {"id": 76806, "fullname": "Bing-Kun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76806?format=json", "institution": "Hefei University of Technology"}], "abstract": "In recent years, while Scene Graph Generation has advancedsignificantly, mainstream methods remain constrained by pre-defined object and relationship categories, limiting general-ization to open real-world scenarios. Inspired by open vocab-ulary object detection, recent efforts have expanded SGG tothe open vocabulary domain. However, these models oftenrely on off-the-shelf VLMs, lacking discriminative attributeextraction and suffering from limited object-relationship se-mantic interaction, which leads to misclassification of un-seen categories. To address these issues, we propose the MoEFeature Decoupling (MoE-FD) framework for Open Vocab-ulary Scene Graph Generation. MoE-FD adaptively learnsfeature decoupling for objects and relationships via multipleexperts, prioritizing critical features through gating networkweights. Moreover, it models semantic interactions betweenobjects and relationships using iterative cross-attention, en-hancing relationship triple associations and visual-semanticalignment. The main contributions of MoE-FD are threefold:(1) A MoE-based feature decoupling framework that adap-tively enhances discriminative feature representation for ob-jects and relations. (2) Semantic interaction modeling be-tween objects and relations to strengthen relationship tripleassociations and image-text alignment accuracy. (3) Exten-sive experiments demonstrate the effectiveness of MoE-FDon the Visual Genome dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37100", "url": null, "sourceid": 42835, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36303, "uid": "175cc8fdd6fdda3d20e4fa4dcf984c42", "name": "Image-Based Outlier Synthesis With Training Data", "authors": [{"id": 135673, "fullname": "Sudarshan Regmi", "url": "http://cvpr.thecvf.com/api/miniconf/users/135673?format=json", "institution": "Independent"}], "abstract": "Out-of-distribution (OOD) detection is critical to ensure the safe deployment of deep learning models in critical applications. Deep learning models can often misidentify OOD samples as in-distribution (ID) samples. This vulnerability worsens in the presence of spurious correlation in the training set. Likewise, in fine-grained classification settings, detection of fine-grained OOD samples becomes inherently challenging due to their high similarity to ID samples. However, current research on OOD detection has focused instead largely on relatively easier (conventional) cases. Even the few recent works addressing these challenging cases rely on carefully curated or synthesized outliers, ultimately requiring external data. This motivates our central research question: ``Can we innovate OOD detection training framework for fine-grained and spurious settings \\textbf{without requiring any external data at all?}\" In this work, we present a unified \\textbf{A}pproach to \\textbf{S}purious, fine-grained, and \\textbf{C}onventional \\textbf{OOD D}etection (\\textbf{\\ASCOOD}) that eliminates the reliance on external data. First, we synthesize virtual outliers from ID data by approximating the destruction of invariant features. Specifically, we propose to add gradient attribution values to ID inputs to disrupt invariant features while amplifying true-class logit, thereby synthesizing challenging near-manifold virtual outliers. Then, we simultaneously incentivize ID classification and predictive uncertainty towards virtual outliers. For this, we further propose to leverage standardized features with z-score normalization. ASCOOD effectively mitigates impact of spurious correlations and encourages capturing fine-grained attributes. Extensive experiments across \\textbf{7} datasets and and comparisons with \\textbf{30+} methods demonstrate merit of ASCOOD in spurious, fine-grained and conventional settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36303", "url": null, "sourceid": 44572, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38174, "uid": "7be8ee938a81badb604610babcc53b30", "name": "Motus: A Unified Latent Action World Model", "authors": [{"id": 189206, "fullname": "Hongzhe Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189206?format=json", "institution": "Tsinghua University"}, {"id": 189207, "fullname": "Hengkai Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189207?format=json", "institution": "the Department of Computer Science, Tsinghua University"}, {"id": 189208, "fullname": "Shenghao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189208?format=json", "institution": "Peking University"}, {"id": 189209, "fullname": "Zeyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189209?format=json", "institution": "Tsinghua University"}, {"id": 189210, "fullname": "Shuhe Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189210?format=json", "institution": "Tsinghua University"}, {"id": 189211, "fullname": "Haitian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189211?format=json", "institution": "Tsinghua University"}, {"id": 189212, "fullname": "Ruowen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189212?format=json", "institution": "Tsinghua University"}, {"id": 189213, "fullname": "Yao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189213?format=json", "institution": "Tsinghua University"}, {"id": 189214, "fullname": "Chendong Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189214?format=json", "institution": "Tsinghua University"}, {"id": 189215, "fullname": "Yinze Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189215?format=json", "institution": "Tsinghua University"}, {"id": 189216, "fullname": "Hongyan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189216?format=json", "institution": "Tsinghua University"}, {"id": 187832, "fullname": "Hanyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187832?format=json", "institution": "Peking University"}, {"id": 164010, "fullname": "Zhizhong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/164010?format=json", "institution": "Horizon Robotics"}, {"id": 128297, "fullname": "Lei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/128297?format=json", "institution": "Peking University"}, {"id": 86642, "fullname": "Hang Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/86642?format=json", "institution": "Tsinghua University"}, {"id": 86599, "fullname": "Jun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86599?format=json", "institution": "Tsinghua University"}], "abstract": "While a general embodied agent must function as a unified system, current methods are built on isolated models for understanding, world modeling, and control. This fragmentation prevents unifying multimodal generative capabilities and hinders learning from large-scale, heterogeneous data. In this paper, we propose Motus, a unified latent action world model that leverages existing general pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformer (MoT) architecture to integrate three experts (ie, understanding, video generation, and action) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (ie, world models, vision-language-action models, inverse dynamics models, video generation models, and video-action joint prediction). Motus further leverages the optical flow to learn latent actions and adopts a recipe with three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level \"delta action\" and enabling large-scale action pretraining. Experiments show that Motus achieves superior performance against state-of-the-art methods in both simulation (a +15% improvement over X-VLA and a +45% improvement over Pi-0.5) and real-world scenarios(improved by +11~48%), demonstrating unified modeling of all functionalities and priors significantly benefits downstream robotic tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38174", "url": null, "sourceid": 40270, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37519, "uid": "b229154b44b3e48723ec020fb471975f", "name": "TTRV: Test-Time Reinforcement Learning for Vision Language Models", "authors": [{"id": 180247, "fullname": "Akshit Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/180247?format=json", "institution": "Independent"}, {"id": 187624, "fullname": "Shyam Marjit", "url": "http://cvpr.thecvf.com/api/miniconf/users/187624?format=json", "institution": "Indian Institute of Science, Bangalore"}, {"id": 135282, "fullname": "Wei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/135282?format=json", "institution": "ELLIS Unit &amp; LIT AI Lab, JKU"}, {"id": 91828, "fullname": "Paul Gavrikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/91828?format=json", "institution": "Independent Researcher"}, {"id": 69178, "fullname": "Serena Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/69178?format=json", "institution": "Stanford"}, {"id": 150894, "fullname": "Hilde Kuehne", "url": "http://cvpr.thecvf.com/api/miniconf/users/150894?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 89688, "fullname": "Rogerio Feris", "url": "http://cvpr.thecvf.com/api/miniconf/users/89688?format=json", "institution": "International Business Machines"}, {"id": 89691, "fullname": "Sivan Doveh", "url": "http://cvpr.thecvf.com/api/miniconf/users/89691?format=json", "institution": "IBM, Weizmann Institute of Science"}, {"id": 131403, "fullname": "James Glass", "url": "http://cvpr.thecvf.com/api/miniconf/users/131403?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 150900, "fullname": "Muhammad Jehanzeb Mirza", "url": "http://cvpr.thecvf.com/api/miniconf/users/150900?format=json", "institution": "MIT"}], "abstract": "Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment.In this work, we propose TTRV to enhance vision\u2013language understanding by adapting the model on-the-fly at inference time, without the need for any labeled data.Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times.Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution.Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to Intern-VL-8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, \\method still yields non-trivial improvements of up to 5.5% in recognition tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37519", "url": null, "sourceid": 42268, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36896, "uid": "ca3f6e75176256acc0e0756a3f8eccea", "name": "Spe-BEVHead: Rethinking the Detection Head Design for Bird\u2019s-Eye-View Object Detection", "authors": [{"id": 172756, "fullname": "Junshu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172756?format=json", "institution": "Tsinghua University"}, {"id": 152220, "fullname": "Sicheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152220?format=json", "institution": "Tsinghua University"}, {"id": 186140, "fullname": "Xin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186140?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 151687, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151687?format=json", "institution": "Tsinghua University"}, {"id": 186141, "fullname": "Ruike Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186141?format=json", "institution": "Tsinghua University"}, {"id": 86015, "fullname": "Jungong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86015?format=json", "institution": "Aberystwyth University"}, {"id": 90686, "fullname": "Guiguang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/90686?format=json", "institution": "Tsinghua University"}], "abstract": "Bird\u2019s-Eye-View (BEV) detection has become a dominant paradigm for 3D object detection in autonomous driving, due to its strong perception capability. However, most existing methods mainly focus on constructing high-quality BEV feature representations, while neglecting the design of task-specific detection heads. In practice, they directly adopt the center-based head originally developed for 2D detection, without any specific optimization. This leads to three inherent limitations: (i) a geometric mismatch between the Gaussian kernel used for classification and the real BEV object, (ii) degraded end-to-end performance without Non-Maximum Suppression(NMS), and (iii) sparse supervisory signals. To address these issues, we propose Spe-BEVHead, a detection head specifically tailored for BEV 3D object detection. Spe-BEVHead introduces three BEV-specific adaptations: (1) a Rotated Box Kernel that generates geometry-aligned classification weights, (2) a Local Response Refinement Module (LRRM) that suppresses non-peak responses and improves end-to-end performance, and (3) a dual-branch architecture that provides richer supervisory signals to promote more robust learning while inherently preserving the performance for end-to-end inference. Extensive experiments show that Spe-BEVHead can be seamlessly integrated into existing BEV backbones, delivering direct performance gains while retaining competitive performance under the challenging end-to-end setting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36896", "url": null, "sourceid": 33514, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37618, "uid": "2096479f1f57167cecb0a029dd9477f3", "name": "Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark", "authors": [{"id": 187873, "fullname": "Lijing Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187873?format=json", "institution": "nanjing university"}, {"id": 187874, "fullname": "Zhan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187874?format=json", "institution": "Nanjing University"}, {"id": 143813, "fullname": "Chenglong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143813?format=json", "institution": "nanjing university"}, {"id": 187875, "fullname": "Jinyao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187875?format=json", "institution": "Nanjing University"}, {"id": 187876, "fullname": "Qiping Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187876?format=json", "institution": "nanjing university"}, {"id": 187877, "fullname": "Zikang Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187877?format=json", "institution": "nanjing university"}, {"id": 187878, "fullname": "Linsen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187878?format=json", "institution": "nanjing university"}, {"id": 187879, "fullname": "Chongde Zi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187879?format=json", "institution": "Nanjing University"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}], "abstract": "Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the video perception. To address these challenges, this paper seeks to advance spectral reconstruction from the image level to the video level, leveraging the complementary features and temporal continuity across adjacent frames in dynamic scenes. Initially, we construct the first high-quality dynamic hyperspectral image dataset (DynaSpec), comprising 30 sequences obtained through frame-scanning acquisition. Subsequently, we propose the Propagation-Guided Spectral Video Reconstruction Transformer (PG-SVRT), which employs a spatial-then-temporal attention to effectively reconstruct spectral features from abundant video information, while using a bridged token to reduce computational complexity. Finally, we conduct simulation experiments to assess the performance of four SCI systems, and construct a DD-CASSI prototype for real-world data collection and benchmarking. Extensive experiments demonstrate that PG-SVRT achieves superior performance in reconstruction quality, spectral fidelity, and temporal consistency, while maintaining minimal FLOPs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37618", "url": null, "sourceid": 40296, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40112, "uid": "5a68e6cc0195b878aa5ff70df000cd5c", "name": "OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation", "authors": [{"id": 193561, "fullname": "Zhuoxiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193561?format=json", "institution": ""}, {"id": 193562, "fullname": "Hongyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193562?format=json", "institution": "Oracle"}, {"id": 193563, "fullname": "Ying Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193563?format=json", "institution": "Oracle"}, {"id": 158034, "fullname": "Yadan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158034?format=json", "institution": "The University of Queensland"}, {"id": 177233, "fullname": "Long Duong", "url": "http://cvpr.thecvf.com/api/miniconf/users/177233?format=json", "institution": "Oracle"}, {"id": 152754, "fullname": "Yuan-Fang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152754?format=json", "institution": "Monash University"}], "abstract": "Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO (OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2--3 orders of magnitude less training data using a small base VLM on modest hardware.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40112", "url": null, "sourceid": 33457, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39805, "uid": "eee683c03a8c960e1309ce1cb5523a1e", "name": "Beyond explicit language: plug-and-play visual-to-Linguistic modeling towards general object tracking", "authors": [{"id": 192891, "fullname": "Kaiyang Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192891?format=json", "institution": "Zhejiang University of Technology"}, {"id": 186195, "fullname": "Ying Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186195?format=json", "institution": "Zhejiang University of Technology"}, {"id": 192883, "fullname": "Chenchen Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/192883?format=json", "institution": "Zhejiang University of Technology"}, {"id": 127567, "fullname": "Jianwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127567?format=json", "institution": "Zhejiang University of Technology"}, {"id": 182416, "fullname": "Dongyan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182416?format=json", "institution": "Zhejiang University of Technology"}], "abstract": "Natural language provides valuable auxiliary information for enhancing visual object tracking. While existing vision-language tracking methods explicitly leverage linguistic descriptions to aid tracking, they suffer from two critical limitations: the inability to dynamically adapt descriptions to the moving target and changing context; and the strong dependency on language input may causes failure when text is unavailable. To address the issues, we design a simple yet effective plug-and-play module that leverages linguistic assistance implicitly, without requiring explicit language input. The proposed textual inversion module converts visual features from template and search regions into text tokens in the CLIP text embedding space. It effectively inverts visual representations into linguistic forms, integrating contextual information from the both template and search region. The linguistic cues are then injected into the visual feature space via a multi-layer semantic injection mechanism. The design enhances the completeness of cross-modal feature representations and the accuracy of inter-modal semantic alignment, thus enabling dynamically updated linguistic information guidance for general object tracking. Extensive experiments demonstrate the effectiveness of our proposed method. We integrate the proposed module into several advanced trackers and evaluate on both visual and vision-language tracking datasets, including MCITrack, DUTrack, and SeqTrack.By training only the newly introduced module and the corresponding decoder, the proposed approach achieves significant performance gains with minimal computational overhead. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39805", "url": null, "sourceid": 44419, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39164, "uid": "3c2039db61fe90adf34f25f38deb893a", "name": "Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter", "authors": [{"id": 190618, "fullname": "Bo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190618?format=json", "institution": "Anhui University"}, {"id": 191487, "fullname": "Xueyang Ze", "url": "http://cvpr.thecvf.com/api/miniconf/users/191487?format=json", "institution": "Anhui University"}, {"id": 191488, "fullname": "Beibei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191488?format=json", "institution": "Anhui university"}, {"id": 191489, "fullname": "Xixi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191489?format=json", "institution": "Anhui University"}, {"id": 145315, "fullname": "Xixi Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/145315?format=json", "institution": "Anhui University"}, {"id": 188997, "fullname": "Bin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188997?format=json", "institution": "Anhui University"}], "abstract": "Textual adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models (VLMs) to downstream tasks. Existing works generally employ the deterministic textual feature adapter to refine each category textual representation. However, due to inherent factors such as different attributes and contexts, there exists significant diversity in textual descriptions for each category. Such description diversity offers rich discriminative semantic knowledge that can benefit downstream visual learning tasks. Obviously, traditional deterministic adapter model cannot adequately capture this varied semantic information. Also, it is desirable to exploit the inter-class relationships in VLM adapter. To address these issues, we propose to exploit random graph model into VLM adapter and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first models the inherent diverse descriptions of each category and inter-class relationships of different categories simultaneously by leveraging a Vertex Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message propagation on VRKG to learn context-aware distribution representation for each class node. Finally, it adopts a reparameterized sampling function to achieve textual adapter learning. Note that, VRGAdapter provides a more general adapter solution that encompasses traditional graphbased adapter as a special case. In addition, to enable more robust performance for downstream tasks, we also introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that dynamically integrates multiple pre-trained models for ensemble prediction. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39164", "url": null, "sourceid": 33001, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36999, "uid": "b9285ab4c80f0e0d5c090dda9fb3014c", "name": "Text-Image Conditioned 3D Generation", "authors": [{"id": 179946, "fullname": "Jiazhong Cen", "url": "http://cvpr.thecvf.com/api/miniconf/users/179946?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 129706, "fullname": "Jiemin Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129706?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186427, "fullname": "Sikuang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186427?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 102592, "fullname": "Guanjun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102592?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186428, "fullname": "Chen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186428?format=json", "institution": "SJTU; HUAWEI"}, {"id": 107257, "fullname": "Taoran Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/107257?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 71290, "fullname": "Zanwei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/71290?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186429, "fullname": "zhikuan bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186429?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 89218, "fullname": "Lingxi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/89218?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 76466, "fullname": "Wei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76466?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "High-quality 3D assets are critical for VR/AR, industrial design, and entertainment, driving growing interest in generative models that can create 3D content from user-provided prompts. Most existing 3D generators, however, rely on a single conditioning modality: image-conditioned models deliver high visual fidelity by exploiting pixel-aligned cues but suffer from viewpoint bias when the input view is limited or ambiguous, whereas text-conditioned models benefit from broad semantic guidance yet lack low-level visual detail. This restricts how users can express their intent and raises a natural question: can the two modalities be combined to yield more flexible and faithful 3D generation? Our diagnostic study shows that even a simple late fusion of text- and image-conditioned predictions improves over single-modality models, evidencing strong cross-modal complementarity. Building on this finding, we formalize the task of Text\u2013Image Conditioned 3D Generation, which requires joint reasoning over a visual exemplar and a textual specification during generation. To address this task, we introduce TIGON, a minimalist dual-branch baseline that maintains separate image- and text-conditioned backbones with lightweight cross-modal fusion. Extensive experiments demonstrate that text\u2013image conditioning yields consistent gains over single-modality methods, suggesting complementary vision\u2013language guidance as a promising direction for future 3D generation research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36999", "url": null, "sourceid": 46305, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38947, "uid": "84f11ceb99dafe222dde3767eb4fe663", "name": "Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT", "authors": [{"id": 182966, "fullname": "Yesheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182966?format=json", "institution": "The Institute of Automation, Chinese Academy of Sciences"}, {"id": 191029, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191029?format=json", "institution": "Beihang University"}, {"id": 187299, "fullname": "Haiyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187299?format=json", "institution": "Peking University"}, {"id": 126504, "fullname": "Baoqi Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/126504?format=json", "institution": "Zhejiang University"}, {"id": 182423, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182423?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 187301, "fullname": "Mingxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187301?format=json", "institution": "Peking University"}, {"id": 191030, "fullname": "Jing-Shu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191030?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 147097, "fullname": "Zheqi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/147097?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 148615, "fullname": "Jin-Ge Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/148615?format=json", "institution": "BAAI"}, {"id": 147263, "fullname": "Xi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147263?format=json", "institution": "Beijing Academy of Artificial Intelligence (BAAI)"}, {"id": 191031, "fullname": "Bowen Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191031?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 191032, "fullname": "Jiajun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191032?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification.However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT.  We propose ReVeL (\\textbf{Re}write and \\textbf{Ve}rify by \\textbf{L}LM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38947", "url": null, "sourceid": 44300, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36560, "uid": "0ca28c19a7db0b4d5e3f17829bbe29b8", "name": "Residual Connections Harm Self-Supervised Abstract Feature Learning", "authors": [{"id": 128481, "fullname": "Xiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128481?format=json", "institution": "University of Chicago"}, {"id": 185339, "fullname": "Ruoxi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185339?format=json", "institution": "Fudan University; Shanghai Academy of Artificial Intelligence for Science"}, {"id": 185340, "fullname": "Will Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185340?format=json", "institution": "University of Chicago"}, {"id": 152310, "fullname": "Rebecca Willett", "url": "http://cvpr.thecvf.com/api/miniconf/users/152310?format=json", "institution": "University of Chicago"}, {"id": 95062, "fullname": "Michael Maire", "url": "http://cvpr.thecvf.com/api/miniconf/users/95062?format=json", "institution": "University of Chicago"}], "abstract": "We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification improves linear probing accuracy for both, notably increasing ImageNet accuracy from 67.8% to 72.7% for MAEs with a VIT-B/16 backbone, while also boosting generation quality for diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36560", "url": null, "sourceid": 33845, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37503, "uid": "d0e0749588db30f8a91be3067b88e2b6", "name": "RealAppiance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manauls", "authors": [{"id": 182390, "fullname": "Yuzheng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182390?format=json", "institution": "Peking University"}, {"id": 155830, "fullname": "Yuxing Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/155830?format=json", "institution": "Peking University"}, {"id": 160138, "fullname": "Lei Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160138?format=json", "institution": "Peking University"}, {"id": 187594, "fullname": "Yuchong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187594?format=json", "institution": "Peking University"}, {"id": 187595, "fullname": "Ziyan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187595?format=json", "institution": "Peking University"}, {"id": 187596, "fullname": "Shangqing Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187596?format=json", "institution": "Peking University"}, {"id": 155831, "fullname": "Jiyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155831?format=json", "institution": "Peking University"}, {"id": 129288, "fullname": "Ruihai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129288?format=json", "institution": "Peking University"}, {"id": 148926, "fullname": "Dongjiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/148926?format=json", "institution": "JD.com"}, {"id": 187597, "fullname": "Hui Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187597?format=json", "institution": "jd.com"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}], "abstract": "Existing appliance assets suffer from poor rendering, incomplete mechanisms, and misalignment with manuals, leading to simulation-reality gaps that hinder appliance manipulation development. In this work, we introduce the RealAppliance dataset, comprising 100 high-fidelity appliances with complete physical, electronic mechanisms, and program logic aligned with their manuals. Based on these assets, we propose the RealAppliance-Bench benchmark, which evaluates multimodal large language models and embodied manipulation planning models across key tasks in appliance manipulation planning: manual page retrieval, appliance part grounding, open-loop manipulation planning, and closed-loop planning adjustment. Our analysis of model performances on RealAppliance-Bench provides insights for advancing appliance manipulation research", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37503", "url": null, "sourceid": 45757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40176, "uid": "be67958ef1e8199e28f765c75287d07c", "name": "Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors", "authors": [{"id": 166496, "fullname": "Ryosuke Hori", "url": "http://cvpr.thecvf.com/api/miniconf/users/166496?format=json", "institution": "Keio University"}, {"id": 159763, "fullname": "Jyun-Ting Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/159763?format=json", "institution": "Carnegie Mellon University"}, {"id": 96336, "fullname": "Zhengyi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96336?format=json", "institution": "Carnegie Mellon University"}, {"id": 188073, "fullname": "Jinkun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188073?format=json", "institution": "Amazon FAR"}, {"id": 119973, "fullname": "Soyong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/119973?format=json", "institution": "Carnegie Mellon University"}, {"id": 93489, "fullname": "HIDEO SAITO", "url": "http://cvpr.thecvf.com/api/miniconf/users/93489?format=json", "institution": "Keio University"}, {"id": 88213, "fullname": "Kris Kitani", "url": "http://cvpr.thecvf.com/api/miniconf/users/88213?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU\u2013pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency. Code and data will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40176", "url": null, "sourceid": 39809, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40185, "uid": "c0a94cff7a1e60bd1c87bc2f4d0d14c4", "name": "Repurposing 3D Generative Model for Autoregressive Layout Generation", "authors": [{"id": 172265, "fullname": "Haoran Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/172265?format=json", "institution": "Tsinghua University"}, {"id": 193560, "fullname": "Yifan Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193560?format=json", "institution": "Beihang University"}, {"id": 126967, "fullname": "Zehuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126967?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 102107, "fullname": "Yangtian Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/102107?format=json", "institution": "University of Hong Kong"}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}, {"id": 87023, "fullname": "Yuxin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87023?format=json", "institution": "Peking University"}, {"id": 86998, "fullname": "Lu Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86998?format=json", "institution": "Beihang University"}], "abstract": "We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model to integrate scene, object, and instruction information and employ a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19\\% higher physical plausibility than the state of the art and 65\\% faster computation. We will release our code publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40185", "url": null, "sourceid": 45795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40111, "uid": "6ba667e88fcf8fa9c8d633ca5f198a48", "name": "Stereo World Model", "authors": [{"id": 102107, "fullname": "Yangtian Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/102107?format=json", "institution": "University of Hong Kong"}, {"id": 126967, "fullname": "Zehuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126967?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 193560, "fullname": "Yifan Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193560?format=json", "institution": "Beihang University"}, {"id": 190427, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190427?format=json", "institution": null}, {"id": 84806, "fullname": "Yan-Pei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84806?format=json", "institution": "Tencent ARC Lab"}, {"id": 190428, "fullname": "Yuewen Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190428?format=json", "institution": "ByteDance"}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}], "abstract": "We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong  monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40111", "url": null, "sourceid": 36497, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39038, "uid": "ec0c5dbf1690d83f53fad507995b0bd6", "name": "MM-ACT: Learn from Multimodal Parallel Generation to Act", "authors": [{"id": 174514, "fullname": "Haotian Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174514?format=json", "institution": "University of Science and Technology of China"}, {"id": 191225, "fullname": "Xinyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191225?format=json", "institution": "Fudan University"}, {"id": 149245, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149245?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 191226, "fullname": "MingKang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191226?format=json", "institution": null}, {"id": 191227, "fullname": "Yitian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191227?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191228, "fullname": "Yuhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191228?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 154770, "fullname": "Zanxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154770?format=json", "institution": "Shenzhen University"}, {"id": 185099, "fullname": "Tianshuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185099?format=json", "institution": "University of Hong Kong"}, {"id": 155398, "fullname": "Yilun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155398?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 88208, "fullname": "Jiangmiao Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88208?format=json", "institution": "Shanghai AI Laboratory "}, {"id": 87804, "fullname": "Dong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87804?format=json", "institution": "University of Science and Technology of China"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 185103, "fullname": "Yao Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185103?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 87374, "fullname": "Wenqi Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87374?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86834, "fullname": "Ping Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86834?format=json", "institution": "The University of Hong Kong"}], "abstract": "A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action(VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context\u2011Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal task learning.Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances, respectively. Our approach achieves a success rate of 96.3\\% on LIBERO,  62.2\\% across four tasks of Franka, and 52.38\\% across eight bimanual tasks of RoboTwin2.0, with an additional gain of 9.25\\% from text-image co-training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39038", "url": null, "sourceid": 39827, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37714, "uid": "37ce349008efb55b22c8594ca047413c", "name": "SAM 3D Body: Robust Full-Body Human Mesh Recovery", "authors": [{"id": 85606, "fullname": "Xitong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85606?format=json", "institution": "Meta"}, {"id": 127724, "fullname": "Devansh Kukreja", "url": "http://cvpr.thecvf.com/api/miniconf/users/127724?format=json", "institution": "Carnegie Mellon University"}, {"id": 188071, "fullname": "Don Pinkus", "url": "http://cvpr.thecvf.com/api/miniconf/users/188071?format=json", "institution": "Indepedent"}, {"id": 188072, "fullname": "Taosha Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188072?format=json", "institution": null}, {"id": 89967, "fullname": "Jinhyung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/89967?format=json", "institution": "Carnegie Mellon University"}, {"id": 119973, "fullname": "Soyong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/119973?format=json", "institution": "Carnegie Mellon University"}, {"id": 188073, "fullname": "Jinkun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188073?format=json", "institution": "Amazon FAR"}, {"id": 188074, "fullname": "Jia-Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188074?format=json", "institution": "Facebook"}, {"id": 188075, "fullname": "Nicol\u00e1s Ugrinovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/188075?format=json", "institution": null}, {"id": 188076, "fullname": "Anushka Sagar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188076?format=json", "institution": "Meta"}, {"id": 75437, "fullname": "Jitendra Malik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75437?format=json", "institution": "University of California at Berkeley"}, {"id": 75834, "fullname": "Matt Feiszli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75834?format=json", "institution": "Meta AI"}, {"id": 188077, "fullname": "Piotr Doll\u00e1r", "url": "http://cvpr.thecvf.com/api/miniconf/users/188077?format=json", "institution": "Thinking Machines"}, {"id": 88213, "fullname": "Kris Kitani", "url": "http://cvpr.thecvf.com/api/miniconf/users/88213?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal pose and body shape. 3DB employs an encoder\u2013decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37714", "url": null, "sourceid": 39197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40311?format=json"], "related_events_ids": [40311]}, {"id": 38768, "uid": "3277651248e7ccf4b6e6a8b1763508fe", "name": "Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation", "authors": [{"id": 180089, "fullname": "Nassim ALI OUSALAH", "url": "http://cvpr.thecvf.com/api/miniconf/users/180089?format=json", "institution": "University of Luxembourg"}, {"id": 190626, "fullname": "Peyman Abendansari", "url": "http://cvpr.thecvf.com/api/miniconf/users/190626?format=json", "institution": "University of Luxemburg"}, {"id": 162221, "fullname": "Vincent Gaudilli\u00e8re", "url": "http://cvpr.thecvf.com/api/miniconf/users/162221?format=json", "institution": "Inria"}, {"id": 190627, "fullname": "Emmanuel Koumandakis", "url": "http://cvpr.thecvf.com/api/miniconf/users/190627?format=json", "institution": "Infinite Orbits SAS"}, {"id": 126663, "fullname": "Anis Kacem", "url": "http://cvpr.thecvf.com/api/miniconf/users/126663?format=json", "institution": "University of Luxemburg"}, {"id": 126667, "fullname": "Enjie Ghorbel", "url": "http://cvpr.thecvf.com/api/miniconf/users/126667?format=json", "institution": "(1) CRISTAL laboratory, ENSI, University of Manouba; (2) SnT, University of Luxembourg"}, {"id": 93028, "fullname": "Djamila Aouada", "url": "http://cvpr.thecvf.com/api/miniconf/users/93028?format=json", "institution": "SnT, University of Luxembourg"}], "abstract": "In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-$n$-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness.  Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38768", "url": null, "sourceid": 30996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39142, "uid": "6a6a609a5ef03909b9297f07351739b2", "name": "Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning", "authors": [{"id": 182886, "fullname": "Xinghao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182886?format=json", "institution": "Beihang University"}, {"id": 191436, "fullname": "Jianwei Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191436?format=json", "institution": "Beihang University"}, {"id": 191437, "fullname": "Xuefeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191437?format=json", "institution": null}, {"id": 191438, "fullname": "Guogang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191438?format=json", "institution": null}, {"id": 191439, "fullname": "Jiayuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191439?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 191440, "fullname": "Shaojie Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191440?format=json", "institution": "State University of New York at Buffalo"}, {"id": 191441, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191441?format=json", "institution": "Beihang University"}], "abstract": "Federated Prototype Learning (FedCL) has emerged as an effective strategy for handling data heterogeneity in Federated Learning (FL). In FedCL, clients collaboratively construct a set of global feature centers (prototypes), and let local features align with these prototypes to mitigate the effects of data heterogeneity. The performance of FedCL highly depends on the quality of prototypes. Existing methods assume that larger inter-class distances among prototypes yield better performance, and thus design different methods to increase these distances. However, we observe that while these methods increase prototype distances to enhance class discrimination, they inevitably disrupt essential semantic relationships among classes, which are crucial for model generalization. This raises an important question: how to construct prototypes that inherently preserve semantic relationships among classes? Directly learning these relationships from limited and heterogeneous client data can be problematic in FL. Recently, the success of pre-trained language models (PLMs) demonstrates their ability to capture semantic relationships from vast textual corpora. Motivated by this, we propose FedTSP, a novel method that leverages PLMs to construct semantically enriched prototypes from the textual modality, enabling more effective collaboration in heterogeneous data settings. We first use a large language model (LLM) to generate fine-grained textual descriptions for each class, which are then processed by a PLM on the server to form textual prototypes. To address the modality gap between client image models and the PLM, we introduce trainable prompts, allowing prototypes to adapt better to client tasks. Extensive experiments demonstrate that FedTSP mitigates data heterogeneity while significantly accelerating convergence.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39142", "url": null, "sourceid": 32626, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39185, "uid": "b4a2d301ddc8a3e8c500551900bdffd4", "name": "Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution", "authors": [{"id": 191530, "fullname": "Hee Min Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191530?format=json", "institution": "Samsung Electronics"}, {"id": 191531, "fullname": "Hyoa Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191531?format=json", "institution": "Samsung Electronics Co., Ltd.; Seoul National University"}, {"id": 191532, "fullname": "Suji Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191532?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology; Samsung Advanced Institute of Technology"}, {"id": 191533, "fullname": "Dokwan Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/191533?format=json", "institution": "Samsung Advanced Institute of Technology"}, {"id": 77105, "fullname": "Nam Ik Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/77105?format=json", "institution": "Seoul National University"}], "abstract": "Space-time video super-resolution (STVSR) is a task aimed at simultaneously upsampling a video in both spatial and temporal dimensions. Previous studies on STVSR have primarily focused on task-specific architectures and modeling paradigms, while effective pretraining strategies remain underexplored. In this paper, we propose a pseudo-temporal space\u2013time reconstruction pretraining framework for STVSR networks that enables effective use of image datasets, which naturally provide strong spatial cues. Each training sample is constructed by duplicating a single image into a pseudo-temporal video and independently zero-filling random pixel regions across its frames. Instead of designing a separate pretraining module, we pretrain the STVSR network on a task aligned with its core objectives of spatial restoration and cross-frame aggregation. The model learns to reconstruct clean, higher-spatio-temporal-resolution outputs from degraded, pseudo-temporal inputs, with a modulation factor encouraging greater focus on difficult regions. Extensive experiments show that our simple pretraining significantly improves STVSR performance and outperforms existing video representation learning approaches. We note our method is effective even when pretraining and finetuning with a limited quantity of data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39185", "url": null, "sourceid": 41216, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40245, "uid": "307527f309aa633edcc5f228a5533f25", "name": "Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models", "authors": [{"id": 184821, "fullname": "Zizhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184821?format=json", "institution": "Fudan University"}, {"id": 193868, "fullname": "Yizhen Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193868?format=json", "institution": "Central South University"}, {"id": 145888, "fullname": "Minghao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/145888?format=json", "institution": "Fudan University"}, {"id": 187008, "fullname": "Yizhou Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187008?format=json", "institution": "Fudan University"}, {"id": 91020, "fullname": "Zhaoyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91020?format=json", "institution": "Fudan University"}, {"id": 91003, "fullname": "Dingkang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91003?format=json", "institution": "Fudan University"}, {"id": 91007, "fullname": "Lihua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91007?format=json", "institution": "Fudan University"}], "abstract": "Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous Medical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40245", "url": null, "sourceid": 43001, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37248, "uid": "2488e9d2b7edf9d573d55e5339ae5b9e", "name": "Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control", "authors": [{"id": 145888, "fullname": "Minghao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/145888?format=json", "institution": "Fudan University"}, {"id": 187007, "fullname": "Yichen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187007?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 187008, "fullname": "Yizhou Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187008?format=json", "institution": "Fudan University"}, {"id": 184821, "fullname": "Zizhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184821?format=json", "institution": "Fudan University"}, {"id": 126844, "fullname": "Jingqun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126844?format=json", "institution": "Bytedance"}, {"id": 153724, "fullname": "Xuecheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153724?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 91003, "fullname": "Dingkang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91003?format=json", "institution": "Fudan University"}, {"id": 91007, "fullname": "Lihua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91007?format=json", "institution": "Fudan University"}], "abstract": "In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image\u2013text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image\u2013text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho\u2011FID of 80.9 (51% better than the second-best) and fine-grained semantic control, achieving 98.7% of the real-image. Our curated datasets, code, and model weights will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37248", "url": null, "sourceid": 34921, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36227, "uid": "deba1b8aa887b1342b5b03f8f194548f", "name": "Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing", "authors": [{"id": 184496, "fullname": "Cheng Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184496?format=json", "institution": "Baidu"}, {"id": 175159, "fullname": "Ting Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/175159?format=json", "institution": "Baidu"}, {"id": 184497, "fullname": "Suyin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184497?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 184498, "fullname": "Tingquan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184498?format=json", "institution": "Baidu"}, {"id": 184499, "fullname": "Zelun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184499?format=json", "institution": "Baidu"}, {"id": 90373, "fullname": "Jiaxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90373?format=json", "institution": "Nankai University"}, {"id": 184500, "fullname": "Xueqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184500?format=json", "institution": "Baidu"}, {"id": 184501, "fullname": "Changda Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184501?format=json", "institution": "Baidu"}, {"id": 184502, "fullname": "Hongen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184502?format=json", "institution": null}, {"id": 184503, "fullname": "Lin Manhui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184503?format=json", "institution": "Baidu"}, {"id": 184504, "fullname": "Yue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184504?format=json", "institution": "Baidu"}, {"id": 175650, "fullname": "yubo zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175650?format=json", "institution": "Baidu"}, {"id": 181263, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181263?format=json", "institution": "Baidu"}, {"id": 184505, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184505?format=json", "institution": "Baidu"}, {"id": 76677, "fullname": "Xing Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76677?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 105233, "fullname": "Yi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105233?format=json", "institution": "Baidu"}, {"id": 184506, "fullname": "Dianhai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184506?format=json", "institution": "Peking University; Baidu Inc."}, {"id": 184507, "fullname": "Yanjun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184507?format=json", "institution": "Baidu"}], "abstract": "Document parsing is a fine-grained task where image resolution significantly impacts performance. While advanced research leveraging vision-language models benefits from high-resolution input to boost model performance, this often leads to a quadratic increase in the number of vision tokens and significantly raises computational costs. We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. Specifically, we introduce a lightweight Valid Region Focus Module (VRFM) which leverages localization and contextual relationship prediction capabilities to identify valid vision tokens. Subsequently, we design and train a compact yet powerful 0.9B vision-language model (PaddleOCR-VL-0.9B) to perform detailed recognition, guided by VRFM outputs to avoid direct processing of the entire large image. Extensive experiments demonstrate that PaddleOCR-VL achieves state-of-the-art performance in both page-level parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference while utilizing substantially fewer vision tokens and parameters, highlighting the effectiveness of targeted coarse-to-fine parsing for accurate and efficient document understanding.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36227", "url": null, "sourceid": 41247, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36566, "uid": "4f9fcbf18f62c9f0be2a96dfe3708bc4", "name": "ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation", "authors": [{"id": 76876, "fullname": "Youngho Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76876?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185366, "fullname": "Wonjune Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/185366?format=json", "institution": "Naver Labs Korea"}, {"id": 185367, "fullname": "Hyunho Ha", "url": "http://cvpr.thecvf.com/api/miniconf/users/185367?format=json", "institution": "Naver Labs"}, {"id": 164628, "fullname": "Sujung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/164628?format=json", "institution": "Naver Labs"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "Pose estimation remains challenging under sparse views, especially when visual overlap across images is extremely limited. Recent advances in video generation models offer a promising solution by enabling keyframe interpolation, which can enrich contextual cues and improve pose estimation performance. However, existing video generation models often lack 3D consistency, producing temporally plausible but spatially inconsistent frames that degrade downstream pose estimation. In this paper, we propose a framework ExPose that directly addresses 3D inconsistency when applying video generation to pose estimation in extreme-view settings. Specifically, we fine-tune a video generation model using Group Relative Preference Optimization (GRPO), aligning its outputs with 3D-consistent supervisory signals derived from pose estimation objectives. Our approach not only enhances the quality of temporal interpolation, but also ensures spatial coherence across views, significantly improving pose estimation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines, highlighting the potential of preference-optimized video generation as a powerful tool for pose estimation in extreme-view scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36566", "url": null, "sourceid": 36970, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36759, "uid": "acc51fbcbcb7e218430df58a98d4a181", "name": "EMMA: Extracting Multiple physical parameters from Multimodal Data", "authors": [{"id": 185809, "fullname": "Farhat Shaikh", "url": "http://cvpr.thecvf.com/api/miniconf/users/185809?format=json", "institution": "Arizona State University"}, {"id": 185810, "fullname": "Ayan Banerjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/185810?format=json", "institution": "Arizona State University"}, {"id": 185811, "fullname": "Sandeep Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/185811?format=json", "institution": "Arizona State University"}], "abstract": "We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36759", "url": null, "sourceid": 42612, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39067, "uid": "e6a07b64cc15d0ce53064207ece842dd", "name": "FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs", "authors": [{"id": 191291, "fullname": "Weiheng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191291?format=json", "institution": "Peking University"}, {"id": 182608, "fullname": "An Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182608?format=json", "institution": "State University of New York at Albany"}, {"id": 86922, "fullname": "Jian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86922?format=json", "institution": "NJUST"}, {"id": 191292, "fullname": "Zhenfei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191292?format=json", "institution": ""}, {"id": 191293, "fullname": "Felix X. Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/191293?format=json", "institution": "State University of New York at Albany"}, {"id": 75494, "fullname": "Ming-Ching Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75494?format=json", "institution": "University at Albany - SUNY"}], "abstract": "Audio-visual large language models (AVLLMs) have made significant strides in understanding visual and auditory content. However, their ability to capture fine-grained temporal relationships between audio and visual streams remains insufficiently evaluated. To address this, we introduce FAVE (Fine-grained Audio-Visual Temporal Evaluation), a comprehensive benchmark targeting three core dimensions of temporal perception: cross-modal temporal alignment (FAVE-align), event temporal relationship (FAVE-low), and detailed moment captioning (FAVE-high). To construct FAVE, we propose a scalable annotation pipeline that integrates shot boundary detection, automated captioning, and GPT-assisted refinement to produce temporally grounded, high-quality data. Extensive experiments on twelve state-of-the-art multimodal LLMs, both open-source and closed-source, reveal key limitations in multimodal integration, temporal relationship and timestamp localization, especially for joint audio-visual tasks. These findings highlight the need for better temporal modeling to improve AVLLMs' understanding of real-world video content. FAVE serves as a rigorous testbed for advancing temporally aware multimodal systems, and will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39067", "url": null, "sourceid": 35733, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38310, "uid": "6da987f4db3577a16f958f27c6d0a251", "name": "RINO: Rotation-Invariant Non-Rigid Correspondences", "authors": [{"id": 107534, "fullname": "Maolin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107534?format=json", "institution": "Technical University Munich"}, {"id": 189569, "fullname": "Shao Jie Hu-Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189569?format=json", "institution": "Technical University of Munich"}, {"id": 189570, "fullname": "Congyue Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189570?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 91088, "fullname": "Riccardo Marin", "url": "http://cvpr.thecvf.com/api/miniconf/users/91088?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 75804, "fullname": "Leonidas Guibas", "url": "http://cvpr.thecvf.com/api/miniconf/users/75804?format=json", "institution": "Stanford University"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}], "abstract": "Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors,  limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challenging non-rigid matching tasks, including arbitrary poses, non-isometry, partiality, non-manifoldness, and noise.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38310", "url": null, "sourceid": 41232, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39944, "uid": "6a734236fe497a0bdbb0018bfc62faaf", "name": "Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data", "authors": [{"id": 193166, "fullname": "Minyoung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193166?format=json", "institution": "LifeCanvas Technologies"}, {"id": 193167, "fullname": "Dae Hee Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193167?format=json", "institution": "LifeCanvas Technologies"}, {"id": 193168, "fullname": "Aditi Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/193168?format=json", "institution": "LifeCanvas Technologies"}, {"id": 178100, "fullname": "Madeline Hon", "url": "http://cvpr.thecvf.com/api/miniconf/users/178100?format=json", "institution": "MIT"}, {"id": 182356, "fullname": "Webster Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182356?format=json", "institution": "LifeCanvas Technologies"}, {"id": 193169, "fullname": "Taegeon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193169?format=json", "institution": "LifeCanvas Technologies"}, {"id": 193170, "fullname": "Brian Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193170?format=json", "institution": "LifeCanvas Technologies Inc."}], "abstract": "Unprecedented visual details of biological structures are being revealed by subcellular-resolution whole-brain 3D microscopy data, enabled by recent advances in intact tissue processing and light-sheet fluorescence microscopy (LSFM). These volumetric data offer rich morphological and spatial cellular information, however, the lack of scalable data processing and analysis methods tailored to these petabyte-scale data poses a substantial challenge for accurate interpretation. Further, existing models for visual tasks such as object detection and classification struggle to generalize to this type of data. To accelerate the development of suitable methods and foundational models, we present CANVAS, a comprehensive set of high-resolution whole mouse brain LSFM benchmark data, encompassing six neuronal and immune cell-type markers, along with cell annotations and a leaderboard. We also demonstrate challenges in generalization of baseline models built on existing architectures, especially due to the heterogeneity in cellular morphology across phenotypes and anatomical locations in the brain. To the best of our knowledge, CANVAS is the first and largest LSFM benchmark that captures intact mouse brain tissue at subcellular level, and includes extensive annotations of cells throughout the brain.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39944", "url": null, "sourceid": 45438, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39276, "uid": "c726062049174dd685bbb960958fa1c1", "name": "Post-training feature pruning for fundus images classification", "authors": [{"id": 191748, "fullname": "Van-Nguyen Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/191748?format=json", "institution": "Sungkyunkwan University"}, {"id": 191749, "fullname": "Duc-Tai Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/191749?format=json", "institution": "Sungkyunkwan University"}, {"id": 177906, "fullname": "Junghyun Bum", "url": "http://cvpr.thecvf.com/api/miniconf/users/177906?format=json", "institution": "Sungkyunkwan"}, {"id": 191750, "fullname": "Hyunseung Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191750?format=json", "institution": "Sung Kyun Kwan University"}], "abstract": "Deep neural networks have achieved strong performance in fundus image classification, yet their flattened feature representations are often highly redundant. Such redundancy can lead to poor generalization across imaging devices, reduced interpretability, and inefficient use of model capacity. To address this issue, this study proposes a post-training feature pruning framework, termed {greedy feature pruning (GFP)}, which removes weak or redundant dimensions from the flattened features of trained backbones. GFP employs a greedy build-up process guided by performance metrics on the training set, constrained by a minimum feature keeping ratio, to identify compact yet discriminative subsets of features. Experiments are conducted on five public fundus datasets covering multiple tasks, including diabetic retinopathy detection (DDR, Messidor-2), glaucoma detection (PAPILA), multi-label classification (ODIR) and multi-class retinal disease classification (RETINA), using EfficientNetV2, ViT, and CoAtNet as backbones. Results show that GFP consistently improves AUROC and AUPRC across datasets while reducing the number of flattened features by up to 96\\%. Feature visualizations and quantitative analyses confirm that GFP enhances the compactness and separability of latent features. Moreover, cross-dataset evaluation demonstrates that GFP improves transferability between datasets, indicating better domain robustness. Overall, the proposed GFP framework provides a simple yet effective approach for compressing feature representations and improving both discriminability and generalization in fundus image classification.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39276", "url": null, "sourceid": 35115, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38341, "uid": "5d977a1a6ac388fbbbc745d11142cdd8", "name": "ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models", "authors": [{"id": 173703, "fullname": "Zhaoyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/173703?format=json", "institution": "University of California San Diego San Diego"}, {"id": 189643, "fullname": "Zhan Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/189643?format=json", "institution": "ByteDance Inc."}, {"id": 189644, "fullname": "Yuchen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189644?format=json", "institution": "University of California, San Diego"}, {"id": 189645, "fullname": "Litian Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189645?format=json", "institution": "University of California, Riverside"}, {"id": 189646, "fullname": "Erdem Biyik", "url": "http://cvpr.thecvf.com/api/miniconf/users/189646?format=json", "institution": "University of Southern California"}, {"id": 75771, "fullname": "Hao Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/75771?format=json", "institution": "UCSD"}], "abstract": "Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context  (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38341", "url": null, "sourceid": 42768, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39927, "uid": "9d635acac17e5e2d78b10ac123bb2a97", "name": "DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime", "authors": [{"id": 134073, "fullname": "Julian Lorenz", "url": "http://cvpr.thecvf.com/api/miniconf/users/134073?format=json", "institution": "University of Augsburg"}, {"id": 193130, "fullname": "Vladyslav Kovganko", "url": "http://cvpr.thecvf.com/api/miniconf/users/193130?format=json", "institution": "Universit\u00e4t Augsburg"}, {"id": 193131, "fullname": "Elias Kohout", "url": "http://cvpr.thecvf.com/api/miniconf/users/193131?format=json", "institution": "University of Augsburg, Universit\u00e4t Augsburg"}, {"id": 181465, "fullname": "Mrunmai Phatak", "url": "http://cvpr.thecvf.com/api/miniconf/users/181465?format=json", "institution": "Universit\u00e4t Augsburg"}, {"id": 161809, "fullname": "Daniel Kienzle", "url": "http://cvpr.thecvf.com/api/miniconf/users/161809?format=json", "institution": "Universit\u00e4t Augsburg"}, {"id": 73696, "fullname": "Rainer Lienhart", "url": "http://cvpr.thecvf.com/api/miniconf/users/73696?format=json", "institution": "University of Augsburg"}], "abstract": "Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents.However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research.To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU.This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39927", "url": null, "sourceid": 33115, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39798, "uid": "0808f7df778520ed04c75f8d0ce0c44c", "name": "Adaptive Learned Image Compression with Graph Neural Networks", "authors": [{"id": 181923, "fullname": "Yunuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181923?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 192878, "fullname": "Bing He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192878?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192879, "fullname": "Zezheng Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192879?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 154175, "fullname": "Hongwei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154175?format=json", "institution": "Alibaba Group"}, {"id": 187789, "fullname": "Qunshan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187789?format=json", "institution": "Alibaba Group"}, {"id": 152151, "fullname": "Yuan Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/152151?format=json", "institution": "Shanghai AI Lab"}, {"id": 130678, "fullname": "Guo Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130678?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Efficient image compression relies on the accurate detection and elimination of both local and global redundancy. While most state-of-the-art (SOTA) learned image compression (LIC) methods are built on Convolutional Neural Networks (CNNs) or Transformer architectures, these frameworks are inherently rigid. Standard CNN kernels and window-based attention mechanisms impose fixed receptive fields and static connectivity patterns, which potentially couple non-redundant pixels simply due to their proximity in Euclidean space. This rigidity limits the model\u2019s ability to adaptively capture spatially varying redundancy across the image, particularly at the global level.To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). Specifically, our approach constructs dual-scale graphs that enable flexible, data-driven receptive fields. Furthermore, we introduce adaptive connectivity by dynamically adjusting the number of neighbors for each node based on local content complexity. These innovations empower our Graph-based Learned Image Compression (GLIC) model to effectively model diverse redundancy patterns across images, leading to more efficient and adaptive compression.Experiments demonstrate that GLIC achieves SOTA performance, outperforming VTM-9.1 by-19.29\\%,  -21.69\\%,  -18.71\\% in BD-rate on  Kodak, Tecnick, and CLIC datasets, respectively. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39798", "url": null, "sourceid": 46773, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36777, "uid": "49b16bb8d40a3f0e2bba9ba44b65bfd9", "name": "See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs", "authors": [{"id": 180919, "fullname": "Yongchang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180919?format=json", "institution": "Southeast University"}, {"id": 156528, "fullname": "Xianzheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/156528?format=json", "institution": "University of Oxford"}, {"id": 185852, "fullname": "Tianyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185852?format=json", "institution": "Southeast University"}, {"id": 185853, "fullname": "Guangquan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185853?format=json", "institution": "Southeast University"}, {"id": 90866, "fullname": "Yang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90866?format=json", "institution": "Southeast University"}], "abstract": "Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps\u2014even if logically valid\u2014can still lead to incorrect final answers.Existing solutions attempt to mitigate this issue by training models to \u201cthink with images\u201d via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures.Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model\u2019s reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer.Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. our method achieves 16.5\\%\u201329.5\\% improvements on TreeBench and 13.7\\% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without any additional training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36777", "url": null, "sourceid": 45415, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39981, "uid": "d9df28994e8a318d20f4a2598f5f16c8", "name": "Draft and Refine with Visual Experts", "authors": [{"id": 193232, "fullname": "SungHeon Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193232?format=json", "institution": "University of California, Irvine"}, {"id": 193233, "fullname": "Ryozo Masukawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/193233?format=json", "institution": "University of California, Irvine"}, {"id": 193234, "fullname": "Jihong Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/193234?format=json", "institution": ""}, {"id": 193235, "fullname": "Sanggeon Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193235?format=json", "institution": "University of California, Irvine"}, {"id": 134707, "fullname": "Wenjun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134707?format=json", "institution": "University of California, Irvine"}, {"id": 193236, "fullname": "Hanning Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193236?format=json", "institution": "University of California, Irvine"}, {"id": 193237, "fullname": "Mahdi Imani", "url": "http://cvpr.thecvf.com/api/miniconf/users/193237?format=json", "institution": "Northeastern University"}, {"id": 132067, "fullname": "Mohsen Imani", "url": "http://cvpr.thecvf.com/api/miniconf/users/132067?format=json", "institution": "University of California, Irvine"}], "abstract": "While recent Large Vision\u2013Language Models (LVLMs) exhibit impressive multimodal reasoning abilities, they often produce ungrounded, *hallucinated* responses by over-relying on linguistic priors rather than visual evidence. This critical limitation arises from the lack of a quantitative measure of how much these models actually rely on visual inputs during reasoning. We propose **Draft and Refine (DnR)**, an agent framework driven by a novel *question-conditioned utilization metric*. This metric quantifies the model\u2019s actual reliance on visual evidence by first constructing a *query-conditioned relevance map* to localize question-specific evidence, and then assessing dependence through relevance-based probabilistic masking. Guided by this metric, the DnR agent refines its initial *draft* through targeted feedback from external visual experts. Each expert\u2019s output (e.g., boxes, masks) is rendered as visual cues on the image, and the VLM is re-queried to select the response that yields the greatest improvement in utilization. This process strengthens visual grounding of predictions without retraining or architectural changes. Experiments across a broad range of VQA and captioning benchmarks demonstrate consistent accuracy gains and reduced hallucination. These results show that quantifying visual utilization provides a principled path for designing more interpretable and evidence-driven multimodal agent systems that effectively leverage visual experts.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39981", "url": null, "sourceid": 42526, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39005, "uid": "45d95af3e246cc5a1514a05b6c5a172a", "name": "SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery", "authors": [{"id": 127776, "fullname": "Kang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127776?format=json", "institution": "Wuhan University"}, {"id": 185943, "fullname": "Lei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185943?format=json", "institution": "antgroup"}, {"id": 102617, "fullname": "Junwei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/102617?format=json", "institution": "Wuhan University"}, {"id": 127721, "fullname": "Bo Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127721?format=json", "institution": "Wuhan University"}, {"id": 185947, "fullname": "Junjian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185947?format=json", "institution": null}, {"id": 185948, "fullname": "Xiangyuan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185948?format=json", "institution": "Ant Group"}, {"id": 154175, "fullname": "Hongwei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154175?format=json", "institution": "Alibaba Group"}, {"id": 149274, "fullname": "Jingdong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149274?format=json", "institution": "Ant Group"}, {"id": 127790, "fullname": "Yansheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127790?format=json", "institution": "Wuhan University"}], "abstract": "While recent foundation models for remote sensing (RS) segmentation have shown notable progress, they still face significant challenges, struggling to process diverse multi-modal inputs, synergize complementary prompt types, and leverage semantic hierarchies. To address these limitations, we introduce SkySense-VITA, a unified in-context segmentation model, which synergistically processes both optical and SAR imagery using visual, textual, or fused prompts. Based on a novel prompt-and-prediction decoupling strategy, we propose the VITA-Former and VITA-Decoder to decouple multi-modal prompt fusion and prediction process, allowing the model to flexibly support visual-only, textual-only, and fused prompt modes. We train SkySense-VITA with a progressive two-stage strategy: a first stage of Image-Level Alignment Pretraining featuring optical-SAR alignment, and a second stage of Pixel-Level In-context Pretraining using Semantic Granularity Annealing (SGA), a coarse-to-fine curriculum that enables robust hierarchical learning. To support this training, we also introduce our new large-scale, multi-modal Sky-VT-300k dataset. Extensive experiments show SkySense-VITA establishes a new state-of-the-art (SOTA) on 18 datasets, with an average performance lead of over 10\\% mIoU. Code, models, and data will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39005", "url": null, "sourceid": 32017, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38336, "uid": "67170e68b57b735f4fe2052e6b68772a", "name": "Synthetic Curriculum Reinforces Compositional Text-to-Image Generation", "authors": [{"id": 182420, "fullname": "Shijian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182420?format=json", "institution": "Southeast Univeristy"}, {"id": 189629, "fullname": "Runhao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189629?format=json", "institution": "Anhui university"}, {"id": 189630, "fullname": "Siyi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189630?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189631, "fullname": "Qingqin Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189631?format=json", "institution": "Tencent Keen Lab"}, {"id": 189632, "fullname": "Xingjian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189632?format=json", "institution": "Shanghai University of Electric Power"}, {"id": 189633, "fullname": "Jiarui Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189633?format=json", "institution": "Xiaohongshu"}, {"id": 189634, "fullname": "Yuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189634?format=json", "institution": "Xiaohongshu"}, {"id": 189635, "fullname": "Hanqian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189635?format=json", "institution": "Southeast University"}, {"id": 152755, "fullname": "Cunjian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152755?format=json", "institution": "Monash University"}], "abstract": "Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a novel compositional curriculum reinforcement learning framework named CompGen that addresses compositional weakness in existing T2I models. Specifically, we leverage scene graphs to establish a novel difficulty criterion for compositional ability and develop a corresponding adaptive Markov Chain Monte Carlo graph sampling algorithm. This difficulty-aware approach enables the synthesis of training curriculum data that progressively optimize T2I models through reinforcement learning. We integrate our curriculum learning approach into Group Relative Policy Optimization (GRPO) and investigate different curriculum scheduling strategies. Our experiments reveal that CompGen exhibits distinct scaling curves under different curriculum scheduling strategies, with easy-to-hard and Gaussian sampling strategies yielding superior scaling performance compared to random sampling. Extensive experiments demonstrate that CompGen significantly enhances compositional generation capabilities for both diffusion-based and auto-regressive T2I models, highlighting its effectiveness in improving the compositional T2I generation systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38336", "url": null, "sourceid": 37812, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40206, "uid": "2fde8cfb613fd3e898205d1dab47f625", "name": "MERIT: Multi-domain Efficient RAW Image Translation", "authors": [{"id": 134707, "fullname": "Wenjun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134707?format=json", "institution": "University of California, Irvine"}, {"id": 193773, "fullname": "Shenghao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193773?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 193774, "fullname": "Yian Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193774?format=json", "institution": "University of California, Irvine"}, {"id": 182957, "fullname": "Yang Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/182957?format=json", "institution": "Purdue University Northwest"}, {"id": 151644, "fullname": "Ziteng Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/151644?format=json", "institution": "The University of Tokyo"}, {"id": 193236, "fullname": "Hanning Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193236?format=json", "institution": "University of California, Irvine"}, {"id": 193775, "fullname": "Yirui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193775?format=json", "institution": "University of California, Irvine"}, {"id": 193776, "fullname": "Yezi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193776?format=json", "institution": "University of California, Irvine"}, {"id": 193235, "fullname": "Sanggeon Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193235?format=json", "institution": "University of California, Irvine"}, {"id": 193232, "fullname": "SungHeon Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193232?format=json", "institution": "University of California, Irvine"}, {"id": 193233, "fullname": "Ryozo Masukawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/193233?format=json", "institution": "University of California, Irvine"}, {"id": 193777, "fullname": "William Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/193777?format=json", "institution": "University of California, Irvine"}, {"id": 132067, "fullname": "Mohsen Imani", "url": "http://cvpr.thecvf.com/api/miniconf/users/132067?format=json", "institution": "University of California, Irvine"}], "abstract": "RAW images captured by different camera sensors exhibit substantial domain shifts due to varying spectral responses, noise characteristics, and tone behaviors, complicating their direct use in downstream vision tasks. Prior methods address this by training one-to-one RAW-to-RAW translators for each source-target domain pair, but such approaches do not scale to real-world scenarios with multiple cameras. We introduce MERIT, the first unified framework for multi-domain RAW image translation, which leverages a single model to perform translations across arbitrary camera domains. To address domain-specific noise discrepancies, we propose a sensor-aware noise modeling loss that explicitly aligns the signal-dependent noise statistics of the generated images with those of the target domain. Additionally, we enhance the generator\u2019s context modeling with a conditional multi-scale large kernel attention module, enabling efficient capture of both global illumination and fine-grained sensor cues. To support standardized evaluation, we construct MDRAW, a new dataset of paired and unpaired RAW images from five diverse camera sensors. Extensive experiments on existing and newly proposed benchmarks demonstrate that MERIT significantly outperforms prior models in both accuracy and scalability, offering a practical and generalizable solution to cross-domain RAW image harmonization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40206", "url": null, "sourceid": 38408, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37297, "uid": "92fe0f14f026db9c6631440f5b5cf0dc", "name": "MOMO: Mars Orbital MOdel Foundation Model for Mars Orbital Applications", "authors": [{"id": 187111, "fullname": "Mirali Purohit", "url": "http://cvpr.thecvf.com/api/miniconf/users/187111?format=json", "institution": "Arizona State University (ASU)"}, {"id": 187112, "fullname": "Bimal Gajera", "url": "http://cvpr.thecvf.com/api/miniconf/users/187112?format=json", "institution": null}, {"id": 187113, "fullname": "Irish Mehta", "url": "http://cvpr.thecvf.com/api/miniconf/users/187113?format=json", "institution": "Arizona State University"}, {"id": 187114, "fullname": "Bhanu Tokas", "url": "http://cvpr.thecvf.com/api/miniconf/users/187114?format=json", "institution": "Arizona State University"}, {"id": 187115, "fullname": "Jacob Adler", "url": "http://cvpr.thecvf.com/api/miniconf/users/187115?format=json", "institution": "Arizona State University"}, {"id": 162879, "fullname": "Steven Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/162879?format=json", "institution": "Jet Propulsion Laboratory"}, {"id": 187116, "fullname": "Scott Dickenshied", "url": "http://cvpr.thecvf.com/api/miniconf/users/187116?format=json", "institution": "Arizona State University"}, {"id": 187117, "fullname": "Serina Diniega", "url": "http://cvpr.thecvf.com/api/miniconf/users/187117?format=json", "institution": "Jet Propulsion Laboratory"}, {"id": 187118, "fullname": "Brian Bue", "url": "http://cvpr.thecvf.com/api/miniconf/users/187118?format=json", "institution": "Jet Propulsion Laboratory"}, {"id": 187119, "fullname": "Umaa Rebbapragada", "url": "http://cvpr.thecvf.com/api/miniconf/users/187119?format=json", "institution": "Jet Propulsion Laboratory"}, {"id": 75560, "fullname": "Hannah Kerner", "url": "http://cvpr.thecvf.com/api/miniconf/users/75560?format=json", "institution": "Arizona State University"}], "abstract": "We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of ~12 million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37297", "url": null, "sourceid": 43217, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38941, "uid": "d9106553cc5dcab924a87b57eb707fdd", "name": "Anti-Degradation Lifelong Multi-View Clustering", "authors": [{"id": 131836, "fullname": "Xingfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131836?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191015, "fullname": "Hao Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191015?format=json", "institution": "Southwest University of Science and Technology"}, {"id": 191016, "fullname": "Honglin Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191016?format=json", "institution": "Southwest University of Science and Technology"}, {"id": 157685, "fullname": "Yuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/157685?format=json", "institution": "Sichuan University"}, {"id": 191017, "fullname": "Xujian Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191017?format=json", "institution": "Southwest University of Science and Technology"}, {"id": 191018, "fullname": "Jia-Qi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191018?format=json", "institution": "A*STAR"}, {"id": 88480, "fullname": "Zhenwen Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/88480?format=json", "institution": "Southwest University Of Science And Technology"}], "abstract": "In real-world scenarios, new views are continuously collected over time, forming a dynamic view stream. To handle such evolving data, a lifelong multi-view clustering framework is needed instead of a static model. However, large discrepancies across views make it challenging to learn new knowledge while preserving previously acquired information. There are few methods use consistency alignment or knowledge distillation to align new knowledge with old ones. However, these strategies cannot fundamentally prevent knowledge degradation, since new knowledge inevitably interferes with the learned representation space. To overcome this limitation, we propose a new \\textbf{A}nti-degradation Lifelong Multi-view Clustering (ALMC) framework. Specifically, we innovatively propose a null-space-projection knowledge base anti-degradation technique, which ensures that new knowledge updates to the model only occur in directions orthogonal to the retained knowledge, thus preventing catastrophic forgetting of knowledge and degradation of clustering performance, and provides theoretical proof for this. Extensive experiments on multiple multi-view benchmark datasets demonstrate superior performance in multi-view clustering.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38941", "url": null, "sourceid": 37755, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36402, "uid": "f4f63e527935e0ffe843235d4131a223", "name": "PRUE: A Practical Recipe for Field Boundary Segmentation at Scale", "authors": [{"id": 184962, "fullname": "Gedeon Muhawenayo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184962?format=json", "institution": "Arizona State University (ASU)"}, {"id": 184963, "fullname": "Caleb Robinson", "url": "http://cvpr.thecvf.com/api/miniconf/users/184963?format=json", "institution": "Microsoft"}, {"id": 107003, "fullname": "Subash Khanal", "url": "http://cvpr.thecvf.com/api/miniconf/users/107003?format=json", "institution": "Washington University in St Louis"}, {"id": 164970, "fullname": "Zhanpei Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/164970?format=json", "institution": "Oregon State University"}, {"id": 184964, "fullname": "Isaac Corley", "url": "http://cvpr.thecvf.com/api/miniconf/users/184964?format=json", "institution": "Taylor Geospatial"}, {"id": 150309, "fullname": "Alexander Wollam", "url": "http://cvpr.thecvf.com/api/miniconf/users/150309?format=json", "institution": "Washington University in St. Louis"}, {"id": 184038, "fullname": "Tianyi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184038?format=json", "institution": "Washington University in St. Louis"}, {"id": 184965, "fullname": "Leonard Strnad", "url": "http://cvpr.thecvf.com/api/miniconf/users/184965?format=json", "institution": null}, {"id": 184966, "fullname": "Ryan Avery", "url": "http://cvpr.thecvf.com/api/miniconf/users/184966?format=json", "institution": "Wherobots"}, {"id": 184967, "fullname": "Lyndon Estes", "url": "http://cvpr.thecvf.com/api/miniconf/users/184967?format=json", "institution": "Clark University"}, {"id": 184968, "fullname": "Ana T\u00e1rano", "url": "http://cvpr.thecvf.com/api/miniconf/users/184968?format=json", "institution": "Arizona State University"}, {"id": 75557, "fullname": "Nathan Jacobs", "url": "http://cvpr.thecvf.com/api/miniconf/users/75557?format=json", "institution": "Washington University in St. Louis"}, {"id": 75560, "fullname": "Hannah Kerner", "url": "http://cvpr.thecvf.com/api/miniconf/users/75560?format=json", "institution": "Arizona State University"}], "abstract": "Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping have undesirable properties for large-scale inference, including sensitivity to illumination, spatial scale, and geographic location changes. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFM) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U-Net semantic segmentation model outperforms instance-based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U-Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real-world conditions. Our model achieves a 76\\% IoU and 47\\% object-F1 on the FTW benchmark, an increase of 6\\% and 9\\% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference.We release trained models and model-derived field boundary datasets for 5 countries outside of the FTW dataset to support future research and deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36402", "url": null, "sourceid": 39657, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37636, "uid": "06524331e2c63c0ed3479bf1be85ce3b", "name": "Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance", "authors": [{"id": 180293, "fullname": "Zhengxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180293?format=json", "institution": "nanjing university"}, {"id": 187923, "fullname": "Qinhui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187923?format=json", "institution": "Nanjing University"}, {"id": 70490, "fullname": "Yiyu Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70490?format=json", "institution": "Nanjing University"}, {"id": 96456, "fullname": "Chuan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96456?format=json", "institution": "Meta Reality Lab"}, {"id": 152200, "fullname": "Xinxin Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152200?format=json", "institution": "Concordia University"}, {"id": 186119, "fullname": "Xiao-Xiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186119?format=json", "institution": "Nanjing University"}, {"id": 128100, "fullname": "Yao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128100?format=json", "institution": "Nanjing University"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}, {"id": 130123, "fullname": "Qiu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130123?format=json", "institution": "Nanjing University"}, {"id": 153839, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153839?format=json", "institution": "Nanjing University"}], "abstract": "We present Pressure2Motion, a novel motion capture algorithm that reconstructs human motion from a ground pressure sequence and text prompt. At inference time, Pressure2Motion requires only a pressure mat, eliminating the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacy-preserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminacy of pressure signals with respect to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint to resolve ambiguities. Specifically, our model adopts a dual-level feature extractor to accurately interpret pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion estimation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion reconstruction, and the established MPL benchmark is the first benchmark for this novel motion capture task. Experiments show that our method generates high-fidelity, physically plausible motions, establishing a new state of the art for this task. The codes and benchmarks will be publicly released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37636", "url": null, "sourceid": 33449, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36349, "uid": "827a1fd7a77a96b4c3a3cd36b431f878", "name": "Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration", "authors": [{"id": 184842, "fullname": "Weile Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184842?format=json", "institution": "Zhejiang University"}, {"id": 184843, "fullname": "Bingchen Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184843?format=json", "institution": "Zhejiang University"}, {"id": 69296, "fullname": "Qifan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/69296?format=json", "institution": "Zhejiang University"}, {"id": 184844, "fullname": "Wendong Bu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184844?format=json", "institution": "Zhejiang University"}, {"id": 126883, "fullname": "Guoming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126883?format=json", "institution": "Zhejiang University"}, {"id": 126912, "fullname": "Wenqiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126912?format=json", "institution": "National University of Singapore"}, {"id": 84791, "fullname": "Shengyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84791?format=json", "institution": "Zhejiang University"}, {"id": 88890, "fullname": "Juncheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88890?format=json", "institution": "Zhejiang University"}, {"id": 126909, "fullname": "Siliang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126909?format=json", "institution": "Zhejiang University"}], "abstract": "Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose **SCALE** (**S**elf-**C**ognitive-**A**ware **L**earning and **E**xploration), which leverages three advertise roles\u2014\u2014*selector*, *predictor*, and *judger* to autonomously discover their limitations and expand its cognitive boundaries through the environment exploration. Moreover, we propose **SCALE-Hop**, a graph exploration strategy that facilitates global planning and helps agents avoid local exploration traps. To further support learning, we construct **SCALE-20k**, a large-scale dataset collected from 19 real-world websites, containing diverse task types and structured demonstrations generated from SCALE\u2019s exploration traces. Experimental results show that our approach significantly improves the performance and generalization of multiple MLLMs in various web environments. Our framework offers a scalable and generalizable solution for building truly autonomous and adaptive web agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36349", "url": null, "sourceid": 34280, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39301, "uid": "0cc346ec65e459bd61888a572b3b138e", "name": "WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing", "authors": [{"id": 153656, "fullname": "Kaihang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153656?format=json", "institution": "Zhejiang University"}, {"id": 184842, "fullname": "Weile Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184842?format=json", "institution": "Zhejiang University"}, {"id": 191800, "fullname": "Haiyi Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191800?format=json", "institution": "Zhejiang University"}, {"id": 69296, "fullname": "Qifan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/69296?format=json", "institution": "Zhejiang University"}, {"id": 184844, "fullname": "Wendong Bu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184844?format=json", "institution": "Zhejiang University"}, {"id": 151420, "fullname": "zehan wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151420?format=json", "institution": "Zhejiang University"}, {"id": 191801, "fullname": "Yun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191801?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 88890, "fullname": "Juncheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88890?format=json", "institution": "Zhejiang University"}, {"id": 126909, "fullname": "Siliang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126909?format=json", "institution": "Zhejiang University"}], "abstract": "Recent image editing models boast next-level intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps\u2014Awareness, Interpretation, and Imagination\u2014each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39301", "url": null, "sourceid": 39053, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37274, "uid": "0ffe258ac76e674bf236710ea891d791", "name": "WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation", "authors": [{"id": 174116, "fullname": "Wei Chow", "url": "http://cvpr.thecvf.com/api/miniconf/users/174116?format=json", "institution": "National University of Singapore"}, {"id": 133166, "fullname": "Jiachun Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/133166?format=json", "institution": "National University of Singapore"}, {"id": 151379, "fullname": "Yongyuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151379?format=json", "institution": "University of Maryland, College Park"}, {"id": 187062, "fullname": "Mingze Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187062?format=json", "institution": "Zhejiang University"}, {"id": 129293, "fullname": "Xue Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/129293?format=json", "institution": "Fudan University"}, {"id": 157940, "fullname": "Liyu Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/157940?format=json", "institution": "School of Computer Science and  Engineering, Nanyang Technological University"}, {"id": 187063, "fullname": "Saining Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187063?format=json", "institution": "Nanyang Technological University"}, {"id": 126909, "fullname": "Siliang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126909?format=json", "institution": "Zhejiang University"}, {"id": 88890, "fullname": "Juncheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88890?format=json", "institution": "Zhejiang University"}, {"id": 187064, "fullname": "Fengda Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187064?format=json", "institution": "Nanyang Technological University"}, {"id": 103767, "fullname": "Weijia Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103767?format=json", "institution": "National University of Singapore"}, {"id": 91500, "fullname": "Hanwang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91500?format=json", "institution": "Nanyang Technological University"}, {"id": 88927, "fullname": "Tat-seng Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/88927?format=json", "institution": "National University of Singapore"}], "abstract": "Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation.However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing.To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation.Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of $100$K interleaved samples spanning over $370$K dialogue turns and $500$K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with $100$ tasks based on $480$ images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains.Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing.We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37274", "url": null, "sourceid": 36897, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37127, "uid": "8631e5b91354a846f00dbbf16ad94259", "name": "Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling", "authors": [{"id": 185044, "fullname": "Xinlei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185044?format=json", "institution": "National University of Singapore"}, {"id": 156408, "fullname": "Chengming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156408?format=json", "institution": "Tencent"}, {"id": 144046, "fullname": "Zhangquan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144046?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186697, "fullname": "Yudong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186697?format=json", "institution": "University of Science and Technology of China"}, {"id": 101815, "fullname": "Shilin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/101815?format=json", "institution": "Nanyang Technological University"}, {"id": 186725, "fullname": "Cheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186725?format=json", "institution": "Hangzhou Dianzi University; DeepWisdom"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 106922, "fullname": "Shuicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/106922?format=json", "institution": "National University of Singapore, Department of Electrical and Computer Engineering"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}], "abstract": "The dominant paradigm of monolithic scaling in Vision-Language Models (VLMs) is failing for understanding and reasoning in documents, yielding diminishing returns as it struggles with the inherent need of this domain for document-based procedural reasoning, cognitive complexity, and factual accuracy. To this end, we introduce MACT, a Multi-Agent Collaboration framework with agent-wise adaptive Test-time scaling that pioneers a paradigm shift to procedural scaling, adapting dynamically to the functional entities of visual documents understanding and reasoning. MACT decomposes the visual document processing flow into four specialized agents, i.e., planning, execution, judgment, and answer, to resolve cognitive overload and introduce a critical self-correction loop for factual grounding.This collaborative architecture is amplified by an agent-wise adaptive test-time scaling strategy that intelligently allocates computational resources based on the complexity and redundancy of each functionality. Evaluated on multiple visual document understanding benchmarks, MACT achieves superior performance with a smaller parameter scale, adapting effectively to various document scenarios without compromising its general or mathematical reasoning capabilities. The three variants of MACT consistently attain top-three average performance rankings, with average performance enhancements of 9.9\u201311.5\\% over the base models. The source code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37127", "url": null, "sourceid": 45850, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37115, "uid": "cc1d6397370f680339bb84ca6ad55267", "name": "VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models", "authors": [{"id": 185044, "fullname": "Xinlei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185044?format=json", "institution": "National University of Singapore"}, {"id": 156408, "fullname": "Chengming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156408?format=json", "institution": "Tencent"}, {"id": 153503, "fullname": "Guibin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153503?format=json", "institution": "National University of Singapore"}, {"id": 144046, "fullname": "Zhangquan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144046?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186697, "fullname": "Yudong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186697?format=json", "institution": "University of Science and Technology of China"}, {"id": 161900, "fullname": "Yongbo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/161900?format=json", "institution": "Zhejiang University"}, {"id": 126680, "fullname": "Peng-Tao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126680?format=json", "institution": "vivo Mobile Communication (Hangzhou) Co., Ltd."}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 106922, "fullname": "Shuicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/106922?format=json", "institution": "National University of Singapore, Department of Electrical and Computer Engineering"}], "abstract": "Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a \"visual processing bottleneck\": a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37115", "url": null, "sourceid": 41979, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37861, "uid": "8727f12649fd2a6867810cc038cfd2af", "name": "Probing and Bridging Geometry\u2013Interaction Cues for Affordance Reasoning in Vision Foundation Models", "authors": [{"id": 188423, "fullname": "Qing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188423?format=json", "institution": "Australian National University"}, {"id": 102621, "fullname": "Xuesong li", "url": "http://cvpr.thecvf.com/api/miniconf/users/102621?format=json", "institution": "Australian National University"}, {"id": 130439, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130439?format=json", "institution": "Australian National University"}], "abstract": "What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free, zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37861", "url": null, "sourceid": 40884, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40080, "uid": "82fa3c9858eb0c2f5edd0add97145f6a", "name": "FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing", "authors": [{"id": 193454, "fullname": "Guangzhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193454?format=json", "institution": "Central South University"}, {"id": 185461, "fullname": "Yanming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185461?format=json", "institution": "Westlake University"}, {"id": 182531, "fullname": "Chenxi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/182531?format=json", "institution": "Westlake University"}, {"id": 71043, "fullname": "Xiaohong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71043?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Text-driven video editing aims to modify video content based on natural language instructions. While recent training-free methods have leveraged pretrained diffusion models, they often rely on an inversion-editing paradigm. This paradigm maps the video to a latent space before editing. However, the inversion process is not perfectly accurate, often compromising appearance fidelity and motion consistency.To address this, we introduce FlowDirector, a novel training-free and inversion-free video editing framework. Our framework models the editing process as a direct evolution in the data space. It guides the video to transition smoothly along its inherent spatio-temporal manifold using an ordinary differential equation (ODE), thereby avoiding the inaccurate inversion step. From this foundation, we introduce three flow correction strategies for appearance, motion, and stability: 1) Direction-aware flow correction amplifies components that oppose the source direction and removes irrelevant terms, breaking conservative streamlines and enabling stronger structural and textural changes. 2) Motion\u2013appearance decoupling optimizes motion agreement as an energy term at each timestep, significantly improving consistency and motion transfer. 3) Differential averaging guidance strategy leverages differences among multiple candidate flows to approximate a low variance regime at low cost, suppressing artifacts and stabilizing the trajectory.  Extensive experiments across various editing tasks and benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction following, temporal consistency, and background preservation, establishing an efficient new paradigm for coherent video editing without inversion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40080", "url": null, "sourceid": 44528, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38173, "uid": "29860f5735fc4987b5f8ef3ee2767847", "name": "PR Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency", "authors": [{"id": 179982, "fullname": "Leezy Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/179982?format=json", "institution": "Ajou University"}, {"id": 180621, "fullname": "SEUNGGYU KIM", "url": "http://cvpr.thecvf.com/api/miniconf/users/180621?format=json", "institution": "Ajou University"}, {"id": 189205, "fullname": "Dongseok Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189205?format=json", "institution": "Sony Group Corporation"}, {"id": 179859, "fullname": "Hyeonbeom Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/179859?format=json", "institution": "Ajou University"}], "abstract": "Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes flickering artifacts in the resulting depth maps but may also lead to estimation failures when the depth range changes abruptly due to camera motion. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry information from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we develop pose estimation and sparse depth estimation modules based on optical flow computed from consecutive image frames. The resulting sparse depth estimates are then used to rescale and refine the relative depth predicted by a pre-trained depth estimation foundation model. Through this refinement process, our method produces dense and temporally consistent depth maps. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our-collected datasets, demonstrating robust and accurate performance in both pose estimation and depth prediction tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38173", "url": "https://ptc-depth.github.io/", "sourceid": 43536, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36624, "uid": "e5406b07110c540bb5d4ae2478236ec4", "name": "Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models", "authors": [{"id": 174628, "fullname": "Jiajia Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/174628?format=json", "institution": "Sun Yat-sen University"}, {"id": 185496, "fullname": "YuJia He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185496?format=json", "institution": "Sun Yat-sen Univercity"}, {"id": 185497, "fullname": "Yuhan Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185497?format=json", "institution": "Sun Yat-sen University"}, {"id": 163391, "fullname": "Hang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/163391?format=json", "institution": "Sun Yat-sen University"}, {"id": 185498, "fullname": "Sihua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185498?format=json", "institution": "Sun Yat-sen University"}, {"id": 185499, "fullname": "Jincheng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185499?format=json", "institution": "Sun Yat-sen University"}, {"id": 185500, "fullname": "Kwok Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185500?format=json", "institution": "Sun Yat-sen University"}, {"id": 105581, "fullname": "Zibin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/105581?format=json", "institution": "Sun Yat-sen University"}, {"id": 85118, "fullname": "Weibin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85118?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Most existing evaluations of generated videos adopt a no-reference paradigm. Although recent benchmarks cover multiple dimensions and show moderate correlation with human preferences, relying solely on textual prompts weakens real-world constraints and makes it difficult to produce accountable and interpretable judgments on instance-level issues such as target behavior deviation, temporal inconsistency, and commonsense violations. In scenarios with explicit expectations, such as controlled generation, reference videos naturally provide rich, unambiguous spatio-temporal evidence, enabling stricter and more trustworthy assessment. Motivated by this, we propose Ref4D, a reference-based, fine-grained, multi-dimensional benchmark for generated video evaluation. Ref4D contains 600 high-quality reference videos with tightly evidence-bounded prompts, and introduces a 12-metric structured evaluation suite along four key dimensions: basic semantic alignment, motion consistency, event temporal consistency, and world knowledge consistency. Experiments on eight text-to-video models show that Ref4D achieves stronger agreement with human judgments than representative no-reference frameworks, while precisely diagnosing the dimensions and causes of failure for each video. By integrating explicit reference evidence with multimodal reasoning, Ref4D provides a practical and human-aligned standard for generated video evaluation and a tool to guide the development of more reliable generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36624", "url": null, "sourceid": 36776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39491, "uid": "ab130df6b0564471063c3387c2a07e4a", "name": "FedAlign: Differentially Private Distribution Alignment for Non-IID Federated Learning", "authors": [{"id": 176628, "fullname": "Peng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176628?format=json", "institution": "Hunan University"}, {"id": 180116, "fullname": "Jiapeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180116?format=json", "institution": "Hunan University"}, {"id": 192191, "fullname": "Yingjie Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/192191?format=json", "institution": "Naval University of Engineering"}, {"id": 192192, "fullname": "Xiong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192192?format=json", "institution": "Hunan University"}, {"id": 127251, "fullname": "Zhuo Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127251?format=json", "institution": "Hunan University"}], "abstract": "Federated Learning (FL) enables collaborative model training without sharing raw data, but client data are often Non-Independent and Identically Distributed (Non-IID), which often slow convergence and degrade global performance. Meanwhile, privacy preservation is also a critical concern in FL. To address these two issues, we propose $\\textit{FedAlign}$, a differentially private framework that aligns local data distributions via client-side statistical moment alignment. Clients upload perturbed distribution statistics, which the server aggregates to infer global distribution characteristics and guide local alignment, thereby reducing inter-client discrepancies. Experiments and theoretical analysis show that FedAlign accelerates convergence and improves accuracy under Non-IID settings while preserving rigorous privacy guarantees.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39491", "url": null, "sourceid": 32954, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39381, "uid": "302626eb160bcd2ff5b9e4a0578e021b", "name": "Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM", "authors": [{"id": 107599, "fullname": "Yunsong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107599?format=json", "institution": "NVIDIA"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency. Our code will be made publicly available upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39381", "url": null, "sourceid": 33689, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38273, "uid": "0a9a820b2c608bb2c204ad980d7c60c8", "name": "EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding", "authors": [{"id": 152779, "fullname": "Seungjun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/152779?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 132274, "fullname": "Zihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132274?format=json", "institution": "National University of Singapore"}, {"id": 107599, "fullname": "Yunsong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107599?format=json", "institution": "NVIDIA"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D representation in an online and nearly real-time manner. In this study, we propose **EmbodiedSplat**, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Our code will be publicly available on paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38273", "url": null, "sourceid": 39387, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38058, "uid": "923ccbcec867ebc5588386df7c370c55", "name": "Moving Border Ownership for Event-based Motion Segmentation", "authors": [{"id": 188942, "fullname": "Zhiyuan Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/188942?format=json", "institution": "University of Maryland, College Park"}, {"id": 136019, "fullname": "Cornelia Fermuller", "url": "http://cvpr.thecvf.com/api/miniconf/users/136019?format=json", "institution": "University of Maryland, College Park"}, {"id": 85403, "fullname": "Yiannis Aloimonos", "url": "http://cvpr.thecvf.com/api/miniconf/users/85403?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Event cameras provide accurate information at motion boundaries\u2014exactly where disentangling ego-motion, object motion, and border ownership determines segmentation quality. We argue that the missing ingredient in dynamic scene interpretation is moving border ownership: detecting motion boundaries and assigning which side is foreground so occlusions are resolved by design.Traditional geometric motion segmentation pipelines (e.g., flow clustering, simple motion models) remain assumption-heavy and slow, while deep models often fail to generalize across sensors or datasets. We introduce a lightweight, ownership-aware predictor trained solely on synthetic events with perfect supervision for boundaries, ownership, and motion, generated via a Blender pipeline. Its key targets\u2014a signed-distance ownership field and a motion mask\u2014focus learning where events occur and yield stable gradients. The model runs in real time and generalizes without tuning: trained on synthetic events, it achieves zero-shot transfer on EED, EVIMO1, EVIMO2, and EMSMC,  delivering state-of-the-art performance. By casting motion segmentation as ownership-aware edge understanding, we combine the robustness of model-based reasoning with the scalability of learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38058", "url": null, "sourceid": 38555, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36483, "uid": "d16dcfcb8006fc84bd0ab2ea79b70863", "name": "Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction", "authors": [{"id": 179913, "fullname": "Shannan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179913?format=json", "institution": "Tsinghua University"}, {"id": 173791, "fullname": "Leqi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/173791?format=json", "institution": "Tsinghua University"}, {"id": 185159, "fullname": "Keyu Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185159?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185160, "fullname": "Jingchen Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/185160?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185161, "fullname": "Hongyang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185161?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185162, "fullname": "Jiajun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185162?format=json", "institution": "University of Science and Technology of China"}, {"id": 185163, "fullname": "Guangting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185163?format=json", "institution": null}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 86975, "fullname": "Chun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86975?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 126882, "fullname": "Fengyun Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126882?format=json", "institution": "WeChat, Tencent Inc."}], "abstract": "We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36483", "url": null, "sourceid": 38018, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37006, "uid": "e2cb74df8fd50616f5e9e763c5a15a01", "name": "Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following", "authors": [{"id": 153748, "fullname": "Tianyi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153748?format=json", "institution": "University of Maryland, College Park"}, {"id": 186448, "fullname": "Yi Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/186448?format=json", "institution": "University of Maryland, College Park"}, {"id": 91542, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91542?format=json", "institution": "The University of Tokyo"}, {"id": 186449, "fullname": "Zuolong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186449?format=json", "institution": "University of Maryland, College Park"}, {"id": 186450, "fullname": "Pranav Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/186450?format=json", "institution": "University of Maryland, College Park"}, {"id": 186451, "fullname": "Kaishen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186451?format=json", "institution": "University of Maryland, College Park"}, {"id": 186452, "fullname": "Qi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186452?format=json", "institution": "University of Maryland, College Park"}, {"id": 186453, "fullname": "Zeying Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186453?format=json", "institution": "University of Maryland, College Park"}, {"id": 186454, "fullname": "Chenxi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186454?format=json", "institution": "University of Maryland, College Park"}, {"id": 186455, "fullname": "Ruibo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186455?format=json", "institution": "University of Maryland, College Park"}, {"id": 183630, "fullname": "Tong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183630?format=json", "institution": "University of Maryland, College Park"}, {"id": 186456, "fullname": "Yanshuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186456?format=json", "institution": "University of Maryland, College Park"}, {"id": 153749, "fullname": "Xiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153749?format=json", "institution": "University of Maryland, College Park"}, {"id": 91572, "fullname": "Renrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91572?format=json", "institution": "MMLab of CUHK &amp;amp;amp; Shanghai AI Laboratory"}, {"id": 129054, "fullname": "Wenhu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129054?format=json", "institution": "University of Waterloo"}, {"id": 84846, "fullname": "Heng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84846?format=json", "institution": "University of Pittsburgh"}], "abstract": "Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored.We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts.Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria\u2014especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment.Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges.As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37006", "url": null, "sourceid": 43188, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38951, "uid": "5aad38004a6546b2382974698dbcb264", "name": "LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol", "authors": [{"id": 182781, "fullname": "Hongyi Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182781?format=json", "institution": "Northwestern University"}, {"id": 148755, "fullname": "Gorkem Durak", "url": "http://cvpr.thecvf.com/api/miniconf/users/148755?format=json", "institution": "Northwestern University"}, {"id": 191035, "fullname": "Halil Aktas", "url": "http://cvpr.thecvf.com/api/miniconf/users/191035?format=json", "institution": "Northwestern University"}, {"id": 191036, "fullname": "Andrea Bejar", "url": "http://cvpr.thecvf.com/api/miniconf/users/191036?format=json", "institution": "Northwestern University"}, {"id": 191037, "fullname": "Baver Tutun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191037?format=json", "institution": "Prof. Dr. Cemil Tascioglu City Hospital"}, {"id": 191038, "fullname": "Emre Uysal", "url": "http://cvpr.thecvf.com/api/miniconf/users/191038?format=json", "institution": "Health Sciences University"}, {"id": 191039, "fullname": "Ezgi B\u00fclb\u00fcl", "url": "http://cvpr.thecvf.com/api/miniconf/users/191039?format=json", "institution": "Health Sciences University"}, {"id": 191040, "fullname": "Mehmet Do\u011fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191040?format=json", "institution": "Health Sciences University"}, {"id": 191041, "fullname": "Berrin Erok", "url": "http://cvpr.thecvf.com/api/miniconf/users/191041?format=json", "institution": "Prof Dr Cemil Ta\u015f\u00e7\u0131o\u011flu City Hospital"}, {"id": 191042, "fullname": "Berna Yildirim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191042?format=json", "institution": "Health Sciences University"}, {"id": 191043, "fullname": "Sukru M Erturk", "url": "http://cvpr.thecvf.com/api/miniconf/users/191043?format=json", "institution": "Istanbul University"}, {"id": 191044, "fullname": "Ulas Bagci", "url": "http://cvpr.thecvf.com/api/miniconf/users/191044?format=json", "institution": "Northwestern University"}], "abstract": "Publicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical labels, and vendor diversity, which hinders the training of robust models. We present \\textbf{LUMINA}, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to expose clinically relevant appearance shifts that current benchmarks overlook. This innovative resource comprises 1824 images from 468 patients (960 benign, 864 malignant) with pathology-confirmed outcomes, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and both high- and low-energy styles, exposing vendor- and energy-driven appearance shifts. To reduce cross\u2011vendor/energy drift while preserving lesion morphology, we introduce a \\textit{foreground\u2011only}, pixel\u2011space alignment (\u201cenergy harmonization\u201d) that aligns each image to a low\u2011energy reference style, leaving the zero\u2011valued background unchanged. By benchmarking modern CNN and transformer baselines on three clinically meaningful tasks\u2014diagnosis (benign vs. malignant), BI\u2011RADS risk grouping, and density\u2014we unify single\u2011vs\u2011two\u2011view evaluation and show that two\u2011view models consistently outperform single\u2011view; in our benchmark, EfficientNet\u2011B0 ($512^2$) attains AUC 93.61\\% for diagnosis, and Swin\u2011T yields the best macro\u2011AUC 89.10\\% for density. Harmonization improves AUC/ACC across backbones and yields more focal Grad\u2011CAM localization around suspicious regions. Being a richly annotated resource, LUMINA thus provides (a) a vendor\u2011diverse, energy\u2011labeled benchmark and (b) a model\u2011agnostic harmonization protocol that together catalyze reliable, deployable mammography AI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38951", "url": null, "sourceid": 33545, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39345, "uid": "e91df292995700ed5f85a01c4515962c", "name": "DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs", "authors": [{"id": 191891, "fullname": "Yanbin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191891?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 191892, "fullname": "Jiangyue Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191892?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 191893, "fullname": "Chun Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191893?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 143592, "fullname": "Yang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/143592?format=json", "institution": "Southern University of Science and Technology"}, {"id": 191894, "fullname": "Hua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191894?format=json", "institution": "Southern University of Science and Technology"}, {"id": 86143, "fullname": "James Kwok", "url": "http://cvpr.thecvf.com/api/miniconf/users/86143?format=json", "institution": "Department of Computer Science and Engineering, The Hong Kong University of Science and Technology"}, {"id": 86185, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86185?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on a single type of graph topology representation (GTR) of graphs, such as fixed-style visual images or unified text descriptions. This \"one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or overly lengthy responses to graph-related queries. To address this,  we propose the $\\mbox{DynamicGTR}$ framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39345", "url": null, "sourceid": 36422, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37268, "uid": "582ca5172eb531ba24fbeaf043de3d0d", "name": "Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation", "authors": [{"id": 187053, "fullname": "Qinglun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187053?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187054, "fullname": "Shen Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187054?format=json", "institution": "Megvii Technology Inc."}, {"id": 187055, "fullname": "Tian Dan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187055?format=json", "institution": null}, {"id": 128936, "fullname": "Haoqiang Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128936?format=json", "institution": "Dexmal"}, {"id": 187056, "fullname": "Guanghui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187056?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 93490, "fullname": "Shuaicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/93490?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies.  E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen benchmark and further conduct 4 real-world experiments to validate its effectiveness in physical environments. Simulation results show that E3Flow achieves a 3.12\\% improvement in average success rate over the state-of-the-art Spherical Diffusion Policy (SDP) while simultaneously delivering a 7$\\times$ inference speedup. E3Flow thus demonstrates a new and highly effective trade-off between performance, efficiency, and data efficiency for robotic policy learning. Code and videos will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37268", "url": null, "sourceid": 42162, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38772, "uid": "5143b1a03753e9634ad9395eb18e0417", "name": "DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation", "authors": [{"id": 90071, "fullname": "Zhechao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90071?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 190631, "fullname": "Yiming Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190631?format=json", "institution": "XPeng Motors"}, {"id": 190632, "fullname": "Lufan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190632?format=json", "institution": "XPeng Motors"}, {"id": 190633, "fullname": "Zeqing Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190633?format=json", "institution": "XPeng Motors"}, {"id": 181757, "fullname": "Chen Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181757?format=json", "institution": "Xpeng Motors"}, {"id": 70172, "fullname": "Dongshuo Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/70172?format=json", "institution": "Tsinghua University"}, {"id": 190634, "fullname": "Ziyao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190634?format=json", "institution": "XPeng Motors"}, {"id": 165027, "fullname": "Cheng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165027?format=json", "institution": null}], "abstract": "Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where existing methods fail, highlighting its strong generalization ability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38772", "url": null, "sourceid": 33278, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36593, "uid": "e069a65788839872ffe1902a16286563", "name": "Bidirectional Normalizing Flow: From Data to Noise and Back", "authors": [{"id": 180383, "fullname": "Yiyang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180383?format=json", "institution": "Tsinghua University"}, {"id": 180579, "fullname": "Qiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180579?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185425, "fullname": "Xianbang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185425?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 180589, "fullname": "Zhicheng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180589?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 181517, "fullname": "Hanhong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181517?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 150920, "fullname": "Kaiming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/150920?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Normalizing Flows (NFs) are a principled framework for generative modeling, consisting of a forward process and a reverse process. The forward process maps data to a simple prior distribution, while the reverse process generates samples by inverting this mapping. Traditional approaches focus on designing expressive forward transformations under strict requirement of explicitly invertibility, so that the reverse process can serve as their exact analytic inverse. Recent advances such as TARFlow enhance the forward model with Transformers and autoregressive structures, achieving state-of-the-art generation quality\u2014but at the expense of slow sampling due to autoregressive decoding. In this work, we introduce Bidirectional Normalizing Flow ($\\textbf{BiFlow}$), a new framework that removes the need for an exact analytic inverse by learning a flexible, data-driven reverse model to $\\textbf{approximate}$ the inverse mapping. This relaxation enables richer architectures and loss formulations while preserving the probabilistic foundation of NFs. BiFlow performs direct, single-forward (1-NFE) generation, eliminating autoregressive bottlenecks and achieving up to two orders of magnitude faster sampling with improved generation quality. We hope this work encourages rethinking Normalizing Flows as direct, flexible, and efficient generative models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36593", "url": null, "sourceid": 32199, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40141, "uid": "d21b47a5b72b3ca7e9f74750fff29dd1", "name": "Stake the Points: Structure-Faithful Instance Unlearning", "authors": [{"id": 193611, "fullname": "Kiseong Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193611?format=json", "institution": "Chung-Ang University"}, {"id": 156659, "fullname": "JungKyoo Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/156659?format=json", "institution": "Chung-Ang University"}, {"id": 156661, "fullname": "Eunwoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156661?format=json", "institution": "ChungAng University"}], "abstract": "Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion\u2013retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion\u2013retention trade-off and enhancing generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40141", "url": null, "sourceid": 40762, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36185, "uid": "817a5acf77700c5f189851e8deacd22c", "name": "Rethinking Camera Choice : An Empirical Study on Fisheye Camera Properties in Robotic Manipulation", "authors": [{"id": 77515, "fullname": "Han Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/77515?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184363, "fullname": "Nan Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/184363?format=json", "institution": "Southeast University"}, {"id": 184364, "fullname": "Xiaotong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184364?format=json", "institution": "University of Science and Technology of China"}, {"id": 184365, "fullname": "Wendi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184365?format=json", "institution": "Shanghai Jiao Tong University; Shanghai Innovation Institute"}, {"id": 184366, "fullname": "Fang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184366?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184367, "fullname": "Jun Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/184367?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86440, "fullname": "Cewu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86440?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 181363, "fullname": "Chuan Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181363?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36185", "url": null, "sourceid": 37495, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39547, "uid": "f835b8aee9c81ffe5be5ec96139a0567", "name": "URScenes: A Multi-scenario Dataset for Unstructured Road Environments", "authors": [{"id": 180384, "fullname": "runsen liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180384?format=json", "institution": "Beihang University"}, {"id": 192316, "fullname": "Aizemaitijiang Baoerhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192316?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 192317, "fullname": "Zhangyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192317?format=json", "institution": "Beihang University"}, {"id": 192318, "fullname": "Jie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192318?format=json", "institution": "Beihang University"}, {"id": 192319, "fullname": "Jinghao Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/192319?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 192320, "fullname": "GuizhenYu GuizhenYu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192320?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 192321, "fullname": "Songyue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192321?format=json", "institution": null}, {"id": 192322, "fullname": "WanCheng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192322?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 192323, "fullname": "Mingjun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192323?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 171619, "fullname": "Zhanbo Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/171619?format=json", "institution": "2077AI"}, {"id": 192324, "fullname": "Wenwen Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192324?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "As autonomous driving technology transitions from small-scale validation to large-scale deployment, its development in unstructured road environments has become a critical and inevitable trend. Autonomous vehicles increasingly rely on high-quality and diverse datasets for perception systems. However, existing public datasets predominantly focus on clear-weather and urban-road scenarios, leaving a significant gap in the coverage of unstructured road environments. To bridge this gap, we construct URScenes, the first multi-scenario, open-source perception dataset for unstructured road environments. The dataset consists of 472 scenes, each lasting 30 seconds, and provides over 28K annotated samples and 119K sweeps. URScenes, for the first time, covers eight typical scenarios, including rainy, snowy, foggy, dusty, glare, night, cloudy, and sunny conditions. Additionally, URScenes supports multi-task perception for 3D object detection, multi-object tracking, and 3D occupancy in unstructured road environments. URScenes also provides a unified annotation system and format conversion tools, enabling easy conversion to popular formats such as NuScenes, KITTI, and Waymo dataset. Finally, this study presents comparative experimental results to assess the performance of state-of-the-art algorithms on the URScenes dataset. The data, development toolkit, and additional information are available online.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39547", "url": null, "sourceid": 43718, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39941, "uid": "468df2874368dc8fc93a74a0977cfba5", "name": "GazeShift: Unsupervised Gaze Estimation and Dataset for VR", "authors": [{"id": 193160, "fullname": "Gil Shapira", "url": "http://cvpr.thecvf.com/api/miniconf/users/193160?format=json", "institution": "Samsung; Bar-Ilan University"}, {"id": 193161, "fullname": "Ishay Goldin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193161?format=json", "institution": "Samsung"}, {"id": 193162, "fullname": "Evgeny Artyomov", "url": "http://cvpr.thecvf.com/api/miniconf/users/193162?format=json", "institution": "Samsung"}, {"id": 149195, "fullname": "Donghoon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/149195?format=json", "institution": "Samsung electronics"}, {"id": 72068, "fullname": "Yosi Keller", "url": "http://cvpr.thecvf.com/api/miniconf/users/72068?format=json", "institution": "Bar Ilan University"}, {"id": 193163, "fullname": "Niv Zehngut", "url": "http://cvpr.thecvf.com/api/miniconf/users/193163?format=json", "institution": "Samsung"}], "abstract": "Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity\u2014particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze\u2014the first large-scale off-axis gaze estimation dataset for VR\u2014comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze\u2013appearance disentanglement in a compact, real-time model. A lightweight few-shot calibration can optionally adapt embeddings to individual users, achieving 1.84\u00b0 mean error on VRGaze under per-person calibration and 7.15\u00b0 on MPIIGaze under person-agnostic calibration, with a tenfold reduction in parameters and 5 ms runtime on a VR headset GPU. Quantitative robustness analyses confirm invariance to illumination variations, demonstrating a label-efficient and deployable solution for VR gaze estimation.VRGaze and GazeShift are released under \\url{https://github.com/gazeshift3/gazeshift}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39941", "url": null, "sourceid": 34795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36441, "uid": "7cfc37ec1434434ea9e2c926984fd996", "name": "PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes", "authors": [{"id": 174079, "fullname": "Derui Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/174079?format=json", "institution": "North China University of Technology"}, {"id": 185053, "fullname": "Qian Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185053?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 185054, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185054?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 185055, "fullname": "Tao Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/185055?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 157536, "fullname": "Peng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157536?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS\u2019s geometric priors are leveraged to resolve  polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS\u2019s normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36441", "url": null, "sourceid": 41234, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36384, "uid": "8d797c31772f01dcac08cb12f08cf389", "name": "Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs", "authors": [{"id": 184921, "fullname": "Hiran Sarkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/184921?format=json", "institution": "Sony Research India"}, {"id": 178103, "fullname": "Liming Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178103?format=json", "institution": "Technical University of Munich"}, {"id": 184922, "fullname": "Yordanka Velikova", "url": "http://cvpr.thecvf.com/api/miniconf/users/184922?format=json", "institution": "Technical University of Munich"}, {"id": 74044, "fullname": "Benjamin Busam", "url": "http://cvpr.thecvf.com/api/miniconf/users/74044?format=json", "institution": "Technical University of Munich"}], "abstract": "Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36384", "url": null, "sourceid": 38806, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38935, "uid": "794baf210c4725bbbbc089dffeb553bb", "name": "Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision", "authors": [{"id": 76634, "fullname": "yizhou jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/76634?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 190997, "fullname": "Yuezhu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190997?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 120011, "fullname": "Jinjin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/120011?format=json", "institution": "Beihang University"}, {"id": 190998, "fullname": "Peng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190998?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 77156, "fullname": "Qingjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77156?format=json", "institution": "Beihang University"}, {"id": 89528, "fullname": "Yunhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89528?format=json", "institution": "Beihang University"}], "abstract": "Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection.However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations.In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels.Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps.We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization.Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability.Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38935", "url": null, "sourceid": 42832, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36543, "uid": "d4dbf7b5778980abf44eca853674f4e9", "name": "SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection", "authors": [{"id": 180737, "fullname": "Zhao-Min Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180737?format=json", "institution": "Wenzhou University"}, {"id": 185308, "fullname": "Xinjian Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185308?format=json", "institution": "Wenzhou University"}, {"id": 185309, "fullname": "Yisu Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185309?format=json", "institution": "Wenzhou University"}, {"id": 185310, "fullname": "Yu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185310?format=json", "institution": "Zhejiang College of Security Technology"}], "abstract": "Due to the prohibitive cost of data annotation and the impossibility of exhaustively enumerating all defect categories, municipal sewer pipe defect detection poses significant generalization challenges for traditional models. Multi-Label Zero-Shot Learning (ML-ZSL) offers a viable solution to address this challenge. However, existing methods struggle to establish robust and fine-grained visual-semantic alignment between the complex visual environment inside the pipes and the often sparse semantic descriptions, leading to a critical issue: Alignment Ambiguity. To mitigate this, we propose a novel Steering-Fusion-Refining Network (SFR-Net) that follows a three-stage paradigm to progressively dissolve this ambiguity. This is achieved as the Representation Steering (RS) module first integrates a parameter-efficient feature steering mechanism to continuously adapt the representation to the pipe scene; the Multi-Granularity Evidence Fusion (MEF) module subsequently aggregates unambiguous multi-granularity visual evidence through decoupled global and local paths; and the Generalized Relational Score Refining (GR) module ultimately learns and transfers relational logic from seen defects to gain a universal score correction ability, directly refining preliminary prediction scores and significantly boosting the model\u2019s zero-shot generalization and prediction consistency. Extensive experiments on the public Sewer-ML dataset and our private WZ-Pipe dataset demonstrate that the proposed SFR-Net achieves state-of-the-art (SOTA) performance in multi-label zero-shot learning task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36543", "url": null, "sourceid": 39383, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39126, "uid": "734ec5296d8357fa809ab7ff73b4740e", "name": "Tea-Adapter: Teacher Adapter for Efficient Conditional Generation", "authors": [{"id": 191407, "fullname": "Yinhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191407?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 127301, "fullname": "Yue Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/127301?format=json", "institution": "The Hong Kong University of Science and Technology, Hong Kong"}, {"id": 191408, "fullname": "Fangqiu Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191408?format=json", "institution": null}, {"id": 191409, "fullname": "Chenyang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191409?format=json", "institution": null}, {"id": 152943, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152943?format=json", "institution": "China Telecom"}, {"id": 191410, "fullname": "Kunyu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191410?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 128022, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128022?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "We propose Tea-Adapter, a plug-and-play adapter designed to efficiently integrate conditional knowledge from a smaller teacher model into a larger student video diffusion model. Existing controllable video DiT methods face critical challenges: full fine-tuning of billion-parameter models is extremely expensive, while cascaded ControlNets introduce substantial parameter overhead and exhibit limited flexibility for novel multi-condition compositions.To overcome these issues, Tea-Adapter introduces a novel reversed distillation method, enabling large video diffusion models to inherit precise control capabilities from smaller, efficiently-tuned teacher diffusion models, eliminating full fine-tuning. Moreover, recognizing the intrinsic relationships between different conditions, we replace the cascaded ControlNet design with a Mixture of Condition Experts (MCE) layer. This structure dynamically routes diverse conditional inputs within a unified architecture, supporting both single-condition control and multiple condition combinations without additional training cost.To achieve cross-scale knowledge transfer, we further develop a Feature Propagation Module to ensure efficient and temporally consistent feature propagation across video frames.Experiments demonstrate that Tea-Adapter enables high-fidelity multiple condition video synthesis, making advanced controllable video generation feasible on low-resource hardware and establishing a new efficiency standard for the field.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39126", "url": null, "sourceid": 40022, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37483, "uid": "273ccb1327ae5286a5a511b9d4243711", "name": "Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models", "authors": [{"id": 187565, "fullname": "Jaeyun Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187565?format=json", "institution": "Kyung Hee University"}, {"id": 187566, "fullname": "Seunghui Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187566?format=json", "institution": "Kyung Hee University"}, {"id": 183244, "fullname": "Taeho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/183244?format=json", "institution": "Kyunghee University"}, {"id": 126303, "fullname": "Hyoseok Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126303?format=json", "institution": "Kyung Hee University"}], "abstract": "Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints\u2014either egocentric (observer-centered) or allocentric (object-centered).While vision\u2013language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene.In this study, we address this underexplored challenge by introducing $\\textbf{Sym}$bolic $\\textbf{P}$rojective $\\textbf{L}$ayout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well.By leveraging four key factors\u2014projection, abstraction, bipartition, and localization\u2014SymPL converts allocentric questions into structured symbolic-layout representations.Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains.These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37483", "url": null, "sourceid": 37313, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37751, "uid": "d111938e811c0ed7c4ee3e749b07b454", "name": "ReMatch: Boosting Representation through Matching for Multimodal Retrieval", "authors": [{"id": 188166, "fullname": "Qianying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188166?format=json", "institution": null}, {"id": 188167, "fullname": "Xiao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188167?format=json", "institution": "Xiaohongshu"}, {"id": 99765, "fullname": "Zhiqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/99765?format=json", "institution": null}, {"id": 184662, "fullname": "Yibo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184662?format=json", "institution": "Xiaohongshu"}, {"id": 86765, "fullname": "Xu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86765?format=json", "institution": "Shanghaitech University"}, {"id": 129493, "fullname": "Zhongfei Qing", "url": "http://cvpr.thecvf.com/api/miniconf/users/129493?format=json", "institution": "SenseTime Research"}, {"id": 106597, "fullname": "Fengfan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/106597?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 126819, "fullname": "Yao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126819?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 90382, "fullname": "Paul Henderson", "url": "http://cvpr.thecvf.com/api/miniconf/users/90382?format=json", "institution": "University of Glasgow"}], "abstract": "We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline, we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark(MMEB).  Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37751", "url": null, "sourceid": 43705, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38835, "uid": "f559078e976df9bade3f5025d1fe8937", "name": "From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition", "authors": [{"id": 183763, "fullname": "Francesco Gentile", "url": "http://cvpr.thecvf.com/api/miniconf/users/183763?format=json", "institution": "Universit\u00e0 di Trento"}, {"id": 190796, "fullname": "Nicola DallAsen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190796?format=json", "institution": "SB Intutions"}, {"id": 190797, "fullname": "Francesco Tonini", "url": "http://cvpr.thecvf.com/api/miniconf/users/190797?format=json", "institution": "University of Trento"}, {"id": 126270, "fullname": "Massimiliano Mancini", "url": "http://cvpr.thecvf.com/api/miniconf/users/126270?format=json", "institution": "University of Trento"}, {"id": 146674, "fullname": "Lorenzo Vaquero", "url": "http://cvpr.thecvf.com/api/miniconf/users/146674?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 75841, "fullname": "Elisa Ricci", "url": "http://cvpr.thecvf.com/api/miniconf/users/75841?format=json", "institution": "University of Trento"}], "abstract": "As vision-language models are deployed at scale, understanding their internal mechanisms becomes increasingly critical. Existing interpretability methods predominantly rely on activations, making them dataset-dependent, vulnerable to data bias, and often restricted to coarse head-level explanations. We introduce SITH (Semantic Inspection of Transformer Heads), a fully data-free, training-free framework that directly analyzes CLIP\u2019s vision transformer in weight space. For each attention head, we decompose its value-output matrix into singular vectors and interpret each one via COMP (Coherent Orthogonal Matching Pursuit), a new algorithm that explains them as sparse, semantically coherent combinations of human-interpretable concepts. We show that SITH yields monosemantic, faithful intra-head explanations, validated through reconstruction fidelity and interpretability experiments. This allows us to use SITH for precise, interpretable weight-space model edits that amplify or suppress specific concepts, improving downstream performance without retraining. Furthermore, we use SITH to study model adaptation, showing how fine-tuning primarily reweights a stable semantic basis rather than learning entirely new features. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38835", "url": null, "sourceid": 31945, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36356, "uid": "5800ccd9514fd789d08e5831951aa6bc", "name": "Efficient and High-Fidelity Omni Modality Retrieval", "authors": [{"id": 88937, "fullname": "Chuong Huynh", "url": "http://cvpr.thecvf.com/api/miniconf/users/88937?format=json", "institution": "Pinterest Inc."}, {"id": 184864, "fullname": "Manh Luong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184864?format=json", "institution": "Monash University"}, {"id": 98052, "fullname": "Abhinav Shrivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/98052?format=json", "institution": "University of Maryland"}], "abstract": "Multimodal retrieval is the task of aggregating information from queries across heterogeneous modalities to retrieve desired targets. State-of-the-art multimodal retrieval models can understand complex queries, yet they are typically limited to two modalities: text and vision. This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.Our OmniRet model addresses two critical challenges for universal retrieval: computational efficiency and representation fidelity. First, feeding massive token sequences from modality-specific encoders to Large Language Models (LLMs) is computationally inefficient. We therefore introduce an attention-based resampling mechanism to generate compact, fixed-size representations from these sequences. This shared module is designed to maintain representational diversity and generalization capabilities while remaining sensitive to modality-specific information. Second, compressing rich omni-modal data into a single embedding vector inevitably causes information loss and discards fine-grained details. We propose Attention Sliced Wasserstein Pooling to preserve these fine-grained details, leading to improved omni-modal representations.OmniRet is trained on an aggregation of approximately 6 million query-target pairs spanning 30 datasets. We benchmark our model on 13 retrieval tasks and a MMEBv2 subset. Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others. Furthermore, we curate a new Audio-Centric Multimodal Benchmark (ACM). This new benchmark introduces two critical, previously missing tasks\u2014composed audio retrieval and audio-visual retrieval\u2014to more comprehensively evaluate a model's omni-modal embedding capacity. We believe our benchmark will facilitate the development of universal retrieval systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36356", "url": null, "sourceid": 36248, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40062, "uid": "4a11d6352175e2c16fa7c264092942a0", "name": "CLP: A Real-World Dataset of Contaminated Lens Protectors for Robust Semantic Segmentation", "authors": [{"id": 156105, "fullname": "Sungyong Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/156105?format=json", "institution": "Soongsil University"}, {"id": 193413, "fullname": "Sooyoung Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193413?format=json", "institution": "Soongsil University"}, {"id": 193414, "fullname": "Hyunseo Koh", "url": "http://cvpr.thecvf.com/api/miniconf/users/193414?format=json", "institution": "Soongsil University"}, {"id": 193415, "fullname": "Youngjae Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193415?format=json", "institution": "Soongsil University"}, {"id": 156106, "fullname": "Heewon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156106?format=json", "institution": "Soongsil University"}], "abstract": "The reliability of autonomous systems in real-world environments is mainly dependent on the robustness of their visual perception.Although recent studies have advanced the handling of visual degradations, physical contaminants that adhere to the camera lens\u2014such as mud, water droplets, and condensation\u2014remain largely underexplored.To this end, we introduce the CLP (Contaminated Lens Protector) dataset, a real-world benchmark designed to evaluate perception performance under realistic lens-protector contamination.The CLP dataset offers degraded images across multiple types of contamination and various lens-to-protector distances, along with dense semantic segmentation masks and aligned restoration targets.This dataset enables robust segmentation and restoration studies in conditions that closely match those encountered by real-world autonomous systems.Experiments analyze strategies to improve perception under contamination with limited data, highlighting the importance of domain generalization, foundation models, data scale, and joint restoration-segmentation pipelines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40062", "url": null, "sourceid": 39643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37076, "uid": "335ebb59c2d4bc89cef80c692c9a10b7", "name": "Learning Surgical Robotic Manipulation with 3D Spatial Priors", "authors": [{"id": 186606, "fullname": "Yu Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186606?format=json", "institution": "University of Science and Technology of China"}, {"id": 186607, "fullname": "Lidian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186607?format=json", "institution": "University of Science and Technology of China"}, {"id": 156145, "fullname": "Xiaomeng Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156145?format=json", "institution": "University of Science and Technology of China"}, {"id": 186608, "fullname": "Jiajun Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186608?format=json", "institution": "University of Adelaide"}, {"id": 186609, "fullname": "Min Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186609?format=json", "institution": "Tuodao Medical"}, {"id": 90108, "fullname": "Yanyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90108?format=json", "institution": "Rutgers University, Newark"}, {"id": 186610, "fullname": "Bei Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/186610?format=json", "institution": "University of Science and Technology of China"}, {"id": 89074, "fullname": "Houqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89074?format=json", "institution": "University of Science and Technology of China"}, {"id": 90089, "fullname": "Jianmin Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/90089?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required.  Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes.  However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms.  In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame.  Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37076", "url": null, "sourceid": 40325, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38454, "uid": "d7524e44b42e52a688f6b5cc14bb3006", "name": "Scene Grounding in the Wild", "authors": [{"id": 189889, "fullname": "Tamir Cohen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189889?format=json", "institution": "Tel Aviv University"}, {"id": 189890, "fullname": "Leo Segre", "url": "http://cvpr.thecvf.com/api/miniconf/users/189890?format=json", "institution": "Tel Aviv University"}, {"id": 189891, "fullname": "Shay Shomer-Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189891?format=json", "institution": "Tel Aviv University"}, {"id": 88462, "fullname": "Shai Avidan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88462?format=json", "institution": "Tel-Aviv University"}, {"id": 156128, "fullname": "Hadar Averbuch-Elor", "url": "http://cvpr.thecvf.com/api/miniconf/users/156128?format=json", "institution": "Department of Computer Science, Cornell University"}], "abstract": "Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry.In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code, data, and trained models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38454", "url": null, "sourceid": 31193, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38347, "uid": "6d27466e8d5feb302b427042e1034d63", "name": "Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection", "authors": [{"id": 189666, "fullname": "Yujin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189666?format=json", "institution": "Yonsei University"}, {"id": 189667, "fullname": "Sewon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189667?format=json", "institution": "Yonsei University"}, {"id": 189668, "fullname": "Daeun Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/189668?format=json", "institution": "Yonsei University"}, {"id": 189669, "fullname": "Seoyoon Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189669?format=json", "institution": null}, {"id": 72999, "fullname": "Hyunsoo Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/72999?format=json", "institution": "Yonsei University"}], "abstract": "Few-shot multi-class anomaly detection is crucial in real industrial settings, where only a few normal samples are available while numerous object types must be inspected. This setting is particularly challenging because defect patterns vary widely across categories while normal data remain scarce. Existing vision\u2013language model\u2013based approaches typically depend on class-specific anomaly descriptions or auxiliary modules, limiting both scalability and computational efficiency. In this work, we propose AnoPLe, a lightweight multimodal prompt learning framework that removes reliance on anomaly-type textual descriptions and avoids any external modules. AnoPLe employs bidirectional interactions between textual and visual prompts, allowing class semantics and instance-level cues to refine one another and form class-grounded representations that capture shared normal patterns across categories. To enhance localization, we design a scale-aware prefix trained on both global and local views, enabling the prompts to capture both global context and fine-grained details. In addition, an alignment loss propagates local anomaly evidence to global features, strengthening the consistency between pixel- and image-level predictions. Despite its simplicity, AnoPLe achieves strong performance on MVTec-AD, VisA, and Real-IAD under the few-shot multi-class setting, surpassing prior approaches while remaining efficient and free from expert-crafted anomaly descriptions. Moreover, AnoPLe generalizes well to unseen anomalies and even extends effectively to the medical domain.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38347", "url": null, "sourceid": 38492, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38577, "uid": "a6a2b3e3c6569e69afe1e025d0ff8d1f", "name": "Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning", "authors": [{"id": 90278, "fullname": "Yuto Shibata", "url": "http://cvpr.thecvf.com/api/miniconf/users/90278?format=json", "institution": "Keio University"}, {"id": 190193, "fullname": "Kashu Yamazaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/190193?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 190194, "fullname": "Lalit Jayanti", "url": "http://cvpr.thecvf.com/api/miniconf/users/190194?format=json", "institution": "Carnegie Mellon University"}, {"id": 90274, "fullname": "Yoshimitsu Aoki", "url": "http://cvpr.thecvf.com/api/miniconf/users/90274?format=json", "institution": "Keio University"}, {"id": 90260, "fullname": "Mariko Isogawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/90260?format=json", "institution": "Keio University"}, {"id": 93898, "fullname": "Katerina Fragkiadaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/93898?format=json", "institution": "CMU"}], "abstract": "Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors remain largely isolated and non-interactive. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics.In this paper, we formulate the imitation of closely interacting, force-exchanging human\u2013human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) and the recipient in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant\u2019s reference motion to the recipient\u2019s real-time pose and encourage physically meaningful support.We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control. We will make our code available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38577", "url": null, "sourceid": 32670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37873, "uid": "416419b78de5f8db4cbea08fc3583666", "name": "Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table", "authors": [{"id": 188456, "fullname": "Han Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188456?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87813, "fullname": "Haoyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87813?format=json", "institution": "Shandong University"}, {"id": 188457, "fullname": "Xiaoxuan Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188457?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188458, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188458?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87817, "fullname": "Jihua Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87817?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Zero-Shot Temporal Action Localization (ZS-TAL) aims to classify and localize actions in untrimmed videos that are unseen during training. Existing training-based ZS-TAL methods typically rely on fine-tuning models on large-scale annotated training data. This can be impractical in real-world applications and damage its generalization. As a result, Training-Free ZS-TAL has gained attention, which directly leveraging Vision-Language Models (VLM) enables action localization without any additional training. However, current techniques perform test-time adaptation independently on each video, neglecting the potential benefit of accumulating knowledge from historical test videos. To address this, we propose a learnable lookup table (LLT) framework. During testing, we continuously update lookup table by incorporating high-confidence, diverse lookup candidates to construct action-positive lookup item. Additionally, we introduce a learnable residual module to adapt the corresponding lookup item to the current video context features. Finally, we employ refined activation scores to select accurate video frames and further adjust the text prototypes. This simple yet effective text-visual collaboration enables training-free ZS-TAL to harness historical videos. Extensive experiments show our method significantly outperforms state-of-the-art zero-shot VLM baselines, validating the effectiveness of our framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37873", "url": null, "sourceid": 46575, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36614, "uid": "bddcc5065237c686cb4d89dba8b276f2", "name": "Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models", "authors": [{"id": 185470, "fullname": "Jingchen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185470?format=json", "institution": "The State University of New York at Buffalo"}, {"id": 180494, "fullname": "Shaobo Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/180494?format=json", "institution": "NEC Laboratories America"}, {"id": 85706, "fullname": "Deep Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/85706?format=json", "institution": "NEC Laboratories America"}, {"id": 185471, "fullname": "Wataru Kohno", "url": "http://cvpr.thecvf.com/api/miniconf/users/185471?format=json", "institution": "NEC Corporation"}, {"id": 185472, "fullname": "Can Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185472?format=json", "institution": "Rutgers University"}, {"id": 84519, "fullname": "Changyou Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84519?format=json", "institution": "State University of New York, Buffalo"}], "abstract": "Knowledge distillation establishes a learning paradigm that learns from both data supervision and teacher guidance. However, the optimal weighting between learning from data and learning from the teacher is hard to determine, as some samples are data-noisy while others are teacher-uncertain. This raises a pressing need to adaptively balance data and teacher supervision. We propose Beta-weighted Knowledge Distillation \\textbf{$\\beta$-KD}, an adaptive, uncertainty-aware knowledge distillation framework that supports arbitrary distillation objectives under a unified Bayesian formulation. Specifically, we model teacher signals as a Gibbs prior over student activations and use amortized optimization to jointly infer activations and weighting parameters $\\beta$, leading to a closed-form, uncertainty-aware weighting. Extensive experiments distilling a 1.7B-parameter student from MobileVLM-7B demonstrate that $\\beta$-KD consistently outperforms existing methods under different loss combination settings. Moreover, large-scale distillation and evaluations on six multimodal benchmarks further confirm the effectiveness of the proposed approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36614", "url": null, "sourceid": 46270, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37770, "uid": "2f384466a8f3cc12bbd45d984ff77765", "name": "NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos", "authors": [{"id": 89053, "fullname": "Yuxue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89053?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 153222, "fullname": "Lue Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153222?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 188216, "fullname": "Ziqi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188216?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 90492, "fullname": "Junran Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90492?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 188217, "fullname": "Feng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188217?format=json", "institution": "TuSimple"}, {"id": 88072, "fullname": "Zhaoxiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88072?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "In this paper, we propose **NeoVerse**, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37770", "url": "https://neoverse-4d.github.io/", "sourceid": 43798, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37052, "uid": "6375d77e5b6bea3ac46a02b2ebc17fe3", "name": "WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval", "authors": [{"id": 172799, "fullname": "Tianyue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172799?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 129356, "fullname": "Leigang Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129356?format=json", "institution": "National University of Singapore"}, {"id": 162442, "fullname": "tianyu yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162442?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 145196, "fullname": "xiangzhao hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/145196?format=json", "institution": "CASIA"}, {"id": 176020, "fullname": "Yifan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176020?format=json", "institution": "Minzu University of China"}, {"id": 85427, "fullname": "Haiyun Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85427?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 85436, "fullname": "Jinqiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85436?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality\u2014either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose **WISER**, a training-free framework that unifies T2I and I2I via a \u201cretrieve\u2013verify\u2013refine\u201d pipeline, explicitly modeling *intent awareness* and *uncertainty awareness*. Specifically, WISER first performs **Wider Search** by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts **Adaptive Fusion** with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward **Deeper Thinking**. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37052", "url": null, "sourceid": 33222, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38792, "uid": "48b3aba5a6c4d0f9b8320bf816f586d3", "name": "Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy", "authors": [{"id": 71640, "fullname": "Teng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71640?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 153281, "fullname": "Zhentao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153281?format=json", "institution": "Tencent Hunyuan"}, {"id": 71092, "fullname": "Guozhen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71092?format=json", "institution": "Nanjing University"}, {"id": 190681, "fullname": "Zihan Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/190681?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 175795, "fullname": "zhengguang zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/175795?format=json", "institution": "tencent"}, {"id": 187395, "fullname": "Youliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187395?format=json", "institution": "Tsinghua University"}, {"id": 155471, "fullname": "Yuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155471?format=json", "institution": "Microsoft Research"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 88671, "fullname": "Ran Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88671?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization.To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38792", "url": null, "sourceid": 44077, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39417, "uid": "e64a2afffe539dbd117ce1499a1883b6", "name": "HTC-VLM: Disentangled Hybrid Token Compression for Vision-Language Models", "authors": [{"id": 147015, "fullname": "jusheng zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147015?format=json", "institution": "National University of Singapore; SUN YAT-SEN UNIVERSITY"}, {"id": 147095, "fullname": "Xiaoyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/147095?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 191355, "fullname": "Kaitong Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191355?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192040, "fullname": "Qinhan Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192040?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192041, "fullname": "Yijia Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192041?format=json", "institution": "Microsoft Research Asia; SUN YAT-SEN UNIVERSITY"}, {"id": 161702, "fullname": "Wenhao Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/161702?format=json", "institution": "Princeton University"}, {"id": 85791, "fullname": "Jian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85791?format=json", "institution": "Snap Inc."}, {"id": 128912, "fullname": "Keze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128912?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens to LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics like object identities, while discrete quantization loses granular details such as textures. We challenge this by introducing **HTC-VLM**, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed to one token via a disentanglement attention mask and a `<voco>` bottleneck, ensuring efficient, grounded representations.HTC-VLM achieves an average performance retention of **87.2%** across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at **81.0%** with a 580-to-1 compression ratio. Attention analyses show the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid can resolve the efficiency\u2013fidelity dilemma, advancing scalable VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39417", "url": null, "sourceid": 34083, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40055, "uid": "c27cb025c4befa0319116806faebb82e", "name": "BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting", "authors": [{"id": 193400, "fullname": "Renbo Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193400?format=json", "institution": "University of Toronto"}, {"id": 193401, "fullname": "Ali SaraerToosi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193401?format=json", "institution": "University of Toronto"}, {"id": 193402, "fullname": "Nicholas Conroy", "url": "http://cvpr.thecvf.com/api/miniconf/users/193402?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 193403, "fullname": "Gennady Pekhimenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/193403?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 130722, "fullname": "Aviad Levis", "url": "http://cvpr.thecvf.com/api/miniconf/users/130722?format=json", "institution": "California Institute of Technology"}], "abstract": "The Event Horizon Telescope (EHT) delivered the first image of a black hole by capturing the light from its surrounding accretion flow, revealing structure but not dynamics. Simulations of black hole accretion dynamics are essential for interpreting EHT images, though they are costly to generate and impractical for inference, as exploring many physical configurations remains computationally intractable. Consequently, EHT analyses often resort to comparing observations with libraries of precomputed models. Motivated by this bottleneck, BHCast presents a framework for forecasting black hole plasma dynamics from a single, blurry image, as those captured by the EHT. At its core, BHCast is a neural model that transforms a static image into forecasted future frames, revealing the underlying dynamics hidden within one snapshot. With a multi-scale pyramid loss, we demonstrate how autoregressive prediction can simultaneously super-resolve and evolve a blurry frame into a coherent, high-resolution movie that remains stable over long time horizons. By forecasting dynamics as a first step, we can then extract interpretable spatio-temporal features, such as pattern speed (rotation rate) and pitch angle. This two-step approach makes BHCast more versatile and interpretable than direct inference of such features. Finally, BHCast uses gradient-boosting trees to recover black hole properties from these plasma features, including the spin and viewing inclination angle. We demonstrate the effectiveness of BHCast on simulations of two distinct black hole accretion systems, Sagittarius A* and M87*, by testing on simulated frames blurred to EHT resolution. In addition, we show an application of our forecaster on real EHT images of M87* .", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40055", "url": null, "sourceid": 34417, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36298, "uid": "36165040b1051f2cfde3c3e5c095d3ed", "name": "CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding", "authors": [{"id": 129725, "fullname": "Yi-Lin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/129725?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184725, "fullname": "Haoran Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184725?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184726, "fullname": "Yuhao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184726?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 176106, "fullname": "Pengyue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176106?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184727, "fullname": "Zhizhao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184727?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 155370, "fullname": "Guiliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155370?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "In this paper, we explore an important yet underexplored task in robot manipulation: cycle-based manipulation, where robots need to perform cyclic or repetitive actions with an expected terminal time. These tasks are crucial in daily life, such as shaking a bottle or knocking a nail. However, few prior works have explored this task, leading to two main challenges: 1) the imitation methods often fail to complete these tasks within the expected terminal time due to the ineffective utilization of history; 2) the absence of a benchmark with sufficient data and automatic evaluation tools hinders development of effective solutions in this area. To address these challenges, we firstly propose the CycleManip framework to achieve cycle-based task manipulation in a end-to-end imitation manner without requiring any extra models, hierarchical structure or significant computational overhead. The core insight is to enhance effective history perception by a cost-aware sampling strategy and to improve historical understanding by multi-task learning. Secondly, we introduce a cycle-based task manipulation benchmark, which provides diverse cycle-based tasks, and an automatic evaluation method. Extensive experiments conducted in both simulation and real-world settings demonstrate that our method achieves high success rates in cycle-based task manipulation. The results further show strong adaptation performance in general manipulation, and the plug-and-play ability on imitation policies such as Vision-Language-Action (VLA) models. Moreover, the results show that our approach can be applied across diverse robotic platforms, including bi-arm grippers, dexterous hands, and humanoid robots.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36298", "url": null, "sourceid": 40401, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38996, "uid": "4fe9b99a93f17369b95f15d8d806128a", "name": "eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting", "authors": [{"id": 127805, "fullname": "Haojie Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127805?format=json", "institution": "Zhejiang University"}, {"id": 190025, "fullname": "Zehao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190025?format=json", "institution": "Zhejiang University"}, {"id": 190026, "fullname": "Yan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190026?format=json", "institution": "Zhejiang University"}, {"id": 153814, "fullname": "Shi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153814?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 185438, "fullname": "Peng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185438?format=json", "institution": "Zhejiang University"}, {"id": 185439, "fullname": "De Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185439?format=json", "institution": "Zhejiang University"}, {"id": 87278, "fullname": "Huajin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87278?format=json", "institution": "Zhejiang University"}, {"id": 87673, "fullname": "Qian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87673?format=json", "institution": "Zhejiang University"}, {"id": 87277, "fullname": "Gang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87277?format=json", "institution": "Zhejiang University"}], "abstract": "Perception under low illumination remains a major challenge for computer vision systems, as RGB sensors often fail to capture sufficient structural and color information in extremely dark environments. Event cameras, with their high dynamic range and temporal resolution, provide complementary cues that are well suited for such conditions. In this work, we present eRetinexGS, a novel framework that jointly leverages event streams and low-light frames through 3D Gaussian Splatting for scene-level enhancement and reconstruction. Unlike previous approaches that operate on individual frames, eRetinexGS enforces geometric and photometric consistency across multiple views, bridging the gap between degraded images and noisy event signals. By introducing an event-assisted Retinex decomposition and a reflectance\u2013illumination representation within the 3DGS pipeline, our method reconstructs normal-light radiance fields with fine-grained details and accurate color. Extensive experiments on both synthetic and real datasets demonstrate that eRetinexGS achieves state-of-the-art performance in low-light scene enhancement while maintaining real-time rendering capability. The code and dataset will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38996", "url": null, "sourceid": 32315, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37797, "uid": "2391ec15bbf52334385acc13936bd38f", "name": "Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment", "authors": [{"id": 188290, "fullname": "Takayuki Hara", "url": "http://cvpr.thecvf.com/api/miniconf/users/188290?format=json", "institution": "LY Corporation"}, {"id": 188291, "fullname": "Yuya Otsuka", "url": "http://cvpr.thecvf.com/api/miniconf/users/188291?format=json", "institution": "LY Corporation"}], "abstract": "Recent advances in vision\u2013language foundation models have enabled text-driven evaluation of image aesthetics and visual quality.  However, existing models are typically optimized for fixed prompts or specific datasets, limiting their adaptability to diverse evaluation criteria. This paper presents \\textit{Probabilistic Prompt Adaptation (PPA)}, a unified probabilistic framework that flexibly predicts aesthetic and quality scores conditioned on arbitrary text prompts. PPA formulates score prediction as a mixture over prompts, dynamically estimating prompt suitability based on both image content and task context. By marginalizing over prompts pre-sampled from a large language model (LLM), it enables annotation-free training using only triplets of task, image, and score. Experiments across multiple IAA and IQA benchmarks demonstrate that PPA achieves consistent and perceptually aligned prompt-based scoring, allowing fine-grained control over evaluation semantics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37797", "url": null, "sourceid": 42828, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38539, "uid": "3c3d2bb7565649cee9650f4a094d4052", "name": "UAV-CB: A Complex-Background RGB\u2013T Dataset and Local Frequency Bridge Network for UAV Detection", "authors": [{"id": 183907, "fullname": "Shenghui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183907?format=json", "institution": "Pengcheng Laboratory"}, {"id": 190087, "fullname": "Menghao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190087?format=json", "institution": "Xinjiang University; Pengcheng Lab"}, {"id": 190088, "fullname": "Longkun Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190088?format=json", "institution": "Pengcheng Laboratory"}, {"id": 190089, "fullname": "Hongyu Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190089?format=json", "institution": "Harbin Institute of Technology; Pengcheng Lab"}, {"id": 190090, "fullname": "Zekai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190090?format=json", "institution": "South China University of Technology"}, {"id": 129700, "fullname": "Feng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129700?format=json", "institution": "Peking University"}, {"id": 184623, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184623?format=json", "institution": "Peking University"}, {"id": 190091, "fullname": "Qingyao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190091?format=json", "institution": "South China University of Technology"}, {"id": 154602, "fullname": "Ke Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154602?format=json", "institution": "Peng Cheng Laboratory"}], "abstract": "Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB\u2013T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency\u2013spatial fusion gap and the cross-modality discrepancy gap in RGB\u2013T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications. The UAV-CB dataset will be publicly released to support future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38539", "url": null, "sourceid": 35432, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36764, "uid": "d348734a9ee240ebc4c0937a6e755621", "name": "Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation", "authors": [{"id": 185820, "fullname": "Sanaz Karimijafarbigloo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185820?format=json", "institution": "Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 185821, "fullname": "Armin Khosravi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185821?format=json", "institution": "Sharif University of Technology"}, {"id": 185822, "fullname": "Alireza Kheyrkhah", "url": "http://cvpr.thecvf.com/api/miniconf/users/185822?format=json", "institution": "Sharif University of Technology"}, {"id": 185823, "fullname": "Reza Azad", "url": "http://cvpr.thecvf.com/api/miniconf/users/185823?format=json", "institution": "RWTH Aachen, Rheinisch Westf\u00e4lische Technische Hochschule Aachen"}, {"id": 185824, "fullname": "Mauricio Reyes", "url": "http://cvpr.thecvf.com/api/miniconf/users/185824?format=json", "institution": null}, {"id": 185825, "fullname": "Dorit Merhof", "url": "http://cvpr.thecvf.com/api/miniconf/users/185825?format=json", "institution": "University of Regensburg"}], "abstract": "Multi-rater medical image segmentation captures the inherent ambiguity of clinical interpretation, where diagnostic boundaries vary across experts and imaging devices. Existing approaches often reduce this diversity to consensus labels or treat rater differences as noise, resulting in overconfident and poorly calibrated models. We propose a harmonized probabilistic framework that disentangles acquisition artifacts from genuine annotator variability through adaptive feature conditioning and frequency-domain personalization. A lightweight Harmonizer Network implicitly models scanner-specific artifacts and performs dynamic feature modulation to standardize latent representations, ensuring that uncertainty reflects anatomy rather than noise. To represent rater-specific styles, we introduce High-Frequency Prompt Modules that operate in the spectral domain to encode annotator-dependent boundary precision and textural sensitivity. These prompts adaptively modulate harmonized features to produce personalized yet anatomically consistent segmentations. Furthermore, a Generalized Energy Distance (GED)\u2013based regularization aligns the generative distribution with empirical annotation variability, promoting diversity where experts disagree and consensus where they converge. Experiments on LIDC-IDRI and NPC-170 show SOTA aggregated and individualized segmentation, with notable GED reductions and improved Dice scores, especially on noisy cases. Beyond accuracy, the model exhibits clinically meaningful uncertainty, confidence rises in agreement regions and declines in ambiguous areas, supporting its use as a reliable and interpretable tool for multi-expert clinical workflows.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36764", "url": null, "sourceid": 40964, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38865, "uid": "7de393a886fd4e75a564829e060449da", "name": "OneHOI: Unifying Human-Object Interaction Generation and Editing", "authors": [{"id": 100125, "fullname": "Jiun Tian Hoe", "url": "http://cvpr.thecvf.com/api/miniconf/users/100125?format=json", "institution": "Nanyang Technological University"}, {"id": 190876, "fullname": "Weipeng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190876?format=json", "institution": "SUN YAT-SEN UNIVERSITY; Nanyang Technological University"}, {"id": 87664, "fullname": "Xudong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87664?format=json", "institution": "Nanyang Technological University"}, {"id": 190414, "fullname": "Yap-Peng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190414?format=json", "institution": "VinUniversity; Nanyang Technological University"}, {"id": 71274, "fullname": "Chee Seng Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/71274?format=json", "institution": "Universiti Malaya"}], "abstract": "Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our new HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code and dataset will be open publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38865", "url": "https://jiuntian.github.io/OneHOI/", "sourceid": 36786, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38455, "uid": "363b27721bc30b0327f475f174615752", "name": "Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning", "authors": [{"id": 189892, "fullname": "Hao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189892?format=json", "institution": "New York University Abu Dhabi"}, {"id": 91679, "fullname": "Hu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91679?format=json", "institution": "The University of Adelaide"}, {"id": 189893, "fullname": "Tiantian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189893?format=json", "institution": "New York University Abu Dhabi"}, {"id": 189894, "fullname": "Prajjwal Bhattarai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189894?format=json", "institution": "New York University"}, {"id": 189895, "fullname": "Tuka Alhanai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189895?format=json", "institution": "New York University, Abu Dhabi"}], "abstract": "Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5\\%), outperforming the prior best method that uses a 12\u00d7 larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at \\url{https://github.com/polyphony-cvpr/polyphony}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38455", "url": null, "sourceid": 42134, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39013, "uid": "243153b1d3e9aae08821d40e3b402ffe", "name": "Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective", "authors": [{"id": 181086, "fullname": "Maoxun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181086?format=json", "institution": "Beihang University"}, {"id": 191178, "fullname": "Duanni Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191178?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 191179, "fullname": "Ziteng Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191179?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 106377, "fullname": "Tianyi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/106377?format=json", "institution": "Beihang University"}, {"id": 182642, "fullname": "Shiji Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182642?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 184738, "fullname": "Yimian Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184738?format=json", "institution": "Nankai University"}, {"id": 77417, "fullname": "Xingxing Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/77417?format=json", "institution": "Beihang University"}], "abstract": "Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarm problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the IRSTD-1k and NUAA-SIRST datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39013", "url": null, "sourceid": 40862, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38504, "uid": "e4c72549fee9063d8aa8f9b6c0b621af", "name": "MVLM: Template-Free Tracking via Vision\u2013Language Margin Confidence and Memory-Gated Tracking", "authors": [{"id": 184083, "fullname": "Dae-Hyeon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/184083?format=json", "institution": "Inha University"}, {"id": 190008, "fullname": "Mina Baek", "url": "http://cvpr.thecvf.com/api/miniconf/users/190008?format=json", "institution": "Inha University"}, {"id": 147770, "fullname": "Jeong-Hun Ha", "url": "http://cvpr.thecvf.com/api/miniconf/users/147770?format=json", "institution": "Inha University"}, {"id": 190009, "fullname": "Chan-Seop Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/190009?format=json", "institution": "Inha University"}, {"id": 179974, "fullname": "Jamshidjon Ganiev", "url": "http://cvpr.thecvf.com/api/miniconf/users/179974?format=json", "institution": "Inha University"}, {"id": 183528, "fullname": "Seung-Hwan Bae", "url": "http://cvpr.thecvf.com/api/miniconf/users/183528?format=json", "institution": "Inha University"}], "abstract": "We introduce a new template-free tracking paradigm based solely on natural language, capable of tracking an arbitrary object and seamlessly switching to a new target without box initialization.Our key idea is to localize an object via vision-language (VL) correlation.However, using the correlation alone is brittle under large search regions due to spatial uncertainty and ambiguous VL saliency. To resolve these, we propose MVLM, a memory-based vision-language margin confidence that integrates vision\u2013language correlation, encoder prediction, and temporal memory.MVLM dynamically gates the search region\u2014switching between compact ROI (Region of Interest) search and global re-localization\u2014to reduce spatial uncertainty. Theoretically, we derive bounds that connect the MVLM score to tracking probability, characterizing mis-localization within ROI and ROI-exclusion probabilities.Through extensive evaluation, we validate our theorems and achieve state-of-the-art performance on several benchmarks (TNL2K, LaSOT, OTB99 and MGIT) using only language guidance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38504", "url": null, "sourceid": 41573, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37386, "uid": "af5b84f03fc2467351129eced8f32909", "name": "Twin-T & TwintVQA: A Reliable Structure\u2013Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks", "authors": [{"id": 180610, "fullname": "Jiahua Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180610?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187314, "fullname": "Siyao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187314?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187315, "fullname": "Jiaxing Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/187315?format=json", "institution": "Harbin Institute of Technology"}, {"id": 172762, "fullname": "Qingtao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/172762?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187316, "fullname": "Changjiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187316?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187317, "fullname": "Zeming Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187317?format=json", "institution": "Harbin Institute of Technology"}, {"id": 96348, "fullname": "Jie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/96348?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "With the rapid development of Vision-Language Models (VLMs), there is a growing demand for automatic analysis of structured visual data. Charts and tables are primary carriers of quantitative information, with regular layouts and explicit numbers. However, current general VLMs and expert models make limited use of these chart-table features during training and inference. Another challenge is cross-format conversion in realistic settings, as chart and table outputs span Python and LaTeX and most VLMs struggle to handle this breadth reliably. These gaps often lead to analysis mistakes, and unreliable generation text.  To overcome these limitations, we propose $\\underline{\\texttt{\\textbf{Twin-T}}}$, a two-stage expert VLM for comprehensive char$\\underline{\\texttt{\\textbf{t}}}$-$\\underline{\\texttt{\\textbf{t}}}$able tasks across Image, LaTeX, and Python. In stage 1, we propose a novel dual-head image encoder that can separate structural cues and fine details from input images. In stage 2, we propose MINT, a preference learning method that emphasizes numbers and keywords fidelity and vision\u2013text matching. Furthermore, we introduce a comprehensive TwintVQA benchmark with 17 chart types, 11 task types, 3 data formats and short / medium / long QA settings. Our model narrows the gap between open-source and closed-source models on mainstream chart\u2013table benchmarks, outperforming open-source models and GLM-4.5V-106B while even remaining competitive with GPT-4o and Gemini-2.5-Pro. Our code and additional details are available in the Appendix.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37386", "url": null, "sourceid": 45989, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37049, "uid": "02dbe7f07a799718d80ddf18d709bdc2", "name": "OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera", "authors": [{"id": 89630, "fullname": "Hao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/89630?format=json", "institution": "Zhejiang University"}, {"id": 186564, "fullname": "Ze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186564?format=json", "institution": null}, {"id": 186565, "fullname": "Shangwei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186565?format=json", "institution": "Zhejiang University"}, {"id": 154552, "fullname": "Mengfei Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154552?format=json", "institution": "Hunan University"}, {"id": 70670, "fullname": "Song Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70670?format=json", "institution": "Zhejiang University"}, {"id": 186566, "fullname": "Teng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186566?format=json", "institution": "Xiaomi Corporation"}, {"id": 89634, "fullname": "Kailun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89634?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 153635, "fullname": "Lin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153635?format=json", "institution": "Nanyang Technological University"}, {"id": 89635, "fullname": "Kaiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89635?format=json", "institution": "Zhejiang University"}], "abstract": "Robust 3D semantic occupancy is essential for legged and humanoid robots, yet most Semantic Scene Completion (SSC) systems are built for wheeled platforms with forward-facing sensors. We present $\\textbf{OneOcc}$, a vision-only panoramic SSC framework tailored to severe body jitter and $360^{\\circ}$ continuity. OneOcc integrates four complementary modules: (i) $\\textit{Dual-Projection fusion (DP-ER)}$, which jointly exploits the raw annular panorama and its equirectangular unfolding to preserve true $360^{\\circ}$ continuity while enabling grid-aligned feature extraction and seam-aware context; (ii) $\\textit{Bi-Grid Voxelization (BGV)}$, which reasons in Cartesian and polar/cylindrical voxel spaces to reduce discretization bias and better align with panoramic geometry, yielding sharper free/occupied boundaries; (iii) a lightweight decoder with $\\textit{Hierarchical AMoE-3D}$ fusion that dynamically routes multi-scale 3D features to specialized experts, improving long-range context and occlusion handling; and (iv) a plug-and-play $\\textit{Gait Displacement Compensation (GDC)}$ module that learns feature-level motion correction from gait, stabilizing representations without extra sensors. We also release two panoramic occupancy benchmarks: $\\textbf{QuadOcc}$ (real quadruped, first-person $360^{\\circ}$) and $\\textbf{Human360Occ (H3O)}$ (CARLA human-ego $360^{\\circ}$ with RGB/Depth/semantic-occupancy and standardized within-/cross-city splits). OneOcc sets new SOTA: on QuadOcc it exceeds strong vision baselines and even popular LiDAR methods, and on H3O it improves within-city by +3.83 mIoU and cross-city by +8.08. The modules are lightweight, enabling deployable full-surround semantic perception for legged and humanoid robots. Datasets and code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37049", "url": null, "sourceid": 39336, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38568, "uid": "9389089fa48bdc2708e6f0890890c516", "name": "EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment", "authors": [{"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}, {"id": 190168, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190168?format=json", "institution": "Tianjin University"}, {"id": 143966, "fullname": "Nan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/143966?format=json", "institution": "Tianjin University"}, {"id": 189948, "fullname": "Qian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189948?format=json", "institution": "Tianjin University"}, {"id": 154514, "fullname": "Qi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154514?format=json", "institution": "Tianjin University"}, {"id": 190169, "fullname": "Mingyan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190169?format=json", "institution": "Tianjin University"}], "abstract": "Recent Vision-Language-Action (VLA) models map visual-textual inputs to robotic actions via end-to-end architectures, yet this approach entangles visual understanding with task-specific actions. This leads to an exhaustive collection of full operational sequences and parameter redundancy across tasks, while generic third-person camera setups require fine-tuning for different hardware due to implicit hand-eye assumptions. We argue that decoupling \\textbf{how robots see} from \\textbf{how robots act} is a missing primitive in VLA systems. We present \\textbf{EgoRoC}, a plug-and-play egocentric alignment head that precedes any task policy and exposes only a thin 6-DoF pose interface. EgoRoC establishes task-agnostic viewpoint consistency from a wrist-mounted (first-person) camera and then alternates alignment with manipulation, while a diffusion-based online hand\u2013eye module corrects the action in the end-effector frame for hardware-agnostic deployment. Trained once from static wrist\u2013target image pairs with relative poses, rather than full manipulation trajectories, EgoRoC leaves downstream VLAs unchanged. By turning egocentric alignment into a reusable capability, EgoRoC reduces training redundancy, strengthens zero-shot cross-scene transfer, and scales across VLA backbones without manual calibration.Across simulation and real settings, attaching EgoRoC consistently boosts success rates, especially on long-horizon and out-of-distribution tasks, and improves data efficiency during fine-tuning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38568", "url": null, "sourceid": 36395, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38240, "uid": "05e41bc57a80ca4bbef3c523cc0d40d9", "name": "SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images", "authors": [{"id": 180954, "fullname": "Zepeng Xin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180954?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 160589, "fullname": "Kaiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/160589?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189401, "fullname": "Luodi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189401?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189402, "fullname": "Wanchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189402?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189403, "fullname": "Xiao Yuchen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189403?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189404, "fullname": "Hui Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189404?format=json", "institution": "China Telecom"}, {"id": 189405, "fullname": "Weizhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189405?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 84835, "fullname": "Deyu Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84835?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 71122, "fullname": "Xiangyong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71122?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38240", "url": null, "sourceid": 44058, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38044, "uid": "4bfa62213e6badab50fce7fc0c8d9b68", "name": "FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters", "authors": [{"id": 131008, "fullname": "Shitong Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131008?format=json", "institution": "Southeast University"}, {"id": 188908, "fullname": "Yufei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188908?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 73868, "fullname": "Zeke Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73868?format=json", "institution": "Baidu Research"}], "abstract": "The recent advent of powerful video generation models, such as Hunyuan, WanX, Veo3, and Kling, has inaugurated a new era in the field. However, the practical deployment of these models is severely impeded by their substantial computational overhead, which stems from enormous parameter counts and the iterative, multi-step sampling process required during inference. Prior research on accelerating generative models has predominantly followed two distinct trajectories: reducing the number of sampling steps (e.g., LCM, DMD, and MagicDistillation) or compressing the model size for more efficient inference (e.g., ICMD). The potential of simultaneously compressing both to create a fast and lightweight model remains an unexplored avenue. In this paper, we propose _**FastLightGen**_, an algorithm that transforms large, computationally expensive models into fast, lightweight counterparts. The core idea is to construct an optimal teacher model, one engineered to maximize student performance, within a synergistic framework for distilling both model size and inference steps. Our extensive experiments on HunyuanVideo-ATI2V and WanX-TI2V reveal that a generator using 4-step sampling and 30\\% parameter pruning achieves optimal visual quality under a constrained inference budget. Furthermore, FastLightGen consistently outperforms all competing methods, establishing a new state-of-the-art in efficient video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38044", "url": null, "sourceid": 34724, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40083, "uid": "12e29001e93f781f477b80f9299192f2", "name": "AnyPcc: Compressing Any Point Cloud with a Single Universal Model", "authors": [{"id": 181082, "fullname": "Kangli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181082?format=json", "institution": "Peking University"}, {"id": 193464, "fullname": "Qianxi Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193464?format=json", "institution": "Peking University"}, {"id": 193465, "fullname": "Yuqi Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/193465?format=json", "institution": "Peking University"}, {"id": 193466, "fullname": "Shihao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193466?format=json", "institution": "Peking University"}, {"id": 89858, "fullname": "Wei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89858?format=json", "institution": "Shenzhen Graduate School, Peking University "}], "abstract": "Generalization remains a critical challenge in deep learning-based point cloud geometry compression. While existing methods perform well on standard benchmarks, their performance collapses in real-world scenarios due to two fundamental limitations: the lack of context models that are robust across diverse data densities, and the inability to efficiently adapt to out-of-distribution (OOD) data. To overcome both challenges, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages coarse-grained spatial priors with fine-grained channel priors to ensure robust context modeling across the entire density spectrum. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. For each instance, it fine-tunes a small subset of network weights and transmits them within the bitstream. The minimal bitrate overhead from these weights is significantly outweighed by the resulting gains in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression while maintaining low complexity. Our code and datasets will be released to encourage reproducible research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40083", "url": null, "sourceid": 42315, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39097, "uid": "2937ad0c0fbd0054fe78eca0466fd677", "name": "WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens", "authors": [{"id": 155522, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155522?format=json", "institution": "University of Science and Technology of China"}, {"id": 155523, "fullname": "Dacheng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/155523?format=json", "institution": "Tencent"}, {"id": 191351, "fullname": "Xiaoxuan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191351?format=json", "institution": "Zhejiang University"}, {"id": 191352, "fullname": "Yong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191352?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 126882, "fullname": "Fengyun Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126882?format=json", "institution": "WeChat, Tencent Inc."}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 86247, "fullname": "Wei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86247?format=json", "institution": "University of Science and Technology of China"}, {"id": 86250, "fullname": "Yang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86250?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39097", "url": null, "sourceid": 42867, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65742, "modified": "2026-04-26T07:33:44.367717-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/39097.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37243, "uid": "c50b40b057149f613124fe9c535b003a", "name": "Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning", "authors": [{"id": 180535, "fullname": "Chongjie Si", "url": "http://cvpr.thecvf.com/api/miniconf/users/180535?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186996, "fullname": "Yidan Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186996?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186997, "fullname": "Fuchao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186997?format=json", "institution": "Nanyang Technological University"}, {"id": 76466, "fullname": "Wei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76466?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Partial Multi-Label Learning (PML) extends the multi-label learning paradigm to scenarios where each sample is associated with a candidate label set containing both ground-truth labels and noisy labels. Existing PML methods commonly rely on two assumptions: sparsity of the noise label matrix and low-rankness of the ground-truth label matrix. However, these assumptions are inherently conflicting and impractical for real-world scenarios, where the true label matrix is typically full-rank or close to full-rank. To address these limitations, we demonstrate that the sparsity constraint contributes to the high-rank property of the predicted label matrix. Based on this, we propose a novel method Schirn, which introduces a sparsity constraint on the noise label matrix while enforcing a high-rank property on the predicted label matrix. Extensive experiments demonstrate the superior performance of Schirn compared to state-of-the-art methods, validating its effectiveness in tackling real-world PML challenges.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37243", "url": null, "sourceid": 37073, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39984, "uid": "b5c358c2c23fa59ee6b48569d0fc2797", "name": "Robust3DGSW: Toward Robust Watermarking for Quantization-Aware 3D Gaussian Splatting", "authors": [{"id": 180806, "fullname": "Boyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180806?format=json", "institution": "East China Normal University"}, {"id": 186054, "fullname": "Jun Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/186054?format=json", "institution": "University of Notre Dame"}, {"id": 128946, "fullname": "Mingsong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128946?format=json", "institution": "East China Normal University"}], "abstract": "Although current watermarking techniques for 3D Gaussian Splatting (3DGS) are promising in protecting the copyrights of both 3DGS models and their rendered images, they greatly suffer from low watermark robustness and poor rendering quality when applying quantization to large 3DGS models to accommodate resource-limited devices. To address these problems, this paper introduces a novel two-stage quantization-aware 3DGS watermarking approach called Robust3DGSW. By properly embedding watermarks into the mid-frequency bands of both the 3D Gaussian parameters and 2D rendered images, the first stage of Robust3DGSW can effectively counteract the quantization-induced signal loss and mitigate the adverse effects of watermarks on rendered images. In the second stage, Robust3DGSW trains both 2D and 3D decoders using our proposed multi-scale adversarial perturbation approach, alongside a gradual quantization process, which enables robust watermark extraction even under excessive quantization. Comprehensive experimental results obtained from the well-known Blender, LLFF, and MipNeRF-360 datasets demonstrate that, when compared to leading 3DGS watermarking techniques, Robust3DGSW not only mitigates the negative effects of quantization on watermarks but also enables fast rendering with high quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39984", "url": null, "sourceid": 38977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37908, "uid": "abcd43eee9c9f6bce781261953bad742", "name": "Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control", "authors": [{"id": 176954, "fullname": "Zhuoli Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176954?format=json", "institution": "University of Technology Sydney"}, {"id": 188561, "fullname": "Yu-Cheng Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188561?format=json", "institution": "University of Technology Sydney"}, {"id": 188562, "fullname": "Yu-Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188562?format=json", "institution": "University of Technology Sydney"}, {"id": 188563, "fullname": "Thomas Do", "url": "http://cvpr.thecvf.com/api/miniconf/users/188563?format=json", "institution": "University of Technology Sydney"}, {"id": 188564, "fullname": "Chin-teng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188564?format=json", "institution": "University of Technology Sydney"}], "abstract": "Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking AI-generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37908", "url": null, "sourceid": 31528, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39914, "uid": "e51c23226c8e20febd0c656a33711356", "name": "Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning", "authors": [{"id": 184203, "fullname": "Seung Hyup Baek", "url": "http://cvpr.thecvf.com/api/miniconf/users/184203?format=json", "institution": "Konkuk University"}, {"id": 193105, "fullname": "Jimin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193105?format=json", "institution": "Sejong University"}, {"id": 193106, "fullname": "Hyeongkeun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193106?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 76828, "fullname": "Jae Won Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/76828?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39914", "url": null, "sourceid": 46416, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36866, "uid": "881d3a52fe93ba18d57708504d345b79", "name": "SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation", "authors": [{"id": 180597, "fullname": "Zixuan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180597?format=json", "institution": "University of Notre Dame"}, {"id": 180694, "fullname": "Kaiyuan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180694?format=json", "institution": "University of Notre Dame"}, {"id": 186054, "fullname": "Jun Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/186054?format=json", "institution": "University of Notre Dame"}, {"id": 186055, "fullname": "Yifan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186055?format=json", "institution": "University of Notre Dame"}, {"id": 186056, "fullname": "Lin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186056?format=json", "institution": "Tohoku University"}, {"id": 185145, "fullname": "Chaoli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185145?format=json", "institution": "University of Notre Dame"}, {"id": 186057, "fullname": "Jianxu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186057?format=json", "institution": "Leibniz-Institut f\u00fcr Analytische Wissenschaften"}, {"id": 186058, "fullname": "Yiyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186058?format=json", "institution": "University of Notre Dame"}], "abstract": "2D Gaussian Splatting has emerged as a novel image representation technique that can support efficient rendering on low-end devices. However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. SGI decomposes a complex image into multi-scale local spaces defined by a set of seeds. Each seed corresponds to a spatially coherent region and, together with lightweight multi-layer perceptrons (MLPs), generates structured implicit 2D neural Gaussians. This seed-based formulation imposes structural regularity on otherwise unstructured Gaussian primitives, which facilitates entropy-based compression at the seed level to reduce the total storage. However, optimizing seed parameters directly on high-resolution images is a challenging and non-trivial task. Therefore, we designed a multi-scale fitting strategy that refines the seed representation in a coarse-to-fine manner, substantially accelerating convergence. Quantitative and qualitative evaluations demonstrate that SGI achieves up to 7.5$\\times$ compression over prior non-quantized 2D Gaussian methods and 1.6$\\times$ over quantized ones, while also delivering 1.6$\\times$ and 6.5$\\times$ faster optimization, respectively, without degrading, and often improving, image fidelity. Uploaded code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36866", "url": null, "sourceid": 35037, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36939, "uid": "09fc33e53018e6c214db6da7f89b2ea2", "name": "VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation", "authors": [{"id": 186272, "fullname": "Jiayi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186272?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 86987, "fullname": "Haobo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86987?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 151729, "fullname": "De Soh Soh", "url": "http://cvpr.thecvf.com/api/miniconf/users/151729?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 135934, "fullname": "Na Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135934?format=json", "institution": "Singapore University of Technology and Design"}], "abstract": "This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that together form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT\u2019s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT.  (ii) Structure-saliency enhanced attention strengthens VGGT\u2019s robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36939", "url": null, "sourceid": 37834, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39205, "uid": "db7f2bf023263eb0a0a8247c5007f49a", "name": "EE-RL: Vision Language Guided Reinforcement Learning with Explorer and Expert model for End-to-End Autonomous Driving", "authors": [{"id": 191575, "fullname": "Xiaolong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191575?format=json", "institution": "Chang&#x27;an University"}, {"id": 191576, "fullname": "Lan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191576?format=json", "institution": "Chang&#x27;an university"}, {"id": 191577, "fullname": "Ruyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191577?format=json", "institution": "IEIT SYSTEMS (Beijing) Co., Ltd."}, {"id": 191578, "fullname": "Shan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191578?format=json", "institution": "Chang&#x27;an University"}, {"id": 191579, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191579?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191580, "fullname": "Xiangmo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191580?format=json", "institution": "Chang&#x27;an University"}], "abstract": "End-to-end driving frameworks that directly map raw sensor data to vehicle control commands have shown remarkable potential. However, their performance often deteriorates in sparse-critical scenarios, where rare but safety-sensitive events occur. To address this problem, we propose Explorer-Expert Reinforcement Learning (EE-RL), a novel end-to-end framework that integrates an RL-based explorer, a fine-tuned vision-language model (VLM)-based expert, and a dual replay buffer. EE-RL adopts a collaborative learning strategy in which the explorer and experts jointly generate experiences from regular driving scenarios to guide policy learning. As training progresses, a dedicated VLM expert focuses on reasoning about sparse-critical scenarios, therefore enhancing learning efficiency and policy optimization in both scenarios. Additionally, the StateHash algorithm is designed to measure RGB-image and kinematic-data similarity, thereby skipping unnecessary VLM reasoning and enabling denser, more effective expert experience generation. Extensive experiments on the CARLA Leaderboard demonstrate that EE-RL significantly outperforms state-of-the-art (SOTA) baselines, achieving +19.82% and +20.98% improvements in driving and infraction scores on Town03, respectively. The EE-RL further achieves 0% accident probability in the red-light running and get an average driving score of 80.09 in the generalization towns (Town05\u201306), demonstrating its strong capability in addressing sparse-critical scenarios as well as its robutness and generzlization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39205", "url": null, "sourceid": 46458, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37735, "uid": "9209d9f3516cb004ae00cb39cfe27fa5", "name": "Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction", "authors": [{"id": 180216, "fullname": "Zhihao LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/180216?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 188124, "fullname": "Shengwei Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188124?format=json", "institution": "SandGold AI Research"}, {"id": 188125, "fullname": "Chuang Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188125?format=json", "institution": "SandGlod AI Research"}, {"id": 188126, "fullname": "Junxuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188126?format=json", "institution": "SandGold AI Research"}, {"id": 188127, "fullname": "Zhilu Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188127?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 188128, "fullname": "Zhiqiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188128?format=json", "institution": "Southern University of Science and Technology"}, {"id": 85525, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85525?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 188129, "fullname": "Guangtao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188129?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Existing image SR and generic diffusion models transfer poorly to fluid SR: they are sampling-intensive, ignore physical constraints, and often yield spectral mismatch and spurious divergence. We address fluid super-resolution (SR) with **ReMD** (**Re**sidual-**M**ultigrid **D**iffusion), a physics-consistent diffusion framework. At each reverse step, ReMD performs a **multigrid residual correction**: the update direction is obtained by coupling data consistency with lightweight physics cues and then correcting the residual across scales; the multiscale hierarchy is instantiated with a **multi-wavelet** basis to capture both large structures and fine vortical details. This coarse-to-fine design accelerates convergence and preserves fine structures while remaining equation-free. Across atmospheric and oceanic benchmarks, ReMD improves accuracy and spectral fidelity, reduces divergence, and reaches comparable quality with markedly fewer sampling steps than diffusion baselines. Our results show that enforcing physics consistency **inside** the diffusion process via multigrid residual correction and multi-wavelet multiscale modeling is an effective route to efficient fluid SR.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37735", "url": null, "sourceid": 44842, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39006, "uid": "f5ebd52a0dc78551b072aa2eb23438b8", "name": "CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions", "authors": [{"id": 172152, "fullname": "Chonghuinan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172152?format=json", "institution": "Harbin Institute of Technology"}, {"id": 191165, "fullname": "Zihan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191165?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89750, "fullname": "Yuxiang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89750?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 191166, "fullname": "Tianyi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191166?format=json", "institution": "Harbin Institute of Technology"}, {"id": 89739, "fullname": "Xiaohe Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89739?format=json", "institution": "Harbin Institute of Technology"}, {"id": 153255, "fullname": "Fan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153255?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}, {"id": 86527, "fullname": "Hongxun Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86527?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks.To address this gap, we propose CREval, a fully automated question\u2013answer (QA)\u2013based evaluation pipeline that that overcomes the incompleteness and poor interpretability of opaque Large Language Models (MLLMs)  scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries.Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval\u2019s automated metrics and human judgments.Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research. All code and data will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39006", "url": null, "sourceid": 40775, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39220, "uid": "c2a6300aa487ba1f56883b5d6c05e8aa", "name": "Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection", "authors": [{"id": 129553, "fullname": "Yasiru Ranasinghe", "url": "http://cvpr.thecvf.com/api/miniconf/users/129553?format=json", "institution": "Johns Hopkins University"}, {"id": 191614, "fullname": "Elim Schenck", "url": "http://cvpr.thecvf.com/api/miniconf/users/191614?format=json", "institution": "Kitware"}, {"id": 191615, "fullname": "Florence Yellin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191615?format=json", "institution": "Kitware, Inc."}, {"id": 191616, "fullname": "Shuowen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191616?format=json", "institution": "DEVCOM Army Research Laboratory"}, {"id": 91001, "fullname": "Christopher Funk", "url": "http://cvpr.thecvf.com/api/miniconf/users/91001?format=json", "institution": "Kitware, Inc"}, {"id": 87493, "fullname": "Vishal M. Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/87493?format=json", "institution": "Johns Hopkins University"}], "abstract": "Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we construct a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB\u2013thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal\u2013Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2\u20134\\% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39220", "url": null, "sourceid": 42702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37943, "uid": "85ce3745f9ab408773e5415064f206a3", "name": "SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training", "authors": [{"id": 188650, "fullname": "Zhiyi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188650?format=json", "institution": "Inner Mongolia University"}, {"id": 188651, "fullname": "Xiaoyue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188651?format=json", "institution": "Inner Mongolia University"}, {"id": 188652, "fullname": "Tianxing Man", "url": "http://cvpr.thecvf.com/api/miniconf/users/188652?format=json", "institution": "Jilin University"}], "abstract": "Recent Vision-Language Models (VLMs) have become increasingly susceptible to jailbreak attacks, where adversarial prompts exploit subtle manipulation to circumvent safety alignment.The diversity and adaptability of such jailbreakers necessitate a defense mechanism with strong generalization capability.However, fine-tuning large-scale VLMs is computationally expensive, and introducing excessive visual or textual defense prompts is impractical for preserving image realism and model usability.We propose SafeLogo, which tunes a logo-sized visual prompt into a universal shield against diverse jailbreak attacks through micro-regional adversarial training.We are the first to integrate min\u2013max adversarial optimization into visual defense prompt generation.Specifically, in the outer loop, SafeLogo injects compact, bounded perturbations into extremely small image regions ($\\leq 2\\%$ pixel coverage), effectively preserving both visual fidelity and semantic consistency.Meanwhile, overcoming the limitations of prior defenses constrained to a single attack direction or fixed benign supervision, the inner loop dynamically generates and selects the strongest one from a variety of jailbreakers.Extensive experiments on LLaVA-1.5-13B,MiniGPT-4, and Qwen3-VL show that SafeLogo markedly lowers jailbreak success rates on MM-SafetyBench, VLGuard, and FigStep, while preserving benign performance on MM-Vet and MME.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37943", "url": null, "sourceid": 42174, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37624, "uid": "25fac99a7e76310b5286a9df9c3839a1", "name": "Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity", "authors": [{"id": 187891, "fullname": "Yitian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187891?format=json", "institution": "Central South University"}, {"id": 187892, "fullname": "Shigeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187892?format=json", "institution": "Central South University"}, {"id": 186591, "fullname": "Xuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186591?format=json", "institution": "Hunan University"}, {"id": 187893, "fullname": "Mingming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187893?format=json", "institution": "Central South University"}, {"id": 187894, "fullname": "Kai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187894?format=json", "institution": null}, {"id": 187895, "fullname": "Zhu Hongye", "url": "http://cvpr.thecvf.com/api/miniconf/users/187895?format=json", "institution": "Hunan University"}, {"id": 187896, "fullname": "Xinning Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187896?format=json", "institution": "Hunan University"}], "abstract": "Avoiding catastrophic forgetting for previous tasks and maintaining model plasticity to support new tasks are two critical objectives of continual learning. However, existing methods usually neglect one of the two aspects and fail to support long task sequences with satisfactory performance, especially in resource-constrained scenarios in which the size of the model is limited. This work proposes GRAPA, a parameter-efficient continual learning method that well balances stability and plasticity of the model to handle long task sequences with diverse complexities. GRAPA enhances model plasticity without sacrificing stability with two novel designs. First, a gradient-guided parameter reuse strategy is proposed to make full use of frozen parameters while ensuring that no task interference is introduced. Second, a reinforcement-learning-based parameter allocation is designed to enable the model to adapt to the current task on top of reused parameters while preserving maximal model capacity for future tasks. Experiments on multiple task sequences composed of various datasets demonstrate that GRAPA lifts mean task accuracy by up to 7.67%, with up to 14.92% gains on subsequent complex tasks, reflecting GRAPA\u2019s superior plasticity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37624", "url": null, "sourceid": 38669, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38115, "uid": "12e8bc147311e2f5bf7ba36fd0039a59", "name": "VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution", "authors": [{"id": 184281, "fullname": "August Leander H\u00f8eg", "url": "http://cvpr.thecvf.com/api/miniconf/users/184281?format=json", "institution": "Technical University of Denmark"}, {"id": 189086, "fullname": "Sophia Bardenfleth", "url": "http://cvpr.thecvf.com/api/miniconf/users/189086?format=json", "institution": "Technical University of Denmark"}, {"id": 189087, "fullname": "Hans Martin Kjer", "url": "http://cvpr.thecvf.com/api/miniconf/users/189087?format=json", "institution": "Technical University of Denmark"}, {"id": 189088, "fullname": "Tim Dyrby", "url": "http://cvpr.thecvf.com/api/miniconf/users/189088?format=json", "institution": "Technical University of Denmark; Copenhagen University Hospital Amager and Hvidovre"}, {"id": 93992, "fullname": "Vedrana Dahl", "url": "http://cvpr.thecvf.com/api/miniconf/users/93992?format=json", "institution": "Technical University of Denmark"}, {"id": 93991, "fullname": "Anders Dahl", "url": "http://cvpr.thecvf.com/api/miniconf/users/93991?format=json", "institution": "DTU Compute"}], "abstract": "Recent advances in volumetric super-resolution (SR) have demonstrated great performance in medical and scientific imaging, with transformer- and CNN-based approaches achieving impressive results even at extreme scaling factors. We show that this impressive performance largely stems from training on downsampled data rather than real low-resolution scans. Such a training setup arises partly from the limited availability of paired high- and low-resolution volumetric datasets. To address this gap, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. When training models on VoDaSuRe, we reveal a significant discrepancy: models trained on downscaled data produce substantially sharper predictions than those trained on real low-resolution scans, which smooth fine structures. Conversely, applying downscaled trained models to real scans preserves more structure but is inaccurate. Our findings suggest that current SR methods are overstated - when applied to real data, they do not recover structures lost in low-resolution scans but instead predict a smoothed average. We argue that progress in deep learning-based volumetric SR requires datasets with paired real scans of high complexity, such as VoDaSuRe. Our dataset and code are publicly available at link_when_published.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38115", "url": null, "sourceid": 36794, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37174, "uid": "1aad8feb62179dc358d760cdc4210a90", "name": "VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer", "authors": [{"id": 186842, "fullname": "Yanning Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186842?format=json", "institution": "National University of Defense Technology"}, {"id": 186843, "fullname": "Peiyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186843?format=json", "institution": "Anhui University"}, {"id": 186844, "fullname": "Zirui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186844?format=json", "institution": "Anhui University"}, {"id": 186845, "fullname": "Yitong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186845?format=json", "institution": "Anhui University"}, {"id": 186846, "fullname": "Yanran Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186846?format=json", "institution": "Anhui University"}, {"id": 186847, "fullname": "Jianfeng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186847?format=json", "institution": "Anhui University"}, {"id": 183734, "fullname": "Ke Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183734?format=json", "institution": "Anhui University"}], "abstract": "Zero-shot anomaly detection (ZSAD) requires detecting and localizing anomalies without access to target-class anomaly samples. Mainstream methods rely on vision\u2013language models (VLMs) such as CLIP: they build hand-crafted or learned prompt sets for normal and abnormal semantics, then compute image\u2013text similarities for open-set discrimination. While effective, this paradigm depends on a text encoder and cross-modal alignment, which can lead to training instability and parameter redundancy. This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. We introduce two learnable tokens within a frozen backbone to directly encode normality and abnormality. Through multi-layer self-attention, these tokens interact with patch tokens, gradually acquiring high-level notions of normality and anomaly while guiding patches to highlight anomaly-related cues. Additionally, we incorporate a Spatial-Aware Cross-Attention (SCA) module and a lightweight Self-Alignment Function (SAF): SCA injects fine-grained spatial information into the tokens, and SAF recalibrates patch features before anomaly scoring. VisualAD achieves state-of-the-art performance on 13 zero-shot anomaly detection benchmarks spanning industrial and medical domains, and adapts seamlessly to pretrained vision backbones such as the CLIP image encoder and DINOv2. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37174", "url": null, "sourceid": 45640, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36659, "uid": "02cf7b95eaf630256990316ef6d5bcb3", "name": "CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning", "authors": [{"id": 185580, "fullname": "Qi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/185580?format=json", "institution": "Beihang university"}, {"id": 87088, "fullname": "Honglin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87088?format=json", "institution": "Westlake University"}, {"id": 76526, "fullname": "Yingchen Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76526?format=json", "institution": "Nanyang Technological University"}, {"id": 185581, "fullname": "Haoyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185581?format=json", "institution": "Beihang University"}, {"id": 87061, "fullname": "Lin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87061?format=json", "institution": "Westlake University "}, {"id": 90418, "fullname": "Song Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/90418?format=json", "institution": "ByteDance"}, {"id": 185040, "fullname": "Qi She", "url": "http://cvpr.thecvf.com/api/miniconf/users/185040?format=json", "institution": "Bytedance AI Lab"}, {"id": 155910, "fullname": "Zilong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155910?format=json", "institution": "Bytedance"}, {"id": 185582, "fullname": "Yunqing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185582?format=json", "institution": "ByteDance"}], "abstract": "Recent releases such as o3 highlight human-like \u201cthinking with images\u201d reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks.We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning.To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse.Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning.Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models. Our code is available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36659", "url": "https://codedance-vl.github.io/", "sourceid": 35282, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37946, "uid": "3c81b085f42fee9992a2391b4aaee470", "name": "AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents", "authors": [{"id": 183891, "fullname": "Baicheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183891?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188654, "fullname": "Mingda Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188654?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188655, "fullname": "Min Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188655?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 85561, "fullname": "Haizhou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85561?format=json", "institution": "The Chinese University of Hong Kong (Shenzhen); National University of Singapore"}, {"id": 88740, "fullname": "Baoyuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88740?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly vital for complex task automation. However, their capacity for self-driven decision-making introduces significant, yet underexplored, security risks, among which backdoor attacks pose a particularly stealthy and high-impact threat. Prior work has shown GUI agents vulnerable to such attacks, but existing methods rely on static trigger-action mappings that execute fixed, context-agnostic behaviors, making them highly detectable. To address this limitation, we introduce **AdapAction**, a novel backdoor attack that subverts the agent\u2019s decision-making by embedding an **adaptive, context-aware policy**. Unlike traditional approaches, AdapAction enables the agent to autonomously select environmentally coherent malicious actions based on the current GUI state and user instruction, thereby evading detection while preserving functional utility. Extensive experiments on the Android-In-The-Zoo (AitZ) and AndroidControl benchmarks show that AdapAction achieves up to 100% Attack Success Rate (ASR) while preserving benign task utility. More critically, AdapAction consistently evades a multi-principle-based LLM defense evaluating instruction alignment, visual coherence, and safety, whereas traditional fixed-action attacks are nearly 100% detected. This resilience stems from AdapAction\u2019s contextually grounded malicious actions, which are semantically and visually indistinguishable from legitimate operations. As a result, AdapAction exhibits exceptional stealth and poses a significantly greater real-world threat to LLM-powered GUI agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37946", "url": null, "sourceid": 43574, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37996, "uid": "82aa20a74dbe3e0e85c23ba8d645d3ce", "name": "Fine-Grained GRPO for Precise Preference Alignment in Flow Models", "authors": [{"id": 154990, "fullname": "Yujie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/154990?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 126396, "fullname": "Pengyang Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/126396?format=json", "institution": "University of Science and Technology of China"}, {"id": 151486, "fullname": "Jiazi Bu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151486?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 103354, "fullname": "Yibin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/103354?format=json", "institution": "Fudan University"}, {"id": 131138, "fullname": "Yuhang Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131138?format=json", "institution": "Nanyang Technological University"}, {"id": 77217, "fullname": "Jiaqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77217?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 74024, "fullname": "Li Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74024?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The incorporation of online reinforcement learning (RL) into diffusion and flow-based generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, these models can explore a variety of denoising trajectories, enhancing the exploratory capacity of RL. However, despite their ability to discover potentially high-reward samples, current approaches often struggle to effectively align with preferences due to the sparsity and narrowness of reward feedback. To overcome this limitation, we introduce a novel framework called Granular-GRPO (G$^2$RPO), which enables fine-grained and comprehensive evaluation of sampling directions in the RL training of flow models. Specifically, we propose a Singular Stochastic Sampling mechanism that supports step-wise stochastic exploration while ensuring strong correlation between injected noise and reward signals, enabling more accurate credit assignment to each SDE perturbation. Additionally, to mitigate the bias introduced by fixed-granularity denoising, we design a Multi-Granularity Advantage Integration module that aggregates advantages computed across multiple diffusion scales, resulting in a more robust and holistic assessment of sampling trajectories. Extensive experiments on various reward models, including both in-domain and out-of-domain settings, demonstrate that our G$^2$RPO outperforms existing flow-based GRPO baselines, highlighting its effectiveness and generalization capability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37996", "url": "https://bujiazi.github.io/g2rpo.github.io/", "sourceid": 40145, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37699, "uid": "a545f9807a9d4746ceb2d13b80a38d2e", "name": "Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference", "authors": [{"id": 188037, "fullname": "Seyeon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188037?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 85728, "fullname": "Juncheol Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/85728?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 188038, "fullname": "Jaehong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188038?format=json", "institution": "Inha University"}, {"id": 85708, "fullname": "Dongsu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85708?format=json", "institution": "KAIST, Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Videos are increasingly used as inputs to machine learning models, where repeated decoding and processing for diverse downstream tasks dominate computational costs. However, existing video processing pipelines remain inefficient: traditional video codecs (H.264, H.265) are optimized for human visual quality and require full pixel decoding for each inference, Compressed Domain Inference (CDI) is tightly coupled to specific codec structures with limited task flexibility, and Video Coding for Machines (VCM) demands separate representations task-specific encoders without human visualization support.We propose Neural Video Pipeline (NVP), a framework that leverages Implicit Neural Representations (INR) to directly extract task-specific features from intermediate layers, eliminating pixel reconstruction overhead.NVP employs lightweight Micro Adapters to bridge INR features directly into the feature space of downstream models, bypassing both decoding and early extraction stages.Through comprehensive benchmarks across four representative tasks (image classification, object detection, action recognition, and segmentation), NVP reduces latency up to 89.5\\%, inference FLOPs up to 29.9\\%, while supporting multiple tasks with a single unified representation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37699", "url": null, "sourceid": 39898, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38933, "uid": "9faf494a168e02772dfb2a2414b8309e", "name": "Harnessing the Power of Foundation Models for Accurate Material Classification", "authors": [{"id": 182198, "fullname": "QINGRAN LIN", "url": "http://cvpr.thecvf.com/api/miniconf/users/182198?format=json", "institution": "Georgia Institute of Technology"}, {"id": 190992, "fullname": "Fengwei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190992?format=json", "institution": "Duke University"}, {"id": 190993, "fullname": "Chaolun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190993?format=json", "institution": "Waseda University"}], "abstract": "Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations:(a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38933", "url": null, "sourceid": 33797, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37259, "uid": "22faad819c7d2f9739083b503674694e", "name": "Synthesizing Visual Concepts as Vision-Language Programs", "authors": [{"id": 181301, "fullname": "Antonia W\u00fcst", "url": "http://cvpr.thecvf.com/api/miniconf/users/181301?format=json", "institution": "TU Darmstadt"}, {"id": 187024, "fullname": "Wolfgang Stammer", "url": "http://cvpr.thecvf.com/api/miniconf/users/187024?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 187025, "fullname": "Hikaru Shindo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187025?format=json", "institution": "TU Darmstadt"}, {"id": 187026, "fullname": "Lukas Helff", "url": "http://cvpr.thecvf.com/api/miniconf/users/187026?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 187027, "fullname": "Devendra Singh Dhami", "url": "http://cvpr.thecvf.com/api/miniconf/users/187027?format=json", "institution": "Eindhoven University of Technology"}, {"id": 187028, "fullname": "Kristian Kersting", "url": "http://cvpr.thecvf.com/api/miniconf/users/187028?format=json", "institution": "German Research Center for AI; The Hessian Center for AI; TU Darmstadt"}], "abstract": "Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLPs), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLPs leverage the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37259", "url": null, "sourceid": 34144, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36859, "uid": "8c226e712430372111f4fe3244742466", "name": "From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal", "authors": [{"id": 186037, "fullname": "Daniel George", "url": "http://cvpr.thecvf.com/api/miniconf/users/186037?format=json", "institution": "Persona"}, {"id": 186038, "fullname": "Charles Yeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/186038?format=json", "institution": "Persona Inc."}, {"id": 186039, "fullname": "Daniel Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/186039?format=json", "institution": "Persona Identities"}, {"id": 186040, "fullname": "Yifei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186040?format=json", "institution": "Persona Identities"}], "abstract": "Frozen visual embeddings (e.g., CLIP, DINOv2/v3, SSCD) power retrieval and integrity systems, yet their use on face-containing data is constrained by unmeasured identity leakage and a lack of deployable mitigations. We take an attacker-aware view and contribute: (i) a benchmark of visual embeddings that reports open-set verification at low false-accept rates, a calibrated diffusion-based template inversion check, and face\u2013context attribution with equal-area perturbations; and (ii) propose a one-shot linear projector that removes an estimated identity subspace while preserving the complementary space needed for utility, which for brevity we denote as the identity sanitization projection ISP. Across CelebA-20 and VGGFace2, we show that these encoders are robust under open-set linear probes, with CLIP exhibiting relatively higher leakage than DINOv2/v3 and SSCD, robust to template inversion, and are context-dominant. In addition, we show that ISP drives linear access to near-chance while retaining high non-biometric utility, and transfers across datasets with minor degradation. Our results establish the first attacker-calibrated facial privacy audit of non-FR encoders and demonstrate that linear subspace removal achieves strong privacy guarantees while preserving utility for visual search and retrieval.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36859", "url": null, "sourceid": 33731, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36230, "uid": "f1f6d9f111dd1f5a4498fa1d38be78c8", "name": "Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study", "authors": [{"id": 106811, "fullname": "Yuhan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106811?format=json", "institution": "University of California, Santa Cruz"}, {"id": 184515, "fullname": "Zihan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184515?format=json", "institution": "University of Washington"}, {"id": 167235, "fullname": "Han Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/167235?format=json", "institution": "Siemens Healthineers"}, {"id": 178499, "fullname": "Simon Arberet", "url": "http://cvpr.thecvf.com/api/miniconf/users/178499?format=json", "institution": "Siemens Healthineers"}, {"id": 184516, "fullname": "Martin F. Kraus", "url": "http://cvpr.thecvf.com/api/miniconf/users/184516?format=json", "institution": "Siemens Healthineers"}, {"id": 75508, "fullname": "Yuyin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75508?format=json", "institution": "UC Santa Cruz"}, {"id": 184517, "fullname": "Florin-Cristian Ghesu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184517?format=json", "institution": "Siemens Healthineers"}, {"id": 90931, "fullname": "Dorin Comaniciu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90931?format=json", "institution": "Siemens Healthineers"}, {"id": 90924, "fullname": "Ali Kamen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90924?format=json", "institution": "Siemens Healthineers"}, {"id": 90948, "fullname": "Riqiang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90948?format=json", "institution": "Siemens Healthineers"}], "abstract": "Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP\u2013HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training provides a robust and generalizable solution for RT planning across various clinical scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36230", "url": null, "sourceid": 44523, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38566, "uid": "ebf283eef673870e3325594ffcc9049e", "name": "M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction", "authors": [{"id": 162153, "fullname": "Fan Junqiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/162153?format=json", "institution": "Nanyang Technological University"}, {"id": 190161, "fullname": "Yunjiao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190161?format=json", "institution": "Nanyang Technological University"}, {"id": 190162, "fullname": "Yizhuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190162?format=json", "institution": "Nanyang Technological University"}, {"id": 190163, "fullname": "Xinyuan Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/190163?format=json", "institution": null}, {"id": 180971, "fullname": "Jiarui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180971?format=json", "institution": "Nanyang Technological University"}, {"id": 72464, "fullname": "Lihua Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/72464?format=json", "institution": "Nanyang Technological University"}, {"id": 187548, "fullname": "Jianfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187548?format=json", "institution": "Nanyang Technological University"}, {"id": 190164, "fullname": "Chris Xiaoxuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190164?format=json", "institution": "University College London"}, {"id": 190165, "fullname": "Fangqiang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190165?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Human mesh reconstruction (HMR) provides direct insights into body-environment interaction, which enables various immersive applications. While existing large-scale HMR datasets rely heavily on line-of-sight RGB input, vision-based sensing is limited by occlusion, lighting variation, and privacy concerns. To overcome these limitations, recent efforts have explored radio-frequency (RF) mmWave radar for privacy-preserving indoor human sensing. However, current radar datasets are constrained by sparse skeleton labels, limited scale, and simple in-place actions. To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. M4Human provides both raw radar tensors (RT) and processed radar point clouds (RPC) to enable research across different levels of RF signal granularity. M4Human includes high-quality motion capture (MoCap) annotations with 3D meshes and global trajectories, and spans 20 subjects and 50 diverse actions, including in-place, sit-in-place, and free-space sports or rehabilitation movements. We establish benchmarks on both RT and RPC modalities, as well as multimodal fusion with RGB-D modalities. Extensive results highlight the significance of M4Human for radar-based human modeling while revealing persistent challenges under fast, unconstrained motion. The dataset and code will be released after the paper publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38566", "url": null, "sourceid": 43055, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38147, "uid": "1be5b1b81fb31616482ac54524fe159b", "name": "Rethinking Concept Bottleneck Models: From Pitfalls to Solutions", "authors": [{"id": 182513, "fullname": "Merve Tapli", "url": "http://cvpr.thecvf.com/api/miniconf/users/182513?format=json", "institution": "Middle East Technical University"}, {"id": 157603, "fullname": "Quentin Bouniot", "url": "http://cvpr.thecvf.com/api/miniconf/users/157603?format=json", "institution": "TUM / Helmholtz Munich"}, {"id": 187024, "fullname": "Wolfgang Stammer", "url": "http://cvpr.thecvf.com/api/miniconf/users/187024?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 154682, "fullname": "Zeynep Akata", "url": "http://cvpr.thecvf.com/api/miniconf/users/154682?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 134258, "fullname": "Emre Akbas", "url": "http://cvpr.thecvf.com/api/miniconf/users/134258?format=json", "institution": "METU"}], "abstract": "Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the ``linearity problem'' causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38147", "url": null, "sourceid": 34104, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38292, "uid": "ce8d8333fc3f225ab596b0505fbdb2cf", "name": "BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models", "authors": [{"id": 181905, "fullname": "Xueliang Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/181905?format=json", "institution": "SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY CHINESE ACADEMY OF SCIENCES"}, {"id": 189520, "fullname": "Juncai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189520?format=json", "institution": "Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences; Shantou University"}, {"id": 189521, "fullname": "Jiacheng Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189521?format=json", "institution": "Tsinghua University"}, {"id": 189522, "fullname": "Dan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189522?format=json", "institution": null}, {"id": 189523, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189523?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences"}, {"id": 69902, "fullname": "Ruxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69902?format=json", "institution": "Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences"}], "abstract": "Vision-language models (VLMs) have demonstrated strong potential for adapting to downstream biomedical tasks with limited training samples. However, their generalization to unseen classes within the same dataset remains limited, as the image\u2013text alignment semantics often rely on spurious cues present in seen classes that do not transfer.To tackle this, we propose $\\textbf{BiomedCCPL (Causal Conditional Prompt Learning)}$, a framework that uses VGAP (Visual Grounder with Adaptive Prototype) to generate image-conditional prompts from multi-scale adaptive prototypes and employs SCD (Synergistic Causal Disentanglement) to regularize the generation of image-conditional prompts.Guided by insights from a causal analysis of generalization to unseen classes, SCD leverages multiple synergistic learning objectives to perform front-door adjustment, ensuring that the dynamically generated image-conditional prompts focus on underlying diagnostic image features shared across seen and unseen classes.Experiments on 11 datasets across 9 modalities demonstrate that BiomedCCPL effectively enhances the model's data efficiency and generalization ability.In particular, on the Base-to-Novel task, BiomedCCPL achieves an average HM of 79.98\\%, surpassing the previous state-of-the-art by 6.45\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38292", "url": null, "sourceid": 46422, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40045, "uid": "510c7d0be8906391133aa81683f4ce2f", "name": "Mechanisms of Object Localization in Vision\u2013Language Models", "authors": [{"id": 182156, "fullname": "Timothy Schauml\u00f6ffel", "url": "http://cvpr.thecvf.com/api/miniconf/users/182156?format=json", "institution": "Goethe University"}, {"id": 193371, "fullname": "Martina G. Vilas", "url": "http://cvpr.thecvf.com/api/miniconf/users/193371?format=json", "institution": "Goethe University"}, {"id": 193372, "fullname": "Gemma Roig", "url": "http://cvpr.thecvf.com/api/miniconf/users/193372?format=json", "institution": "Johann Wolfgang Goethe Universit\u00e4t Frankfurt am Main"}], "abstract": "Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis.We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while internal structure is largely ignored. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early\u2013mid layers for LLaVA and mid\u2013late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads.Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40045", "url": null, "sourceid": 37558, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37804, "uid": "f282242afef97d009d90f7c360fb1882", "name": "Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models", "authors": [{"id": 174090, "fullname": "Jialiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174090?format=json", "institution": "Ocean University of China"}, {"id": 183538, "fullname": "Junlong Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/183538?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 188306, "fullname": "Junyan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188306?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 183543, "fullname": "Hao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183543?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 188307, "fullname": "Yirong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188307?format=json", "institution": "Institute of Digital Twin, Eastern Institute of Technology, Ningbo"}, {"id": 188308, "fullname": "Yunpu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188308?format=json", "institution": "Ludwig-Maximilians-Universit\u00e4t M\u00fcnchen; Siemens Corporate Research"}, {"id": 154481, "fullname": "Xiaoyu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154481?format=json", "institution": "Eastern Institute of Technology, Ningbo"}], "abstract": "Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in Chain-of-Thought (CoT) reasoning. However, existing LVLM reasoning paradigms only begin reasoning after the entire video becomes available, introducing unnecessary latency and diminishing attention to early visual cues in dynamic scenes. Inspired by the human ability to think while watching, we introduce a streaming reasoning paradigm for LVLMs, where reasoning unfolds sequentially with incoming frames and deepens after the full video is observed. We instantiate this paradigm through Think-as-You-See (TaYS), a unified framework that enables LVLMs to reason while watching by integrating streaming CoT generation, stream-constrained training, and stream-parallel inference. Specifically, TaYS employs temporally aligned streaming reasoning units with precise CoT supervision, enforces ordered reasoning via streaming attention masks and positional encodings, and utilizes a parallel KV caches mechanism that decouples input encoding from reasoning generation, ensuring alignment and true concurrency. We evaluate TaYS on the Qwen2.5-VL model family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experimental results show that TaYS achieves superior reasoning performance compared with batch-mode CoT, while reducing pre-reasoning latency to under one second and overall answer delay by more than 50\\%. These findings demonstrate the effectiveness of the streaming paradigm in enabling real-time, human-like reasoning for LVLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37804", "url": null, "sourceid": 36474, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37803, "uid": "03e3685ec5e4518558e64360a570cc34", "name": "Dual Band Video Thermography: Separating Time-Varying Reflection and Emission Near Ambient Conditions", "authors": [{"id": 131212, "fullname": "Sriram Narayanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131212?format=json", "institution": "Carnegie Mellon University"}, {"id": 131210, "fullname": "Mani Ramanagopal", "url": "http://cvpr.thecvf.com/api/miniconf/users/131210?format=json", "institution": "Carnegie Mellon University"}, {"id": 89406, "fullname": "Srinivasa G. Narasimhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89406?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Long-wave infrared radiation captured by a thermal camera includes (a) emission from an object governed by its temperature and emissivity, and (b) reflected radiation from the surrounding environment. Separating these components is a long-standing challenge in thermography. Even when using multiple bands, the problem is under-determined without priors on emissivity. This difficulty is amplified in near ambient conditions, where emitted and reflected signals are of comparable magnitude. We present a dual-band video thermography framework that reduces this ambiguity by combining two complementary ideas at a per-pixel level: (i) spectral cues (ratio of emissivity between bands is unknown but fixed), and (ii) temporal cues (object radiation changes smoothly while background radiation changes rapidly). We derive an image formation model and an algorithm to jointly estimate the object's emissivity at each band, and the time-varying object and background temperatures. Experiments with calibrated and uncalibrated emissivities in everyday scenes (e.g., coffee pot heating up, palm print on mirrors dissipating, reflections of moving people), demonstrate robust separation and recovery of temperature fields. We will release code and data upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37803", "url": "https://dual-band-thermal.github.io/", "sourceid": 38951, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40314?format=json"], "related_events_ids": [40314]}, {"id": 39854, "uid": "33190d14d318d1c823c983c952c1fac8", "name": "Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens", "authors": [{"id": 192985, "fullname": "Zeyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192985?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 152242, "fullname": "Xueyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152242?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 192986, "fullname": "Delin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192986?format=json", "institution": "University of Hongkong"}, {"id": 192987, "fullname": "Maohao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192987?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 75997, "fullname": "Chuang Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75997?format=json", "institution": "UMass Amherst/MIT-IBM Watson AI Lab"}], "abstract": "Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery\u2014the internal construction and manipulation of visual cues\u2014we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as \u201cMirage\u201d, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to \u201cthink visually\u201d, it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that \\Model unlocks stronger multimodal reasoning without explicit image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39854", "url": null, "sourceid": 33455, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37833, "uid": "9ed1f55ac4f3b402b1d08b26870c34a6", "name": "Spatiotemporal Pyramid Flow Matching for Climate Emulation", "authors": [{"id": 182460, "fullname": "Jeremy Irvin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182460?format=json", "institution": "Stanford University"}, {"id": 188359, "fullname": "Jiaqi Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188359?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 188360, "fullname": "Zikui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188360?format=json", "institution": "Stanford University"}, {"id": 188361, "fullname": "Abdulaziz Alharbi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188361?format=json", "institution": "Stanford University"}, {"id": 188362, "fullname": "Yufei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188362?format=json", "institution": "Stanford University"}, {"id": 188363, "fullname": "Nomin-Erdene Bayarsaikhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188363?format=json", "institution": "Stanford University"}, {"id": 188364, "fullname": "Daniele Visioni", "url": "http://cvpr.thecvf.com/api/miniconf/users/188364?format=json", "institution": "Cornell University"}, {"id": 188365, "fullname": "Andrew Y. Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188365?format=json", "institution": "Stanford University"}, {"id": 188366, "fullname": "Duncan Watson-Parris", "url": "http://cvpr.thecvf.com/api/miniconf/users/188366?format=json", "institution": "University of California, San Diego"}], "abstract": "Generative models have the potential to transform the way we emulate Earth\u2019s changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at [anonymized for review].", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37833", "url": null, "sourceid": 45455, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36915, "uid": "db59312b9eb5439bf9e3c66374aed1f7", "name": "JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning", "authors": [{"id": 183399, "fullname": "Yifan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183399?format=json", "institution": "Shanghai Normal University"}, {"id": 186211, "fullname": "Juntuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186211?format=json", "institution": "Brown University"}, {"id": 166982, "fullname": "Yuming Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/166982?format=json", "institution": "OPPO Research Institute"}, {"id": 166984, "fullname": "Xudong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/166984?format=json", "institution": "OPPO Research Institute"}, {"id": 186212, "fullname": "Chunyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186212?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 186213, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186213?format=json", "institution": "Shanghai Normal University"}, {"id": 186214, "fullname": "Xiao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186214?format=json", "institution": "Shanghai normal university"}, {"id": 186215, "fullname": "Liang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186215?format=json", "institution": null}, {"id": 162196, "fullname": "Dan Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/162196?format=json", "institution": "OPPO"}], "abstract": "The value of a photograph lies not in what it contains, but in what it is about. -John SzarkowskiWith the advancement of Vision-Language Models (VLMs), employing VLM-as-a-Judge for visual evaluation has become a widely adopted metric in vision research. However, existing VLM-as-a-Judge approaches suffer from biased scoring outcomes with low discrimination and lack the capacity for unified multi-attribute compositional assessment. To address these limitations, we propose a novel training paradigm, termed JoPPO (**Jo**int **P**robabilistic **P**olicy **O**ptimization) that enables the VLMs to learn ranking under compositional assessment constraints. We evaluate the JoPPO on image aesthetics as a testbed, a task requiring nuanced understanding of multiple attributes including composition, lighting, color and geometry. Training follows two stages: (1) Supervised Fine-Tuning (SFT) on synthetic composition dataset provided by automated data generation pipeline to instill compositional priors; and (2) Contrastive Joint Conditional Probabilistic Reinforcement Learning: building upon the GRPO algorithm, we introduce JoPPO, which compute reward based on the expected win rate of total scores derived from the conditional distribution of fine-grained attribute scores within batches, effectively enhancing the model\u2019s discriminative ability in composite evaluation. Across standard aesthetic benchmarks, our method achieves consistent improvements in ranking consistency, demonstrating strong zero-shot generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36915", "url": null, "sourceid": 38469, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39519, "uid": "860b989a383593396648518c761c64a5", "name": "CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation", "authors": [{"id": 192254, "fullname": "Pablo Messina", "url": "http://cvpr.thecvf.com/api/miniconf/users/192254?format=json", "institution": "Pontificia Universidad Catolica de Chile"}, {"id": 76856, "fullname": "Andr\u00e9s Villa", "url": "http://cvpr.thecvf.com/api/miniconf/users/76856?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 87404, "fullname": "Juan Le\u00f3n Alc\u00e1zar", "url": "http://cvpr.thecvf.com/api/miniconf/users/87404?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 151119, "fullname": "Karen Sanchez", "url": "http://cvpr.thecvf.com/api/miniconf/users/151119?format=json", "institution": "King Abdullah University of Science and Technology, Saudi Arabia"}, {"id": 76727, "fullname": "Carlos Hinojosa", "url": "http://cvpr.thecvf.com/api/miniconf/users/76727?format=json", "institution": "KAUST"}, {"id": 192255, "fullname": "Denis Parra", "url": "http://cvpr.thecvf.com/api/miniconf/users/192255?format=json", "institution": "Pontificia Universidad Catolica de Chile"}, {"id": 87388, "fullname": "Alvaro Soto", "url": "http://cvpr.thecvf.com/api/miniconf/users/87388?format=json", "institution": "Universidad Cat\u00f3lica de Chile"}, {"id": 75441, "fullname": "Bernard Ghanem", "url": "http://cvpr.thecvf.com/api/miniconf/users/75441?format=json", "institution": "KAUST"}], "abstract": "Medical vision\u2013language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present \"CURE\", an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance emphasizing on harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39519", "url": null, "sourceid": 45682, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40373?format=json"], "related_events_ids": [40373]}, {"id": 39255, "uid": "6bd64b8669cf79ea63ff5a68f1baf5e3", "name": "Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events", "authors": [{"id": 182110, "fullname": "Xiaoxing You", "url": "http://cvpr.thecvf.com/api/miniconf/users/182110?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 191721, "fullname": "Qiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191721?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 191722, "fullname": "Lingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191722?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 89609, "fullname": "Xiaojun Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89609?format=json", "institution": "University of Technology Sydney"}, {"id": 88598, "fullname": "Jun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88598?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating key information across multiple modalities such as videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into a structured prior that serves as a global scaffold for cross-modal reasoning.Guided by this hierarchy, **CoE** first performs cross-modal grounding to localize key visual cues, followed by event-evolution reasoning to capture temporal dependencies and causal transitions across the video.A lightweight style adaptation module further refines the generated summaries to match domain-specific linguistic conventions. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and superior cross-domain generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39255", "url": null, "sourceid": 44976, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39562, "uid": "ab3c3351517e17a8b2561ee2227dae11", "name": "Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration", "authors": [{"id": 192352, "fullname": "Qianqian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192352?format=json", "institution": "Wuhan University"}, {"id": 192353, "fullname": "Jinchi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192353?format=json", "institution": "Wuhan University"}, {"id": 192354, "fullname": "Xiaolu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192354?format=json", "institution": "Wuhan University"}, {"id": 185802, "fullname": "Yongchao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185802?format=json", "institution": "Wuhan University"}], "abstract": "Bamboo slips are essential media for recording ancient East Asian civilizations, but excavated slips often suffer severe deformation due to dehydration and stress effects, creating substantial challenges for restoration. Traditional manual restoration is time-consuming and risks damage, while existing generative models struggle with the complex non-linear deformations in bamboo materials.We propose a novel framework for inverse restoration of deformed bamboo slips that provides a progressive physical deformation modeling with stepwise inverse displacement prediction. Our approach establishes a computable mathematical model of deformation based on wood fiber microstructure and stress-diffusion coupling effects, enabling the forward process to simulate physically plausible deformation trajectories as a deterministic, physics-driven progressive evolution. The inverse process transforms from predicting abstract noise to learning physically meaningful inverse displacement fields that progressively restore deformations.Experimental results show substantial gains in restoration fidelity while preserving delicate textual features, enabling the reliable correction of complex non-linear deformations that defeat traditional techniques. By integrating physical insights into bamboo material behavior with progressive restoration modeling, this work establishes a new paradigm for digital archaeological restoration\u2014one that holds significant potential to transform how deformed cultural relics are reconstructed and studied.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39562", "url": null, "sourceid": 34479, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38265, "uid": "30aaabbec8b8e1961770c48ee27036b9", "name": "Spike-driven Discrete Aggregation for Event-based Object Detection", "authors": [{"id": 181080, "fullname": "Huaning Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181080?format=json", "institution": "zhejiang University"}, {"id": 189454, "fullname": "Ziming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189454?format=json", "institution": "Zhejiang University"}, {"id": 189455, "fullname": "Runhao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189455?format=json", "institution": "Zhejiang University"}, {"id": 189456, "fullname": "Rui Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189456?format=json", "institution": null}, {"id": 87278, "fullname": "Huajin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87278?format=json", "institution": "Zhejiang University"}], "abstract": "With their high dynamic range and temporal resolution, event cameras are well-suited for object detection, especially under motion blur and extreme illumination.Recent state-of-the-art works for event-based object detection primarily focus on the high-level design of backbones. However, developing effective event representations is equally crucial, as it bridges asynchronous event streams with the dense tensors required by detection networks. Most existing aggregation strategies for event representation continuously accumulate all events within sampled intervals without selective filtering, inevitably introducing uninformative events that degrade detection accuracy.To address this limitation, we introduce a novel perspective, termed Discrete Aggregation, which adaptively and discretely selects informative events for differentiable aggregation. We realize this through the Spiking Discrete Aggregation (SDA) module, which is inspired by the threshold-based spike firing mechanism in Spiking Neural Networks (SNNs) and implemented using gated recurrent spiking neurons.Additionally, we introduce the Multi-Timescale Fusion (MTF) method which leverages coarse-grained temporal features from continuous event streams to further enhance the representation capability of SDA. Experimental results on neuromorphic datasets demonstrate that our method achieves state-of-the-art performance among all fully spiking architectures while using fewer parameters, reaching 43.4\\% $mAP_{50:95}$   on Gen1 (+ 4.5\\% over prior art). Moreover, our method exhibits superior robustness under noisy conditions and shows strong compatibility with non-spiking models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38265", "url": null, "sourceid": 36035, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36780, "uid": "8f7062bc9c3fa3b8fd9a01211c2e7792", "name": "Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model", "authors": [{"id": 181109, "fullname": "Zecheng Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181109?format=json", "institution": "Peking University"}, {"id": 185857, "fullname": "Yifan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185857?format=json", "institution": "Peking University"}, {"id": 185858, "fullname": "Zijie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185858?format=json", "institution": "Peking University"}, {"id": 158283, "fullname": "Wenxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158283?format=json", "institution": "Peking University"}, {"id": 185859, "fullname": "Yuanhong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185859?format=json", "institution": "Peking University"}, {"id": 89151, "fullname": "Zhaofei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89151?format=json", "institution": "Peking University"}, {"id": 88774, "fullname": "Tiejun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88774?format=json", "institution": "Peking University"}], "abstract": "Spiking Neural Networks (SNNs) are considered to have enormous potential in the future development of Artificial Intelligence due to their brain-inspired and energy-efficient properties. Compared to vanilla Spatial-Temporal Back-propagation (STBP) training methods, online training can effectively avoid the risk of GPU memory explosion. However, current online learning frameworks cannot tackle the gradient discrepancy problem between the forward and backward process, merely aiming to optimize the GPU memory, resulting in no performance advantages compared to the STBP-based models in the inference stage. To address the aforementioned challenges, we propose Hybrid-Driven Leaky Integrate-and-Fire (HD-LIF) model family for efficient online learning, which respectively adopt different spiking calculation mechanism in the upper-region and lower-region of the firing threshold. We theoretically point out that our learning framework can effectively separate temporal gradients and address the misalignment problem of surrogate gradients, as well as achieving full-stage optimization towards learning precision, memory footprint and power consumption. Experimental results have demonstrated that our scheme is enable to achieve state-of-the-art performance for multiple evaluation metrics, breaking through the traditional paradigm of SNN online training and deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36780", "url": null, "sourceid": 43820, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38702, "uid": "aacdba4f27d947d222396a6eb7d7a7af", "name": "From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation", "authors": [{"id": 160450, "fullname": "rui hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/160450?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 190490, "fullname": "Song Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190490?format=json", "institution": "Xidian University"}, {"id": 190491, "fullname": "Wen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190491?format=json", "institution": "Xidian University"}, {"id": 90655, "fullname": "Jinjian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90655?format=json", "institution": "Xidian University"}], "abstract": "Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Event-based cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of dense annotations restricts the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion.To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization.Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38702", "url": null, "sourceid": 37747, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37418, "uid": "cac2d36868fd45ce175dfe731aa5bf6b", "name": "SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More", "authors": [{"id": 187399, "fullname": "Muye Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187399?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 129237, "fullname": "Lingling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129237?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 187400, "fullname": "Yifei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187400?format=json", "institution": "Xi&#x27;an Jiaotong University; Zhongguancun Academy"}, {"id": 185530, "fullname": "Yaqiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185530?format=json", "institution": "Lenovo"}, {"id": 129235, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129235?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23\\% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37418", "url": null, "sourceid": 40139, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37475, "uid": "65f4d68a152bcdfb558e48afafe24a0a", "name": "IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion", "authors": [{"id": 180548, "fullname": "Lizhou Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180548?format=json", "institution": "\u4e0a\u6d77\u4ea4\u901a\u5927\u5b66"}, {"id": 106625, "fullname": "Songpengcheng Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/106625?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187542, "fullname": "Zengyuan Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187542?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187543, "fullname": "LanSun LanSun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187543?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 130686, "fullname": "Jiarui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130686?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 105373, "fullname": "Ling Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/105373?format=json", "institution": "Shanghai Jiao Tong Univeristy"}], "abstract": "Capturing full-body human motion with object interactions is crucial for AR/VR and robotics applications, yet it remains challenging for conventional vision-based methods due to occlusions and constrained capture volumes. Inertial measurement units (IMUs) offer a compelling alternative without line-of-sight requirements, but existing IMU-based motion capture assumes an isolated human and ignores object contacts and dynamics. To bridge this gap, we present IMU-HOI, a novel framework that jointly recovers full-body human pose and 6-DoF object trajectory from sparse IMUs on the body and object, explicitly modeling human-object Interaction.Our approach first infers probabilistic hand\u2013object contacts directly from IMU streams and uses them as a high-level signal to route between kinematic and inertial reasoning. These contact cues drive a three-stage fusion pipeline that refines human pose and root translation, and fuses hand-based forward kinematics with object-IMU integration for object motion, yielding coherent, drift-resilient trajectories for both human and object.  Experiments on challenging human-object interaction scenarios demonstrate substantial accuracy gains over prior inertial motion capture methods. Moreover, IMU-HOI can be plugged into existing sparse-IMU mocap backbones with minimal changes, effectively extending the scope of purely inertial motion capture from isolated humans to full human\u2013object interaction and joint motion estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37475", "url": null, "sourceid": 40509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37690, "uid": "8cb984778464a4dabc3bdccdb3c60bde", "name": "SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models", "authors": [{"id": 175536, "fullname": "Ruosen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/175536?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 188019, "fullname": "Zhikang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188019?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188020, "fullname": "Jialei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188020?format=json", "institution": null}, {"id": 184898, "fullname": "Jiahao Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184898?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188021, "fullname": "Dong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188021?format=json", "institution": "The University of Hong Kong"}, {"id": 188022, "fullname": "Lingyun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188022?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 188023, "fullname": "Weijian Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188023?format=json", "institution": null}, {"id": 188024, "fullname": "Zizhuang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188024?format=json", "institution": null}], "abstract": "Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion.We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder.The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation.Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D, and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D.These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence.We will release code and model checkpoints to support future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37690", "url": null, "sourceid": 31267, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39349, "uid": "430cbcb0d9dcadb1ca659196202990d4", "name": "FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling", "authors": [{"id": 106625, "fullname": "Songpengcheng Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/106625?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191901, "fullname": "Qingyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191901?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86439, "fullname": "Zhuo Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/86439?format=json", "institution": "ByteDance"}, {"id": 130686, "fullname": "Jiarui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130686?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187542, "fullname": "Zengyuan Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187542?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 130721, "fullname": "Qi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130721?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 105373, "fullname": "Ling Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/105373?format=json", "institution": "Shanghai Jiao Tong Univeristy"}], "abstract": "Full-body motion estimation from sparse VR observations is an inherently under-constrained problem, with only three 6-DoF trackers (HMD and controllers) available to infer a full skeletal pose. To address this ambiguity, we introduce a probabilistic framework that models joint orientations as distributions on SO(3) using the Matrix Fisher distribution. Instead of predicting a single deterministic pose, our network outputs a distribution for each joint, whose mode and concentration directly quantify prediction uncertainty on the rotation manifold. This enables likelihood-based training and principled uncertainty propagation. At the core of our model is a causal Transformer encoder that fuses sparse observations with motion history. We further propose region-wise tokens for the torso, arms, and legs, obtained via attention pooling over local joint features and semantic VR anchors. These tokens guide compact, per-region Fisher regression. To ensure kinematic coherence efficiently, we employ a limb refinement module, where each child joint's Fisher parameters are conditioned on its parent's distribution and the regional context, propagating pose and uncertainty hierarchically. Extensive experiments on standard sparse-VR benchmarks show that our approach achieves state-of-the-art performance, while providing well-calibrated joint-wise uncertainty.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39349", "url": null, "sourceid": 43566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39410, "uid": "bb9bd43af0e81257a5d156a5411ebce7", "name": "RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation", "authors": [{"id": 192016, "fullname": "Shijie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192016?format=json", "institution": "Fudan University"}, {"id": 155269, "fullname": "Bin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155269?format=json", "institution": "Singapore Management University"}, {"id": 192017, "fullname": "Jiarui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192017?format=json", "institution": "Fudan University"}, {"id": 192018, "fullname": "Xiangyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192018?format=json", "institution": "Fudan University"}, {"id": 89124, "fullname": "Jingjing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89124?format=json", "institution": "Fudan University"}, {"id": 86826, "fullname": "Yu-Gang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86826?format=json", "institution": "Fudan University"}], "abstract": "Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow(RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., Pi0) by triggering task replanning for task-level OOD and task rollback for state-level OOD within 100 ms. These results have demonstrated that our RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39410", "url": null, "sourceid": 38746, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36924, "uid": "91cd93e0fc08ec1e3067cb9c49e8f92d", "name": "COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification", "authors": [{"id": 144136, "fullname": "Sun Siyi", "url": "http://cvpr.thecvf.com/api/miniconf/users/144136?format=json", "institution": "Xiamen University"}, {"id": 186230, "fullname": "Jinliang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186230?format=json", "institution": "Xiamen University"}, {"id": 186231, "fullname": "Juanjuan Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186231?format=json", "institution": "Jinan University"}, {"id": 186232, "fullname": "Zhihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186232?format=json", "institution": null}, {"id": 157183, "fullname": "Shaozi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157183?format=json", "institution": "Xiamen University"}, {"id": 157182, "fullname": "Zhiming Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157182?format=json", "institution": "Xiamen University"}], "abstract": "Occlusion presents two critical challenges for person re-identification (Re-ID): feature interference and information loss. While existing efforts have explored occlusion-aware data augmentation and feature reconstruction to mitigate these issues, the former often fails to address erroneous matches caused by similar occlusion patterns and background distractions, whereas the latter typically introduces significant computational overhead. To overcome these limitations, we propose a Consistent Occlusion and Prompt Enhancement (COPE) network. COPE incorporates a Cross-Identity Consistent Occlusion (CICO) module that applies identical occlusions across different identities and encourages feature similarity in the same occluded regions across different identities to reduce occlusion feature interference. A Prompt Background Filling (PBF) module leverages vision-language alignment to generate foreground heatmaps and performs random background filling, enhancing feature robustness under varying backgrounds. Additionally, a lightweight Prompt Similarity Scoring (PSS) module refines retrieval similarity by utilizing prompt-guided reliability scores. Extensive experiments on both occluded and holistic Re-ID benchmarks demonstrate that COPE consistently outperforms existing methods. Notably, it achieves 82.4% Rank-1 accuracy and 76.4% mAP on the challenging Occluded-Duke dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36924", "url": null, "sourceid": 43997, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39924, "uid": "75d08d5f3498e7bd52c3c4f50672f121", "name": "LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction", "authors": [{"id": 186468, "fullname": "Timo L\u00fcddecke", "url": "http://cvpr.thecvf.com/api/miniconf/users/186468?format=json", "institution": "University of G\u00f6ttingen"}, {"id": 186467, "fullname": "Jan Frederik Meier", "url": "http://cvpr.thecvf.com/api/miniconf/users/186467?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 193126, "fullname": "Jan van Delden", "url": "http://cvpr.thecvf.com/api/miniconf/users/193126?format=json", "institution": "University of G\u00f6ttingen"}, {"id": 186487, "fullname": "Alexander Ecker", "url": "http://cvpr.thecvf.com/api/miniconf/users/186487?format=json", "institution": "University of G\u00f6ttingen"}], "abstract": "Parameter-efficient fine-tuning (PEFT) methods have recently gained popularity for applying deep neural networks on small datasets as they reduce overfitting, simplify deployment, and enable fast training. We demonstrate that for dense image prediction tasks, a well-designed and lightweight dense readout on top of a frozen large backbone can surpass state-of-the-art PEFT methods in both efficiency and accuracy. Our parameter-efficient readout module combines interpolation and attention for fine-grained dense prediction. It integrates seamlessly with a wide range of pretrained vision backbones such as DINOv3. We achieve competitive or superior performance in semantic segmentation, object detection, pose estimation and semantic contour prediction, offering an efficient alternative to current PEFT techniques.  Code: https://to.be.released", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39924", "url": null, "sourceid": 43770, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40033, "uid": "ae23fc20a0346df4e0b9594aefb7c26d", "name": "Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection", "authors": [{"id": 179951, "fullname": "SA ZHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/179951?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 89695, "fullname": "Wanqian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89695?format=json", "institution": "IIE, CAS"}, {"id": 193346, "fullname": "Lin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193346?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 193347, "fullname": "Xiaohua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193347?format=json", "institution": "Tsinghua University"}, {"id": 193348, "fullname": "Chenxu Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/193348?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193349, "fullname": "Jinchao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193349?format=json", "institution": "Instituion of Information Engeering, Chinese Academy of Sciences,"}, {"id": 89665, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89665?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter action-relevant segments for each phase leveraging phase-wise semantic cues, producing semantically aligned visual representations. Furthermore, we propose the Adaptive Phase-wise Alignment (APA) module to perform phase-level visual\u2013textual matching, and adaptively aggregates alignment results across phases for final prediction. This adaptive phase-wise alignment facilitates the capture of transferable action patterns and significantly enhances generalization to unseen actions. Extensive experiments on two OV-TAD benchmarks demonstrated the superiority of the proposed method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40033", "url": null, "sourceid": 39215, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39942, "uid": "761635701a1b57e55385d45e8e868498", "name": "Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation", "authors": [{"id": 183621, "fullname": "Bowen Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/183621?format=json", "institution": "USTC"}, {"id": 184536, "fullname": "Zheng-Peng Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184536?format=json", "institution": "Nankai University"}, {"id": 180775, "fullname": "Qixin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180775?format=json", "institution": "Tencent"}, {"id": 130020, "fullname": "Wenjing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130020?format=json", "institution": "Peking University"}, {"id": 157066, "fullname": "Hao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157066?format=json", "institution": "WeChat Vision, Tencent Inc."}, {"id": 76506, "fullname": "Chun-Le Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76506?format=json", "institution": "Nankai University"}, {"id": 73507, "fullname": "Chongyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73507?format=json", "institution": "Nankai University"}, {"id": 86641, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86641?format=json", "institution": "WeChat, Tencent"}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}], "abstract": "Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI.Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools.In this paper, we propose $\\textbf{Stand-In}$, a lightweight and plug-and-play framework for identity preservation in video generation.Specifically, we introduce a conditional image branch into the pre-trained video generation model.Identity control is achieved through restricted self-attentions with conditional position mapping.Thanks to these designs, which greatly preserve the pretrained prior of the video generation model, our approach is able to outperform other full-parameter training methods in video quality and identity preservation, even with just $\\sim$1\\% additional parameters and only 2000 training pairs.Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.Code and dataset will be available to the community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39942", "url": null, "sourceid": 42953, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37720, "uid": "184c31bbacd0d31253965ebe448448f2", "name": "Beyond the Static-World: Lifelong Learning for All-in-One Medical ImageRestoration", "authors": [{"id": 188083, "fullname": "Shihao Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188083?format=json", "institution": "Tianjin University"}, {"id": 153901, "fullname": "Hongying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153901?format=json", "institution": "Tianjin University"}, {"id": 106136, "fullname": "Fanhua Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106136?format=json", "institution": "Tianjin University"}, {"id": 153902, "fullname": "Liang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153902?format=json", "institution": "Tianjin University"}, {"id": 170573, "fullname": "Jingjing Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/170573?format=json", "institution": "University of Bristol"}], "abstract": "All-in-One Medical Image Restoration (MedIR) models offer a promising path towards generalized medical imaging intelligence but face two critical spatiotemporal challenges: 1) Spatial Modality Interference, where conflicting gradients from diverse modalities (e.g., MRI, CT, PET) degrade performance; and 2) a Temporal Static-World Assumption that ignores the continual data streams in real-world clinical settings, leading to catastrophic forgetting. To address this dual challenge, we propose Resilient On-the-fly Medical Enhancement(ROME), a novel lifelong learning framework governed by a \"Disentangle-Optimize-Consolidate\" paradigm. ROME first resolves the foundational modality conflict via the Modality-Invariant Disentanglement via Adversarial Balancing(MIDAB) module. It establishes a strategic \"adversarial balance\" between a \"content preservation force\" and a \"modality erasure force\" to optimize a disentangled, unified feature manifold. Building on this stable foundation, the Adaptive Feature Consolidation(AFC) module combats forgetting. AFC dynamically locates an optimal feature consolidation point via a prediction network, enforced by a novel Diversity Loss to ensure robust continuous learning. Experiments demonstrate that ROME not only achieves SOTA performance in static settings but also exhibits superior resilience in rigorous domain-incremental benchmarks, reducing the average catastrophic performance degradation by over 10%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37720", "url": null, "sourceid": 43899, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39909, "uid": "29c08ec2725d8cc323fdd68147a40703", "name": "GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception", "authors": [{"id": 178455, "fullname": "jianqiang xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178455?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 181698, "fullname": "Gensheng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181698?format=json", "institution": "Sungkyunkwan University"}, {"id": 131476, "fullname": "\u5218\u534e\u5cf0 Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131476?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Reliable 3D perception from multi-view roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird's-Eye-View (BEV) space by representing them as 3D Gaussian distributions. By incorporating learnable perturbations guided by camera geometry, our model explicitly accounts for potential calibration inaccuracies. Second, to maximize the synergy between modalities, we propose a new orthogonal fusion module. This module employs constrained attention to enforce orthogonality between camera and LiDAR features, effectively disentangling redundant information and promoting the learning of complementary representations. Extensive experiments on the challenging RCooper dataset demonstrate that GSV2X sets a new state-of-the-art in multi-view roadside perception and exhibits remarkable robustness in complex, real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39909", "url": "https://github.com/Kilitoku/GSV2X", "sourceid": 40245, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36509, "uid": "8c9a375b5a9103b8e637ddfe81251902", "name": "Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision", "authors": [{"id": 127999, "fullname": "Kartik Kuckreja", "url": "http://cvpr.thecvf.com/api/miniconf/users/127999?format=json", "institution": "BITS Pilani, Birla Institute of Technology and Science"}, {"id": 185237, "fullname": "Parul Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/185237?format=json", "institution": "Adobe Systems; Monash University"}, {"id": 76698, "fullname": "Muhammad Haris Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76698?format=json", "institution": "Mohamed Bin Zayed University of Artificial Intelligence"}, {"id": 88074, "fullname": "Abhinav Dhall", "url": "http://cvpr.thecvf.com/api/miniconf/users/88074?format=json", "institution": "Monash University"}], "abstract": "Deepfake detection models often generate natural-language explanations, yet their reasoning is frequently ungrounded in visual evidence, limiting reliability. Existing evaluations measure classification accuracy but overlook reasoning fidelity. We propose DeepfakeJudge, a framework for scalable reasoning supervision and evaluation, that integrates an out-of-distribution benchmark containing recent generative and editing forgeries, a human-annotated subset with visual reasoning labels, and a suite of evaluation models, that specialize in evaluating reasoning rationales without the need for explicit ground truth reasoning rationales. The Judge is optimized through a bootstrapped generator\u2013evaluator process that scales human feedback into structured reasoning supervision and supports both pointwise and pairwise evaluation. On the proposed meta-evaluation benchmark, our reasoning-bootstrapped model achieves an accuracy of 96.2\\%, outperforming \\texttt{30x} larger baselines. The reasoning judge attains very high correlation with human ratings and 98.9\\% percent pairwise agreement on the human annotated meta-evaluation subset. These results establish reasoning fidelity as a quantifiable dimension of deepfake detection and demonstrate scalable supervision for interpretable deepfake reasoning. Our user study indicates that humans prefer reasonings generated by our framework 70\\% of the time, in faithfullness, groundedness and usefulness compared to  other models and datasets.  All of our datasets, models, and codebase will be open-sourced.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36509", "url": null, "sourceid": 45770, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38266, "uid": "a6835971b07994243412ed59f9c5f310", "name": "iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation", "authors": [{"id": 127874, "fullname": "ZHOUJIE FU", "url": "http://cvpr.thecvf.com/api/miniconf/users/127874?format=json", "institution": "Nanyang Technological University"}, {"id": 126775, "fullname": "Xianfang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126775?format=json", "institution": "Tencent PCG"}, {"id": 189457, "fullname": "jinghong lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189457?format=json", "institution": "Fudan University"}, {"id": 189458, "fullname": "Xinyao Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189458?format=json", "institution": "Nanyang Technological University"}, {"id": 127834, "fullname": "Chen Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127834?format=json", "institution": "Nanyang Technological University"}, {"id": 189459, "fullname": "Junyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189459?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189460, "fullname": "Jiacheng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/189460?format=json", "institution": null}, {"id": 88826, "fullname": "Wei Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88826?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189461, "fullname": "Shiyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189461?format=json", "institution": "ShanghaiTech University"}, {"id": 181923, "fullname": "Yunuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181923?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 87502, "fullname": "Gang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87502?format=json", "institution": "Tencent"}, {"id": 86136, "fullname": "Guosheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/86136?format=json", "institution": "Nanyang Technological University"}], "abstract": "Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Our code and model weights will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38266", "url": null, "sourceid": 45049, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39552, "uid": "4f0adf8dde494dff7c0c6ed4df420520", "name": "Adapting In-context Generation for Enhanced Composed Image Retrieval", "authors": [{"id": 134545, "fullname": "Haiwen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/134545?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 101507, "fullname": "Zining Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/101507?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184224, "fullname": "Delong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184224?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 152334, "fullname": "Zhaohui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152334?format=json", "institution": "Sensetime"}, {"id": 131585, "fullname": "Zhicheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131585?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131551, "fullname": "Fei Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/131551?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "As a challenge vision-language task, Composed Image Retrieval (CIR) aims to integrate information from a bi-modal query (image + text) to retrieve target images. While supervised CIR has achieved notable success in domain-specific scenarios, its reliance on manually annotated triplets restricts its scalability and application. Zero-shot CIR alleviates this by leveraging unlabeled data or automatically collected triplets, yet it often suffers from an intractable domain gap. To this end, we shift the focus to developing robust CIR models under limited labeled data and propose \\textbf{D}omain-\\textbf{A}daptive \\textbf{I}n-context \\textbf{G}eneration (DAIG), which adapts the in-context capability of a pretrained Text-to-Image (T2I) model to the target domain and the CIR task using few-shot samples and then transforms the LLM-generated textual triplets into unbiased CIR triplets as additional training data. After that, we present a two-stage framework applicable to any supervised CIR approach. The first stage, Distributionally Robust Synthetic Pretraining (DRSP), perturbs visual features to expand the distribution of synthetic data and improve training robustness on it. The second stage, Fine-grained Real-world Adaptation (FRA), fine-tunes on manually annotated triplets by imposing an angular margin on matching pairs to facilitate fine-grained learning. Experiments on two benchmarks validate the effectiveness of our method, $i.e.,$ under both few-shot and fully supervised CIR settings, DAIG yields substantial performance gains over CLIP4CIR, BLIP4CIR, and SPRC. The code and data will be released as open source.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39552", "url": null, "sourceid": 31925, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36376, "uid": "0200a91354cdcc7e7f803af641b0a56c", "name": "MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding", "authors": [{"id": 183169, "fullname": "Oskar Kristoffersen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183169?format=json", "institution": "Technical University of Denmark (DTU)"}, {"id": 184907, "fullname": "Alba Reinders", "url": "http://cvpr.thecvf.com/api/miniconf/users/184907?format=json", "institution": "DTU"}, {"id": 94900, "fullname": "Morten Hannemose", "url": "http://cvpr.thecvf.com/api/miniconf/users/94900?format=json", "institution": "Technical University of Denmark"}, {"id": 93991, "fullname": "Anders Dahl", "url": "http://cvpr.thecvf.com/api/miniconf/users/93991?format=json", "institution": "DTU Compute"}, {"id": 164954, "fullname": "Dim Papadopoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/164954?format=json", "institution": "Technical University of Denmark"}], "abstract": "Geo-spatial analysis of our world benefit from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a benchmark composed of four modalities: 197k high-resolution aerial images, 329k ground-view images, textual information, and geographic coordinates from 18,557 distinct landmarks in the United States. The MMLandmarks dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models on various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval.We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding. The dataset, labels, and code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36376", "url": null, "sourceid": 32751, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36264, "uid": "67b291530eb3c8b5bcbbe0d931a59a87", "name": "Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation", "authors": [{"id": 155292, "fullname": "Lilin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155292?format=json", "institution": "Sichuan University"}, {"id": 184618, "fullname": "Yimo Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184618?format=json", "institution": "Sichuan University"}, {"id": 184619, "fullname": "Li Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/184619?format=json", "institution": null}, {"id": 184620, "fullname": "Jiancheng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184620?format=json", "institution": null}, {"id": 184621, "fullname": "Xianggen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184621?format=json", "institution": "Sichuan University"}], "abstract": "Deep neural networks are highly vulnerable to adversarial examples, i.e.,small perturbations that can significantly degrade model performance. While adversarial training has become the primary defense strategy, most studies focus on balanced datasets, overlooking the challenges posed by real-world long-tail data. Motivated by the fact that perturbations in adversarial examples inherently alter the training distribution, we theoretically investigate their impact. We first revisit adversarial training for long-tail data and identify two key limitations: (i) a skewed training objective caused by class imbalance, and (ii) unstable evolution of adversarial distributions. Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose Rebalanced Adversarial Intensity for Long-Tailed Data (RAIL), a plug-and-play framework that adaptively adjusts perturbations during adversarial training. Extensive experiments demonstrate that RAIL consistently enhances adversarial robustness and class-balance on long-tailed datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36264", "url": null, "sourceid": 35501, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37209, "uid": "c5e247391bbdd57927e2930538296f2e", "name": "From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings", "authors": [{"id": 177462, "fullname": "Jiajie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177462?format=json", "institution": "ShanghaiTech University"}, {"id": 186927, "fullname": "S\u00f6ren Schwertfeger", "url": "http://cvpr.thecvf.com/api/miniconf/users/186927?format=json", "institution": "ShanghaiTech University"}, {"id": 177791, "fullname": "Alexander Kleiner", "url": "http://cvpr.thecvf.com/api/miniconf/users/177791?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel \"Latent Action Energy\" metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37209", "url": null, "sourceid": 35196, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65743, "file": "/media/PosterPDFs/CVPR%202026/37209.png", "modified": "2026-04-27T01:34:47.615660-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65744, "file": "/media/PosterPDFs/CVPR%202026/37209-thumb.png", "modified": "2026-04-27T01:34:47.782970-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36473, "uid": "1dec44d3b8974e86ab337f1755cd893e", "name": "FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection", "authors": [{"id": 185128, "fullname": "Jin Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/185128?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 180078, "fullname": "Boran Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180078?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185129, "fullname": "Jiajun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185129?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 160222, "fullname": "Jiaqi guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/160222?format=json", "institution": "Nankai University"}, {"id": 185130, "fullname": "Shuo Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185130?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185131, "fullname": "Pengju Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/185131?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are inherently coupled with network-specific parameters, inevitably introducing architectural bias and compromising generalization; or (ii) DNN-free, which utilize heuristics that lack rigorous theoretical guarantees for stability and accuracy. Neither approach explicitly constrains distributional equivalence of the representative subsets, largely because continuous distribution matching is broadly considered inapplicable to discrete dataset sampling. Furthermore, prevalent distribution metrics (e.g., MSE, KL, MMD, and CE) are often incapable of accurately capturing higher-order moment differences. These deficiencies lead to suboptimal coreset performance, preventing the selected coreset from being truly equivalent to the original dataset.In this work, we propose FAST (Frequency-domain Aligned Sampling via Topology), the first DNN-free distribution-matching coreset selection framework that formulates coreset selection as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information (i.e., all moments and intrinsic correlations) in the frequency domain. We further discover that naive CFD suffers from a \u201cvanishing phase gradient\u201d issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high. This preserves global structures before refining local details, enabling accurate matching with few frequencies while preventing overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art  coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2$\\times$ average speedup even on CPU with 1.7GB of memory, underscoring its high performance and energy efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36473", "url": null, "sourceid": 45473, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40087, "uid": "08a9b8693c1c3ce49fd327977f837abf", "name": "HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation", "authors": [{"id": 183151, "fullname": "Jiajun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183151?format=json", "institution": "Peking University"}, {"id": 193483, "fullname": "Shijia Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193483?format=json", "institution": "Ocean University of China; Heriot-Watt University"}, {"id": 193484, "fullname": "Ruikang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193484?format=json", "institution": "Peking University"}, {"id": 178254, "fullname": "Qi Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/178254?format=json", "institution": "Peking University"}], "abstract": "Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40087", "url": null, "sourceid": 43595, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36611, "uid": "5599b93313facb3b448b1124a75b23f7", "name": "Scene-Centric Unsupervised Video Panoptic Segmentation", "authors": [{"id": 136627, "fullname": "Christoph Reich", "url": "http://cvpr.thecvf.com/api/miniconf/users/136627?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 159011, "fullname": "Oliver Hahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/159011?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 106015, "fullname": "Nikita Araslanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/106015?format=json", "institution": "TU Munich"}, {"id": 128661, "fullname": "Laura Leal-Taixe", "url": "http://cvpr.thecvf.com/api/miniconf/users/128661?format=json", "institution": "NVIDIA"}, {"id": 129663, "fullname": "Christian Rupprecht", "url": "http://cvpr.thecvf.com/api/miniconf/users/129663?format=json", "institution": "University of Oxford"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}, {"id": 73884, "fullname": "Stefan Roth", "url": "http://cvpr.thecvf.com/api/miniconf/users/73884?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}], "abstract": "Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose CUViPS, the first unsupervised VPS approach. CUViPS generates temporally consistent panoptic video pseudo-labels from monocular scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate and unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. CUViPS consistently outperforms all baselines and demonstrates strong label-efficient learning. With CUViPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36611", "url": null, "sourceid": 34705, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39607, "uid": "f3da4e93ad137dbe17a730562563cf9d", "name": "Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation", "authors": [{"id": 188608, "fullname": "Kailing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188608?format=json", "institution": "East China Normal University"}, {"id": 159520, "fullname": "Tianwen Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/159520?format=json", "institution": "East China Normal University"}, {"id": 192467, "fullname": "Lijin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192467?format=json", "institution": "The University of Tokyo"}, {"id": 76647, "fullname": "Yuqian Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76647?format=json", "institution": "Fudan University"}, {"id": 156478, "fullname": "Jingyu Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156478?format=json", "institution": "East China Normal University"}, {"id": 188610, "fullname": "Xiaoling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188610?format=json", "institution": "East China Normal University"}, {"id": 89725, "fullname": "Liang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/89725?format=json", "institution": "East China Normal University"}], "abstract": "Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic\u2013geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic\u2013Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39607", "url": null, "sourceid": 30640, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37929, "uid": "15aea757696aaf9af950992f299b6789", "name": "CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning", "authors": [{"id": 188608, "fullname": "Kailing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188608?format=json", "institution": "East China Normal University"}, {"id": 188609, "fullname": "Qi&amp;#x27;ao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188609?format=json", "institution": "East China Normal University"}, {"id": 159520, "fullname": "Tianwen Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/159520?format=json", "institution": "East China Normal University"}, {"id": 76647, "fullname": "Yuqian Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76647?format=json", "institution": "Fudan University"}, {"id": 166718, "fullname": "Yang Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/166718?format=json", "institution": "Fudan University"}, {"id": 188610, "fullname": "Xiaoling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188610?format=json", "institution": "East China Normal University"}], "abstract": "Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end\u2011to\u2011end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Considering the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM\u2011driven open\u2011world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long\u2011term visual dependencies. Code will be available after review.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37929", "url": null, "sourceid": 39577, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38132, "uid": "e20dd8dbc9d9a3e0f0396c27a38df6aa", "name": "V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence", "authors": [{"id": 161478, "fullname": "Jiancheng Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/161478?format=json", "institution": "Tsinghua University"}, {"id": 189131, "fullname": "Runze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189131?format=json", "institution": "Fudan University"}, {"id": 159520, "fullname": "Tianwen Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/159520?format=json", "institution": "East China Normal University"}, {"id": 156197, "fullname": "Mohammad Mahdi", "url": "http://cvpr.thecvf.com/api/miniconf/users/156197?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}, {"id": 89233, "fullname": "Xiangyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/89233?format=json", "institution": "Fudan University"}, {"id": 189132, "fullname": "Xiaomeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189132?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}, {"id": 76647, "fullname": "Yuqian Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76647?format=json", "institution": "Fudan University"}], "abstract": "Cross-view object correspondence, exemplified by the representative task of ego\u2013exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V$^{2}$-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V$^{2}$-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V$^{2}$-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego\u2013exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V$^{2}$-SAM, achieving new state-of-the-art performance on Ego-Exo4D (Ego\u2013Exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence). Codes will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38132", "url": null, "sourceid": 33962, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37781, "uid": "dcc4b21702248a25947ecf9ba174e0f5", "name": "Bypassing the Transport Plan: Dynamic Reweighting for Out-of-Distribution Detection with Optimal Transport", "authors": [{"id": 165874, "fullname": "Yang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/165874?format=json", "institution": "Zhejiang University"}, {"id": 131929, "fullname": "Weiming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131929?format=json", "institution": "Zhejiang University"}, {"id": 126847, "fullname": "Jun Dan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126847?format=json", "institution": "Zhejiang University"}, {"id": 188253, "fullname": "Tengyue Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188253?format=json", "institution": "City University of Hong Kong; Zhejiang University"}, {"id": 154335, "fullname": "Fan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154335?format=json", "institution": "Zhejiang University"}, {"id": 153988, "fullname": "Hua Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153988?format=json", "institution": "Nanyang Technological University"}, {"id": 107430, "fullname": "Junhao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107430?format=json", "institution": "Nanyang Technological University &amp; A*STAR"}, {"id": 188254, "fullname": "Jiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188254?format=json", "institution": "Nanyang Technological University"}, {"id": 154336, "fullname": "Shunjie Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154336?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 154337, "fullname": "Lianyong Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154337?format=json", "institution": "China University of Petroleum"}], "abstract": "Semi-supervised learning (SSL) has achieved remarkable progress by leveraging both limited labeled data and abundant unlabeled data. However, unlabeled datasets often contain out-of-distribution (OOD) samples from unknown classes, which can lead to performance degradation in open-set SSL scenarios. Current OOD detection methods are constrained by the absence of labeled OOD samples. Although optimal transport (OT) has proven to be effective to provide pseudo OOD scores for supervised learning, it still faces two main challenges, i.e., the unavoidable problem of finding the optimal transport plan and the unreliable OOD score caused by dense solutions. To overcome thess limitations, we propose a novel open-set OOD detection model named **DREW**, which leverages **D**ynamic **REW**eighting approach for OT-based OOD detection. Specifically, we start by formulating OOD detection as a semi-unbalanced optimal transport (SemiUOT) problem. The proposed DREW model can dynamically transform SemiUOT into the classical OT formula and then directly obtain the pseudo OOD score from the new source distribution weights. Contrary to existing OT-based methods, DREW provides theoretically grounded and more accurate pseudo OOD scores, while avoiding the direct computation of the transport plan. Empirical results demonstrate the superiority of DREW in terms of both accuracy and efficiency. Extensive analytical experiments are conducted to elucidate the properties of each component.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37781", "url": null, "sourceid": 30968, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38742, "uid": "b5464187e768a1db895e5df954d66a04", "name": "MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior", "authors": [{"id": 190559, "fullname": "Joshua Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/190559?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 181257, "fullname": "Sara Aghajanzadeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/181257?format=json", "institution": "University of Illinois Urbana Champaign"}, {"id": 135038, "fullname": "Zhen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/135038?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 69186, "fullname": "David Forsyth", "url": "http://cvpr.thecvf.com/api/miniconf/users/69186?format=json", "institution": "University of Illinois Urbana-Champaign"}], "abstract": "The primary axes of interest in low-light image enhancement (LLIE) are color constancy\u2014ensuring consistent outputs across inputs of the same scene under varying illumination and noise\u2014and generalization across diverse datasets. Existing methods, whether supervised, unsupervised, or zero-shot, rely on auxiliary loss functions and empirically selected hyperparameters, which yield strong results on the datasets used for evaluation but often exhibit limited generalization. To overcome these constraints, we propose MR. Illuminate (pronounced \"Mister Illuminate\"), the first deep learning-based solution for LLIE that requires no optimization and no degradation assumption. \"MR.\" emphasizes our Modulate\u2013Refine design: global illuminance and color are modulated via Adaptive Instance Normalization (AdaIN), while local structure and color are refined through self-attention features within a pre-trained diffusion model, taking a unique approach from prior methods. Extensive quantitative evaluations show that our approach surpasses SOTA methods on standard LLIE benchmarks, while qualitative results demonstrate improved color fidelity. Moreover, without any modification to our framework, our method achieves competitive results on the auto white balance (AWB) task, underscoring its strong generalization capability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38742", "url": null, "sourceid": 40566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36733, "uid": "7c75e25d978ed546db1b6f6e1dd84848", "name": "Think, Then Verify: A Hypothesis\u2013Verification Multi-Agent Framework for Long Video Understanding", "authors": [{"id": 88871, "fullname": "Zheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88871?format=json", "institution": "Zhejiang University of Technology"}, {"id": 185746, "fullname": "Haoran Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185746?format=json", "institution": "Zhejiang University of Technology"}, {"id": 185747, "fullname": "Haoxuan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185747?format=json", "institution": "Zhejiang University of Technology"}, {"id": 77399, "fullname": "Zhipeng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/77399?format=json", "institution": "Fudan University"}, {"id": 159520, "fullname": "Tianwen Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/159520?format=json", "institution": "East China Normal University"}, {"id": 157696, "fullname": "Cong Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157696?format=json", "institution": "Zhejiang University of Technology"}], "abstract": "Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis\u2013verification process.Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer.Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36733", "url": null, "sourceid": 36052, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37121, "uid": "eb3c9620d7cd9b43d91437fc4397453e", "name": "ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering", "authors": [{"id": 186713, "fullname": "Xiaojun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186713?format=json", "institution": "Shenzhen University"}, {"id": 186714, "fullname": "Sixiao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186714?format=json", "institution": "Shenzhen University"}, {"id": 186715, "fullname": "Ziqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186715?format=json", "institution": "Shenzhen University"}, {"id": 186716, "fullname": "Min Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186716?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences"}, {"id": 129246, "fullname": "Qin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129246?format=json", "institution": "Shenzhen University"}, {"id": 186717, "fullname": "Liang-Jie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186717?format=json", "institution": "Shenzhen University"}], "abstract": "Chart Question Answering (CQA) benchmarks are critical for evaluating Multimodal Large Language Models (MLLMs) on visual data reasoning. Existing benchmarks focus mainly on final-answer correctness, ignoring intermediate reasoning steps and the propagation of errors in multi-step processes. To address this, we introduce \\textbf{ChartR}, a benchmark designed to assess both the accuracy and robustness of reasoning in chart-understanding tasks. Each question is decomposed into 4\u201310 sub-questions covering key reasoning types, and each chart includes four visually perturbed variants (blurred, noise-added, watermark-added, annotation-removed) to systematically evaluate robustness. ChartR contains 200 base charts, 800 variants, 1,652 questions, and 8,260 image\u2013question pairs. We further propose a comprehensive evaluation framework with eight metrics that evaluate reasoning-chain accuracy, robustness under visual perturbations, and enable analysis of potential error propagation patterns. Experiments on twelve MLLMs, including general-purpose and chart-specialized models, reveal low reasoning reliability, early-step errors that may propagate, value extraction as the primary bottleneck, and sharp performance drops under perturbations, highlighting reliance on textual cues over true visual understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37121", "url": null, "sourceid": 38244, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37360, "uid": "ae77492ecda002233a4f8e9f3a5973a1", "name": "EgoSound: Benchmarking Sound Understanding in Egocentric Videos", "authors": [{"id": 187253, "fullname": "Bingwen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187253?format=json", "institution": "Fudan University"}, {"id": 76647, "fullname": "Yuqian Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76647?format=json", "institution": "Fudan University"}, {"id": 76243, "fullname": "Qiaole Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76243?format=json", "institution": "Fudan University"}, {"id": 187254, "fullname": "Guolei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187254?format=json", "institution": "Nankai University"}, {"id": 159520, "fullname": "Tianwen Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/159520?format=json", "institution": "East China Normal University"}, {"id": 187255, "fullname": "Yuzheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187255?format=json", "institution": "Fudan University"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}, {"id": 89233, "fullname": "Xiangyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/89233?format=json", "institution": "Fudan University"}], "abstract": "Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in visual-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37360", "url": null, "sourceid": 43054, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36783, "uid": "433d9cb8f9bb7b8623c18045a1c75c18", "name": "VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions", "authors": [{"id": 77078, "fullname": "Adrian Bulat", "url": "http://cvpr.thecvf.com/api/miniconf/users/77078?format=json", "institution": "Samsung AI Center Cambridge"}, {"id": 185863, "fullname": "Alberto Baldrati", "url": "http://cvpr.thecvf.com/api/miniconf/users/185863?format=json", "institution": "Samsung"}, {"id": 159514, "fullname": "Ioannis Maniadis Metaxas", "url": "http://cvpr.thecvf.com/api/miniconf/users/159514?format=json", "institution": "Samsung"}, {"id": 127546, "fullname": "Yassine Ouali", "url": "http://cvpr.thecvf.com/api/miniconf/users/127546?format=json", "institution": "Samsung"}, {"id": 86077, "fullname": "Georgios Tzimiropoulos", "url": "http://cvpr.thecvf.com/api/miniconf/users/86077?format=json", "institution": "Queen Mary University London"}], "abstract": "Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36783", "url": null, "sourceid": 38551, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38026, "uid": "52b8e6d0773472f490daafa68658b651", "name": "EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models", "authors": [{"id": 144193, "fullname": "Yiyang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144193?format=json", "institution": "Wuhan University"}, {"id": 184315, "fullname": "Wenke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184315?format=json", "institution": "Nanyang Technological University"}, {"id": 188870, "fullname": "Pei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188870?format=json", "institution": "Xiaomi Corporation"}, {"id": 147114, "fullname": "Yihao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147114?format=json", "institution": "Wuhan University"}, {"id": 149183, "fullname": "Kehua Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/149183?format=json", "institution": "Wuhan University"}, {"id": 188098, "fullname": "Zhenbo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188098?format=json", "institution": "Xiaomi Corporation"}, {"id": 188099, "fullname": "Jian Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188099?format=json", "institution": "Xiaomi Corporation"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition.To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38026", "url": null, "sourceid": 39751, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38243, "uid": "b56c83155c640bf87af59210d57da3a9", "name": "Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning", "authors": [{"id": 182022, "fullname": "Zhenwu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182022?format=json", "institution": "East China Normal University"}, {"id": 156478, "fullname": "Jingyu Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156478?format=json", "institution": "East China Normal University"}, {"id": 189408, "fullname": "Peiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189408?format=json", "institution": "East China Normal University"}, {"id": 189409, "fullname": "Xingzan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189409?format=json", "institution": "East China Normal University"}, {"id": 159520, "fullname": "Tianwen Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/159520?format=json", "institution": "East China Normal University"}, {"id": 189410, "fullname": "Wenxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189410?format=json", "institution": "East China Normal University"}, {"id": 189411, "fullname": "Yuan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189411?format=json", "institution": "East China Normal University"}, {"id": 127185, "fullname": "Jiao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/127185?format=json", "institution": "Xiamen University"}, {"id": 89127, "fullname": "Lizhuang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/89127?format=json", "institution": "Dept. of Computer Sci. &amp; Eng., Shanghai Jiao Tong University"}, {"id": 87891, "fullname": "Shaohui Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87891?format=json", "institution": "East China Normal University"}], "abstract": "Text-based human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. Our method integrates three complementary components: (1) retrospective feature supervision that enforces coarse-to-fine consistency across transformer layers,(2) motion preservation mechanism that focuses on  subtle variations accoding to the source-target similarity, and (3) triplet-based semantic alignment that strengthens text-motion correspondence. Together, these components form a unified supervision paradigm that balances change and invariance. Extensive experiments on the MotionFix and STANCE Adjustment datasets demonstrate that **OmniME achieves state-of-the-art performance in editing alignment**, validating the effectiveness of our unified learning framework. The code will be made publicly available uppon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38243", "url": null, "sourceid": 44079, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38529, "uid": "106ba8a5ad18d8cc7837398000276c46", "name": "MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding", "authors": [{"id": 176275, "fullname": "Wenhui Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/176275?format=json", "institution": "Renmin University of China"}, {"id": 190067, "fullname": "Xiaoyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190067?format=json", "institution": "Tongji University"}, {"id": 152287, "fullname": "Jiaze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152287?format=json", "institution": "Zhejiang University"}, {"id": 188096, "fullname": "Yijing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188096?format=json", "institution": "Renmin University of China"}, {"id": 188097, "fullname": "Jianzhong Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/188097?format=json", "institution": "Xiaomi Corporation"}, {"id": 188098, "fullname": "Zhenbo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188098?format=json", "institution": "Xiaomi Corporation"}, {"id": 155075, "fullname": "Ruihua Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/155075?format=json", "institution": "Renmin University of China"}, {"id": 188099, "fullname": "Jian Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188099?format=json", "institution": "Xiaomi Corporation"}], "abstract": "Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs).In this paper, we present MLLM-Sampler Joint-Evolution (MiSJoE), a novel framework that **jointly evolves** the MLLM and a lightweight key-frame sampler for efficient long-form video understanding.MiSJoE builds upon a key assumption that **only a small subset of key-frames is truly informative for answering each question to a video**.Specifically, MiSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question.Then, these queries interact with a frozen CLIP model to produce a query\u2013frame similarity matrix. Finally, A lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation.Both the MLLM and sampler are **jointly optimized through reinforcement learning**, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding.A new long-video QA dataset containing 2.8k videos with 7k question\u2013answer pairs is collected to support the training process.Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MiSJoE achieves 8.0% accuracy gain upon the base MLLM, and 1.1% higher accuracy than strongest baseline method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38529", "url": null, "sourceid": 46267, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38491, "uid": "886438584cd75af26e683e77526dff7e", "name": "SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation", "authors": [{"id": 189973, "fullname": "Ziyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189973?format=json", "institution": null}, {"id": 189974, "fullname": "Yingnan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189974?format=json", "institution": "Zhejiang University"}, {"id": 150618, "fullname": "Zedong Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150618?format=json", "institution": null}, {"id": 189488, "fullname": "Minghua Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189488?format=json", "institution": "Alibaba Group"}, {"id": 189975, "fullname": "Yanfen Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189975?format=json", "institution": "Alibaba Group; Wuhan University"}, {"id": 189976, "fullname": "Mingchao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189976?format=json", "institution": null}, {"id": 189489, "fullname": "Junjun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189489?format=json", "institution": "Alibaba Group"}, {"id": 189487, "fullname": "Shichao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189487?format=json", "institution": null}, {"id": 189977, "fullname": "Yang Kuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189977?format=json", "institution": null}, {"id": 189978, "fullname": "Pei Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189978?format=json", "institution": "Alibaba Group; Alibaba Group"}, {"id": 189979, "fullname": "Zhining Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189979?format=json", "institution": "Alibaba Group"}, {"id": 189980, "fullname": "Lu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189980?format=json", "institution": null}, {"id": 189981, "fullname": "Honglin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/189981?format=json", "institution": "Alibaba Group"}, {"id": 128317, "fullname": "Xiaolong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128317?format=json", "institution": "Georgia Institute of Technology"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}, {"id": 73318, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73318?format=json", "institution": "Zhejiang University"}], "abstract": "Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical \"brain-action\" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware FlowExploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Data and code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38491", "url": null, "sourceid": 44996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40337?format=json"], "related_events_ids": [40337]}, {"id": 37724, "uid": "dc3471f74deaba60b3d3693749ce09bc", "name": "REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding", "authors": [{"id": 152287, "fullname": "Jiaze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152287?format=json", "institution": "Zhejiang University"}, {"id": 156129, "fullname": "Hao Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/156129?format=json", "institution": "University of Science and Technology of China"}, {"id": 176275, "fullname": "Wenhui Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/176275?format=json", "institution": "Renmin University of China"}, {"id": 188094, "fullname": "Jingyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188094?format=json", "institution": "Fudan University"}, {"id": 188095, "fullname": "Boshen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188095?format=json", "institution": "Renmin University of China"}, {"id": 147633, "fullname": "Yuxun Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147633?format=json", "institution": "Nankai University"}, {"id": 188096, "fullname": "Yijing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188096?format=json", "institution": "Renmin University of China"}, {"id": 188097, "fullname": "Jianzhong Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/188097?format=json", "institution": "Xiaomi Corporation"}, {"id": 188098, "fullname": "Zhenbo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188098?format=json", "institution": "Xiaomi Corporation"}, {"id": 188099, "fullname": "Jian Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188099?format=json", "institution": "Xiaomi Corporation"}], "abstract": "Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1) long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model\u2019s reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37724", "url": null, "sourceid": 36334, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37004, "uid": "67ed5d53450537d929bca84e0289813e", "name": "StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives", "authors": [{"id": 146154, "fullname": "Jinghao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/146154?format=json", "institution": "Northwest University,China"}, {"id": 186445, "fullname": "Yuhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186445?format=json", "institution": "Northwest University Xi&#x27;an"}, {"id": 186446, "fullname": "GuoHua Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186446?format=json", "institution": null}, {"id": 149614, "fullname": "Kang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/149614?format=json", "institution": "School of Information Science and Technology, Northwest University,"}, {"id": 186447, "fullname": "Han Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186447?format=json", "institution": "Northwest University Xi&#x27;an"}], "abstract": "Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces the retained cues to build cross scene semantic ties. Compared with baseline methods, the experiments show that CLIP-T improves by up to 10\u201315\\%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With a matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37004", "url": null, "sourceid": 45985, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39472, "uid": "38b5190d475f087295d31efdccb0c058", "name": "Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning", "authors": [{"id": 180219, "fullname": "Zhuofan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/180219?format=json", "institution": "Xiamen University"}, {"id": 192140, "fullname": "Zishan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192140?format=json", "institution": "Xiamen University"}, {"id": 186230, "fullname": "Jinliang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186230?format=json", "institution": "Xiamen University"}, {"id": 192141, "fullname": "Jie Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192141?format=json", "institution": "Xiamen University"}, {"id": 192142, "fullname": "Shaohua Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192142?format=json", "institution": "Xiamen University"}, {"id": 90856, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90856?format=json", "institution": "Case Western Reserve University"}], "abstract": "Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text\u2013image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39472", "url": null, "sourceid": 43255, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40337, "uid": "886438584cd75af26e683e77526dff7e", "name": "SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation", "authors": [{"id": 189973, "fullname": "Ziyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189973?format=json", "institution": null}, {"id": 189974, "fullname": "Yingnan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189974?format=json", "institution": "Zhejiang University"}, {"id": 150618, "fullname": "Zedong Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150618?format=json", "institution": null}, {"id": 189488, "fullname": "Minghua Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189488?format=json", "institution": "Alibaba Group"}, {"id": 189975, "fullname": "Yanfen Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189975?format=json", "institution": "Alibaba Group; Wuhan University"}, {"id": 189976, "fullname": "Mingchao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189976?format=json", "institution": null}, {"id": 189489, "fullname": "Junjun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189489?format=json", "institution": "Alibaba Group"}, {"id": 189487, "fullname": "Shichao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189487?format=json", "institution": null}, {"id": 189977, "fullname": "Yang Kuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189977?format=json", "institution": null}, {"id": 189978, "fullname": "Pei Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189978?format=json", "institution": "Alibaba Group; Alibaba Group"}, {"id": 189979, "fullname": "Zhining Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189979?format=json", "institution": "Alibaba Group"}, {"id": 189980, "fullname": "Lu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189980?format=json", "institution": null}, {"id": 189981, "fullname": "Honglin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/189981?format=json", "institution": "Alibaba Group"}, {"id": 128317, "fullname": "Xiaolong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128317?format=json", "institution": "Georgia Institute of Technology"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}, {"id": 73318, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73318?format=json", "institution": "Zhejiang University"}], "abstract": "Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical \"brain-action\" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware FlowExploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Data and code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40337", "url": null, "sourceid": -44996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38491?format=json"], "related_events_ids": [38491]}, {"id": 39867, "uid": "cb507f13bcf4b271a7f0544275e4db34", "name": "GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing", "authors": [{"id": 181778, "fullname": "Jinyang Bo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181778?format=json", "institution": "Northwest University, China"}, {"id": 193029, "fullname": "Fan Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193029?format=json", "institution": "Northwest University Xi&#x27;an"}, {"id": 193030, "fullname": "Wenrui Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193030?format=json", "institution": "Northwest University Xi&#x27;an"}, {"id": 193031, "fullname": "Shangxun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193031?format=json", "institution": "Northwest University Xi&#x27;an"}, {"id": 193032, "fullname": "Yang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193032?format=json", "institution": "Northwest University"}, {"id": 186445, "fullname": "Yuhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186445?format=json", "institution": "Northwest University Xi&#x27;an"}, {"id": 149614, "fullname": "Kang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/149614?format=json", "institution": "School of Information Science and Technology, Northwest University,"}, {"id": 186446, "fullname": "GuoHua Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186446?format=json", "institution": null}], "abstract": "3D Gaussian splatting (3DGS) has emerged as a promising approach for high-fidelity 3D scene representation. However, relighting and composition of Gaussian splatting remain challenging because path tracing is not directly applicable. Existing relighting methods for Gaussian splatting typically adopt either approximate rendering formulations or rely on Gaussian ray tracing, yielding low relighting performance and low rendering efficiency. To address these limitations, we propose Gaussian hybrid path tracing (GHPT), a three-stage framework to acquire relightable Gaussian splatting models. The first stage utilizes planar-based Gaussian splatting reconstruction representation (PGSR) to enable multi-view consistent depth rendering and reconstruct the surface mesh of a scene. The second stage performs physically-based differentiable rendering on the obtained mesh to reconstruct the material maps and the environment map. The third stage utilizes factorized inverse path tracing (FIPT) on the G-buffer rendered by the PGSR, and visibility and indirect illumination are evaluated by hardware-accelerated ray tracing on the mesh with the material maps and the environment map reconstructed in the second stage. Experiments demonstrate that the relighting performance of GHPT outperforms the baselines, and our method can perform real-time relighting and composition of Gaussian splatting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39867", "url": null, "sourceid": 38076, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39582, "uid": "7cbabebc4ff57faf9cd22b4188243793", "name": "PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image", "authors": [{"id": 154399, "fullname": "Ziang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154399?format=json", "institution": "Nanyang Technological University"}, {"id": 126298, "fullname": "Fangzhou Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126298?format=json", "institution": "Nanyang Technological University"}, {"id": 90325, "fullname": "Zhaoxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90325?format=json", "institution": "MMLab@NTU, Nanyang Technological University"}, {"id": 127120, "fullname": "Liang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127120?format=json", "institution": "Shanghai AI Lab"}, {"id": 89788, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89788?format=json", "institution": "Nanyang Technological University"}], "abstract": "3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce \\textbf{PhysX-Anything}, the first \\textbf{simulation-ready} physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by \\textbf{193$\\times$}, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, \\textbf{PhysX-Mobility}, which expands the object categories in prior physical 3D datasets by over \\textbf{2$\\times$} and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39582", "url": "https://physx-anything.github.io/", "sourceid": 44939, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65746, "file": "/media/PosterPDFs/CVPR%202026/39582.png", "modified": "2026-04-27T04:39:28.417462-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65747, "file": "/media/PosterPDFs/CVPR%202026/39582-thumb.png", "modified": "2026-04-27T04:39:28.616477-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65745, "modified": "2026-04-27T02:27:41.603626-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/39582.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36231, "uid": "d23941275ef524a546d5921aa8c5af2d", "name": "Same Attention, Different Truths: Put Logit-Lens over Visual Attention to Detect and Mitigate LVLM Object Hallucination", "authors": [{"id": 181101, "fullname": "Zichuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181101?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 184518, "fullname": "Songlin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184518?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 184519, "fullname": "Bo Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184519?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 184520, "fullname": "Zhenchen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184520?format=json", "institution": "NLPR, Institute of Automation, Chinese Academy of Sciences"}, {"id": 184521, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184521?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 184522, "fullname": "BeibeiDong BeibeiDong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184522?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences; Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89679, "fullname": "Jing Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89679?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating objects that are absent from the image. Prior work largely attributes this to insufficient visual attention. However, in this work, we are surprised to find that both real and hallucinated objects receive equally strong visual attention in the model\u2019s mid-to-late layers. This indicates that the key issue may not be how much the model attends, but **what it attends to and why**. To this end, we decode the visual features of high-attention regions using Logit Lens, and observe that high-attention regions corresponding to real objects can be correctly decoded to the target object token, whereas those for hallucinated objects cannot. Building on this, we identify two distinct hallucination mechanisms: **(i) visual uncertainty**, triggered by semantically similar or confusable regions, masking these regions eliminates the hallucination. **(ii) contextual prior**, triggered by strong co-occurrence priors, even when the initially attended region is masked, the hallucination persists and attention drifts to other regions. Based on these findings, we propose a simple yet effective training-free **Detect\u2013Mitigate framework** comprising a Logit-Lens Consistency Check to detect hallucination and targeted remedies: High-Attention Regions Masking (HARM) for visual uncertainty hallucination, and Visual Evidence Enhanced Decoding (VEED) for contextual prior hallucination, which leverages genuine visual evidence to suppress erroneous priors. Our approach achieves state-of-the-art results on multiple hallucination benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36231", "url": null, "sourceid": 37290, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38150, "uid": "819fefda81994b411a379b51918baa45", "name": "SceMoS: Local Scene-Aware Human Motion Synthesis by Planning with Geometry-Grounded Tokens", "authors": [{"id": 158729, "fullname": "Anindita Ghosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/158729?format=json", "institution": "MPI-Informatics, Saarland University"}, {"id": 75786, "fullname": "Vladislav Golyanik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75786?format=json", "institution": "MPI for Informatics"}, {"id": 89090, "fullname": "Taku Komura", "url": "http://cvpr.thecvf.com/api/miniconf/users/89090?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 189165, "fullname": "Philipp Slusallek", "url": "http://cvpr.thecvf.com/api/miniconf/users/189165?format=json", "institution": "German Research Center for Artificial Intelligence (DFKI); Saarland University"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}, {"id": 89511, "fullname": "Rishabh Dabral", "url": "http://cvpr.thecvf.com/api/miniconf/users/89511?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}], "abstract": "Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent (\u201cwalk to the couch\u201d) and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a top-down bird\u2019s-eye-view (BEV) image of the scene, encoded with DINOv2 features, as the scene representation, and (2) geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38150", "url": null, "sourceid": 46071, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37195, "uid": "62f2e5c3c0b25e4cc73b796979789a7f", "name": "Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction", "authors": [{"id": 186885, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186885?format=json", "institution": "Beihang University"}, {"id": 186886, "fullname": "Changqun Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/186886?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 186887, "fullname": "Yuze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186887?format=json", "institution": "Beihang University"}, {"id": 186888, "fullname": "Junyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186888?format=json", "institution": "Shandong University"}, {"id": 186889, "fullname": "Wantong Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186889?format=json", "institution": "Beihang University"}, {"id": 186890, "fullname": "Xinxiong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186890?format=json", "institution": "PengCheng Labratory"}, {"id": 186891, "fullname": "Yue Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186891?format=json", "institution": null}], "abstract": "Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction,  enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for the compact unified reconstruction and high-fidelity rendering of urban scenes from both aerial and street views. Specifically, we first develop an Aerial-Street Joint Adaptive Densification method to resolve the densification conflicts arising from large view coverage disparity. We then introduce a Contribution-based Anchor Pruning strategy to effectively mitigate the storage overhead from capturing multi-scale scene details. Furthermore, we propose a Global-to-Local Optimization strategy to refine the reconstruction of under-optimized regions resulting from imbalanced view distributions. Experiments across diverse urban scene datasets demonstrate that Urban-GS significantly outperforms the state-of-the-art method in novel-view rendering quality, while simultaneously reducing storage overhead by an average of 41\\%.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37195", "url": null, "sourceid": 41737, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38716, "uid": "ac0d7b02669d3fc473a8a11232e89d82", "name": "FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding", "authors": [{"id": 190516, "fullname": "Yiweng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190516?format=json", "institution": "Fudan University"}, {"id": 73872, "fullname": "Bo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73872?format=json", "institution": "Facebook"}, {"id": 77020, "fullname": "Junke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77020?format=json", "institution": "Fudan University"}, {"id": 190517, "fullname": "Xiangyu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190517?format=json", "institution": "Fudan University"}, {"id": 190518, "fullname": "Ziyi Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/190518?format=json", "institution": "Fudan University"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}], "abstract": "This paper presents FluxMem, a training-free and adaptive memory framework for efficient streaming video understanding. FluxMem progressively compresses visual memory through a hierarchical redundancy reduction process. Specifically, Temporal Adjacency Selection (TAS) removes redundant tokens across adjacent frames to alleviate temporal redundancy, while Spatial Domain Consolidation (SDC) further merges spatially repetitive regions within each frame into compact representations. To ensure robustness across diverse scene dynamics, both modules employ adaptive thresholds derived from intrinsic scene statistics, automatically adjusting the compression rate without manual tuning. Extensive experiments demonstrate that FluxMem establishes a new state of the art on online benchmarks, achieving 76.4 on StreamingBench and 66.3 on OVO-Bench in real time. Furthermore, it exhibits strong offline performance, attaining 73.1 on MLVU while using 65% fewer visual tokens.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38716", "url": null, "sourceid": 30753, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38541, "uid": "b3b0b34ebdc9b8ba6bd98224365ed43d", "name": "The Drift Kernel: Why Diffusion Models Change Even When Told Not To", "authors": [{"id": 184179, "fullname": "Gokul Srinath Seetha Ram", "url": "http://cvpr.thecvf.com/api/miniconf/users/184179?format=json", "institution": "Independent Author"}, {"id": 190092, "fullname": "Rashmi Elavazhagan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190092?format=json", "institution": null}], "abstract": "Even when told to \u201cdo nothing,\u201d modern diffusion models subtly alter their output relative to the input they are supposed to preserve. We call this effect **No-Op Drift**. We introduce the **Drift Kernel**$$K_M(\\sigma)=\\mathbb{E}[|I' - I_0|_2^2 \\mid \\sigma],$$the expected perceptual deviation induced when running a diffusion model at noise strength $\\sigma$ under a null instruction. Using 120{,}000 baseline samples (30{,}000 per model across SD15, SD21, SDXL, and InstructPix2Pix) and 9{,}600 ablation samples (four strengths, null vs. strict copy prompts), we show that variance-driven diffusion models follow a quadratic form$$K_M(\\sigma)\\approx k_M\\sigma^2 + c_M,$$with aggregate $R^2=0.97$. We derive this scaling from first principles via a Taylor expansion of the decoder, yielding $k_M=\\mathrm{Tr}(J_D J_D^\\top)$, which depends only on the decoder Jacobian\u2014not on prompts. To validate mechanistic structure, we construct synthetic decoders that reproduce the two regimes seen in practice: quadratic variance-driven drift and flat, high-variance edit-driven drift. We show that prompt wording has negligible effect ($<17%$ coefficient difference), proving that drift is structural, not prompt-induced. We release **NoOp-Bench**, a benchmark with 10{,}000 inputs and full code for reproducible kernel estimation. Additional proofs, ablations, LPIPS/CLIP metrics, and extended visualizations appear in the Supplementary Material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38541", "url": null, "sourceid": 32137, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38795, "uid": "e940b96a746bb2917d9d8a48227456e2", "name": "RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution", "authors": [{"id": 190685, "fullname": "Ali Mosleh", "url": "http://cvpr.thecvf.com/api/miniconf/users/190685?format=json", "institution": "Samsung Electronics (AI Center\u2013Toronto)"}, {"id": 190686, "fullname": "Faraz Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/190686?format=json", "institution": "Samsung"}, {"id": 156295, "fullname": "Fengjia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156295?format=json", "institution": "Samsung"}, {"id": 190687, "fullname": "Stavros Tsogkas", "url": "http://cvpr.thecvf.com/api/miniconf/users/190687?format=json", "institution": "Samsung"}, {"id": 69239, "fullname": "Junyong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/69239?format=json", "institution": "Samsung"}, {"id": 89869, "fullname": "Michael S. Brown", "url": "http://cvpr.thecvf.com/api/miniconf/users/89869?format=json", "institution": "Samsung / York"}, {"id": 76844, "fullname": "Alex Levinshtein", "url": "http://cvpr.thecvf.com/api/miniconf/users/76844?format=json", "institution": "Samsung"}], "abstract": "Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing'' pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations. We will make our calibrated kernels and noise models publicly available, to facilitate research on image enhancement for mobile photography.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38795", "url": null, "sourceid": 45940, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36786, "uid": "b858ea5596a4dcd199209deb235bf758", "name": "MDS-VQA: Model-Informed Data Selection for Video Quality Assessment", "authors": [{"id": 181281, "fullname": "Jian Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181281?format=json", "institution": "City University of Hong Kong"}, {"id": 175300, "fullname": "Xiaoyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175300?format=json", "institution": "City University of Hong Kong"}, {"id": 185868, "fullname": "Zhihua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185868?format=json", "institution": "City University of Hong Kong"}, {"id": 185869, "fullname": "Yilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185869?format=json", "institution": "Google"}, {"id": 185870, "fullname": "Balu Adsumilli", "url": "http://cvpr.thecvf.com/api/miniconf/users/185870?format=json", "institution": "YouTube"}, {"id": 84859, "fullname": "Kede Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/84859?format=json", "institution": "City University of Hong Kong"}], "abstract": "Recent advances in learning-based video quality assessment (VQA) have achieved remarkable progress, yet the two fundamental components, model and data, are often studied in isolation.Model-centric approaches tend to design superior architectures over fixed and repeatedly used datasets, risking overfitting to benchmark-specific characteristics. In contrast, data-centric efforts emphasize constructing large-scale datasets through costly and time-consuming subjective experiments, typically overlooking the strengths and failure modes of existing VQA models. This separation limits progress, leading to brittle generalization and inefficient use of annotation resources.To bridge the gap, we introduce MDS-VQA, a model-informed data selection method that integrates model-centric and data-centric VQA. In its specific instantiation, a learned failure prediction module trained via a learning-to-rank formulation is combined with a content diversity measure based on deep semantic video features.Experiments across multiple VQA datasets demonstrate that MDS-VQA effectively spots diverse and challenging samples that expose model weaknesses.The selected videos are proven to be particularly informative for fine-tuning, offering a principled path toward constructing more challenging datasets and developing more generalizable and robust VQA models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36786", "url": null, "sourceid": 45801, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39623, "uid": "713db6add2b5e85a240d2daffd3e9dab", "name": "Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection", "authors": [{"id": 192501, "fullname": "Xueyang Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192501?format=json", "institution": "The  University of Melbourne"}, {"id": 183287, "fullname": "Zizhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183287?format=json", "institution": "The University of Melbourne"}, {"id": 192502, "fullname": "Tian Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192502?format=json", "institution": null}, {"id": 88361, "fullname": "Dong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88361?format=json", "institution": "University of New South Wales"}, {"id": 191224, "fullname": "Kourosh Khoshelham", "url": "http://cvpr.thecvf.com/api/miniconf/users/191224?format=json", "institution": "University of Melbourne"}, {"id": 127808, "fullname": "Liangliang Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127808?format=json", "institution": "Delft University of Technology"}], "abstract": "3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature separation or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, surface misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point\u2013patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industrial test set with real CAD models exhibiting planar, angular, and structural defects. Experiments on public and industrial datasets show superior AUC-ROC and AUC-PR performance, including over 50\\% point-level improvement on the new industrial anomaly type and average object-level gains of 7\\% on Real3D-AD and 4\\% on Anomaly-ShapeNet, demonstrating strong robustness and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39623", "url": null, "sourceid": 43972, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38346, "uid": "9bc792898a1f66e2ed7c2e2407b3fe8c", "name": "RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation", "authors": [{"id": 69924, "fullname": "Jiahao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69924?format=json", "institution": "The Australian National University"}, {"id": 164445, "fullname": "Joseph Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/164445?format=json", "institution": "Roblox"}, {"id": 189660, "fullname": "Young-Yoon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189660?format=json", "institution": null}, {"id": 189661, "fullname": "Seonghyeon Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/189661?format=json", "institution": null}, {"id": 189662, "fullname": "Victor Zordan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189662?format=json", "institution": "Roblox"}, {"id": 92141, "fullname": "Guy Tevet", "url": "http://cvpr.thecvf.com/api/miniconf/users/92141?format=json", "institution": "Tel Aviv University"}, {"id": 85573, "fullname": "Karen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85573?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 86243, "fullname": "Stephen Gould", "url": "http://cvpr.thecvf.com/api/miniconf/users/86243?format=json", "institution": "Australian National University"}, {"id": 189663, "fullname": "Oren Jacob", "url": "http://cvpr.thecvf.com/api/miniconf/users/189663?format=json", "institution": "Roblox"}, {"id": 189664, "fullname": "Haomiao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189664?format=json", "institution": "Roblox"}, {"id": 189665, "fullname": "Mubbasir Kapadia", "url": "http://cvpr.thecvf.com/api/miniconf/users/189665?format=json", "institution": "Rutgers University ; Roblox"}, {"id": 86255, "fullname": "Yizhak Ben-Shabat", "url": "http://cvpr.thecvf.com/api/miniconf/users/86255?format=json", "institution": "Technion, Israel Institute of Technology"}], "abstract": "Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences.We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure provides the first benchmark for fine-grained, per-category evaluation, revealing model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38346", "url": null, "sourceid": 31605, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37726, "uid": "6f9f69256d91652afad06c9c4894cce6", "name": "Act2See: Emergent Active Visual Perception for Video Reasoning", "authors": [{"id": 87384, "fullname": "Martin Q. Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/87384?format=json", "institution": "Carnegie Mellon University"}, {"id": 188101, "fullname": "Yuxiao Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188101?format=json", "institution": "Carnegie Mellon University"}, {"id": 188102, "fullname": "Aditya Agrawal", "url": "http://cvpr.thecvf.com/api/miniconf/users/188102?format=json", "institution": null}, {"id": 188103, "fullname": "Willis Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188103?format=json", "institution": null}, {"id": 188104, "fullname": "Paul Pu Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188104?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 188105, "fullname": "Ruslan Salakhutdinov", "url": "http://cvpr.thecvf.com/api/miniconf/users/188105?format=json", "institution": "School of Computer Science, Carnegie Mellon University"}, {"id": 188106, "fullname": "Louis-Philippe Morency", "url": "http://cvpr.thecvf.com/api/miniconf/users/188106?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37726", "url": null, "sourceid": 43129, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36344, "uid": "244a2a16c9d3a2e517a188c9596676fe", "name": "UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection", "authors": [{"id": 146684, "fullname": "Yanran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146684?format=json", "institution": "Tsinghua University"}, {"id": 130710, "fullname": "Wenzhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130710?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 155028, "fullname": "Yifei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155028?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 184825, "fullname": "Bingyao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184825?format=json", "institution": "Tsinghua University"}, {"id": 184826, "fullname": "Yu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184826?format=json", "institution": "Tsinghua University"}, {"id": 157160, "fullname": "Lei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157160?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88597, "fullname": "Jiwen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88597?format=json", "institution": "Tsinghua University"}, {"id": 77142, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/77142?format=json", "institution": "Tsinghua University"}], "abstract": "In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose **UniGenDet**: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36344", "url": "https://ivg-yanranzhang.github.io/UniGenDet/", "sourceid": 40277, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39465, "uid": "543f36cbc029e57fb098cead9cec5cb6", "name": "Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates", "authors": [{"id": 179925, "fullname": "Guojun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179925?format=json", "institution": "Wuhan University of Technology"}, {"id": 192129, "fullname": "MingyangZhang MingyangZhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192129?format=json", "institution": "Wuhan University of Technology"}, {"id": 192130, "fullname": "Jianwen Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192130?format=json", "institution": null}, {"id": 192131, "fullname": "Cheng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192131?format=json", "institution": null}, {"id": 192132, "fullname": "Yanchao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192132?format=json", "institution": null}, {"id": 192133, "fullname": "Junwei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192133?format=json", "institution": "Wuhan University of Technology"}], "abstract": "Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates ($<$ 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: it first guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; then, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39465", "url": "https://mommqq.github.io/MDIC-proj/", "sourceid": 37432, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38517, "uid": "ddf79cabfa698a8d8a713582d8a3fbff", "name": "BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling", "authors": [{"id": 190042, "fullname": "Jiachen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190042?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 190043, "fullname": "Xianhui Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190043?format=json", "institution": "vivo"}, {"id": 190044, "fullname": "Yi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190044?format=json", "institution": "Nanyang Technological University"}, {"id": 190045, "fullname": "Zebiao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190045?format=json", "institution": "Vivo"}, {"id": 156966, "fullname": "Xing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156966?format=json", "institution": "vivo"}, {"id": 156967, "fullname": "Hong Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156967?format=json", "institution": "Hangzhou VIVO Information Technology Co., Ltd"}, {"id": 190046, "fullname": "Yanmei Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190046?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Face retouching requires removing subtle imperfections while preserving unique facial identity  features, in order to enhance overall aesthetic appeal. However, existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences. Conversely, while online reinforcement learning (RL) excels at preference alignment, its stochastic exploration paradigm conflicts with the high-fidelity demands of face retouching and often introduces noticeable noise artifacts due to accumulated stochastic drift. To address these limitations, we propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences. We construct FRPref-10K, a fine-grained preference dataset covering five key retouching dimensions, and train a specialized reward model capable of evaluating subtle perceptual differences. To reconcile exploration and fidelity, we introduce Dynamic Path Guidance (DPG). DPG stabilizes the stochastic sampling trajectory by dynamically computing an anchor-based ODE path and replanning a guided trajectory at each sampling timestep, effectively correcting stochastic drift while maintaining controlled exploration. Extensive experiments show that BeautyGRPO outperforms both specialized face retouching methods and general image editing models, achieving superior texture quality, more accurate blemish removal, and overall results that better align with human aesthetic preferences. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38517", "url": null, "sourceid": 41811, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36836, "uid": "f6f0655687b41bca41c7cb54c5d407ba", "name": "Causality in Video Diffusers is Separable from Denoising", "authors": [{"id": 133226, "fullname": "Xingjian Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/133226?format=json", "institution": "MIT"}, {"id": 185984, "fullname": "Guande He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185984?format=json", "institution": "University of Texas at Austin"}, {"id": 154196, "fullname": "Zhengqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154196?format=json", "institution": "Google"}, {"id": 75717, "fullname": "Eli Shechtman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75717?format=json", "institution": "Adobe Research, US"}, {"id": 138396, "fullname": "Xun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/138396?format=json", "institution": "Adobe Research"}, {"id": 156829, "fullname": "Zongze Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156829?format=json", "institution": "Adobe Research"}], "abstract": "Causality\u2014referring to temporal, uni-directional cause\u2013effect relationships between components\u2014underlies many complex generative processes, including videos, language, and robot trajectories.Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context.In this paper, we show that the causal computation in these models is separable from the multi-step denoising process.Through systematic probing of autoregressive video diffusers, we uncover two key regularities:(1) early blocks produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and(2) deeper blocks exhibit sparse cross-frame attention and primarily perform intra-frame rendering.Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder.Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that CSD significantly improves throughput and latency while matching or surpassing the generation quality of strong causal diffusion baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36836", "url": null, "sourceid": 46064, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36598, "uid": "b47d9065841adbb95bc1254f4e045571", "name": "CLIP Is Shortsighted: Paying Attention Beyond the First Sentence", "authors": [{"id": 156560, "fullname": "Marc-Antoine Lavoie", "url": "http://cvpr.thecvf.com/api/miniconf/users/156560?format=json", "institution": "University of Toronto"}, {"id": 179921, "fullname": "Anas Mahmoud", "url": "http://cvpr.thecvf.com/api/miniconf/users/179921?format=json", "institution": "Mila - Quebec Artificial Intelligence Institute; University of Toronto"}, {"id": 185435, "fullname": "Aldo Zaimi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185435?format=json", "institution": "Mila - Quebec Artificial Intelligence Institute"}, {"id": 185437, "fullname": "Arsene Fansi Tchango", "url": "http://cvpr.thecvf.com/api/miniconf/users/185437?format=json", "institution": "Montreal Institute of Learning Algorithms"}, {"id": 85829, "fullname": "Steven L. Waslander", "url": "http://cvpr.thecvf.com/api/miniconf/users/85829?format=json", "institution": "University of Toronto"}], "abstract": "CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models.  However, CLIP\u2019s pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36598", "url": null, "sourceid": 39106, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37508, "uid": "a788f0e71f06b98c423367abd77503d7", "name": "HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT", "authors": [{"id": 183127, "fullname": "Yongsung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/183127?format=json", "institution": "Seoul National University"}, {"id": 183124, "fullname": "Wooseok Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/183124?format=json", "institution": "Seoul National University"}, {"id": 187600, "fullname": "Jaihyun Lew", "url": "http://cvpr.thecvf.com/api/miniconf/users/187600?format=json", "institution": "Seoul National University"}, {"id": 187601, "fullname": "Hun Hwangbo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187601?format=json", "institution": "Seoul National University"}, {"id": 187602, "fullname": "Jaehoon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/187602?format=json", "institution": "Seoul National University"}, {"id": 152386, "fullname": "Sungroh Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/152386?format=json", "institution": "Seoul National University"}], "abstract": "Visual Geometry Grounded Transformer (VGGT) has shown significant progress in 3D vision tasks. However, its global attention layers incur quadratic computational cost with respect to the number of input views, becoming a critical bottleneck for scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits head-wise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approximates the Hessian with respect to two distinct error terms on a small calibration set. In the inference stage, we perform HeSS-Guided Sparsification, leveraging the pre-computed HeSS to reallocate the total attention budget\u2014assigning denser attention to sensitive heads and sparser attention to more robust ones. We demonstrate that HeSS effectively captures head-wise sparsification sensitivity and empirically confirm that attention heads in the global attention layers exhibit heterogeneous sensitivity characteristics. Extensive experiments further show that our method effectively mitigates performance degradation under high sparsity, demonstrating strong robustness across varying sparsification levels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37508", "url": null, "sourceid": 34596, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37538, "uid": "e5f0a515e53bfd721fdb162b9b34d220", "name": "Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective", "authors": [{"id": 176046, "fullname": "Maijie Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/176046?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131642, "fullname": "Yuhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131642?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131641, "fullname": "Yixiong Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131641?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187674, "fullname": "Yao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187674?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 187675, "fullname": "Chenru Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187675?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Dataset quantization has recently emerged as a promising solution for mitigating the computational and memory challenges of large-scale datasets. However, existing approaches rely on a bin generation step that is computationally expensive and inefficient for large-scale datasets. Moreover, a fixed drop ratio in its patch dropping step fails to adapt to the diverse redundancy levels across samples, which degrades the representational quality of the quantized coreset. To address these limitations, we present Bin-Generation-Free Dataset Quantization (BGFDQ), a fully restructured framework that incorporates a simple yet effective KNN-based neighbor identification and neighbor-aware coreset selection strategy. We theoretically demonstrate that the proposed selection strategy achieves superior sampling efficiency compared to bin-generation-based methods. Additionally, we introduce an adaptive patch dropping strategy to further enhance the quality of the quantized dataset. Extensive experiments on four image classification benchmarks show that BGFDQ consistently outperforms state-of-the-art baselines. In particular, we achieve up to 5\\% validation accuracy improvement on CIFAR-100. Moreover, our framework successfully scales to datasets containing up to $10^5$ same-class samples while existing bin-generation-based approaches fail due to memory constraints. Code is available at https://anonymous.4open.science/r/BGFDQ-F093.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37538", "url": null, "sourceid": 40809, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39810, "uid": "cf4bc646dd5d0bc48d3f63b68a6f5926", "name": "PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence", "authors": [{"id": 181199, "fullname": "Ruiyan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181199?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 71640, "fullname": "Teng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71640?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192905, "fullname": "Kaihui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192905?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190681, "fullname": "Zihan Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/190681?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 88671, "fullname": "Ran Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88671?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 89127, "fullname": "Lizhuang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/89127?format=json", "institution": "Dept. of Computer Sci. &amp; Eng., Shanghai Jiao Tong University"}], "abstract": "Pose-guided video generation refers to controlling the motion of subjects in generated video through a sequence of poses. It enables precise control over subject motion and has important applications in animation. However, current pose-guided video generation methods are limited to accepting only human poses as input, thus generalizing poorly to pose of other subjects. To address this issue, we propose PoseAnything, the first universal pose-guided video generation framework capable of handling both human and non-human characters, supporting arbitrary skeletal inputs. To enhance consistency preservation during motion, we introduce Part-aware Temporal Coherence Module, which divides the subject into different parts, establishes part correspondences, and computes cross-attention between corresponding parts across frames to achieve fine-grained part-level consistency. Additionally, we propose Subject and Camera Motion Decoupled CFG, a novel guidance strategy that, for the first time, enables independent camera movement control in pose-guided video generation, by separately injecting subject and camera motion control information into the positive and negative anchors of CFG. Furthermore, we present XPose, a high-quality public dataset containing 50,000 non-human pose-video pairs, along with an automated pipeline for annotation and filtering. Extensive experiments demonstrate that Pose-Anything significantly outperforms state-of-the-art methods in both effectiveness and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39810", "url": null, "sourceid": 35428, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38650, "uid": "24e55cac9cc9e1404b6b557666901797", "name": "SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition", "authors": [{"id": 190386, "fullname": "Ning Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190386?format=json", "institution": "Chang\u2019an University"}, {"id": 190387, "fullname": "Tieyue Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190387?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 190388, "fullname": "Naeha Sharif", "url": "http://cvpr.thecvf.com/api/miniconf/users/190388?format=json", "institution": "University of Western Australia"}, {"id": 73072, "fullname": "Farid Boussaid", "url": "http://cvpr.thecvf.com/api/miniconf/users/73072?format=json", "institution": "The University of Western Australia"}, {"id": 128834, "fullname": "Guangming Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128834?format=json", "institution": "Xidian University"}, {"id": 190389, "fullname": "Lin Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/190389?format=json", "institution": "Donghailab; Donghai Lab"}, {"id": 88320, "fullname": "Mohammed Bennamoun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88320?format=json", "institution": "University of Western Australia"}, {"id": 90793, "fullname": "Liang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90793?format=json", "institution": "Xidian University"}], "abstract": "Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained language model to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions. Our project is available at:", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38650", "url": null, "sourceid": 36719, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38576, "uid": "92fb24d9addf85cd522569d24d24b160", "name": "POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation", "authors": [{"id": 181979, "fullname": "Yaohou Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181979?format=json", "institution": "Tohoku University"}, {"id": 190189, "fullname": "Qingzhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190189?format=json", "institution": "Baidu"}, {"id": 190190, "fullname": "Yongsong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190190?format=json", "institution": "Tohoku University"}, {"id": 190191, "fullname": "Junyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190191?format=json", "institution": "AWS"}, {"id": 93481, "fullname": "Tomo Miyazaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/93481?format=json", "institution": "Tohoku University"}, {"id": 190192, "fullname": "Shinichiro Omachi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190192?format=json", "institution": "Tohoku University"}], "abstract": "Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence like aesthetic appeal during training. We find that to achieve a high text accuracy could reduce aesthetic score and instruction following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient and effective training is an unsolved problem.In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38576", "url": null, "sourceid": 36552, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36297, "uid": "6d17745ad39541ad3f760e9c9b20058b", "name": "Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences", "authors": [{"id": 182151, "fullname": "Julian Kaltheuner", "url": "http://cvpr.thecvf.com/api/miniconf/users/182151?format=json", "institution": "University Bonn"}, {"id": 184723, "fullname": "Hannah Dr\u00f6ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/184723?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 86230, "fullname": "Markus Plack", "url": "http://cvpr.thecvf.com/api/miniconf/users/86230?format=json", "institution": "University of Bonn"}, {"id": 184724, "fullname": "Patrick Stotko", "url": "http://cvpr.thecvf.com/api/miniconf/users/184724?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 72969, "fullname": "Reinhard Klein", "url": "http://cvpr.thecvf.com/api/miniconf/users/72969?format=json", "institution": "University of Bonn"}], "abstract": "emporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast optimization method based on a novel preconditioned surface encoding that estimates coherent non-rigid deformations without sacrificing temporal stability or accuracy. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multi-layer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60\u00d7 faster than existing training-free methods and even matching the inference-time performance of heavy pretrained models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36297", "url": "https://github.com/vc-bonn/neu-pig.git", "sourceid": 35699, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40226, "uid": "c53e096440994145aebc056e820ece8b", "name": "Reviving ConvNeXt for Efficient Convolutional Diffusion Models", "authors": [{"id": 182404, "fullname": "Taesung Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/182404?format=json", "institution": "KAIST"}, {"id": 127367, "fullname": "Lorenzo Bianchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/127367?format=json", "institution": "CNR-ISTI"}, {"id": 193823, "fullname": "Lennart Wittke", "url": "http://cvpr.thecvf.com/api/miniconf/users/193823?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 193824, "fullname": "Felix Watine", "url": "http://cvpr.thecvf.com/api/miniconf/users/193824?format=json", "institution": null}, {"id": 127354, "fullname": "Fabio Carrara", "url": "http://cvpr.thecvf.com/api/miniconf/users/127354?format=json", "institution": "CNR-ISTI"}, {"id": 85079, "fullname": "Jong Chul Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/85079?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 193825, "fullname": "Romann M. Weber", "url": "http://cvpr.thecvf.com/api/miniconf/users/193825?format=json", "institution": "Disney Research, Disney"}, {"id": 152547, "fullname": "Vinicius C. Azevedo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152547?format=json", "institution": "Disney Research, Disney Research"}], "abstract": "Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures.Yet the locality bias, parameter efficiency, and hardware friendliness\u2014the attributes that established ConvNets as the efficient vision backbone\u2014have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a ConvNeXt-inspired backbone redesigned for conditional diffusion modeling. We find that FCDM-XL, using only 50$\\%$ of the FLOPs of DiT-XL/2, achieves comparable performance while delivering 7$\\times$ and 7.5$\\times$ speedups at 256$\\times$256 and 512$\\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40226", "url": null, "sourceid": 33932, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37387, "uid": "f471afca234cd00e51058653c32ffca9", "name": "MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models", "authors": [{"id": 180146, "fullname": "Mingrui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180146?format=json", "institution": "Xiamen University"}, {"id": 187318, "fullname": "Hang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187318?format=json", "institution": "Xiamen University"}, {"id": 131760, "fullname": "Jiayi Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/131760?format=json", "institution": "Xiamen University"}, {"id": 76395, "fullname": "Xiaoshuai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76395?format=json", "institution": "Xiamen University"}, {"id": 86308, "fullname": "Rongrong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/86308?format=json", "institution": "Xiamen University"}], "abstract": "Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce MICON-Bench, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present Dynamic Attention Rebalancing (DAR), a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37387", "url": null, "sourceid": 44275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40007, "uid": "a634a5f2675a3e5c04bd4fa3b7a17214", "name": "U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences", "authors": [{"id": 155840, "fullname": "Xiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155840?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 188169, "fullname": "Alan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188169?format=json", "institution": "National University of Singapore; Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 142971, "fullname": "Youquan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142971?format=json", "institution": "Fudan University"}, {"id": 188378, "fullname": "Linfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188378?format=json", "institution": "ByteDance Inc."}, {"id": 76351, "fullname": "Lingdong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76351?format=json", "institution": "National University of Singapore"}, {"id": 89788, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89788?format=json", "institution": "Nanyang Technological University"}, {"id": 131414, "fullname": "Qingshan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131414?format=json", "institution": "Nanjing University of Posts and Telecommunications"}], "abstract": "Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present **U4D**, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a \"hard-to-easy\" manner through two sequential stages: (1) *uncertainty-region modeling*, which reconstructs high-entropy regions with fine geometric fidelity, and (2) *uncertainty-conditioned completion*, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40007", "url": null, "sourceid": 35664, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37620, "uid": "630a1e2b96fd0b0216e3a30b53eaa6c7", "name": "Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing", "authors": [{"id": 180199, "fullname": "Pengzhen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180199?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 158291, "fullname": "Yanwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158291?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 187880, "fullname": "Xiaoyan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187880?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 187881, "fullname": "Xiaojun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187881?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 76244, "fullname": "Wu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76244?format=json", "institution": "University of Science and Technology of China"}, {"id": 126593, "fullname": "Weiping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126593?format=json", "institution": "IIE"}], "abstract": "Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we reveal a critical insight: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remain largely invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37620", "url": null, "sourceid": 41321, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38558, "uid": "007619eedbdde16adf6849f0e993f245", "name": "SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation", "authors": [{"id": 184246, "fullname": "Jing-Yao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184246?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 190141, "fullname": "Heng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190141?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 190142, "fullname": "Mingsen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190142?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 190143, "fullname": "Binbin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190143?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 152669, "fullname": "Fei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/152669?format=json", "institution": ", Institute of automation, Chinese academy of science"}], "abstract": "We introduce a novel method for video Scene Text Segmentation (STS), a task critical for understanding dynamic visual content. Despite the success of foundation models like SAM2 in generic segmentation, their application to video STS is hindered by the reliance on external prompts, limited output resolution, and instability in video sequences. To address these, we present a comprehensive framework based on SAM2. First, we fine-tune the image encoder using LoRA and integrate a self-prompting module, enabling the model to autonomously generate text-specific prompts. Second, we augment the decoder with additional upsampling branches at 512\u00d7512 and 1024\u00d71024 resolutions, complementing the original 256\u00d7256 output to produce high-fidelity, multi-resolution masks. Third, we enhance the memory mechanism by combining short-term memory with a top-k selection strategy, ensuring temporally consistent and stable segmentation across video frames. A significant obstacle in video STS is data scarcity. To this end, we contribute two datasets: STS-SynthV, containing 1,410 synthetic video clips generated via FlowText, and STS-RealV, comprising 660 meticulously annotated real-world video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple video and image scene text benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38558", "url": null, "sourceid": 35481, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38682, "uid": "6cb7e3021d98159b9f4ce8b55786de04", "name": "VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency", "authors": [{"id": 159829, "fullname": "Chao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159829?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 90787, "fullname": "Fang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90787?format=json", "institution": "Xidian University"}, {"id": 152373, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152373?format=json", "institution": "Xidian University"}, {"id": 190454, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190454?format=json", "institution": "Xidian University"}, {"id": 190455, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190455?format=json", "institution": "Xidian University"}, {"id": 190456, "fullname": "Xinyan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190456?format=json", "institution": "Xidian University"}, {"id": 71561, "fullname": "Lingling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/71561?format=json", "institution": "Xidian University"}, {"id": 72376, "fullname": "Puhua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72376?format=json", "institution": "School of Artificial Intelligence , Xidian University"}, {"id": 90815, "fullname": "Xu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90815?format=json", "institution": "Xidian University"}, {"id": 131849, "fullname": "Wenping Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/131849?format=json", "institution": "Xidian University"}, {"id": 190457, "fullname": "Siqi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190457?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}], "abstract": "Text-driven 3D editing, enabled by advancements in 3D reconstruction techniques such as NeRF and 3D Gaussian Splatting, aims to provide intuitive scene customization. However, existing methods frequently exhibit limitations in controllability and consistency. To address these shortcomings, we propose \\textbf{VDFE}, a difference-aware 3D scene editing method based on non-intrusive utilization of pre-trained video diffusion priors, which integrates Optimal Control Guided Flow Editing (FlowOCE), Decoupled Flow Difference (DFD), and Difference-Aware Gaussians Editing (DAGE). Specifically, FlowOCE treats the editing process as an optimal control problem, optimizing a noise-free editing trajectory to minimize unintended modifications in non-target region; DFD precisely locates editing region by analyzing flow differences, which supplies priors for the subsequent optimization process; and DAGE leverages precise localization to selectively update 3D Gaussians for efficient and precise refinement. Extensive experiments demonstrate that our method significantly outperforms existing methods in both qualitative and quantitative evaluations, achieving state-of-the-art (SOTA) performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38682", "url": null, "sourceid": 39263, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39518, "uid": "f2e624030d1505c3aabc60d22dc1022a", "name": "OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer", "authors": [{"id": 175673, "fullname": "Haosong Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175673?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 72347, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/72347?format=json", "institution": "Northwest Polytechnical University"}, {"id": 92957, "fullname": "Yalun Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/92957?format=json", "institution": "Nanyang Technological University"}, {"id": 90429, "fullname": "Yushi Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90429?format=json", "institution": "Nanyang Technological University"}, {"id": 87789, "fullname": "Yihang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87789?format=json", "institution": "Nanyang Technological University"}, {"id": 192250, "fullname": "Tianyu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192250?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192251, "fullname": "Zhengshen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192251?format=json", "institution": "ByteDance Seed"}, {"id": 192252, "fullname": "Yufeng Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192252?format=json", "institution": "Beijing Institute of Technology"}, {"id": 159506, "fullname": "Junfei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159506?format=json", "institution": "Alibaba Group"}, {"id": 192253, "fullname": "Wenchao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192253?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 89788, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89788?format=json", "institution": "Nanyang Technological University"}], "abstract": "General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39518", "url": null, "sourceid": 41085, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39329, "uid": "9971bf4009012efdc5a9cb7ed7417e90", "name": "TruckDrive: Long-Range Autonomous Highway Driving Dataset", "authors": [{"id": 164473, "fullname": "Filippo Ghilotti", "url": "http://cvpr.thecvf.com/api/miniconf/users/164473?format=json", "institution": "TORC Europe GMBH"}, {"id": 138267, "fullname": "Edoardo Palladin", "url": "http://cvpr.thecvf.com/api/miniconf/users/138267?format=json", "institution": "Torc Robotics"}, {"id": 191861, "fullname": "Samuel Brucker", "url": "http://cvpr.thecvf.com/api/miniconf/users/191861?format=json", "institution": "Torc Robotics"}, {"id": 191862, "fullname": "Adam Sigal", "url": "http://cvpr.thecvf.com/api/miniconf/users/191862?format=json", "institution": "Torc Robotics"}, {"id": 87821, "fullname": "Mario Bijelic", "url": "http://cvpr.thecvf.com/api/miniconf/users/87821?format=json", "institution": "Princeton University"}, {"id": 87808, "fullname": "Felix Heide", "url": "http://cvpr.thecvf.com/api/miniconf/users/87808?format=json", "institution": "Department of Computer Science, Princeton University"}], "abstract": "Safe highway autonomy for heavy trucks remains an open and unsolved challenge: due to long braking distances, scene understanding of hundreds of meters is required for anticipatory planning and to allow safe braking margins. However, existing driving datasets primarily cover urban scenes, with perception effectively limited to short ranges of only up to 100 meters. To address this gap, we introduce TruckDrive, a highway-scale multimodal driving dataset, captured with a sensor suite purpose-built for long range sensing: seven long-range FMCW LiDARs measuring range and radial velocity, three high-resolution short-range LiDARs, eleven 8MP surround cameras with varying focal lengths and ten 4D FMCW radars. The dataset offers 475 thousands samples with 165 thousands densely annotated frames for driving perception benchmarking up to 1,000 meters to for 2D detection and 400 meters for 3D detection, depth estimation, tracking, planning and end to end driving over 20 seconds sequences at highway speeds. We find that state-of-the-art autonomous driving models do not generalize to ranges beyond 150 meters, with drops between 31\\% and 99\\% in 3D perception tasks, exposing a systematic long-range gap that current architectures and training signals cannot close.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39329", "url": null, "sourceid": 40399, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36865, "uid": "7829e6d847f0b9d897d940aa3f3b7b46", "name": "Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model", "authors": [{"id": 172666, "fullname": "Yulong Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/172666?format=json", "institution": "Northeastern University"}, {"id": 180853, "fullname": "Shijie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180853?format=json", "institution": "Northeastern University, China"}, {"id": 186052, "fullname": "Ziyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186052?format=json", "institution": "Northeastern University"}, {"id": 186053, "fullname": "Lin Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186053?format=json", "institution": "Northeastern University"}], "abstract": "Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and lack the ability to generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight source model to the target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in the original target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36865", "url": null, "sourceid": 31651, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40368, "uid": "e9c519a82a563ea4613914e83d06acda", "name": "3D-LATTE: Latent Space 3D Editing from Textual Instructions", "authors": [{"id": 93733, "fullname": "Maria Parelli", "url": "http://cvpr.thecvf.com/api/miniconf/users/93733?format=json", "institution": "University of T\u00fcbingen"}, {"id": 166991, "fullname": "Michael Oechsle", "url": "http://cvpr.thecvf.com/api/miniconf/users/166991?format=json", "institution": "Google"}, {"id": 192078, "fullname": "Michael Niemeyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/192078?format=json", "institution": "Google"}, {"id": 87927, "fullname": "Federico Tombari", "url": "http://cvpr.thecvf.com/api/miniconf/users/87927?format=json", "institution": "Google, TUM"}, {"id": 69174, "fullname": "Andreas Geiger", "url": "http://cvpr.thecvf.com/api/miniconf/users/69174?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals.   Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40368", "url": null, "sourceid": -40790, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39440?format=json"], "related_events_ids": [39440]}, {"id": 39316, "uid": "baa0ceb9d3bf8583d22479f67f86d67d", "name": "DF$^2$-VB: Dual-level Fuzzy Fusion with View-specific Boosting for Multi-view Multi-label Classification", "authors": [{"id": 183314, "fullname": "Yuena Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/183314?format=json", "institution": "Beijing University of Technology"}, {"id": 191830, "fullname": "Haichun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191830?format=json", "institution": "Fuzhou University"}, {"id": 191831, "fullname": "Yi Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191831?format=json", "institution": "Beijing University of Technology"}, {"id": 180962, "fullname": "Hao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/180962?format=json", "institution": "Beijing University Of Technology"}, {"id": 87377, "fullname": "Yongjian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87377?format=json", "institution": "Beijing University of Technology"}, {"id": 131032, "fullname": "Zhen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131032?format=json", "institution": "Beijing University of Technology"}, {"id": 191832, "fullname": "Gengyu Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191832?format=json", "institution": "Beijing University of Technology"}], "abstract": "Multi-view multi-label classification (MVMLC) aims to utilize both consensus and complementarity information to predict potentially relevant labels for samples. Existing MVMLC approaches typically focus on either feature-level fusion, which integrates complementary features for more expressive representations, or decision-level fusion, which aggregates view-specific predictions to exploit label supervision more effectively. In fact, relying solely on feature-level fusion often underutilizes label information and limits discriminability of learned representations, whereas pure decision-level fusion pays insufficient attention to view representation expressiveness and thus constrains classification performance. To address these limitations, we propose DF$^2$-VB, a dual-level fusion framework that jointly exploits complementary strengths to mitigate their respective weaknesses by integrating feature- and decision-level fusion. At the feature level, a Fuzzy Dynamic Fusion (FDF) module maps consensus features into a more compatible fuzzy feature space, where essential features are identified and redundant features are suppressed to further fuse an expressive consensus representation and boost view-specific predictions for decision-level fusion. At the decision level, a View-specific Boosting (VB) strategy adaptively measures the importance of samples and view-specific predictions to strengthen the utilization of supervision for facilitating the discriminability in feature-level fusion. Complementarily, FDF and VB jointly reinforce the model expressiveness and discriminability for reliable predictions. Extensive experiments on multiple public datasets verify the superiority of our strategy over advanced MVMLC models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39316", "url": null, "sourceid": 33583, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39527, "uid": "9d55db9e54e6dfb6ae280528ee34a0a1", "name": "HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution", "authors": [{"id": 156825, "fullname": "Chao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156825?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 147320, "fullname": "Boqian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147320?format=json", "institution": "Xidian University"}, {"id": 192272, "fullname": "Jinghao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192272?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 156827, "fullname": "Guang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156827?format=json", "institution": "Xidian university"}], "abstract": "Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. Specifically, we perform diffusion only on the residual map, allowing the network to focus more effectively on high-frequency information restoration. We then introduce wavelet-based downsampling in place of standard CNN downsampling to achieve multi-scale frequency decomposition, enabling sparse cross-attention between the high-frequency subbands of the pre-super-resolved image and the low-frequency subbands of the diffused image for explicit high-frequency guidance. Moreover, a Dynamic Thresholding Block (DTB) is designed to refine high-frequency selection during the sparse attention process. During upsampling, the invertibility of the wavelet transform ensures low-loss feature reconstruction. Experiments on both synthetic and real-world datasets demonstrate that HDW-SR achieves competitive super-resolution performance, excelling particularly in recovering fine-grained image details. The code will be available after acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39527", "url": null, "sourceid": 34776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40172, "uid": "895a699ee499c9e2141d030d61c32ad1", "name": "Driving on Registers", "authors": [{"id": 193705, "fullname": "Ellington Kirby", "url": "http://cvpr.thecvf.com/api/miniconf/users/193705?format=json", "institution": "Valeo"}, {"id": 87147, "fullname": "Alexandre Boulch", "url": "http://cvpr.thecvf.com/api/miniconf/users/87147?format=json", "institution": "valeo.ai"}, {"id": 193706, "fullname": "Yihong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193706?format=json", "institution": "Valeo"}, {"id": 193707, "fullname": "Yuan Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193707?format=json", "institution": "Valeo"}, {"id": 86003, "fullname": "Gilles Puy", "url": "http://cvpr.thecvf.com/api/miniconf/users/86003?format=json", "institution": "valeo.ai"}, {"id": 86061, "fullname": "\u00c9loi Zablocki", "url": "http://cvpr.thecvf.com/api/miniconf/users/86061?format=json", "institution": "Valeo"}, {"id": 87148, "fullname": "Andrei Bursuc", "url": "http://cvpr.thecvf.com/api/miniconf/users/87148?format=json", "institution": "valeo.ai"}, {"id": 87186, "fullname": "Spyros Gidaris", "url": "http://cvpr.thecvf.com/api/miniconf/users/87186?format=json", "institution": "Valeo.ai"}, {"id": 87177, "fullname": "Renaud Marlet", "url": "http://cvpr.thecvf.com/api/miniconf/users/87177?format=json", "institution": "INRIA"}, {"id": 127671, "fullname": "Florent Bartoccioni", "url": "http://cvpr.thecvf.com/api/miniconf/users/127671?format=json", "institution": "Valeo"}, {"id": 113787, "fullname": "Anh Quan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/113787?format=json", "institution": "INRIA"}, {"id": 193708, "fullname": "Nermin Samet", "url": "http://cvpr.thecvf.com/api/miniconf/users/193708?format=json", "institution": "Valeo"}, {"id": 107207, "fullname": "TUAN-HUNG VU", "url": "http://cvpr.thecvf.com/api/miniconf/users/107207?format=json", "institution": "valeo.ai"}, {"id": 77540, "fullname": "Matthieu Cord", "url": "http://cvpr.thecvf.com/api/miniconf/users/77540?format=json", "institution": "Sorbonne Universit\u00e9"}], "abstract": "We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40172", "url": null, "sourceid": 34414, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36216, "uid": "4e7634b7fd398dff22445063e0266ac7", "name": "Towards Stable Federated Continual Test-Time Adaptation in Wild World", "authors": [{"id": 184472, "fullname": "Liwen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184472?format=json", "institution": "Anhui University"}, {"id": 129216, "fullname": "Xingbo Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129216?format=json", "institution": "Anhui University"}, {"id": 184473, "fullname": "Iman Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184473?format=json", "institution": "University of Nottingham, Malaysia Campus"}, {"id": 129224, "fullname": "Zhe Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129224?format=json", "institution": "Anhui University"}], "abstract": "Federated Learning (FL) enables collaborative model training while preserving privacy, but faces challenges with client data heterogeneity and domain shifts during deployment. Although Personalized Federated Learning (PFL) mitigates heterogeneity, it typically requires labelled data from target clients, which is an impractical assumption. Test-Time Adaptation (TTA) offers label-free adaptation, yet its direct use in a continual federated setting risks destabilizing the global model and causing catastrophic forgetting. To address this, we consider the Federated Continual Test-Time Adaptation (FedCTTA) setting, where unlabeled clients arrive sequentially, requiring online adaptation and continuous global model updates. We propose BPFedCTTA, a framework that employs Bayesian Prior-guided Adaptation (BPA) for stable local adaptation via Maximum a Posteriori estimation, and Uncertainty-Gated Single-client Aggregation (UGSA) to selectively integrate updates based on client uncertainty. This approach balances adaptation with knowledge retention, thereby mitigating forgetting. Extensive experiments on cross-domain classification and segmentation show BPFedCTTA outperforms existing FL, PFL, and TTA methods in sequential adaptation and global model improvement. The source code will be made public upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36216", "url": null, "sourceid": 37871, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38779, "uid": "8433ddfeaa9887f62f36e8c05b6e7022", "name": "CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation", "authors": [{"id": 180652, "fullname": "Bohao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180652?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 190647, "fullname": "Zhicheng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190647?format=json", "institution": "Xidian University"}, {"id": 190648, "fullname": "Huixian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190648?format=json", "institution": "Northwest Polytechnical University"}, {"id": 190649, "fullname": "Yangming Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190649?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model's reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0\\% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5\\% AP, demonstrating superior robustness and data efficiency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38779", "url": null, "sourceid": 46166, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36818, "uid": "ee79d5e1b7c2b698105bc5811e6a3eb5", "name": "Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency", "authors": [{"id": 184180, "fullname": "Yi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184180?format=json", "institution": "Wuhan University"}, {"id": 184242, "fullname": "Yi Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184242?format=json", "institution": "Wuhan University"}, {"id": 185943, "fullname": "Lei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185943?format=json", "institution": "antgroup"}, {"id": 185944, "fullname": "Panwang Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/185944?format=json", "institution": "Wuhan University"}, {"id": 145097, "fullname": "Qiong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145097?format=json", "institution": "Wuhan University"}, {"id": 185945, "fullname": "Yingying Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185945?format=json", "institution": "Wuhan University"}, {"id": 185946, "fullname": "Xuejun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185946?format=json", "institution": "Wuhan University"}, {"id": 185947, "fullname": "Junjian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185947?format=json", "institution": null}, {"id": 185948, "fullname": "Xiangyuan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185948?format=json", "institution": "Ant Group"}, {"id": 154175, "fullname": "Hongwei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154175?format=json", "institution": "Alibaba Group"}, {"id": 104189, "fullname": "Yongjun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104189?format=json", "institution": "Wuhan University"}], "abstract": "Owing to the weak stereo geometry of satellite images, Planar Block Adjustment (PBA) is a predominant technique for correcting geometric distortions in satellite images, which treats elevation as a known constraint and primarily optimizes planar coordinates. Existing PBA methods mainly rely on explicit tie points, suffering from parallax caused by inaccurate elevation (e.g., near high buildings) and irreversible error accumulation, which severely degrades adjustment accuracy. In this paper, a \"Beyond Tie Points\" paradigm for satellite image adjustment is proposed. A pretrained feature extractor is employed to extract robust dense features and a parallax-aware confidence map from each image. A gridded coarse-to-fine optimization framework then directly solves for the adjustment parameters basing on confidence-weighted feature consistency. Experiments conducted on multiview satellite image datasets covering Beijing, Guangzhou and San Jose demonstrate that the proposed method is significantly superior to traditional approaches in both accuracy and robustness, reducing the average error by up to 75.43% compared to traditional PBA.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36818", "url": null, "sourceid": 34407, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65748, "file": "/media/PosterPDFs/CVPR%202026/36818.png", "modified": "2026-04-28T05:21:02.545890-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65749, "file": "/media/PosterPDFs/CVPR%202026/36818-thumb.png", "modified": "2026-04-28T05:21:02.710042-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36287, "uid": "a463cac2327534f6f02563ffbdf92918", "name": "Heuristic Self-Paced Learning for  Domain Adaptive Semantic Segmentation under Adverse Conditions", "authors": [{"id": 184688, "fullname": "Shiqin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184688?format=json", "institution": "Wuhan University"}, {"id": 184689, "fullname": "Haoyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184689?format=json", "institution": "Wuhan University"}, {"id": 184691, "fullname": "Huaizhou Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184691?format=json", "institution": "Wuhan University"}, {"id": 184692, "fullname": "Yinkan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184692?format=json", "institution": "Wuhan University"}, {"id": 184693, "fullname": "Dongfang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184693?format=json", "institution": "Wuhan University"}, {"id": 184694, "fullname": "Xiaoqing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184694?format=json", "institution": "Huazhong University of Science and Technology; Zhongguancun Academy"}, {"id": 184695, "fullname": "Xingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184695?format=json", "institution": "Wuhan University"}, {"id": 75857, "fullname": "Zheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75857?format=json", "institution": "Wuhan University"}, {"id": 155280, "fullname": "Kaiyan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155280?format=json", "institution": "Wuhan University"}], "abstract": "The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, high-dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an \\emph{autonomous class scheduler}. This scheduler consists of two components: (i) a high-dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category-fair policy-gradient objective that ensures balanced improvement across classes. Coupled with mixed source\u2013target supervision, the learned class rankings direct the network\u2019s focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state-of-the-art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving), and shows generalization ability in synthetic-to-real semantic segmentation (i.e., SYNTHIA $\\rightarrow$ Cityscapes).", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36287", "url": null, "sourceid": 39993, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39970, "uid": "a8caec1eeb9c042bae95680a60789701", "name": "StyleTextGen: Style-Conditioned Multilingual Scene Text Generation", "authors": [{"id": 191118, "fullname": "Zeyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191118?format=json", "institution": "Nankai University"}, {"id": 193218, "fullname": "Fangmin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193218?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 154622, "fullname": "Yan Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154622?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 183226, "fullname": "Yichao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183226?format=json", "institution": "Nankai University"}, {"id": 191116, "fullname": "Liu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191116?format=json", "institution": "Nankai University"}, {"id": 152555, "fullname": "Yu ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/152555?format=json", "institution": "Nankai University"}], "abstract": "Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39970", "url": null, "sourceid": 32598, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38981, "uid": "34ba30e0f862c55b46622ab5d0aec71f", "name": "DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding", "authors": [{"id": 183226, "fullname": "Yichao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183226?format=json", "institution": "Nankai University"}, {"id": 188367, "fullname": "Huawen Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188367?format=json", "institution": "Institute of Information Engineering"}, {"id": 191116, "fullname": "Liu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191116?format=json", "institution": "Nankai University"}, {"id": 191117, "fullname": "Shiyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191117?format=json", "institution": "Zhongguancun Academy ; Nankai University"}, {"id": 191118, "fullname": "Zeyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191118?format=json", "institution": "Nankai University"}, {"id": 152555, "fullname": "Yu ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/152555?format=json", "institution": "Nankai University"}], "abstract": "GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14\\% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding  performance and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38981", "url": null, "sourceid": 34452, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36710, "uid": "53eef9f0f3c7ddca94957c056aef9227", "name": "HierEdit : Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing", "authors": [{"id": 185700, "fullname": "Yuyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185700?format=json", "institution": "Dartmouth College"}, {"id": 185701, "fullname": "Alexander Huang-Menders", "url": "http://cvpr.thecvf.com/api/miniconf/users/185701?format=json", "institution": "Dartmouth College"}, {"id": 87547, "fullname": "Yu-Wing Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87547?format=json", "institution": "Dartmouth College"}], "abstract": "High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce **HierEdit**, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (**Local-Window MMDiT**) that refines only edited regions within the original high-resolution image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (**Inference Acceleration**) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without requiring any specialized high-resolution training data. Extensive experiments demonstrate that **HierEdit** achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36710", "url": null, "sourceid": 39528, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38263, "uid": "08ecfbbc924a19234f7eb081c20d87ac", "name": "Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models", "authors": [{"id": 189448, "fullname": "Shojiro Yamabe", "url": "http://cvpr.thecvf.com/api/miniconf/users/189448?format=json", "institution": "Institute of Science Tokyo"}, {"id": 189449, "fullname": "Futa Kai Waseda", "url": "http://cvpr.thecvf.com/api/miniconf/users/189449?format=json", "institution": "The University of Tokyo"}, {"id": 189450, "fullname": "Daiki Shiono", "url": "http://cvpr.thecvf.com/api/miniconf/users/189450?format=json", "institution": "Tohoku University"}, {"id": 189451, "fullname": "Tsubasa Takahashi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189451?format=json", "institution": "Acompany"}], "abstract": "Recent large vision\u2013language models (LVLMs) have been applied to diverse VQA tasks.However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect.In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling.Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available.Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort.While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image\u2013text modality gap.To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas.This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost.Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do.Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model.We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility.Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38263", "url": null, "sourceid": 32653, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36556, "uid": "c7c78f106a8458a6c7794002e878a9d8", "name": "Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions", "authors": [{"id": 153757, "fullname": "Seongyu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153757?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185331, "fullname": "Seungwoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/185331?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 90970, "fullname": "Hyeonggon Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90970?format=json", "institution": "KAIST"}, {"id": 126430, "fullname": "Joon Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/126430?format=json", "institution": "KAIST"}, {"id": 87269, "fullname": "Arda Senocak", "url": "http://cvpr.thecvf.com/api/miniconf/users/87269?format=json", "institution": "KAIST"}], "abstract": "We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local tactile\u2013visual alignment via dense cross-modal feature interactions, producing tactile saliency maps for touch-conditioned material segmentation. To overcome dataset constraints, we introduce: (i) in-the-wild multi-material scene images that expand visual diversity, and (ii) a material-diversity pairing strategy that aligns each tactile sample with visually varied yet tactilely consistent images, improving contextual localization and robustness to weak signals. We also construct two new tactile-grounded material segmentation datasets for quantitative evaluation. Experiments on both new and existing benchmarks show that our approach substantially outperforms prior visuo-tactile methods in tactile localization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36556", "url": null, "sourceid": 43611, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37286, "uid": "9c27fc035ee5723de2346839e39e6663", "name": "Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization", "authors": [{"id": 180917, "fullname": "Xinyu Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180917?format=json", "institution": "Zhejiang University"}, {"id": 187082, "fullname": "Heng Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/187082?format=json", "institution": "Zhejiang University"}, {"id": 187083, "fullname": "Zhengwen Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187083?format=json", "institution": "Ant Group"}, {"id": 187084, "fullname": "Shuheng Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187084?format=json", "institution": null}, {"id": 90247, "fullname": "Changhua Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90247?format=json", "institution": "Nanjing University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}, {"id": 74218, "fullname": "Linchao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74218?format=json", "institution": "Zhejiang University"}], "abstract": "Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs.    We propose Advantage Decoupled Preference Optimization (\\textbf{ADPO}), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy.    ADPO introduces two innovations: \\textbf{a preference verification reward} improving verification capability and \\textbf{a decoupled optimization mechanism} enabling synergistic optimization of generation and verification.    Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness.    Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores.    ADPO achieves up to \\textbf{+34.1\\%} higher verification AUC and \\textbf{-53.5\\%} lower inference time, with significant gains of \\textbf{+2.8\\%/+1.4\\%} accuracy on MathVista/MMMU, \\textbf{+1.9} cIoU on ReasonSeg, and \\textbf{+1.7\\%/+1.0\\%} step success rate on AndroidControl/GUI Odyssey.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37286", "url": null, "sourceid": 37584, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37441, "uid": "c9d64f4101b5c9a415f45c377b780f6b", "name": "Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation", "authors": [{"id": 187455, "fullname": "Yang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187455?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 128323, "fullname": "Yiwei Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128323?format=json", "institution": "Beihang University"}, {"id": 128288, "fullname": "Feng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128288?format=json", "institution": "Beihang University, Tsinghua University"}], "abstract": "Deep learning-based gaze estimation methods often exhibit significant performance degradation on unseen target domains. Through systematic frequency-domain analysis, we reveal that face images contain frequency components with distinct contributions: some facilitate cross-domain generalization while others introduce domain-specific interference that impedes it, with both components varying across datasets and constituting a key source of domain gap. Based on these observations, we propose the Frequency-Guided Adaptive Learning framework (FGAL), a novel framework enhancing domain generalization without accessing target domain data. The FGAL consists of two complementary modules: the Adaptive Interference Suppression Module (AISM) and the Spectrum Diversification Module (SDM). AISM adaptively suppresses sample-specific interfering frequency components through learnable modulation maps, while SDM diversifies frequency distribution patterns to enhance robustness against cross-domain variations. Experiments demonstrate that FGAL achieves substantial improvements, outperforming baselines by up to 28.2\\% and state-of-the-art methods by up to 19.5\\% across multiple cross-domain settings, demonstrating our framework's potential for broader domain generalization tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37441", "url": null, "sourceid": 44822, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36528, "uid": "7f15535c7ed441352a52758452d82f18", "name": "AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models", "authors": [{"id": 185279, "fullname": "Hongyi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185279?format=json", "institution": "Universiti Malaya - Cai&#x27;s Group"}, {"id": 181090, "fullname": "HONGYI CAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/181090?format=json", "institution": "Universiti Malaya"}, {"id": 185280, "fullname": "MingKang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185280?format=json", "institution": "Universiti Malaya"}, {"id": 185281, "fullname": "Muxin Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185281?format=json", "institution": "Monash University"}, {"id": 185282, "fullname": "Moayad Aloqaily", "url": "http://cvpr.thecvf.com/api/miniconf/users/185282?format=json", "institution": "United Arab Emirates University"}, {"id": 166278, "fullname": "jie li", "url": "http://cvpr.thecvf.com/api/miniconf/users/166278?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 185283, "fullname": "Xinfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185283?format=json", "institution": "Nanyang Technological University"}, {"id": 185284, "fullname": "Jialie Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185284?format=json", "institution": "City St George&#x27;s, University of London"}, {"id": 153699, "fullname": "Meikang Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153699?format=json", "institution": "Augusta University"}, {"id": 185285, "fullname": "Qingsong Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185285?format=json", "institution": "Squirrel Ai Learning"}], "abstract": "Text-to-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberate and subtle injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack vectors. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses subtle, injected stereotypes and multiple interacting attacks. We evaluate the framework on a new benchmark covering 17 distinct backdoor attack scenarios, including challenging cases where multiple backdoors co-exist. AutoDebias detects malicious patterns with 91.6\\% accuracy and reduces the backdoor success rate from 90\\% to negligible levels, while preserving the visual fidelity of the original model.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36528", "url": null, "sourceid": 40793, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37124, "uid": "efbb36a8571980ef0b1fdb4aa04376d6", "name": "Learning What Helps: Task-Aligned Context Selection for Vision Tasks", "authors": [{"id": 186719, "fullname": "Jingyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186719?format=json", "institution": "KTH Royal Institute of Technology; Science for Life Laboratory"}, {"id": 186720, "fullname": "Emir Konuk", "url": "http://cvpr.thecvf.com/api/miniconf/users/186720?format=json", "institution": "KTH Royal Institute of Technology"}, {"id": 186721, "fullname": "Fredrik Strand", "url": "http://cvpr.thecvf.com/api/miniconf/users/186721?format=json", "institution": "Karolinska Institutet"}, {"id": 186722, "fullname": "Christos Matsoukas", "url": "http://cvpr.thecvf.com/api/miniconf/users/186722?format=json", "institution": "AstraZeneca"}, {"id": 182230, "fullname": "Kevin Smith", "url": "http://cvpr.thecvf.com/api/miniconf/users/182230?format=json", "institution": "KTH Royal Institute of Technology"}], "abstract": "Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present Task-Aligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective.By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help.Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37124", "url": null, "sourceid": 39345, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39802, "uid": "dd61b41d5b91a9a09e504f025a87553b", "name": "Charge: A Comprehensive Benchmark and Dataset for Dynamic Novel View Synthesis", "authors": [{"id": 183638, "fullname": "Michal Nazarczuk", "url": "http://cvpr.thecvf.com/api/miniconf/users/183638?format=json", "institution": "Huawei"}, {"id": 186134, "fullname": "Thomas Tanay", "url": "http://cvpr.thecvf.com/api/miniconf/users/186134?format=json", "institution": "Huawei"}, {"id": 182228, "fullname": "Arthur Moreau", "url": "http://cvpr.thecvf.com/api/miniconf/users/182228?format=json", "institution": "Huawei Technlogies Research &amp; Development (UK) Ltd."}, {"id": 128891, "fullname": "Zhensong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128891?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 126534, "fullname": "Eduardo P\u00e9rez-Pellitero", "url": "http://cvpr.thecvf.com/api/miniconf/users/126534?format=json", "institution": "Huawei Noah&#x27;s Ark Lab (UK)"}], "abstract": "This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39802", "url": null, "sourceid": 40605, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38109, "uid": "b192238c5f2084617424714334351ffa", "name": "RAM: Recover Any 3D Human Motion in-the-Wild", "authors": [{"id": 158508, "fullname": "Sen Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/158508?format=json", "institution": "University of Washington"}, {"id": 189074, "fullname": "Ning Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189074?format=json", "institution": "Anhui University"}, {"id": 189075, "fullname": "Jinqin Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189075?format=json", "institution": "Anhui University"}, {"id": 189076, "fullname": "Jiale Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189076?format=json", "institution": "East China University of Science and Technology"}, {"id": 189030, "fullname": "Zhang Huaping", "url": "http://cvpr.thecvf.com/api/miniconf/users/189030?format=json", "institution": "Beijing Institute of Technology"}, {"id": 93440, "fullname": "Jenq-Neng Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93440?format=json", "institution": "University of Washington, Seattle"}, {"id": 188406, "fullname": "Lei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188406?format=json", "institution": null}], "abstract": "Recovering 3D human motion from monocular videos in-the-wild remains challenging due to occlusions, rapid movements, and viewpoint variations. To address these challenges, we introduce **Recover-Anyone Module (RAM)**, a unified framework for real-time and accurate 3D human motion reconstruction. RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38109", "url": null, "sourceid": 37978, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36619, "uid": "ead6d4ad482d8a338ec1de8a26504e97", "name": "TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis", "authors": [{"id": 185480, "fullname": "Xuewei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185480?format=json", "institution": "Ocean University of China"}, {"id": 149241, "fullname": "Yajie Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/149241?format=json", "institution": "Wuhan Textile University"}, {"id": 185481, "fullname": "Pan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185481?format=json", "institution": "Chongqing Normal University"}, {"id": 185482, "fullname": "Xianfang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185482?format=json", "institution": "Wuhan Textile University"}, {"id": 185483, "fullname": "Feifei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/185483?format=json", "institution": null}, {"id": 182986, "fullname": "Qiangguo Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182986?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 185484, "fullname": "Jialiang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185484?format=json", "institution": null}, {"id": 183862, "fullname": "Junlin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183862?format=json", "institution": "Wuhan University of Science and Technology"}], "abstract": "Cardiovascular disease (CVD) diagnosis relies heavily on electrocardiograms (ECGs). However, most existing self-supervised uni-modal methods suffer from limited representational capacity, while multi-modal frameworks are hindered by coarse-grained semantic alignment across modalities, thus restricting their generalizability in clinical settings. To address these limitations, we propose TAMER, a Tri-modal contrastive Alignment and Multi-scale Embedding Refinement framework that jointly models ECG recordings, spectrograms, and diagnostic reports. TAMER is composed of three key components: First, the tri-modal feature encoding and projection (TFEP) module employs modality-specific encoders to extract global and local features from ECG recordings, spectrograms, and diagnostic reports, and projects them into latent spaces. Then, the global-local temporal-spectral alignment (GLTSA) module captures complementary rhythm- and wave-level characteristics via contrastive alignment and attentive interaction between temporal and spectral modalities. Finally, the report-aware alignment and refinement (RAAR) module performs diagnostic-level alignment and wave-level refinement with clinical reports, enabling semantic enrichment of ECG representations.Extensive experiments on three public ECG datasets demonstrate that TAMER achieves state-of-the-art zero-shot classification performance (AUC: 81.2%) and strong cross-domain generalization (AUC: 83.1%), outperforming existing uni-modal and multi-modal baselines methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36619", "url": null, "sourceid": 37490, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36893, "uid": "75b3ef7f6b2e155c5ff3a5b2c32fed80", "name": "Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting", "authors": [{"id": 182228, "fullname": "Arthur Moreau", "url": "http://cvpr.thecvf.com/api/miniconf/users/182228?format=json", "institution": "Huawei Technlogies Research &amp; Development (UK) Ltd."}, {"id": 186133, "fullname": "Richard Shaw", "url": "http://cvpr.thecvf.com/api/miniconf/users/186133?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 183638, "fullname": "Michal Nazarczuk", "url": "http://cvpr.thecvf.com/api/miniconf/users/183638?format=json", "institution": "Huawei"}, {"id": 107610, "fullname": "Jisu Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/107610?format=json", "institution": "Gwangju Institute of Science and Technology"}, {"id": 186134, "fullname": "Thomas Tanay", "url": "http://cvpr.thecvf.com/api/miniconf/users/186134?format=json", "institution": "Huawei"}, {"id": 128891, "fullname": "Zhensong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128891?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 84723, "fullname": "Songcen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84723?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 126534, "fullname": "Eduardo P\u00e9rez-Pellitero", "url": "http://cvpr.thecvf.com/api/miniconf/users/126534?format=json", "institution": "Huawei Noah&#x27;s Ark Lab (UK)"}], "abstract": "Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, ``Off The Grid\" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36893", "url": null, "sourceid": 45029, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36671, "uid": "60803ea798a0c0dfb7f36397d8d4d772", "name": "DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation", "authors": [{"id": 180564, "fullname": "Alex Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180564?format=json", "institution": "SGIT AI"}, {"id": 156629, "fullname": "Zhiwei Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156629?format=json", "institution": "Riemann Lab,                     Huawei Technologies Ltd."}, {"id": 185613, "fullname": "Qicheng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185613?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185614, "fullname": "Chenshi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185614?format=json", "institution": "Central South University"}, {"id": 185615, "fullname": "Yujie Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185615?format=json", "institution": "Beijing University Of Technology"}, {"id": 127182, "fullname": "Guang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/127182?format=json", "institution": "SGIT AI"}, {"id": 89243, "fullname": "Yong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89243?format=json", "institution": "Zhejiang University"}, {"id": 135748, "fullname": "Mengmeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135748?format=json", "institution": "Zhejiang University of Technology"}], "abstract": "Recent generative models allow robots to generate future visual outcomes for action guidance, yet most still address imagination and control independently, resulting in visually coherent rollouts but physically inconsistent behaviors. While structural priors enhance spatial grounding, these methods remain visually correlation-driven rather than causally informed, overlooking the bidirectional coupling between robot actions and the evolving environment. We formalize the coupling as interaction dynamics, which specify where environmental changes occur and how actions cause them. Based on this formulation, we introduce DynBridge, an end-to-end framework that unifies imagination and control through the shared dynamics representation. Specifically, DynBridge realizes this via three components: (1) an Interaction Dynamics Generator that forecasts interaction dynamics via joint trajectory generation and action prediction; (2) an Action-Conditioned Dynamics Aggregator that integrates dynamics under control signals; and (3) a Dynamics-Guided Action Predictor that leverages the aggregated dynamics to produce executable, context-aware actions. Results demonstrate that DynBridge consistently outperforms prior methods on simulated and real-world benchmarks without external pretraining.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36671", "url": null, "sourceid": 40730, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38961, "uid": "62b9868335fc381363065ca4ca58b33d", "name": "Towards Policy-Adaptive Image Guardrail: Benchmark and Method", "authors": [{"id": 191067, "fullname": "Caiyong Piao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191067?format=json", "institution": "Fudan University"}, {"id": 131427, "fullname": "Zhiyuan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131427?format=json", "institution": "Peking University"}, {"id": 191068, "fullname": "Haoming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191068?format=json", "institution": "Tencent AI Lab"}, {"id": 154325, "fullname": "Yunzhen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154325?format=json", "institution": null}, {"id": 186102, "fullname": "Kaiqing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186102?format=json", "institution": "Shenzhen University"}, {"id": 191069, "fullname": "Feiyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191069?format=json", "institution": "Fudan University"}, {"id": 129791, "fullname": "Shuigeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129791?format=json", "institution": "Fudan University"}], "abstract": "Accurate rejection of sensitive or harmful visual content, i.e., harmful image guardrail, is critical in many application scenarios. This task must continuously adapt to the evolving safety policies and content across various domains and over time.  However, traditional classifiers, confined to fixed categories, require frequent retraining when new policies are introduced. Vision-language models (VLMs) offer a more adaptable and generalizable foundation for dynamic safety guardrails. Despite this potential, existing VLM-based safeguarding methods are typically trained and evaluated under only a fixed safety policy. We find that these models are heavily overfitted to the seen policy, fail to generalize to unseen policies, and even lose the basic instruction-following ability and general knowledge. To address this issue, in this paper we make two key contributions. First, we benchmark the cross-policy generalization performance of existing VLMs with **SafeEditBench**, a new evaluation suite. SafeEditBench leverages image-editing models to convert unsafe images into safe counterparts, producing policy-aligned datasets where each safe\u2013unsafe image pair remains visually similar except for localized regions violating specific safety rules. Human annotators then provide accurate safe/unsafe labels under five distinct policies, enabling fine-grained assessment of policy-aware generalization. Second, we introduce **SafeGuard-VL**, a reinforcement learning\u2013based method with verifiable rewards (RLVR) for robust unsafe-image guardrails. Instead of relying solely on supervised fine-tuning (SFT) under fixed policies, SafeGuard-VL explicitly optimizes the model with policy-grounded rewards, promoting verifiable adaptation across evolving policies. Extensive experiments verify the effectiveness of our method for unsafe image guardrails across various policies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38961", "url": null, "sourceid": 41080, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39736, "uid": "a9eea24ba5eb9664c1341d21cf78476e", "name": "Benchmarking Unified Any-to-Any Interleaved Multimodal Learning", "authors": [{"id": 180066, "fullname": "Yanlin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180066?format=json", "institution": "National University of Singapore"}, {"id": 192744, "fullname": "Minghui Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192744?format=json", "institution": "National University of Singapore"}, {"id": 192745, "fullname": "Kaiwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192745?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 192746, "fullname": "Shize Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192746?format=json", "institution": "National University of Singapore (NUS)"}, {"id": 192747, "fullname": "Yiran Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192747?format=json", "institution": null}, {"id": 192748, "fullname": "Haodong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192748?format=json", "institution": "StepFun; South China University of Technology"}, {"id": 192749, "fullname": "Congyue Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192749?format=json", "institution": "South China University of Technology"}, {"id": 192750, "fullname": "Weijie Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192750?format=json", "institution": "Nanyang Technological University"}, {"id": 192751, "fullname": "Yushen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192751?format=json", "institution": "South China University of Technology"}, {"id": 131572, "fullname": "Shengqiong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131572?format=json", "institution": "National University of Singapore"}, {"id": 76247, "fullname": "Wei Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/76247?format=json", "institution": "Nanjing University"}, {"id": 158453, "fullname": "Lei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/158453?format=json", "institution": "Microsoft Research Asia"}, {"id": 84993, "fullname": "Furu Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84993?format=json", "institution": "Microsoft Research"}, {"id": 178983, "fullname": "Hao Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/178983?format=json", "institution": "National University of Singapore"}, {"id": 192752, "fullname": "Mong-Li Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192752?format=json", "institution": "National University of Singapore"}, {"id": 192753, "fullname": "Wynne Hsu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192753?format=json", "institution": "National University of Singapore"}], "abstract": "In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. Our resource will all be open.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39736", "url": null, "sourceid": 36097, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38431, "uid": "407f5bc47339074637a289fb2b000805", "name": "MultiShotMaster: A Controllable Multi-Shot Video Generation Framework", "authors": [{"id": 189855, "fullname": "Qinghe Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189855?format=json", "institution": "Kuaishou; Dalian University of Technology"}, {"id": 88119, "fullname": "Xiaoyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88119?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189856, "fullname": "Baolu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189856?format=json", "institution": "Dalian University of Technology"}, {"id": 155973, "fullname": "Weikang Bian", "url": "http://cvpr.thecvf.com/api/miniconf/users/155973?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189135, "fullname": "Quande Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189135?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}, {"id": 75722, "fullname": "Xintao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75722?format=json", "institution": "Tencent"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 156268, "fullname": "Kun Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156268?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 87542, "fullname": "Xu Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87542?format=json", "institution": "Dalian University of Technology"}], "abstract": "Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38431", "url": null, "sourceid": 33868, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40016, "uid": "77bf56892ce1f5ba79671eff6f421419", "name": "Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints", "authors": [{"id": 180979, "fullname": "WENBIN LUO", "url": "http://cvpr.thecvf.com/api/miniconf/users/180979?format=json", "institution": "Kyushu Universty"}, {"id": 193309, "fullname": "Takafumi Iwaguchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193309?format=json", "institution": "Kyushu University"}, {"id": 193310, "fullname": "Ryusuke Sagawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/193310?format=json", "institution": "DENSO IT Laboratory"}, {"id": 97728, "fullname": "Hiroshi Kawasaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/97728?format=json", "institution": "Kyushu University"}], "abstract": "The depth precision of an indirect time-of-flight (I-ToF) camera is highly dependent on its coding scheme. However, identifying the optimal coding scheme is challenging due to the infinitely many possible combinations of modulation and demodulation functions. Although previous works have derived depth-precision metrics to guide coding-scheme design, they either do not satisfy the constraints of real-world I-ToF devices or rely heavily on large-scale deep-learning optimization. In this work, we first analyze the error-propagation process in I-ToF depth sensing and derive a new metric for guiding the design and search of coding schemes. Then we incorporate practical hardware constraints of I-ToF sensors directly into the coding-scheme design, which greatly reduces the space of feasible modulation and demodulation functions and makes metric-based search feasible. The coding schemes obtained by our search method outperform previous schemes in both simulations and real-world experiments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40016", "url": null, "sourceid": 43606, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39190, "uid": "82a39bd5c8bf10687c67b8ca738d5a1a", "name": "Mitigating The Distribution Shift of Diffusion-based Dataset Distillation", "authors": [{"id": 129607, "fullname": "Yue Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129607?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191543, "fullname": "Chenyu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191543?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191544, "fullname": "Pengyu An", "url": "http://cvpr.thecvf.com/api/miniconf/users/191544?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 127551, "fullname": "Yonglu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127551?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Dataset Distillation (DD) seeks to create small, synthetic datasets for efficient model training. While diffusion models are powerful generators, their use in DD is hampered by distribution shifts between synthetic and ideal distilled data, leading to suboptimal performance. We identify two critical shifts. First, considering the small capacity of the synthetic data, an optimal synthetic distribution for DD should be a simplification of the real data distribution, rather than replicating the original data's complexity. Second, there is a hazardous empirical deviation in the synthetic dataset from this learned distribution due to the data sampling process. To address these, we introduce a two-stage approach. During diffusion training time, we mitigate the distribution shift by employing an L1 sparsity regularizer, compelling the diffusion model to learn a compact and semantically sparse manifold. Then, during sampling time, we abandon the flawed sequential sampling paradigm and instead synchronously denoises the entire synthetic dataset with distribution regularizers. This framework systematically mitigates both identified distribution shifts. Experiments show our method achieves state-of-the-art performance with superior computational efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39190", "url": null, "sourceid": 37659, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36426, "uid": "5a82e1764dd93b64072fcde33fd4b38f", "name": "Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding", "authors": [{"id": 167290, "fullname": "Yuchen Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/167290?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China"}, {"id": 185009, "fullname": "Zhenyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185009?format=json", "institution": "Baidu Inc."}, {"id": 185010, "fullname": "Naibin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185010?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 185011, "fullname": "Yilong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185011?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 185012, "fullname": "Peng Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185012?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 126564, "fullname": "Zheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/126564?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 185013, "fullname": "Shuohuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185013?format=json", "institution": "Baidu"}, {"id": 84960, "fullname": "Yu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/84960?format=json", "institution": "Baidu"}, {"id": 84968, "fullname": "Hua Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84968?format=json", "institution": "Baidu"}, {"id": 126593, "fullname": "Weiping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126593?format=json", "institution": "IIE"}, {"id": 84919, "fullname": "Haifeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84919?format=json", "institution": "Baidu"}], "abstract": "Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential ``blink-like'' process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36426", "url": null, "sourceid": 41473, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37935, "uid": "d90b1c32b55f731e8a2072bfad782fdd", "name": "CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography", "authors": [{"id": 182680, "fullname": "Gasser Elazab", "url": "http://cvpr.thecvf.com/api/miniconf/users/182680?format=json", "institution": "CARIAD SE"}, {"id": 176235, "fullname": "Frank Neuhaus", "url": "http://cvpr.thecvf.com/api/miniconf/users/176235?format=json", "institution": "V&amp;R Vision &amp; Robotics GmbH"}, {"id": 188627, "fullname": "Tilman Ko\u00df", "url": "http://cvpr.thecvf.com/api/miniconf/users/188627?format=json", "institution": "Vision &amp; Robotics GmbH"}, {"id": 188628, "fullname": "Malte Splietker", "url": "http://cvpr.thecvf.com/api/miniconf/users/188628?format=json", "institution": "V&amp;R Vision &amp; Robotics GmbH"}, {"id": 188629, "fullname": "Aditya Date", "url": "http://cvpr.thecvf.com/api/miniconf/users/188629?format=json", "institution": "Hochschule Ravensburg-Weingarten University of Applied Sciences"}, {"id": 188630, "fullname": "Michael Unterreiner", "url": "http://cvpr.thecvf.com/api/miniconf/users/188630?format=json", "institution": "Porsche AG"}, {"id": 188631, "fullname": "Maximilian Jansen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188631?format=json", "institution": "CARIAD"}, {"id": 188632, "fullname": "Olaf Hellwich", "url": "http://cvpr.thecvf.com/api/miniconf/users/188632?format=json", "institution": "TU Berlin"}], "abstract": "Autonomous driving must operate reliably across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we introduce a standardized evaluation protocol for road surface irregularities and a stereo-guided depth completion variant that achieves leading performance on CARD. Moreover, we benchmark state-of-the-art depth estimation models to establish strong baselines. We host CARD on Hugging Face with an open source SDK and standardized splits to enable public leaderboards and reproducible evaluation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37935", "url": null, "sourceid": 41541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37565, "uid": "11eee5a972a890e44dc5f1d8129db9ed", "name": "UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation", "authors": [{"id": 187739, "fullname": "Jiehui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187739?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 132567, "fullname": "Yuechen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132567?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 187740, "fullname": "Xu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187740?format=json", "institution": "Tsinghua University"}, {"id": 187741, "fullname": "Yuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187741?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 91459, "fullname": "Zhi Cen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91459?format=json", "institution": "Zhejiang University"}, {"id": 87856, "fullname": "Bin Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87856?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187742, "fullname": "Yan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187742?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 87541, "fullname": "Xin Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87541?format=json", "institution": "Kuaishou"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 154575, "fullname": "Jiaya Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/154575?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}], "abstract": "Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation.To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities\u2014segmentation masks, human skeletons, DensePose, optical flow, and depth maps\u2014and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and models will be released. More results can be viewed in the supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37565", "url": null, "sourceid": 35580, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36172, "uid": "c7226ffd4a80a6abfd8b4e348e11e7a0", "name": "F$^2$-Assist: Multi-Phase Fetal Growth Forecast and Report Generation from Ultrasound Examination", "authors": [{"id": 184328, "fullname": "Bin Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184328?format=json", "institution": "Hunan University"}, {"id": 182070, "fullname": "XUSHENG LIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/182070?format=json", "institution": "City University of Hong Kong"}, {"id": 107253, "fullname": "Xinpeng Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/107253?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 184329, "fullname": "Jinlin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184329?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}, {"id": 129205, "fullname": "Shengli Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129205?format=json", "institution": "Shenzhen Maternity and Child Healthcare Hospital"}, {"id": 127232, "fullname": "Kenli Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127232?format=json", "institution": "Hunan University"}, {"id": 76503, "fullname": "Jiawei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/76503?format=json", "institution": "City University of Hong Kong"}], "abstract": "Forecasting fetal growth from sequential ultrasound examinations is essential for personalized prenatal care. Existing medical vision-language models (MLLMs) are limited to single-phase/organ evaluations and qualitative reasoning, neglecting longitudinal history and precise continuous biometric values.  To address this gap, we introduce the novel task of multi-phase fetal growth forecasting and report generation. To support this task, we first present GrowthFetus, the largest multi-phase, multi-organ fetal ultrasound dataset to date, containing 9,280 examinations from 2,000 fetuses. Based on this dataset, we propose F$^2$-Assist, a unified MLLM framework with three key components: (i) a Cross-Phase Organ Alignment module for for heterogeneous multi-organ feature fusion across phases, (ii) a History-Aware Temporal Encoding module for modeling irregular temporal dynamics, and (iii) a Growth Parameter Adapter that encodes continuous biometric values as differentiable tokens for numerically precise reasoning. Extensive experiments show that F$^2$-Assist achieves temporally coherent predictions and clinically consistent reports, significantly outperforming state-of-the-art MLLMs.  Our study establishes a practical framework for longitudinal ultrasound analysis, bridging growth forecasting and report generation in a unified model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36172", "url": null, "sourceid": 42585, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37999, "uid": "b09ba312974bccb95fa44b4eec467373", "name": "Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions", "authors": [{"id": 179716, "fullname": "Xiaoxiao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/179716?format=json", "institution": "Stanford University"}, {"id": 188791, "fullname": "Mingyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188791?format=json", "institution": "Stanford University"}, {"id": 152706, "fullname": "Kun yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152706?format=json", "institution": "Universit\u00e9 de Strasbourg &amp; Technical University of Munich"}, {"id": 155680, "fullname": "Min Woo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/155680?format=json", "institution": "Stanford University"}, {"id": 181333, "fullname": "Mark Endo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181333?format=json", "institution": "Stanford University"}, {"id": 188792, "fullname": "Shengguang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188792?format=json", "institution": "Stanford University"}, {"id": 188793, "fullname": "Changlin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188793?format=json", "institution": "Stanford University"}, {"id": 126414, "fullname": "Yuhui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126414?format=json", "institution": "Stanford University"}, {"id": 188794, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188794?format=json", "institution": "Stanford University"}, {"id": 69178, "fullname": "Serena Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/69178?format=json", "institution": "Stanford"}], "abstract": "Large Vision-Language Models (VLMs) often answer classic visual illusions \"correctly\" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: *do VLMs perceive visual changes or merely recall memorized patterns?* While several studies have noted this phenomenon, the underlying causes remain unclear.  To move from observations to systematic understanding, this paper introduces \\textbf{VI-Probe}, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall.  Unlike prior work that focus on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls.Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37999", "url": null, "sourceid": 36806, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37058, "uid": "a4700f244723a6277a576f50af1d387b", "name": "Robust Promptable Video Object Segmentation", "authors": [{"id": 88378, "fullname": "Sohyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/88378?format=json", "institution": "POSTECH"}, {"id": 142876, "fullname": "Yeho Gwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/142876?format=json", "institution": "POSTECH"}, {"id": 186580, "fullname": "Lukas Hoyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/186580?format=json", "institution": "Google Zurich"}, {"id": 86863, "fullname": "Konrad Schindler", "url": "http://cvpr.thecvf.com/api/miniconf/users/86863?format=json", "institution": "ETH Zurich"}, {"id": 179645, "fullname": "Christos Sakaridis", "url": "http://cvpr.thecvf.com/api/miniconf/users/179645?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 87833, "fullname": "Suha Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/87833?format=json", "institution": "POSTECH"}], "abstract": "The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37058", "url": null, "sourceid": 32236, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37982, "uid": "1dcfee25dedf7c8e7e25a9b588299f84", "name": "FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction", "authors": [{"id": 188744, "fullname": "Yuqiu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188744?format=json", "institution": "Simon Fraser University"}, {"id": 188745, "fullname": "Jialin Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/188745?format=json", "institution": "Simon Fraser University"}, {"id": 188746, "fullname": "Marissa Ramirez de Chanlatte", "url": "http://cvpr.thecvf.com/api/miniconf/users/188746?format=json", "institution": "Lawrence Berkeley National Lab"}, {"id": 188747, "fullname": "Rochishnu Chowdhury", "url": "http://cvpr.thecvf.com/api/miniconf/users/188747?format=json", "institution": "Lawrence Berkeley National Lab"}, {"id": 188748, "fullname": "Rushil Desai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188748?format=json", "institution": "University of California, Berkeley"}, {"id": 135756, "fullname": "Wuyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/135756?format=json", "institution": "Simon Fraser University"}, {"id": 188749, "fullname": "Daniel Martin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188749?format=json", "institution": "Lawrence Berkeley National Lab"}, {"id": 179555, "fullname": "Michael Mahoney", "url": "http://cvpr.thecvf.com/api/miniconf/users/179555?format=json", "institution": "University of California Berkeley"}], "abstract": "Real objects inhabit a physical world and must behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. We consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object function, beyond visual cues? We propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. (1) We define our simulation-based uncertainty induced from fluid simulations that capture physical plausibility. (2) We integrate our uncertainty with NBV (next-best-view) policies to prioritize views that improve both visual and physical fidelity. On NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our method yields up to +8.6% PSNR, and -62.3% velocity divergence, with PSNR gains on function-critical surfaces of +7.7%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37982", "url": null, "sourceid": 31497, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39649, "uid": "fb15e71ebe80aa93e4a63eeb755645d3", "name": "Tunable Soft Equivariance with Guarantees", "authors": [{"id": 183872, "fullname": "Md Ashiqur Rahman", "url": "http://cvpr.thecvf.com/api/miniconf/users/183872?format=json", "institution": "Purdue University"}, {"id": 192562, "fullname": "Lim Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192562?format=json", "institution": "DSO National Laboratories"}, {"id": 192563, "fullname": "Jeremiah Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192563?format=json", "institution": "DSO National Laboratories"}, {"id": 126949, "fullname": "Teck-Yian Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/126949?format=json", "institution": "DSO National Laboratories"}, {"id": 85283, "fullname": "Raymond A. Yeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/85283?format=json", "institution": "Purdue University"}], "abstract": "Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model\u2019s performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39649", "url": "https://github.com/ashiq24/soft-equivariance", "sourceid": 34202, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38160, "uid": "3e60378a5dd096f82f27912f135834bb", "name": "Rewis3d: Reconstruction for Weakly-Supervised Semantic Segmentation", "authors": [{"id": 189183, "fullname": "Jonas Ernst", "url": "http://cvpr.thecvf.com/api/miniconf/users/189183?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute; Universit\u00e4t des Saarlandes"}, {"id": 184098, "fullname": "Wolfgang Boettcher", "url": "http://cvpr.thecvf.com/api/miniconf/users/184098?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 186580, "fullname": "Lukas Hoyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/186580?format=json", "institution": "Google Zurich"}, {"id": 126719, "fullname": "Jan Lenssen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126719?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 84515, "fullname": "Bernt Schiele", "url": "http://cvpr.thecvf.com/api/miniconf/users/84515?format=json", "institution": "Max Planck Institute for Informatics"}], "abstract": "We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student\u2013teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7 % without requiring additional labels or inference overhead. Our code will be released upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38160", "url": null, "sourceid": 45434, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39673, "uid": "a7ad2cab998ae504104b31b75758440e", "name": "BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery", "authors": [{"id": 129976, "fullname": "Pushpak Pati", "url": "http://cvpr.thecvf.com/api/miniconf/users/129976?format=json", "institution": "Johnson &amp; Johnson"}, {"id": 192612, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192612?format=json", "institution": "Johnson and Johnson"}, {"id": 192613, "fullname": "Abbas Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192613?format=json", "institution": "Johnson and Johnson"}, {"id": 192614, "fullname": "Tom\u00e9 Albuquerque", "url": "http://cvpr.thecvf.com/api/miniconf/users/192614?format=json", "institution": "Johnson and Johnson"}, {"id": 192615, "fullname": "Steffen Jaensch", "url": "http://cvpr.thecvf.com/api/miniconf/users/192615?format=json", "institution": "Johnson and Johnson"}, {"id": 192616, "fullname": "Amina Mollaysa", "url": "http://cvpr.thecvf.com/api/miniconf/users/192616?format=json", "institution": "Johnson and Johnson"}, {"id": 106033, "fullname": "Walid Hassan", "url": "http://cvpr.thecvf.com/api/miniconf/users/106033?format=json", "institution": "Johnson and Johnson"}, {"id": 192617, "fullname": "Samantha J Allen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192617?format=json", "institution": "Johnson and Johnson"}, {"id": 192618, "fullname": "Joke Reumers", "url": "http://cvpr.thecvf.com/api/miniconf/users/192618?format=json", "institution": "Johnson and Johnson"}, {"id": 192619, "fullname": "Helai Mohammad", "url": "http://cvpr.thecvf.com/api/miniconf/users/192619?format=json", "institution": "Johnson and Johnson"}, {"id": 192620, "fullname": "Scott Oloff", "url": "http://cvpr.thecvf.com/api/miniconf/users/192620?format=json", "institution": "Johnson &amp; Johnson"}, {"id": 192621, "fullname": "Tommaso Mansi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192621?format=json", "institution": "Johnson and Johnson"}, {"id": 192622, "fullname": "Rui Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192622?format=json", "institution": "Johnson and Johnson"}, {"id": 192623, "fullname": "Dmytro Lituiev", "url": "http://cvpr.thecvf.com/api/miniconf/users/192623?format=json", "institution": "Johnson and Johnson"}, {"id": 175275, "fullname": "Zhoubing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175275?format=json", "institution": "Johnson &amp; Johnson"}], "abstract": "Compound activity modeling is critical for drug discovery, where accurate *in silico* predictions can significantly reduce reliance on expensive, time\u2011consuming target-specific experimental assays. Traditional machine learning approaches for compound activity modeling typically rely on either chemoproteomics-centric molecular data or phenotype-centric imaging screens, limiting their ability to capture complementary biological signals. While multimodal approaches show promise, they often fail to capture the interplay between molecular mechanisms and cellular responses. In this paper, we present **BiGMINT**, a **Bi**ologically **G**uided **M**ultimodal framework that hierarchically **INT**egrates chemoproteomic and high-content imaging (HCI) data, introducing chemoproteomics-guided phenotypic aggregation, task-aware cross-modal fusion, and protein\u2013protein interaction priors for modeling activities. On two large-scale in-house datasets, with 99K and 40K compound\u2013HCI pairs from U2OS and iNeuron, **BiGMINT** improves mean AUCROC by up to 10.0% and 4.2%, and high-performing task coverage by up to 103% and 56% over best unimodal and multimodal methods. Thorough analysis revealed mechanistic insights, showing these gains stem from modality complementarity, and protein\u2013protein priors enhance modeling of challenging activities. Code will be released for reproducibility on acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39673", "url": null, "sourceid": 40756, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38271, "uid": "9f8e4b84f731020330443ee756a01dc4", "name": "Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models", "authors": [{"id": 189469, "fullname": "Vishal Pramanik", "url": "http://cvpr.thecvf.com/api/miniconf/users/189469?format=json", "institution": "University of Florida"}, {"id": 189470, "fullname": "Maisha Maliha", "url": "http://cvpr.thecvf.com/api/miniconf/users/189470?format=json", "institution": "University of Oklahoma"}, {"id": 189471, "fullname": "Susmit Jha", "url": "http://cvpr.thecvf.com/api/miniconf/users/189471?format=json", "institution": "DARPA"}, {"id": 156386, "fullname": "Alvaro Velasquez", "url": "http://cvpr.thecvf.com/api/miniconf/users/156386?format=json", "institution": "Defense Advanced Research Projects Agency"}, {"id": 189472, "fullname": "Olivera Kotevska", "url": "http://cvpr.thecvf.com/api/miniconf/users/189472?format=json", "institution": "Oak Ridge National Laboratory"}, {"id": 179880, "fullname": "Sumit Jha", "url": "http://cvpr.thecvf.com/api/miniconf/users/179880?format=json", "institution": "University of Florida"}], "abstract": "We study concept-level forgetting in pretrained vision models: removing an entire semantic category so the system no longer recognizes that object in unseen images and contexts, rather than merely forgetting specific training examples. Prior work either applies blunt global projections or fine-tunes parameters, which can introduce collateral damage to unrelated features, add compute, and become unstable as forgetting strength increases. We introduce Contrastive Subnet Erasure (CSE), a training-free, encoder-centric edit that targets a compact set of channels most responsible for the class and attenuates them in a calibrated manner. The modification is algebraically folded into the subsequent layer, yielding no inference-time overhead and leaving task heads unchanged. To evaluate whether forgetting generalizes beyond the data used to specify the class, we introduce a cross dataset protocol in which the class is defined on a source dataset and performance is measured on a disjoint target dataset drawn from a different distribution with no shared images. This setup tests whether the model still fails to recognize the object when it looks different or appears in new scenes, and it avoids overfitting to patterns of the source dataset. Across CIFAR 10, CIFAR 100, and ImageNet under this protocol, CSE achieves stronger forgetting of the target class while better preserving non target utility than existing baselines in both single class and multi class settings. Overall, CSE provides a simple stable and deployment ready mechanism for class level unlearning in vision.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38271", "url": null, "sourceid": 35749, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40127, "uid": "9313492d85a66aa6cfec9656fbb95629", "name": "VAST: Video Ability\u2011Stratified Taxonomy for Data\u2011Efficient Video Reasoning", "authors": [{"id": 193591, "fullname": "Zhongan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193591?format=json", "institution": "Zhejiang University"}, {"id": 180353, "fullname": "Xiaoyu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180353?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 193592, "fullname": "Lingxiao Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/193592?format=json", "institution": "Fudan University; Shanghai Artificial Intelligence Laboratory"}, {"id": 128709, "fullname": "Kun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128709?format=json", "institution": "Hefei University of Technology"}, {"id": 188842, "fullname": "zhiliang wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188842?format=json", "institution": "Zhejiang University"}, {"id": 193593, "fullname": "Xingcheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193593?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 193594, "fullname": "Qiaosheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193594?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 193595, "fullname": "Chaochao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193595?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}], "abstract": "Reinforcement learning (RL) enhances reasoning capabilities in multimodal large language models (MLLMs) for video understanding. However, current methods face two coupled challenges. \\textbf{First}, existing methods organize datasets by task types rather than reasoning capabilities. This creates a many-to-many mismatch where models learn task patterns instead of transferable reasoning abilities. Consequently, achieving ability generalization requires broad coverage across ability-task combinations, making RL training costly. \\textbf{Second}, these methods compensate for this inefficiency through complex algorithmic modifications (e.g., specialized temporal architectures or multi-objective reward frameworks), which increase the complexity of training. To address these issues, we take a joint perspective from both the data and method sides. On the data side, we propose VAST, an ability-stratified framework that reorganizes video understanding tasks into a three-layer cognitive taxonomy spanning Perception, Reasoning, and Cognition. We further construct VAST-15K for training and VAST-Bench for evaluation. On the method side, we introduce VideoVAST, employing RL with consistency rewards for reasoning-answer alignment without architectural modifications. Experiments show that VideoVAST achieves 66.3\\% accuracy on MVBench and 57.4\\% on VAST-Bench, compared with 62.7\\% and 54.3\\% respectively for Video-R1. Under the same training settings, VideoVAST uses 72\\% fewer GPU hours and 96\\% fewer training samples. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40127", "url": null, "sourceid": 37008, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38469, "uid": "012a7e2d6e350a7ca98e33cbed8be478", "name": "Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs", "authors": [{"id": 128531, "fullname": "Ziqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128531?format=json", "institution": "Hefei University of Technology"}, {"id": 189918, "fullname": "Chang Che", "url": "http://cvpr.thecvf.com/api/miniconf/users/189918?format=json", "institution": "Hefei University of Technology"}, {"id": 189687, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189687?format=json", "institution": "National University of Defense Technology"}, {"id": 189919, "fullname": "Hui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189919?format=json", "institution": "Hefei University of Technology"}, {"id": 189447, "fullname": "Zenglin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189447?format=json", "institution": "Hefei University of Technology"}, {"id": 86706, "fullname": "Cees G. M. Snoek", "url": "http://cvpr.thecvf.com/api/miniconf/users/86706?format=json", "institution": "University of Amsterdam"}, {"id": 85089, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85089?format=json", "institution": "Hefei University of Technology"}], "abstract": "While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition,  harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38469", "url": null, "sourceid": 36895, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40166, "uid": "585a999cb6b22b1e8ced8066e496784c", "name": "Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval", "authors": [{"id": 183168, "fullname": "Dongsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183168?format=json", "institution": "Northeast Normal University"}, {"id": 193688, "fullname": "Xinyuan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193688?format=json", "institution": "Northeast Normal University"}, {"id": 193689, "fullname": "Huijie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193689?format=json", "institution": "Northeast Normal University"}, {"id": 193690, "fullname": "Pingting Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193690?format=json", "institution": "Northeast Normal University"}, {"id": 193691, "fullname": "Qiushi Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193691?format=json", "institution": "northeast normal university"}], "abstract": "The core challenge in Text-based Person Retrieval (TPR) lies in establishing fine-grained, many-to-many semantic alignment between textual words and visual regions. Existing methods predominantly rely on pointwise similarity or attention mechanisms, implicitly assuming matches are independent and balanced. Consequently, under conditions of attribute overlap and substantial background noise, these methods often misallocate matching weights to non-discriminative regions or words, resulting in ambiguous matching outcomes. To address this, we propose QC-Align, a quota-calibrated fine-grained alignment framework guided by context-aware marginals. Specifically, we propose a Context-Aware Marginal Estimator (CAME) that dynamically assigns \"matching quotas\" to each word and visual region, and subsequently employs a Quota-Calibrated Transport (QCT) objective to explicitly constrain the matching quality each word and region can carry, thereby jointly optimizing the many-to-many correspondence between text and vision under these constraints. Notably, QC-Align is a parameter-free, plug-and-play training regularizer that requires no fine-grained annotations and incurs no inference overhead. Experiments on multiple mainstream person retrieval benchmarks demonstrate that QC-Align consistently improves baseline model performance, with greater gains and better interpretability in few-shot and cross-domain scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40166", "url": null, "sourceid": 41106, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38142, "uid": "1946140d57f78341485a76d39cd6b0ce", "name": "Residual Diffusion Bridge Model for Image Restoration", "authors": [{"id": 180748, "fullname": "Hebaixu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180748?format=json", "institution": "Zhongguancun Academy, Beijing, China"}, {"id": 91732, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91732?format=json", "institution": "The University of Sydney"}, {"id": 184689, "fullname": "Haoyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184689?format=json", "institution": "Wuhan University"}, {"id": 186959, "fullname": "Haonan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186959?format=json", "institution": "Wuhan University"}, {"id": 158612, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158612?format=json", "institution": "Wuhan University"}, {"id": 86222, "fullname": "Jiayi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86222?format=json", "institution": "Wuhan University"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}], "abstract": "Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Additionally, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38142", "url": null, "sourceid": 38412, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39855, "uid": "3c10e2d01d8fcae59688c61ccbbca20f", "name": "FLARE: A Failure-Aware Framework for Autonomous Correction and Recovery in Visual-Language Robotic Manipulation", "authors": [{"id": 90968, "fullname": "Ganlong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90968?format=json", "institution": "University of Hong Kong"}, {"id": 192988, "fullname": "Zijia Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192988?format=json", "institution": "Duke University"}, {"id": 192989, "fullname": "Xingping Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192989?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192990, "fullname": "Zhanghui Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192990?format=json", "institution": "TengenX"}, {"id": 192991, "fullname": "Ye Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/192991?format=json", "institution": null}, {"id": 74074, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74074?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Vision-Language-Action Models~(VLAs) have demonstrated significant promise in generalizing to complex, long-horizon robotic manipulation tasks. However, their performance remains brittle, as they are typically trained on trajectory-monotonic, failure-free demonstrations. This reliance on \"perfect\" data leaves them unable to recover from common execution errors, such as a missed grasp, a dropped object, or an unexpected collision. In this paper, we propose FLARE, a novel framework that endows VLAs with robust error recovery capabilities through a \"Retry\" and \"Reset\" paradigm. First, we introduce a \"Retry\" mechanism by injecting perturbation and bridging segments that decouple robot pose from environment state into demonstrations, enabling the policy to autonomously handle execution deviations. Second, to address critical, state-breaking (OOD) failures, we introduce a \"Reset\" pipeline. We leverage an MLLM for offline failure analysis to automatically identify OOD states from execution videos. This analysis enables the efficient, targeted collection of a small library of object-centric \"Reset\" skills, which are trained to restore the environment to a task-valid state. Our full framework integrates these learned policies. At inference, an online MLLM monitor arbitrates between task execution and \"Reset\" skills. Experiments on challenging, contact-rich manipulation tasks show our approach significantly improves task success and robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39855", "url": null, "sourceid": 38204, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40169, "uid": "90092a96d41dc90bde569b2383465360", "name": "DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs", "authors": [{"id": 96307, "fullname": "Nikhil Behari", "url": "http://cvpr.thecvf.com/api/miniconf/users/96307?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 193697, "fullname": "Diego Rivero", "url": "http://cvpr.thecvf.com/api/miniconf/users/193697?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 193698, "fullname": "Luke Apostolides", "url": "http://cvpr.thecvf.com/api/miniconf/users/193698?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 74228, "fullname": "Suman Ghosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/74228?format=json", "institution": "TU Berlin"}, {"id": 188104, "fullname": "Paul Pu Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188104?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 85615, "fullname": "Ramesh Raskar", "url": "http://cvpr.thecvf.com/api/miniconf/users/85615?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space\u2013time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40169", "url": null, "sourceid": 35390, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36518, "uid": "0ccbfba55e36b3ba7c1ad522cd1ae2e5", "name": "TrajTok: Learning Trajectory Tokens enables better Video Understanding", "authors": [{"id": 152947, "fullname": "Chenhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152947?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 90298, "fullname": "Jieyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90298?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 185258, "fullname": "Jianing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185258?format=json", "institution": "University of Washington"}, {"id": 161151, "fullname": "Weikai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161151?format=json", "institution": "University of Washington"}, {"id": 182395, "fullname": "Ashutosh Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/182395?format=json", "institution": "Woven by Toyota, Inc."}, {"id": 85099, "fullname": "Quan Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85099?format=json", "institution": "Woven by Toyota"}, {"id": 90013, "fullname": "Oncel Tuzel", "url": "http://cvpr.thecvf.com/api/miniconf/users/90013?format=json", "institution": "Apple"}, {"id": 84603, "fullname": "Chun-Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84603?format=json", "institution": "Google"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}], "abstract": "Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While the recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex, external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight, efficient, and yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2).  It achieves the best accuracy at scale across both classification and retrieval benchmarks,  while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer.   We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision\u2013language models (TrajVLM) with especially strong performance in long-video reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36518", "url": null, "sourceid": 34973, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40357, "uid": "5162c8e2f5e54609e7f32ae984efe16a", "name": "TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models", "authors": [{"id": 191380, "fullname": "Jiaming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191380?format=json", "institution": "Ant Group"}, {"id": 191381, "fullname": "Guanyu Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191381?format=json", "institution": "University of Manchester"}, {"id": 87628, "fullname": "Hongwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87628?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 191382, "fullname": "Zhicong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191382?format=json", "institution": "Ant Group"}, {"id": 191383, "fullname": "Kangjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191383?format=json", "institution": "Tianjin University"}, {"id": 76212, "fullname": "Yi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76212?format=json", "institution": "Nanyang Technological University, Singapore"}, {"id": 191384, "fullname": "Wenbo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191384?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 91735, "fullname": "Guowen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91735?format=json", "institution": "Nanyang Technological University"}, {"id": 87654, "fullname": "Tianwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87654?format=json", "institution": "Nanyang Technological University"}], "abstract": "Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods, which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a $\\textbf{TE}$mporal-aware $\\textbf{A}$utomated $\\textbf{R}$ed-teaming framework, named $\\textbf{TEAR}$, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80\\% attack success rate, a significant boost from prior best result of 57\\%.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40357", "url": null, "sourceid": -35252, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39111?format=json"], "related_events_ids": [39111]}, {"id": 40352, "uid": "d12dbd37a4a8bc4327e3faa9bf8a20cf", "name": "ChordEdit: One-Step Low-Energy Transport for Image Editing", "authors": [{"id": 190926, "fullname": "Liangsi Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190926?format=json", "institution": "Guangdong University of Technology"}, {"id": 188356, "fullname": "Xuhang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188356?format=json", "institution": "Huizhou University"}, {"id": 190927, "fullname": "Minzhe Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190927?format=json", "institution": "Guangdong University of Technology"}, {"id": 190928, "fullname": "Shichu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190928?format=json", "institution": "Shenzhen University"}, {"id": 176876, "fullname": "Jingchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176876?format=json", "institution": "the School of Computer Science, Peking University"}, {"id": 183044, "fullname": "Yang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183044?format=json", "institution": "Guangdong University of Technology"}], "abstract": "The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce \\textbf{ChordEdit}, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40352", "url": "https://chordedit.github.io/", "sourceid": -45497, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38895?format=json"], "related_events_ids": [38895]}, {"id": 38526, "uid": "cf6a9b5fa924deb1c223f5c9c80ac9a8", "name": "LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation", "authors": [{"id": 181357, "fullname": "Haoyu Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/181357?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 187818, "fullname": "Xueting Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187818?format=json", "institution": "Southern University of Science and Technology"}, {"id": 187817, "fullname": "Yu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187817?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187816, "fullname": "Wenze Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187816?format=json", "institution": null}, {"id": 159245, "fullname": "Zhihao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159245?format=json", "institution": "Harbin Institute of Technology"}, {"id": 131200, "fullname": "Weihong Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/131200?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 187819, "fullname": "Zhiyong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187819?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 131181, "fullname": "Honghai LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/131181?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}], "abstract": "Skeleton-based Temporal Action Segmentation (STAS) aims to densely parse untrimmed skeletal sequences into frame-level action categories. However, existing methods, while proficient at capturing spatio-temporal kinematics, neglect the underlying physical dynamics that govern human motion. This oversight limits inter-class discriminability between actions with similar kinematics but distinct dynamic intents, and hinders precise boundary localization where dynamic force profiles shift. To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. Specifically, LaDy first computes generalized coordinates from joint positions and then estimates Lagrangian terms under physical constraints to explicitly synthesize the generalized forces. To further ensure physical coherence, our Energy Consistency Loss enforces the work-energy theorem, aligning kinetic energy change with the work done by the net force. The learned dynamics then drive a Spatio-Temporal Modulation module: Spatially, generalized forces are fused with spatial representations to provide more discriminative semantics. Temporally, salient dynamic signals are constructed for temporal gating, thereby significantly enhancing boundary awareness. Experiments on challenging datasets show LaDy achieves state-of-the-art performance, validating the integration of physical dynamics for action segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38526", "url": null, "sourceid": 35439, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37606, "uid": "19321216090a45a21c65ee8ea8cdddbc", "name": "Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation", "authors": [{"id": 181357, "fullname": "Haoyu Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/181357?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 187815, "fullname": "Bowen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187815?format=json", "institution": null}, {"id": 159245, "fullname": "Zhihao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159245?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187816, "fullname": "Wenze Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187816?format=json", "institution": null}, {"id": 187817, "fullname": "Yu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187817?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187818, "fullname": "Xueting Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187818?format=json", "institution": "Southern University of Science and Technology"}, {"id": 131200, "fullname": "Weihong Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/131200?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 187819, "fullname": "Zhiyong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187819?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 131181, "fullname": "Honghai LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/131181?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}], "abstract": "Skeleton-based Temporal Action Segmentation (STAS) seeks to densely segment and classify diverse actions within long, untrimmed skeletal motion sequences. However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. Specifically, Spectral Scalpel employs adaptive multi-scale spectral filters as scalpels to edit frequency spectra, coupled with a discrepancy loss between adjacent actions serving as the surgical objective. This design amplifies representational disparities between neighboring actions, effectively mitigating boundary localization ambiguities and categorical confusions. Furthermore, complementing long-term temporal modeling, we introduce a frequency-aware channel mixer to strengthen channel evolution by aggregating spectra across channels. This work presents a novel paradigm for STAS that extends conventional spatio-temporal modeling by incorporating frequency-domain analysis. Extensive experiments on five public datasets demonstrate that Spectral Scalpel achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37606", "url": null, "sourceid": 45073, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37414, "uid": "95f8fbf9e0653a1c0fee3572b5a25042", "name": "DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process", "authors": [{"id": 180087, "fullname": "xulun ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180087?format=json", "institution": "Ningbo University"}, {"id": 187386, "fullname": "Yifan Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187386?format=json", "institution": "Ningbo University"}, {"id": 186403, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186403?format=json", "institution": "Shenzhen University"}, {"id": 187387, "fullname": "Zelei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187387?format=json", "institution": "Ningbo University"}, {"id": 187388, "fullname": "Jieyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187388?format=json", "institution": "Ningbo University"}], "abstract": "The performance of mean-based prototypical methods in few-shot learning is frequently compromised by noise and hard positives, where entangled feature representations cause prototype instability. We present a novel ``Filter-Repair-Expand'' framework grounded in Determinantal Point Process (DPP) theory. The method leverages DPP as its core logic, employing it to estimate sample confidence to filter anomalous samples from the initial set, guide a diffusion process via volume-maximization to enhance the sample representation, and subsequently maximize the volume of synergistic disentangled subspaces, constructing robust and diverse prototype subspaces. Experimental results establish new state-of-the-art performance on multiple benchmarks, demonstrating significant gains in few-shot learning robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37414", "url": null, "sourceid": 44021, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36875, "uid": "393a02a8b46c706c42e3aa9795cc73ca", "name": "Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning", "authors": [{"id": 186075, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186075?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186076, "fullname": "Mian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186076?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186077, "fullname": "Yuyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186077?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 177335, "fullname": "Mingqi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/177335?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 186078, "fullname": "Wenyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186078?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186079, "fullname": "Haoxiang You", "url": "http://cvpr.thecvf.com/api/miniconf/users/186079?format=json", "institution": "Yale University"}, {"id": 180804, "fullname": "Yunbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180804?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 132631, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/132631?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 133538, "fullname": "Wenjun Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133538?format=json", "institution": "Eastern Institute of Technology, Ningbo"}], "abstract": "Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior.  Designing such reward functions can be challenging and may not generalize well across different tasks.  To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals.  For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos.  To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state.  We then employ a learned forward\u2013backward representation that represents the probability of visiting the goal state from a given state\u2013action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36875", "url": "https://qiwang067.github.io/genreward", "sourceid": 43987, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65739, "file": "/media/PosterPDFs/CVPR%202026/36875.png", "modified": "2026-04-24T21:06:22.320370-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65740, "file": "/media/PosterPDFs/CVPR%202026/36875-thumb.png", "modified": "2026-04-24T20:53:49.004500-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65741, "modified": "2026-04-25T02:08:52.041948-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/36875.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37627, "uid": "45e16b2b55c1b4854ea2cc7361f3a513", "name": "AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction", "authors": [{"id": 172770, "fullname": "Tingyun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/172770?format=json", "institution": "Wuhan University"}, {"id": 180422, "fullname": "Xinyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180422?format=json", "institution": "Wuhan University"}, {"id": 104189, "fullname": "Yongjun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104189?format=json", "institution": "Wuhan University"}, {"id": 184242, "fullname": "Yi Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184242?format=json", "institution": "Wuhan University"}, {"id": 187902, "fullname": "Xiaoan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187902?format=json", "institution": "Wuhan University"}, {"id": 187903, "fullname": "Fan Weiwei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187903?format=json", "institution": "Wuhan University"}, {"id": 187904, "fullname": "Jiahao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187904?format=json", "institution": "Wuhan University"}], "abstract": "Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \\modelname, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos.Central to our method are scale-aware spatio-temporal anchors (S$^2$A-Anchors), which enable a unified optimization via three key decoupling mechanisms:(i) separating ego-motion from object motion,(ii) isolating static geometry from temporal deformation, and(iii) adapting scale between distant terrain and nearby objects.This design effectively stabilizes optimization under large motion and scale imbalance.Extensive experiments on UAV and driving benchmarks show that \\modelname~achieves state-of-the-art rendering quality (PSNR/LPIPS), precise trajectory recovery (ATE/RPE), and faithful motion reconstruction, consistently surpassing recent pose-free baselines.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37627", "url": null, "sourceid": 42853, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37827, "uid": "4dad75d2365dedf73aa94b7606de7426", "name": "ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval", "authors": [{"id": 184421, "fullname": "Zixu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184421?format=json", "institution": "Shandong University"}, {"id": 157298, "fullname": "Yupeng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157298?format=json", "institution": "Shandong University"}, {"id": 184420, "fullname": "Zhiwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184420?format=json", "institution": "Shandong University"}, {"id": 188349, "fullname": "Mingyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188349?format=json", "institution": "Shandong University"}, {"id": 184417, "fullname": "Zhiheng Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184417?format=json", "institution": "Shandong University"}, {"id": 84777, "fullname": "Liqiang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/84777?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "The Composed Image Retrieval (CIR) task provides a flexible retrieval paradigm via a reference image and modification text, but it heavily relies on expensive and error-prone triplet annotations. This paper systematically investigates the Noisy Triplet Correspondence (NTC) problem introduced by annotations. We find that NTC noise, particularly \"hard noise\" (i.e., the reference and target images are highly similar but the modification text is incorrect), poses a unique challenge to existing Noise Correspondence Learning (NCL) methods because it breaks the traditional \"small loss hypothesis\". We identify and elucidate three key, yet overlooked, challenges in the NTC task, namely (C1) Modality Suppression, (C2) Negative Anchor Deficiency, and (C3) Unlearning Backlash. To address these challenges, we propose a Cone-based robuSt noisE-unlearning comPositional network (ConeSep). Specifically, we first propose Geometric Fidelity Quantization, theoretically establishing and practically estimating a noise boundary to precisely locate noisy correspondence. Next, we introduce Negative Boundary Learning, which learns a \"diagonal negative combination\" for each query as its explicit semantic opposite-anchor in the embedding space. Finally, we design Boundary-based Targeted Unlearning, which models the noisy correction process as an optimal transport problem, elegantly avoiding Unlearning Backlash. Extensive experiments on benchmark datasets (FashionIQ and CIRR) demonstrate that ConeSep significantly outperforms current state-of-the-art methods, which fully demonstrates the effectiveness and robustness of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37827", "url": null, "sourceid": 38962, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39966, "uid": "893f93375b0b69ee6854a8c70aa78689", "name": "E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought", "authors": [{"id": 193210, "fullname": "Meiqi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193210?format=json", "institution": "Alibaba Group"}, {"id": 193211, "fullname": "mingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193211?format=json", "institution": null}, {"id": 193212, "fullname": "Junxiong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193212?format=json", "institution": "Alibaba Group"}], "abstract": "Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for e-commerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset \\textbf{E-comIQ-18k} to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train \\textbf{E-comIQ-M}, a specialized evaluation model that aligns with human expert judgment. Our framework enables \\textbf{E-comIQ-Bench}, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this area.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39966", "url": null, "sourceid": 32372, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38433, "uid": "a81d0525b1bf67680570fa60790b8e07", "name": "Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning", "authors": [{"id": 189858, "fullname": "Shashanka Venkataramanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189858?format=json", "institution": "Valeo.ai"}, {"id": 189859, "fullname": "Valentinos Pariza", "url": "http://cvpr.thecvf.com/api/miniconf/users/189859?format=json", "institution": "niversity of Technology Nuremberg"}, {"id": 189860, "fullname": "Mohammadreza Salehi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189860?format=json", "institution": "Samsung; University of Amsterdam"}, {"id": 126805, "fullname": "Lukas Knobel", "url": "http://cvpr.thecvf.com/api/miniconf/users/126805?format=json", "institution": "University of Amsterdam &amp; TNO"}, {"id": 189861, "fullname": "Elias Ramzi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189861?format=json", "institution": "Valeo"}, {"id": 87186, "fullname": "Spyros Gidaris", "url": "http://cvpr.thecvf.com/api/miniconf/users/87186?format=json", "institution": "Valeo.ai"}, {"id": 87148, "fullname": "Andrei Bursuc", "url": "http://cvpr.thecvf.com/api/miniconf/users/87148?format=json", "institution": "valeo.ai"}, {"id": 189862, "fullname": "Yuki M Asano", "url": "http://cvpr.thecvf.com/api/miniconf/users/189862?format=json", "institution": "University of Technology Nuremberg"}], "abstract": "We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in self-supervised learning  clustering methods. Existing approaches assign image features to large codebooks via clustering algorithms such as Sinkhorn-Knopp, but they often overlook the inherent ambiguity in cluster semantics. To address this, we introduce a multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, producing higher-quality dense representations. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations.This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38433", "url": null, "sourceid": 44745, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37810, "uid": "58770e07bc120e15567ce7e2d014f19c", "name": "BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting", "authors": [{"id": 181287, "fullname": "Jiaxing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181287?format=json", "institution": "Nanjing University"}, {"id": 188323, "fullname": "Dongyang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/188323?format=json", "institution": "nanjing university"}, {"id": 188324, "fullname": "Hangyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188324?format=json", "institution": "nanjing university"}, {"id": 188325, "fullname": "Zhouyuxiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188325?format=json", "institution": "nanjing university"}, {"id": 152507, "fullname": "Yuanqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152507?format=json", "institution": "Nanjing University"}, {"id": 188326, "fullname": "Jie Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188326?format=json", "institution": "Nanjing university"}, {"id": 188327, "fullname": "Zhengkang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188327?format=json", "institution": "Nanjing Urban Construction Tunnel&amp;Bridge Intelligent Management Co.,Ltd."}, {"id": 90556, "fullname": "Yanwen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/90556?format=json", "institution": "Nanjing University"}], "abstract": "The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37810", "url": null, "sourceid": 33810, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38777, "uid": "a8a5204dd969d34ddbd26915537bd937", "name": "Explaining CLIP Zero-shot Predictions Through Concepts", "authors": [{"id": 190640, "fullname": "Onat Ozdemir", "url": "http://cvpr.thecvf.com/api/miniconf/users/190640?format=json", "institution": "University of Edinburgh"}, {"id": 190641, "fullname": "Anders Christensen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190641?format=json", "institution": "Orbital"}, {"id": 154719, "fullname": "Stephan Alaniz", "url": "http://cvpr.thecvf.com/api/miniconf/users/154719?format=json", "institution": "T\u00e9l\u00e9com Paris"}, {"id": 154682, "fullname": "Zeynep Akata", "url": "http://cvpr.thecvf.com/api/miniconf/users/154682?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 134258, "fullname": "Emre Akbas", "url": "http://cvpr.thecvf.com/api/miniconf/users/134258?format=json", "institution": "METU"}], "abstract": "Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes.  We introduce a framework that bridges these two paradigms by explaining CLIP\u2019s zero-shot predictions through human-understandable concepts. Our method projects CLIP\u2019s joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP\u2019s semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP\u2019s strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38777", "url": "https://oonat.github.io/ezpc/", "sourceid": 44358, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38406, "uid": "19c2d0bfe72d7072c73323cd5c94bcc0", "name": "MUFASA: A Multi-Layer Framework for Slot Attention", "authors": [{"id": 189810, "fullname": "Sebastian Bock", "url": "http://cvpr.thecvf.com/api/miniconf/users/189810?format=json", "institution": "Technical University of Darmstadt; Zuse School ELIZA"}, {"id": 189811, "fullname": "Leonie Sch\u00fc\u00dfler", "url": "http://cvpr.thecvf.com/api/miniconf/users/189811?format=json", "institution": "Technical University Darmstadt"}, {"id": 96327, "fullname": "Krishnakant Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/96327?format=json", "institution": "TU Darmstadt"}, {"id": 92831, "fullname": "Simone Schaub-Meyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/92831?format=json", "institution": "TU Darmstadt"}, {"id": 73884, "fullname": "Stefan Roth", "url": "http://cvpr.thecvf.com/api/miniconf/users/73884?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}], "abstract": "Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38406", "url": null, "sourceid": 40087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38305, "uid": "f8ace07a82a8c427a24248210e8f783b", "name": "Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency", "authors": [{"id": 158350, "fullname": "Guangyan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158350?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189549, "fullname": "Qi Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189549?format=json", "institution": "Beijing Institute of Technology"}, {"id": 158351, "fullname": "Te Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/158351?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189550, "fullname": "Zichen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189550?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189551, "fullname": "Weixin Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189551?format=json", "institution": "Waseda University"}, {"id": 189552, "fullname": "Luojie Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189552?format=json", "institution": "Beijing Institute of Technology"}, {"id": 158352, "fullname": "Meiling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158352?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155459, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155459?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186935, "fullname": "Hua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186935?format=json", "institution": "Zhejiang University"}, {"id": 155458, "fullname": "Yufeng Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155458?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Video data provides a rich source beyond expensive action-labeled data for advancing robot learning. Recent approaches have demonstrated promising potential in leveraging video data by learning latent actions for policy training. The latent action tokenizer encodes latent actions between successive video frames, and the tokenizer is trained to reconstruct future frames using current frames and the encoded latent actions. However, the unique pairing of successive frames permits future frame reconstruction with little understanding of transition dynamics, hindering the learning of semantically consistent latent actions. Moreover, the tokenizer typically allocates distinct latent action subsets to individual embodiments to accommodate heterogeneous morphologies, constraining knowledge transfer. To overcome such limitations, we propose the action-centric cycle consistency, aiming to establish a unified latent action space. Our method samples latent actions from the latent action space and decodes them with video frames to generate diverse subsequent frames, then enforces cycle consistency by predicting the sampled actions from both original and generated frames. Our concise method creates a challenging task that learns corresponding latent actions from current frames and diverse generated future frames, compelling the tokenizer to develop semantically consistent action representations. Additionally, sampled latent actions can be applied to video frames from distinct embodiments, facilitating the alignment of latent actions across embodiments. Experiments demonstrate that our approach achieves a 20.1% improvement over OpenVLA on the LIBERO benchmark and increases the average length from 3.27 to 3.93 on the CALVIN benchmark. In real-world experiments, our method maintains strong performance with a 44% improvement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38305", "url": null, "sourceid": 32035, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37696, "uid": "c64e9c3927db744afc824dd2c8d81008", "name": "EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition", "authors": [{"id": 188035, "fullname": "Yihan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188035?format=json", "institution": "Great Bay University"}, {"id": 88780, "fullname": "Xuelin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88780?format=json", "institution": "Adobe Research"}, {"id": 85466, "fullname": "Xiaodong Cun", "url": "http://cvpr.thecvf.com/api/miniconf/users/85466?format=json", "institution": "Tencent AI Lab"}], "abstract": "Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although na\u00efvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37696", "url": null, "sourceid": 41120, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37422, "uid": "445fa0405d9532d901b9b3a9da55a813", "name": "GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization", "authors": [{"id": 180633, "fullname": "Zixuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/180633?format=json", "institution": "Zhongguancun Academy"}, {"id": 91732, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91732?format=json", "institution": "The University of Sydney"}, {"id": 158612, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158612?format=json", "institution": "Wuhan University"}, {"id": 187414, "fullname": "Zidie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187414?format=json", "institution": "JiLin University"}, {"id": 187415, "fullname": "Wenbin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187415?format=json", "institution": "Jilin University, China"}, {"id": 186959, "fullname": "Haonan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186959?format=json", "institution": "Wuhan University"}, {"id": 187416, "fullname": "En Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187416?format=json", "institution": "Jilin University, China"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}], "abstract": "Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37422", "url": null, "sourceid": 39093, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36693, "uid": "ef152a79c2f6111858309f64e2d68ebd", "name": "Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning", "authors": [{"id": 185654, "fullname": "Haozhen Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185654?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 152687, "fullname": "Xiaozhong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/152687?format=json", "institution": "Tencent Youtu Lab"}, {"id": 175257, "fullname": "Yuansen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175257?format=json", "institution": "National University of Singapore"}, {"id": 185655, "fullname": "Wenbin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185655?format=json", "institution": "National University of Singapore"}, {"id": 185656, "fullname": "Xiaoxiao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185656?format=json", "institution": "Ruijin Hospital, Shanghai Jiao Tong University School of Medicine"}, {"id": 185657, "fullname": "jingjing liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185657?format=json", "institution": "National University of Singapore"}, {"id": 156407, "fullname": "Kai WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/156407?format=json", "institution": "Tencent YouTu Lab"}, {"id": 185658, "fullname": "Jiazhen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185658?format=json", "institution": "Technical University of Munich"}, {"id": 131883, "fullname": "Bailiang Jian", "url": "http://cvpr.thecvf.com/api/miniconf/users/131883?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 85991, "fullname": "Hongwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85991?format=json", "institution": "University of Zurich"}], "abstract": "MLLMs are beginning to appear in clinical workflows, but their ability to perform complex medical reasoning remains unclear. We present Med-CMR, a fine-grained Medical Complex Multimodal Reasoning benchmark. Med-CMR distinguishes from existing counterparts by three core features: 1) Systematic capability decomposition, splitting medical multimodal reasoning into fine-grained visual understanding and multi-step reasoning to enable targeted evaluation; 2) Challenging task design, with visual understanding across three key dimensions (small-object detection, fine-detail discrimination, spatial understanding) and reasoning covering four clinically relevant scenarios (temporal prediction, causal reasoning, long-tail generalization, multi-source integration); 3) Broad, high-quality data coverage, comprising 20,653 Visual Question Answering (VQA) pairs spanning 11 organ systems and 12 imaging modalities, validated via a rigorous two-stage (human expert + model-assisted) review to ensure clinical authenticity. We evaluate 18 state-of-the-art MLLMs with Med-CMR, revealing GPT-5 as the top-performing commercial model: 57.81 accuracy on multiple-choice questions (MCQs) and a 48.70 open-ended score, outperforming Gemini 2.5 Pro (49.87 MCQ accuracy, 45.98 open-ended score) and leading open-source model Qwen3-VL-235B-A22B (49.34 MCQ accuracy, 42.62 open-ended score). However, specialized medical MLLMs do not reliably outperform strong general models, and long-tail generalization emerges as the dominant failure mode. Med-CMR thus provides a stress test for visual\u2013reasoning integration and rare-case robustness in medical MLLMs, and a rigorous yardstick for future clinical systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36693", "url": null, "sourceid": 33610, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37404, "uid": "8e9240a9f16db317677aa70bfeb6f560", "name": "Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery", "authors": [{"id": 187357, "fullname": "Wei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187357?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 187358, "fullname": "Xianghan Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187358?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187359, "fullname": "Zhiyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187359?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 85981, "fullname": "Xianbiao Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85981?format=json", "institution": "International Digital Economy Academy"}, {"id": 184713, "fullname": "Rong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184713?format=json", "institution": null}, {"id": 181531, "fullname": "CHUNGUANG LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/181531?format=json", "institution": "Beijing University of Posts and Telecommunications, P.R. China"}], "abstract": "Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily depend upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37404", "url": null, "sourceid": 31957, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38597, "uid": "7143b3c23169789d1d83178002a9b07f", "name": "PhysInOne: Visual Physics Learning and Reasoning in One Suite", "authors": [{"id": 158684, "fullname": "Siyuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/158684?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 190248, "fullname": "Hejun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190248?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 190249, "fullname": "Hu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190249?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 158682, "fullname": "Jinxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158682?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 190250, "fullname": "Dongsheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190250?format=json", "institution": "Occidental College"}, {"id": 190251, "fullname": "Junwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190251?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 190252, "fullname": "Yixiao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190252?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 190253, "fullname": "Jiayue Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190253?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 190254, "fullname": "Shiwei Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190254?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 177017, "fullname": "Shangjia Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177017?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 183886, "fullname": "Yafei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183886?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 190255, "fullname": "Hongkang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/190255?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 187014, "fullname": "Shenxing Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187014?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 76680, "fullname": "Zihui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76680?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 190256, "fullname": "DataTeam vLAR", "url": "http://cvpr.thecvf.com/api/miniconf/users/190256?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 77364, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77364?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 190257, "fullname": "Zhihua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190257?format=json", "institution": "Synapsor"}, {"id": 75768, "fullname": "Chuhang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75768?format=json", "institution": "Amazon"}, {"id": 85913, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85913?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "We present **PhysInOne**, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 1.4 million videos across 129,400 dynamic 3D scenes, covering 68 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne\u2019s efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38597", "url": null, "sourceid": 32399, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39137, "uid": "fa71aa74e9991ac6b0425e28bc5e5484", "name": "Zero-Shot Feature Upsampling via Neighborhood Attention Filtering.", "authors": [{"id": 127650, "fullname": "Loick Chambon", "url": "http://cvpr.thecvf.com/api/miniconf/users/127650?format=json", "institution": "Valeo"}, {"id": 191427, "fullname": "Paul Couairon", "url": "http://cvpr.thecvf.com/api/miniconf/users/191427?format=json", "institution": null}, {"id": 86061, "fullname": "\u00c9loi Zablocki", "url": "http://cvpr.thecvf.com/api/miniconf/users/86061?format=json", "institution": "Valeo"}, {"id": 87147, "fullname": "Alexandre Boulch", "url": "http://cvpr.thecvf.com/api/miniconf/users/87147?format=json", "institution": "valeo.ai"}, {"id": 191428, "fullname": "Nicolas THOME", "url": "http://cvpr.thecvf.com/api/miniconf/users/191428?format=json", "institution": "sorbonne universit\u00e9"}, {"id": 77540, "fullname": "Matthieu Cord", "url": "http://cvpr.thecvf.com/api/miniconf/users/77540?format=json", "institution": "Sorbonne Universit\u00e9"}], "abstract": "Vision Foundation Models (VFMs) extract spatially downsampled representations, which poses challenges for pixel-level tasks that require fine-grained details.Existing approaches face a trade-off: classical filters are fast and broadly applicable but use fixed forms and feature-independent guidance, while modern upsamplers achieve stronger accuracy with learnable, VFM-specific guidance but require retraining per VFM.We introduce Neighborhood Attention Filtering (NAF), bridging classical filtering with modern upsamplers. Guided solely by the high-resolution input image, NAF learns adaptive content and spatial weights through Cross-Scale Neighborhood Attention and Rotary Position Embeddings (RoPE).NAF is VFM-agnostic and zero-shot: once trained, it upsamples features from any VFM without retraining, being the first VFM-agnostic architecture to outperform VFM-specific upsamplers by achieving state-of-the-art scores on multiple downstream tasks.It remains highly efficient, scaling to 2K feature maps and reconstructing intermediate-resolution maps at 18 FPS.Beyond feature upsampling, NAF demonstrates strong performance on image restoration, showing its versatility. We open-source our code and checkpoints.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39137", "url": null, "sourceid": 42727, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38653, "uid": "aa37b70bbe5a37d659bf67dee2ca9492", "name": "Understanding, Accelerating, and Improving MeanFlow Training", "authors": [{"id": 130508, "fullname": "Jin-Young Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130508?format=json", "institution": "Yonsei University"}, {"id": 130485, "fullname": "Hyojun Go", "url": "http://cvpr.thecvf.com/api/miniconf/users/130485?format=json", "institution": "Twelvelabs"}, {"id": 190395, "fullname": "Lea Bogensperger", "url": "http://cvpr.thecvf.com/api/miniconf/users/190395?format=json", "institution": "University of Zurich"}, {"id": 88311, "fullname": "Julius Erbach", "url": "http://cvpr.thecvf.com/api/miniconf/users/88311?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190396, "fullname": "Nikolai Kalischek", "url": "http://cvpr.thecvf.com/api/miniconf/users/190396?format=json", "institution": "Google"}, {"id": 87927, "fullname": "Federico Tombari", "url": "http://cvpr.thecvf.com/api/miniconf/users/87927?format=json", "institution": "Google, TUM"}, {"id": 86863, "fullname": "Konrad Schindler", "url": "http://cvpr.thecvf.com/api/miniconf/users/86863?format=json", "institution": "ETH Zurich"}, {"id": 138002, "fullname": "Dominik Narnhofer", "url": "http://cvpr.thecvf.com/api/miniconf/users/138002?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38653", "url": null, "sourceid": 30893, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37314, "uid": "cbaafc6f429ace305bb4ead3bff5f73c", "name": "EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs", "authors": [{"id": 187138, "fullname": "Zhenghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187138?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187139, "fullname": "Huiqun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187139?format=json", "institution": "ZGC Lab"}, {"id": 87605, "fullname": "Di Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87605?format=json", "institution": "Beihang University"}], "abstract": "Multimodal large language models (MLLMs) are increasingly applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most concurrent works enhance spatial reasoning by introducing 3D priors or geometric supervision, which improves performance but incurs substantial data preparation and alignment costs. Purely 2D approaches, however, struggle with multi-frame spatial reasoning due to missing viewpoint transitions and overlooked implicit objects that act as spatial bridges. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Captioning and Progressive Spatial Analysis, jointly constructing a coherent linguistic scene graph across frames. With only 5K auto-generated SFT samples and 20K RL samples, EgoMind achieves competitive results on VSI-Bench, SPAR-Bench, STI-Bench and SPBench, demonstrating its effectiveness in reinforcing the spatial reasoning capabilities of MLLMs and highlighting the potential of linguistic reasoning for spatial cognition.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37314", "url": null, "sourceid": 34722, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36186, "uid": "ebc764ed84593fef4bd6cde9eb72be0a", "name": "TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning", "authors": [{"id": 184368, "fullname": "Yunbi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184368?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 184369, "fullname": "Enqi Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184369?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 184370, "fullname": "Shiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184370?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 157785, "fullname": "hui shuai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157785?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 177875, "fullname": "Lei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/177875?format=json", "institution": "Tongji Univerisity"}, {"id": 128941, "fullname": "Juncheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128941?format=json", "institution": "Shanghai University"}, {"id": 184371, "fullname": "Kuai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184371?format=json", "institution": "Nanjing Medical University"}, {"id": 184372, "fullname": "Shu Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184372?format=json", "institution": null}, {"id": 184373, "fullname": "Yongchu Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184373?format=json", "institution": null}, {"id": 131414, "fullname": "Qingshan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131414?format=json", "institution": "Nanjing University of Posts and Telecommunications"}], "abstract": "Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients' quality of life. Current deep learning approaches predominantly predict transformation matrices for the misaligned tooth point cloud via point-to-point geometric constraints to achieve tooth alignment. Nevertheless, these matrices are likely to exhibit clinical-specific distributions, which deterministic constraints fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is assisted by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. We validate our method on a challenge dataset from clinical practice and an extra orthodontic dataset. Its efficacy was confirmed through effective ablation studies and comparative analyses, highlighting its potential for application in orthodontic treatment.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36186", "url": null, "sourceid": 32479, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38724, "uid": "f9f0cd50df8a94019fc0391c61b96f3b", "name": "Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction", "authors": [{"id": 181295, "fullname": "Yuanbo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181295?format=json", "institution": "Jiangnan University"}, {"id": 75854, "fullname": "Tianyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75854?format=json", "institution": "Jiangnan University"}, {"id": 184809, "fullname": "Cong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184809?format=json", "institution": "Jiangnan University"}, {"id": 184810, "fullname": "Tao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184810?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129533, "fullname": "Xiaojun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129533?format=json", "institution": "Jiangnan University"}, {"id": 154654, "fullname": "Josef Kittler", "url": "http://cvpr.thecvf.com/api/miniconf/users/154654?format=json", "institution": "University of Surrey"}], "abstract": "With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks.However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability.To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation.SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts.This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations.Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples.Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.The code will be released here.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38724", "url": null, "sourceid": 43802, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40163, "uid": "48ac3f2c97f8d0a311a7b1ba0eb6cbf8", "name": "SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception", "authors": [{"id": 180563, "fullname": "Mingjie Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/180563?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 193675, "fullname": "heguangjun heguangjun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193675?format=json", "institution": null}, {"id": 89070, "fullname": "Dongli Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89070?format=json", "institution": "University of Sydney"}, {"id": 98719, "fullname": "Youtian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/98719?format=json", "institution": "Nanjing university"}, {"id": 178283, "fullname": "Hongjue Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178283?format=json", "institution": "Beihang University"}, {"id": 193676, "fullname": "Pengming Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193676?format=json", "institution": null}, {"id": 193677, "fullname": "Jian Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193677?format=json", "institution": "Harbin Engineering University"}, {"id": 187983, "fullname": "Yue Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187983?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "Open-vocabulary dense perception (OVDP) aims to localize objects unseen during training by leveraging textual knowledge. Despite the remarkable progress of recent CLIP-based approaches, we identify a critical limitation: synonym-induced grounding inconsistency, where semantically equivalent expressions yield disparate spatial attention patterns. This inconsistency undermines the robustness and performance of existing methods in real-world OVDP applications. To address this issue,  we propose SynCLIP, a Synonym-Coherent Language-Image Pretraining framework that enhances synonym-robust grounding for OVDP tasks. SynCLIP introduces a Semantic-consistent Spatial Attention alignment (SSA) module to enhance spatial attention consistency by minimizing discrepancies between attention maps of original and synonymous expressions. Furthermore, a Spatial Attention Refinement (SAR) module selectively strengthens the most semantically relevant spatial regions within aligned maps, resulting in more precise and stable grounding. To support synonym-coherent pretraining, we also construct a Synonym-Enriched Visual Corpus (SEViC), which augments each category with multiple synonyms and textual definitions. Extensive experiments on multiple benchmarks demonstrate that SynCLIP substantially improves grounding consistency under diverse linguistic variants and achieves state-of-the-art performance among CLIP-based OVDP methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40163", "url": null, "sourceid": 35231, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38441, "uid": "af6963942d2a2107c69cf67acb6d302b", "name": "VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation", "authors": [{"id": 126812, "fullname": "Shikun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/126812?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 152514, "fullname": "Liao Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152514?format=json", "institution": "ByteDance Inc."}, {"id": 154603, "fullname": "Huichao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154603?format=json", "institution": "Bytedance Intelligent Creation"}, {"id": 154604, "fullname": "Yiheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154604?format=json", "institution": "ByteDance Inc."}, {"id": 189870, "fullname": "Yangyang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/189870?format=json", "institution": "ByteDance Inc."}, {"id": 189871, "fullname": "Xian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189871?format=json", "institution": null}, {"id": 90635, "fullname": "Yi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90635?format=json", "institution": "bytedance"}, {"id": 154605, "fullname": "Xu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154605?format=json", "institution": "ByteDance"}, {"id": 126799, "fullname": "Jia Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/126799?format=json", "institution": "Department of Computer Science and Technology, Tsinghua University"}, {"id": 145690, "fullname": "Daniel Kang Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/145690?format=json", "institution": "ByteDance Inc."}, {"id": 89702, "fullname": "Xinglong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89702?format=json", "institution": "Jilin University"}], "abstract": "Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment.To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38441", "url": null, "sourceid": 30727, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37063, "uid": "5590ca803d2831d2a3dff3fa3eaca6f4", "name": "Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models", "authors": [{"id": 100571, "fullname": "Xingyu Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/100571?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158457, "fullname": "Mengying Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158457?format=json", "institution": "Harbin Institute of Technology"}, {"id": 186592, "fullname": "Xinghua Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186592?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158460, "fullname": "Dong Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158460?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158459, "fullname": "Fanding Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158459?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158461, "fullname": "Gongning Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158461?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158462, "fullname": "wei wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158462?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158463, "fullname": "Kuanquan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158463?format=json", "institution": "Harbin Institute of Technology"}, {"id": 90856, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90856?format=json", "institution": "Case Western Reserve University"}], "abstract": "Although EDM aims to unify the design space of diffusion models, its reliance on fixed Gaussian noise prevents it from explaining emerging flow-based methods that diffuse arbitrary noise. Moreover, our study reveals that EDM's forcible injection of Gaussian noise has adverse effects on image restoration task, as it corrupts the degraded images, overextends the restoration distance, and increases the task's complexity. To interpret diverse methods for handling distinct noise patterns within a unified theoretical framework and to minimize the restoration distance, we propose \\textbf{EDA}, which \\textbf{E}lucidates the \\textbf{D}esign space of \\textbf{A}rbitrary-noise diffusion models. Theoretically, EDA expands noise pattern flexibility while preserving EDM's modularity, with rigorous proof that increased noise complexity introduces no additional computational overhead during restoration. EDA is validated on three representative medical image denoising and natural image restoration tasks: MRI bias field correction (global smooth noise), CT metal artifact removal (global sharp noise) and natural image shadow removal (local boundary-aware noise). With only 5 sampling steps, competitive results against specialized methods across medical and natural tasks demonstrate EDA's strong generalization capability for image restoration.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37063", "url": null, "sourceid": 43881, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37571, "uid": "2fa402c141a5cd008c41692f1062ef23", "name": "ASFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments", "authors": [{"id": 162741, "fullname": "xuzhi wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162741?format=json", "institution": "Tianjin Normal University"}, {"id": 187750, "fullname": "Xinran Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187750?format=json", "institution": "Tianjin Normal University"}, {"id": 70670, "fullname": "Song Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70670?format=json", "institution": "Zhejiang University"}, {"id": 76351, "fullname": "Lingdong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76351?format=json", "institution": "National University of Singapore"}, {"id": 187751, "fullname": "Ziping Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187751?format=json", "institution": "Tianjin Normal University"}], "abstract": "Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce ASFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that ASFormer achieves state-of-the-art performance. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37571", "url": null, "sourceid": 39772, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38980, "uid": "595b8086058fbf22a5bdfc5b01273e2c", "name": "Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?", "authors": [{"id": 190507, "fullname": "Renye Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190507?format=json", "institution": "Peking University"}, {"id": 151412, "fullname": "Jikang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151412?format=json", "institution": "Wuhan University"}, {"id": 126812, "fullname": "Shikun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/126812?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191112, "fullname": "Yi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191112?format=json", "institution": null}, {"id": 191113, "fullname": "You Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191113?format=json", "institution": "nanjing university"}, {"id": 89608, "fullname": "Wei Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89608?format=json", "institution": "Stanford University"}, {"id": 191114, "fullname": "Zongwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191114?format=json", "institution": "Peking University"}, {"id": 153028, "fullname": "Ling Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153028?format=json", "institution": "Peking University"}, {"id": 75860, "fullname": "Junliang Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/75860?format=json", "institution": "Tsinghua University"}, {"id": 191115, "fullname": "Yimao Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191115?format=json", "institution": "Peking University"}], "abstract": "Diffusion models have achieved outstanding success in image generation, yet their objectives are often limited to reconstruction, making it difficult to align with human preferences directly. Reinforcement learning (RL) offers a promising approach to address this by optimizing models using explicit reward signals. However, most studies apply RL across the entire denoising process, which is both computationally expensive and tends to weaken preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages.  In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action\u2013reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose \\ourmethod{}, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, \\ourmethod{} adaptively identifies the optimal timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare `dual benefit': a reduction in computational costs alongside a significant performance improvement. Theoretical analysis from an entropy perspective and extensive experiments verify our claims: compared with state-of-the-art methods, \\ourmethod{} improves performance by xx\\% while cutting computational cost by xx\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38980", "url": null, "sourceid": 35936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40340, "uid": "e25ca237a194723ab3c86e793660ef21", "name": "Confusion-Aware Spectral Regularizer for Long-Tailed Recognition", "authors": [{"id": 183412, "fullname": "Ziquan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183412?format=json", "institution": "University of Leicester, UK"}, {"id": 190099, "fullname": "Gaojie Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190099?format=json", "institution": "University of Exeter"}, {"id": 190100, "fullname": "Hanruo Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190100?format=json", "institution": "University of Leicester"}, {"id": 177704, "fullname": "Si-Yuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177704?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 190101, "fullname": "Yunxiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190101?format=json", "institution": "University of Exeter"}, {"id": 190102, "fullname": "ZEYU FU", "url": "http://cvpr.thecvf.com/api/miniconf/users/190102?format=json", "institution": "University of Exeter"}, {"id": 190103, "fullname": "Ronghui Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190103?format=json", "institution": "University of Exeter"}, {"id": 190104, "fullname": "Guoqiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190104?format=json", "institution": "University of Exeter"}, {"id": 190105, "fullname": "Zhao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190105?format=json", "institution": "Zhengzhou University"}, {"id": 190106, "fullname": "Xia Yuhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190106?format=json", "institution": "Chengdu University of Technology"}, {"id": 190107, "fullname": "Jiaxing Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190107?format=json", "institution": "Chongqing University; University of Exeter"}, {"id": 74040, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74040?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 190108, "fullname": "Lu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190108?format=json", "institution": "University of Exeter"}, {"id": 190109, "fullname": "Tianjin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190109?format=json", "institution": "University of Exeter"}], "abstract": "Long-tailed image classification remains a long-standing challenge, as real-world data typically follow highly imbalanced distributions where a few head classes dominate and many tail classes contain only limited samples. This imbalance biases feature learning toward head categories and leads to significant degradation on rare classes. Although recent studies have proposed re-sampling, re-weighting, and decoupled learning strategies, the improvement on the most underrepresented classes still remains marginal compared with overall accuracy. In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization.  We first establish a new theoretical framework of class-specific error analysis, which shows that the worst-class error can be tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix and a model-dependent complexity term. Guided by this insight, we propose the Confusion-Aware Spectral Regularizer (CAR) that minimizes the spectral norm of the confusion matrix during training to reduce inter-class confusion and enhance tail-class generalization. To enable stable and efficient optimization, CAR integrates a Differentiable Confusion Matrix Surrogate and an EMA-based Confusion Estimator to maintain smooth and low-variance estimates across mini-batches. Extensive experiments across multiple long-tailed benchmarks demonstrates that CAR substantially improves both worst-class accuracy and overall performance. When combined with ConCutMix augmentation, CAR consistently surpasses exisiting state-of-the-art long-tailed learning methods under both the training-from-scratch setting (by $2.37\\% \\sim 4.83\\%$) and the fine-tuning-from-pretrained setting (by $2.42\\% \\sim 4.17\\%$) across ImageNet-LT, CIFAR100-LT, and iNaturalist datasets.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40340", "url": null, "sourceid": -46719, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38547?format=json"], "related_events_ids": [38547]}, {"id": 39976, "uid": "237b5accef653ebe40f8d2bb179dc554", "name": "ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding", "authors": [{"id": 193226, "fullname": "Daichi Yashima", "url": "http://cvpr.thecvf.com/api/miniconf/users/193226?format=json", "institution": "Keio University"}, {"id": 156674, "fullname": "Shuhei Kurita", "url": "http://cvpr.thecvf.com/api/miniconf/users/156674?format=json", "institution": "National Institute of Informatics"}, {"id": 193227, "fullname": "Yusuke Oda", "url": "http://cvpr.thecvf.com/api/miniconf/users/193227?format=json", "institution": "National Institute of Informatics; Nara Institute of Science and Technology"}, {"id": 92433, "fullname": "Komei Sugiura", "url": "http://cvpr.thecvf.com/api/miniconf/users/92433?format=json", "institution": "Keio University"}], "abstract": "While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge.In this study, we focus on video understanding by MLLMs.This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length.In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations.A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames.These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding.To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation.Furthermore, our model compresses these features in a way that scales linearly with sequence length.We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks.ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.Our project page and code are provided in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39976", "url": null, "sourceid": 40953, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38625, "uid": "4b997c95a3cd22f4f5a45903bc4f319a", "name": "UniChange: Unifying Change Detection with Multimodal Large Language Model", "authors": [{"id": 180396, "fullname": "Xu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180396?format=json", "institution": "Computer Science, Nankai University"}, {"id": 190330, "fullname": "Danyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190330?format=json", "institution": "Nankai University"}, {"id": 190331, "fullname": "Xiaohang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190331?format=json", "institution": "Nankai University"}, {"id": 190332, "fullname": "Tianhao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190332?format=json", "institution": "Sichuan Agricultural University"}, {"id": 190333, "fullname": "Hualong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190333?format=json", "institution": "Nankai University"}, {"id": 190334, "fullname": "Jianye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190334?format=json", "institution": "Nankai University"}, {"id": 190335, "fullname": "Qicheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190335?format=json", "institution": "Nankai University"}, {"id": 130949, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130949?format=json", "institution": "Nankai University"}], "abstract": "Change detection (CD) is a fundamental task for monitoring and analysing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalisation and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialised CD functionalities. We introduce three special tokens: [T1], [T2], and [CHANGE], utilising their embeddings as the key to query variations. This approach successfully accommodates both BCD and SCD tasks. Furthermore, UniChange utilises text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38625", "url": null, "sourceid": 31391, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39617, "uid": "34d4da4dbe2204c93ce46975c5466b82", "name": "Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation", "authors": [{"id": 156446, "fullname": "Huajie Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156446?format=json", "institution": "Peking University"}, {"id": 192490, "fullname": "Peterson Co", "url": "http://cvpr.thecvf.com/api/miniconf/users/192490?format=json", "institution": "Peking University"}, {"id": 192491, "fullname": "Yijie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192491?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 186910, "fullname": "Shanyu Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186910?format=json", "institution": "Peking University"}, {"id": 156445, "fullname": "Yuheng Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/156445?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 186708, "fullname": "Cheng Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186708?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 192492, "fullname": "Xiansheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192492?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 192493, "fullname": "Zhongxia Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192493?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 156220, "fullname": "Pengwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156220?format=json", "institution": "baai-\u5317\u4eac\u4eba\u5de5\u667a\u80fd\u7814\u7a76\u9662"}, {"id": 128283, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128283?format=json", "institution": "Kuaishou Inc."}, {"id": 91956, "fullname": "Shanghang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91956?format=json", "institution": "Peking University"}], "abstract": "Long-horizon, open-world robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision\u2013Language\u2013Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines \\textit{referential grounding} in cluttered or underspecified scenes, impedes effective \\textit{task decomposition} of long-horizon goals with close-loop interaction, and limits \\textit{causal explanation} by obscuring the rationale behind action choices. To address these issues, we first introduce \\textbf{Visual Sketch}, a lightweight visual intermediate that renders points, boxes, arrows, and typed relations on the robot\u2019s current views to externalize spatial intent, bind language to scene geometry, and provide a human-verifiable bridge between high-level reasoning and low-level control. Building on \\textit{Visual Sketch}, we present \\textbf{Action-Sketcher}, a VLA framework that operates in a cyclic \\textit{See $\\rightarrow$ Think $\\rightarrow$ Sketch $\\rightarrow$ Act} workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate a 2.3M-sample corpus with interleaved images, text, \\textit{Visual Sketch} supervision, and action sequences, and train \\textit{Action-Sketcher} with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Experiments on cluttered tabletops and multi-object tasks, in simulation and on real robots, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39617", "url": null, "sourceid": 36343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39358, "uid": "3d910ef9becd86edb2d3be8ca8dfa1fc", "name": "Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning", "authors": [{"id": 191916, "fullname": "Rui Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191916?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 131695, "fullname": "Bin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/131695?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191917, "fullname": "Kai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191917?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191918, "fullname": "Bo Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191918?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive theoretical and experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39358", "url": null, "sourceid": 37372, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40031, "uid": "3f0b954a086bb1bdcc2af1ffa99022f4", "name": "Failure Modes for Deep Learning\u2013Based Online Mapping: How to Measure and Address Them", "authors": [{"id": 161946, "fullname": "Michael Hubbertz", "url": "http://cvpr.thecvf.com/api/miniconf/users/161946?format=json", "institution": "Bergische Universit\u00e4t Wuppertal"}, {"id": 193344, "fullname": "Qi Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/193344?format=json", "institution": "Aptiv Germany"}, {"id": 193345, "fullname": "Tobias Meisen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193345?format=json", "institution": "Bergische Universit\u00e4t Wuppertal"}], "abstract": "Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map topologies. We propose metrics based on evaluation subsets that control for geographical proximity and topological similarity between training and validation scenes. We introduce Fr\u00e9chet distance\u2013based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: an input-feature overfitting score quantifying the performance drop when geographic cues disappear, and a topology overfitting score measuring degradation as scenes become topologically novel. Beyond models, we analyze dataset biases and contribute topology-aware diagnostics: A minimum-spanning-tree (MST) diversity metric for training sets and a symmetric coverage metric to quantify topological similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that topology-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and topology-centric dataset design for deployable online mapping.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40031", "url": null, "sourceid": 33188, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37604, "uid": "afdef57f55877167686754f14886a0de", "name": "MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection", "authors": [{"id": 181631, "fullname": "Jun Yeong Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/181631?format=json", "institution": "yonsei university"}, {"id": 187811, "fullname": "JunYoung Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187811?format=json", "institution": "Yonsei University"}, {"id": 183504, "fullname": "Minji Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183504?format=json", "institution": "Yonsei University"}, {"id": 177103, "fullname": "Yu Rang Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/177103?format=json", "institution": "Yonsei University College of Medicine"}], "abstract": "The CLIP model's outstanding generalization has driven recent success in Zero-Shot Anomaly Detection (ZSAD) for detecting anomalies in unseen categories. The core challenge in ZSAD is to specialize the model for anomaly detection tasks while preserving CLIP's powerful generalization capability. Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. Furthermore, to prevent functional redundancy among the LoRA experts, we introduce (1) Frozen Orthogonal Feature Separation (FOFS), which orthogonally separates the input feature space to force experts to focus on distinct information, and (2) a simplex equiangular tight frame (ETF) loss to regulate the expert outputs to form maximally equiangular representations. Comprehensive experimental results across 14 benchmark datasets spanning industrial and medical domains demonstrate that MoECLIP outperforms existing state-of-the-art methods. We provide our code in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37604", "url": "https://github.com/CoCoRessa/MoECLIP", "sourceid": 36624, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38872, "uid": "fff23c80b2468e9402716e56f083ebc8", "name": "Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models", "authors": [{"id": 182326, "fullname": "Ivan Luiz De Moura Matos", "url": "http://cvpr.thecvf.com/api/miniconf/users/182326?format=json", "institution": "Telecom Paris, Institut Polytechnique de Paris"}, {"id": 190892, "fullname": "Abdel Djalil SAD SAOUD", "url": "http://cvpr.thecvf.com/api/miniconf/users/190892?format=json", "institution": "The Atomic Energy and Alternative Energies Commission ( CEA List ), Technological Research Division"}, {"id": 190893, "fullname": "Ekaterina Iakovleva", "url": "http://cvpr.thecvf.com/api/miniconf/users/190893?format=json", "institution": "T\u00e9l\u00e9com Paris"}, {"id": 190894, "fullname": "Vito Paolo Pastore", "url": "http://cvpr.thecvf.com/api/miniconf/users/190894?format=json", "institution": "University of Genoa"}, {"id": 190895, "fullname": "Enzo Tartaglione", "url": "http://cvpr.thecvf.com/api/miniconf/users/190895?format=json", "institution": "T\u00e9l\u00e9com Paris, Institut Polytechnique de Paris"}], "abstract": "The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and bias-agnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates \"bias-free\" subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate \"as is\", without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model. The code will be published upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38872", "url": null, "sourceid": 38300, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38118, "uid": "36313c39a91e7913ecc0cc12e5ad7b3d", "name": "Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment", "authors": [{"id": 182583, "fullname": "Yurong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182583?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 77485, "fullname": "Zicheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77485?format=json", "institution": "JD.com"}, {"id": 85779, "fullname": "Congying Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85779?format=json", "institution": "University of  Chinese Academy of Sciences"}, {"id": 85811, "fullname": "Tiande Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85811?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 139633, "fullname": "Xinmin QIu", "url": "http://cvpr.thecvf.com/api/miniconf/users/139633?format=json", "institution": "University of the Chinese Academy of Sciences"}], "abstract": "Diffusion bridge models offer a powerful framework for connecting two data distributions, such as in image restoration and translation. Many existing methods learn this bridge by mimicking the score-matching formulation of standard diffusion models. In this work, we find that this way leads to an anomalous underfitting phenomenon near the target endpoint,  as the process approaches the target distribution ($t \\to 0$). This underfitting, characterized by significant drift in the predicted variance and direction, results from an excessively large discrepancy in noise levels between the network's input and its regression target. To resolve this issue, we propose the Noise-Aligned Diffusion Bridge (NADB). Our approach reformulates the diffusion bridge by first employing a mean network to provide a cleaner conditional target, and then introducing a novel, noise-aligned mapping relationship. This new formulation resolves the noise mismatch and corrects the underfitting near the target endpoint.Experimental validation across multiple image restoration and image translation tasks demonstrates the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38118", "url": null, "sourceid": 37123, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38381, "uid": "b6b8870c2b130a562fbe89fb0bb9518a", "name": "IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution", "authors": [{"id": 184231, "fullname": "Junwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184231?format=json", "institution": "Xidian University"}, {"id": 189756, "fullname": "Mengzu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189756?format=json", "institution": "Xidian University"}, {"id": 189757, "fullname": "Zhenyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189757?format=json", "institution": "Xidian University"}, {"id": 90647, "fullname": "Fangfang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90647?format=json", "institution": "Xidian University"}, {"id": 189758, "fullname": "Sijia Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189758?format=json", "institution": "Xidian University"}, {"id": 146314, "fullname": "Tao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146314?format=json", "institution": "Xidian University"}, {"id": 89279, "fullname": "Weisheng Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89279?format=json", "institution": "Xidian University"}], "abstract": "Single Image Super-Resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input, which becomes increasingly challenging under real-world computational constraints. However, most efficient SISR methods adopt lightweight, spatially uniform strategies that allocate equal computation and focus across all regions\u2014ignoring the uneven distribution of visual complexity.From an information theory perspective, textures and edges inherently carry more critical information, resulting in reconstruction errors that are disproportionately concentrated in these regions. This motivates allocating more resources and attention to these informative areas.In this paper, we propose IAFMNet, an Information-Aware Feature Modulation network for efficient SR. At its core lies the Information Density Map (IDM), which is estimated in an unsupervised manner by minimizing the information entropy of features, thereby highlighting informative regions with high theoretical encoding costs. Guided by the IDM, IAFMNet adopts a synergistic dual-branch design: (1) a sparse convolution branch that dynamically allocates computation to informative areas while bypassing low-information regions, and (2) an implicit modulation branch that adaptively emphasizes complex regions via information-aware affine transformations.Extensive experiments demonstrate that IAFMNet effectively identifies informative regions and achieves superior visual fidelity with reduced computational overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38381", "url": null, "sourceid": 40556, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39871, "uid": "a07ee351d0b3e5104dc2777f647f52ea", "name": "Rethinking Visual Rearrangement from A Diffusion Perspective", "authors": [{"id": 148630, "fullname": "Tianliang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/148630?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 77199, "fullname": "Xinhang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/77199?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 130849, "fullname": "Yuyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130849?format=json", "institution": "Institute of Computing Technology,University of the Chinese Academy of Sciences"}, {"id": 85108, "fullname": "Shuqiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85108?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Rearranging disarrayed objects to their intended goal states requires the agent to comprehend the changes that have occurred in the scene and to reason about the process of these changes. To address this, we propose a novel perspective on the visual rearrangement task, drawing inspiration from the diffusion processes in molecular thermodynamics. We model the room shuffle and unshuffle stages as the forward and reverse processes of diffusion. In contrast to conventional methods that rely on scene modeling and differential comparisons, our approach provides insight into the intrinsic evolution process between the goal and initial states of the scene, which allows for a more reasonable rearrangement of objects through fine-grained and progressive denoising steps with high confidence. By analyzing the task objectives, we represent the scene via spatial distributions of objects and model the visual rearrangement process using a diffusion bridge model. Building upon this, we introduce the Diffusion Rearrangement model, which takes point cloud data as input, fits it into Gaussian mixture distributions to represent the states of objects, and predicts the rearrangement target through an iterative denoising transformer. Experimental results on the RoomR dataset demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39871", "url": null, "sourceid": 43552, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39385, "uid": "65546ed1aea0d4a89d6aa035968c2a3d", "name": "Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing", "authors": [{"id": 102099, "fullname": "Runze He", "url": "http://cvpr.thecvf.com/api/miniconf/users/102099?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 180724, "fullname": "YIJI CHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180724?format=json", "institution": "Tencent"}, {"id": 186613, "fullname": "Tiankai Hang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186613?format=json", "institution": "Tencent"}, {"id": 127826, "fullname": "Zhimin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127826?format=json", "institution": "Tencent Data Platform"}, {"id": 186611, "fullname": "Yu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186611?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 107041, "fullname": "Zijin Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/107041?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 181780, "fullname": "Shiyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181780?format=json", "institution": "Tsinghua University"}, {"id": 88610, "fullname": "Wenxun Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/88610?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 89807, "fullname": "Penghui Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/89807?format=json", "institution": "Beihang University"}, {"id": 184465, "fullname": "Ao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184465?format=json", "institution": "JD.com"}, {"id": 186615, "fullname": "Chunyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186615?format=json", "institution": "Tencent Hunyuan"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 88803, "fullname": "Jizhong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88803?format=json", "institution": "Institute of Information Engineering"}, {"id": 88801, "fullname": "Jiao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/88801?format=json", "institution": "Institute of Information Engineering,Chinese Academy of Sciences"}], "abstract": "In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model\u2019s overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39385", "url": null, "sourceid": 41021, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39528, "uid": "6e62eb1cd9d42ec8bfba3d4624597524", "name": "Meta-CoT: Enhancing Granularity and Generalization in Image Editing", "authors": [{"id": 181780, "fullname": "Shiyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181780?format=json", "institution": "Tsinghua University"}, {"id": 180724, "fullname": "YIJI CHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180724?format=json", "institution": "Tencent"}, {"id": 186613, "fullname": "Tiankai Hang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186613?format=json", "institution": "Tencent"}, {"id": 107041, "fullname": "Zijin Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/107041?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 102099, "fullname": "Runze He", "url": "http://cvpr.thecvf.com/api/miniconf/users/102099?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 186611, "fullname": "Yu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186611?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 88610, "fullname": "Wenxun Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/88610?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 144220, "fullname": "yunlong lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/144220?format=json", "institution": "Xiamen university"}, {"id": 186615, "fullname": "Chunyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186615?format=json", "institution": "Tencent Hunyuan"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 69348, "fullname": "Yansong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69348?format=json", "institution": "Tsinghua University"}], "abstract": "Unified multi-modal understanding/generative models have shown improved image editing performance by incorporating fine-grained understanding into their Chain-of-Thought (CoT) process. However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. We observe that any editing intention can be represented as a triplet \u2014 (task, target, required understanding ability). Inspired by this, Meta-CoT decomposes both the editing task and the target, generating task-specific CoT and traversing editing operations on all targets. This decomposition enhances the model's understanding granularity of editing operations and guides it to learn each element of the triplet during training, substantially improving the editing capability.(2) Generalizability. In the second decomposition level, we further break down editing tasks into five fundamental meta-tasks. We find that training on these five meta-tasks, together with the other two elements of the triplet, is sufficient to achieve strong generalization across diverse, unseen editing tasks.To further align the model's editing behavior with its CoT reasoning, we introduce the CoT\u2013Editing Consistency Reward, which encourages more accurate and effective utilization of CoT information during editing. Experiments demonstrate that our method achieves an overall 15.8\\% improvement across 21 editing tasks, and generalizes effectively to unseen editing tasks when trained on only a small set of meta-tasks. Our code, benchmark, and model will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39528", "url": null, "sourceid": 33985, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39233, "uid": "0fd36ed341d965a7e3f6c413a30c3b31", "name": "WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering", "authors": [{"id": 107296, "fullname": "Yuxuan Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/107296?format=json", "institution": "Tsinghua University"}, {"id": 191669, "fullname": "Xin Ming", "url": "http://cvpr.thecvf.com/api/miniconf/users/191669?format=json", "institution": "Tsinghua University"}, {"id": 185251, "fullname": "Tianxiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185251?format=json", "institution": "Tsinghua University"}, {"id": 191670, "fullname": "Zhuofan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191670?format=json", "institution": "Tsinghua University"}, {"id": 91458, "fullname": "Qixuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91458?format=json", "institution": "ShanghaiTech University"}, {"id": 73999, "fullname": "Lan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73999?format=json", "institution": "ShanghaiTech University"}, {"id": 84990, "fullname": "Feng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84990?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39233", "url": null, "sourceid": 32967, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37464, "uid": "cceb9d12436d7333402c76860baf380c", "name": "SpikeTrack: High-performance and Energy-efficient Event-Based Object Tracking with Spiking Neural Network", "authors": [{"id": 155486, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155486?format=json", "institution": "Dalian University of Technology"}, {"id": 155485, "fullname": "Jiqing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155485?format=json", "institution": "Dalian Martime University"}, {"id": 187512, "fullname": "Chuanyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187512?format=json", "institution": "Dalian University of Technology"}, {"id": 187513, "fullname": "Qianhui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187513?format=json", "institution": "Shandong University"}, {"id": 155487, "fullname": "Huilin Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/155487?format=json", "institution": "Jiangsu University of Science And Technology"}, {"id": 187514, "fullname": "Ziqi Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187514?format=json", "institution": "School of Integrated Circuits,         Dalian University of Technology"}, {"id": 85034, "fullname": "Xin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85034?format=json", "institution": "Dalian University of Technology"}], "abstract": "Event cameras have attracted considerable attention for object tracking due to their microsecond-level temporal resolution and wide dynamic range, yet effectively harnessing spiking neural networks (SNNs) in this domain remains challenging. In this paper, we introduce SpikeTrack, a purely spike-driven framework for single-object tracking that addresses the shortcomings of RGB-based approaches in fast-motion or target appearance change. Central to SpikeTrack is the Multi-Search-sequence-and-Single-Template (MSST) training paradigm, which captures rich temporal dependencies, alongside a Dynamic Integer Leaky Integrate-and-Fire (DI-LIF) neuron that adaptively predicts integer-valued activations based on the input features during training and converts them into spikes during inference. Our design preserves the intrinsic sparsity and fine-grained spatiotemporal acuity of event data, resulting in efficient energy consumption without sacrificing performance. Extensive evaluations on FE108, FELT, and VisEvent demonstrate that SpikeTrack exceeds the performance of state-of-the-art trackers in both accuracy and efficiency. Furthermore, ablation studies validate each module\u2019s contribution, highlighting the practical potential of spike-driven architectures for future vision applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37464", "url": null, "sourceid": 38421, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36887, "uid": "a64f23737e72f85b8fc0eb8ad5b36458", "name": "V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative  Perception", "authors": [{"id": 186112, "fullname": "Weijia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186112?format=json", "institution": "Xiamen University"}, {"id": 155906, "fullname": "Haoen Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155906?format=json", "institution": "Xiamen University"}, {"id": 186113, "fullname": "Tianxu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186113?format=json", "institution": "Lanzhou University of Technology"}, {"id": 186114, "fullname": "Shuaibing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186114?format=json", "institution": "Nanchang University"}, {"id": 128691, "fullname": "Qiming Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/128691?format=json", "institution": "XMU"}, {"id": 86652, "fullname": "Cheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86652?format=json", "institution": "Xiamen University"}, {"id": 86653, "fullname": "Chenglu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86653?format=json", "institution": "Xiamen University"}], "abstract": "Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range, hindering progress toward Level 5 autonomy. While existing cooperative perception paradigms such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex 3D environments. To advance research in cross-view cooperative perception, we present V2U4Real, the first large-scale real-world multi-modal dataset for Vehicle-to-UAV (V2U) cooperative perception. V2U4Real is collected by a ground vehicle and a UAV equipped with multi-view LiDARs and RGB cameras. The dataset covers urban streets, university campuses, and rural roads under diverse traffic scenarios, comprising over 56K LiDAR frames, 56K multi-view camera images, and 700K manually annotated 3D bounding boxes across four classes. To support a wide range of research tasks, we establish benchmarks for single-agent 3D object detection, cooperative 3D object detection and object tracking. Comprehensive evaluations of several state-of-the-art models demonstrate the effectiveness of V2U cooperation in enhancing perception robustness and long-range awareness, particularly under severe occlusion conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36887", "url": null, "sourceid": 30934, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36261, "uid": "446bd5fa13bf6b29f2b0530c36300b10", "name": "Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers", "authors": [{"id": 184604, "fullname": "Yutian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184604?format=json", "institution": "Carnegie Mellon University"}, {"id": 88689, "fullname": "Yuheng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88689?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 184605, "fullname": "Ruogu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184605?format=json", "institution": "Carnegie Mellon University"}, {"id": 184606, "fullname": "Jay Patrikar", "url": "http://cvpr.thecvf.com/api/miniconf/users/184606?format=json", "institution": null}, {"id": 73029, "fullname": "Sebastian Scherer", "url": "http://cvpr.thecvf.com/api/miniconf/users/73029?format=json", "institution": "Carnegie Mellon University"}], "abstract": "We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me employs a light-weight distilled confidence predictor to rank tokens and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\\times$ and $7.2\\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36261", "url": null, "sourceid": 36284, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37963, "uid": "cb45a1110562681fbcf8dc2cead7bd23", "name": "TGTrack: Temporal Generative Learning for Unified Single Object Tracking", "authors": [{"id": 188694, "fullname": "Wanting Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188694?format=json", "institution": "Dalian University of Technology"}, {"id": 159501, "fullname": "Xin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159501?format=json", "institution": "Dalian University of Technology"}, {"id": 187512, "fullname": "Chuanyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187512?format=json", "institution": "Dalian University of Technology"}, {"id": 188695, "fullname": "Jie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188695?format=json", "institution": "Dalian University of Technology"}, {"id": 180503, "fullname": "Ben Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180503?format=json", "institution": "Dalian University of Technology"}, {"id": 89488, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89488?format=json", "institution": "Dalian University of Technology"}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}], "abstract": "Existing single object trackers typically treat temporal modeling superficially by passing limited inter-frame information, such as propagated tokens or template updates, without intrinsic temporal supervision learning. To address this limitation, we propose TGTrack, a new unified tracking framework that incorporates a temporally generative supervision task to guide the model in learning temporal dynamics. The core of TGTrack is a temporally generative learning paradigm equipped with a transformer-based generative decoder, which consists of a gated fusion module and an autoregressive prediction mechanism. This joint design enables the model to infer future scenarios from preceding information, thereby improving its ability to model both visual appearance and temporal dynamics. Furthermore, we introduce a time token embedding to explicitly encode the temporal position of each frame. Experiments on 11 benchmarks spanning five modalities show that TGTrack achieves state-of-the-art performance in robust unified tracking. For instance, TGTrack-B384 achieves an AUC of 75.3\\% on LaSOT. Code and models will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37963", "url": null, "sourceid": 38190, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39253, "uid": "978c1e23fffde1ea092c5102de17a9ad", "name": "Streamlined Knowledge Distillation", "authors": [{"id": 179983, "fullname": "Hyeon-Jin Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/179983?format=json", "institution": "Yonsei University"}, {"id": 175190, "fullname": "Han-Jin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/175190?format=json", "institution": "Yonsei University"}, {"id": 191712, "fullname": "Seok-Hwan Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191712?format=json", "institution": "Yonsei University"}], "abstract": "Logit-based Knowledge Distillation (KD) has emerged as a lightweight alternative to feature-based KD. Recent logit-based methods often rely on multi-knowledge alignment and relational modeling. These methods are often inefficient due to redundant objectives, suboptimal transformations, and poorly designed loss functions. Motivated by these issues, we propose Streamlined Knowledge Distillation (SKD), a simple yet effective logit-based method that transfers only two essential forms of knowledge without requiring additional alignment or relational modeling. Specifically, SKD transfers instance-wise knowledge via Kullback-Leibler divergence and direction-wise knowledge by aligning the Gramian matrix of normalized logits. For the latter, we introduce a Mahalanobis distance-based direction-wise loss stabilized through Tikhonov regularization and Cholesky decomposition. This direction-wise loss accounts for variance and correlation in the output space and, as we formally show, is equivalent to the L2-norm in a covariance-whitened space. Extensive experiments demonstrate that SKD consistently outperforms existing logit-based methods and even surpasses feature-based methods, despite its simpler design. Code is available at \\url{https://anonymous.4open.science/r/StreamLined-DF23/}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39253", "url": null, "sourceid": 46489, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39053, "uid": "560f8f4b1f6a8f9b982c7ac33ffe30e1", "name": "SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning", "authors": [{"id": 94898, "fullname": "Jitesh Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/94898?format=json", "institution": "Georgia Tech"}, {"id": 184159, "fullname": "Jialuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184159?format=json", "institution": "Georgia Institute of Technology"}, {"id": 136978, "fullname": "Zixian Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/136978?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 90298, "fullname": "Jieyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90298?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 186255, "fullname": "Chris Dongjoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186255?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 98421, "fullname": "Sangho Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/98421?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 133433, "fullname": "Rohun Tripathi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133433?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 89855, "fullname": "Tanmay Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/89855?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 98107, "fullname": "Christopher Clark", "url": "http://cvpr.thecvf.com/api/miniconf/users/98107?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 75967, "fullname": "Humphrey Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75967?format=json", "institution": "Georgia Tech | UIUC / Oregon | PAIR"}], "abstract": "As humans, we are natural any-horizon reasoners, *i.e.*, we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: **Is it possible to develop performant any-horizon video reasoning systems?** Inspired by human behavior, we first propose **SAGE**, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, **SAGE-MM**, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate **SAGE-Bench** with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to **6.1%** on open-ended video reasoning tasks, as well as an impressive **8.2%** improvement on videos longer than 10 minutes. We will open-source our system code, data, and checkpoints upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39053", "url": "https://praeclarumjj3.github.io/sage", "sourceid": 38110, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37862, "uid": "df2782c019d0d66a88af774011e8ab29", "name": "UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos", "authors": [{"id": 182442, "fullname": "Gu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182442?format=json", "institution": "Tsinghua University"}, {"id": 188424, "fullname": "Qicheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188424?format=json", "institution": "Tsinghua University"}, {"id": 188425, "fullname": "Haozhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188425?format=json", "institution": "Tsinghua University"}, {"id": 175939, "fullname": "Jianhan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/175939?format=json", "institution": "Shanghai Qi Zhi Institue"}, {"id": 188426, "fullname": "Long He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188426?format=json", "institution": "Tsinghua University"}, {"id": 180671, "fullname": "Yiming Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180671?format=json", "institution": "Tsinghua University"}, {"id": 188427, "fullname": "Zeyu Ping", "url": "http://cvpr.thecvf.com/api/miniconf/users/188427?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188428, "fullname": "Zhecheng Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188428?format=json", "institution": "Tsinghua University"}, {"id": 188429, "fullname": "Chenhao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188429?format=json", "institution": "Tsinghua University"}, {"id": 188430, "fullname": "Chengbo Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188430?format=json", "institution": "Tsinghua University"}, {"id": 188431, "fullname": "Tianhai Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188431?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 87420, "fullname": "Xiaoyu Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/87420?format=json", "institution": "School of Software, Tsinghua University"}, {"id": 188432, "fullname": "Maanping Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188432?format=json", "institution": "Tsinghua University"}, {"id": 188433, "fullname": "Feihong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188433?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 153553, "fullname": "Mingyu Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/153553?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 188434, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188434?format=json", "institution": "Tsinghua University"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 75520, "fullname": "Hang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/75520?format=json", "institution": "Tsinghua University"}, {"id": 150896, "fullname": "Huazhe Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150896?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision\u2013language\u2013action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset of 10M paired image\u2013pointcloud\u2013action frames and over 50K trajectories across eight dexterous hands (6\u201324 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand\u2013object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function\u2013Actuator\u2013Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human\u2013robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81\\% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37862", "url": null, "sourceid": 32205, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36853, "uid": "18f91d43eb4c7f0e879697f012ea3815", "name": "UniT: Unified Multimodal Chain-of-Thought Test-time Scaling", "authors": [{"id": 155681, "fullname": "Leon Liangyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155681?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 126717, "fullname": "Haoyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/126717?format=json", "institution": "University of California, Irvine"}, {"id": 90116, "fullname": "Zhipeng Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90116?format=json", "institution": "Facebook"}, {"id": 70562, "fullname": "Ziqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70562?format=json", "institution": "Nanyang Technological University"}, {"id": 127675, "fullname": "Animesh Sinha", "url": "http://cvpr.thecvf.com/api/miniconf/users/127675?format=json", "institution": "Meta AI"}, {"id": 85185, "fullname": "Xiaoliang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85185?format=json", "institution": "Facebook"}, {"id": 89436, "fullname": "Jialiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89436?format=json", "institution": "Facebook"}, {"id": 186025, "fullname": "Zecheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186025?format=json", "institution": "Meta"}, {"id": 186026, "fullname": "Jianwei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186026?format=json", "institution": "Microsoft"}, {"id": 73494, "fullname": "Chunyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73494?format=json", "institution": "Microsoft Research, Redmond"}, {"id": 186027, "fullname": "Junzhe Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186027?format=json", "institution": "Meta"}, {"id": 174139, "fullname": "Chu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174139?format=json", "institution": null}, {"id": 69178, "fullname": "Serena Yeung", "url": "http://cvpr.thecvf.com/api/miniconf/users/69178?format=json", "institution": "Stanford"}, {"id": 69156, "fullname": "Felix Juefei-Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/69156?format=json", "institution": "GenAI, Meta"}], "abstract": "Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge.We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36853", "url": null, "sourceid": 30989, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39682, "uid": "ea7a98942b8033462a7c54397555e842", "name": "Rethinking BCE Loss for Multi-Label Image Recognition with Fine-tuning", "authors": [{"id": 192638, "fullname": "Ao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192638?format=json", "institution": "Nanjing University"}, {"id": 130219, "fullname": "Zhiwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130219?format=json", "institution": "Nanjing University"}, {"id": 185897, "fullname": "Zifeng Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185897?format=json", "institution": "Nanjing University"}, {"id": 183174, "fullname": "Cong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183174?format=json", "institution": "Nanjing University"}, {"id": 130214, "fullname": "Yafeng Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130214?format=json", "institution": "Nanjing University"}, {"id": 192639, "fullname": "Shufan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192639?format=json", "institution": "Nanjing University"}, {"id": 192640, "fullname": "Qing Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192640?format=json", "institution": "Nanjing University"}], "abstract": "Fine-tuning vision\u2013language models such as CLIP has become the mainstream paradigm for multi-label image recognition, and prompt tuning is widely adopted due to its lightweight parameter cost and strong transferability. However, we find that when these methods use Binary Cross-entropy as the supervision loss, the model\u2019s confidence structure becomes systematically distorted, leading to pronounced miscalibration. Existing calibration techniques, such as temperature scaling or regularization-based methods, largely fail in multi-label settings because they cannot capture inherent semantic dependencies between classes, nor can they correct the global structural shifts introduced during fine-tuning. To address this issue, we propose Class-wise Covariance Regularization, which aligns the predicted covariance structure of class confidences with the semantic correlations encoded in pretrained text embeddings. This alignment preserves the geometric consistency of the class space throughout fine-tuning, resulting in more stable and interpretable confidence distributions across categories. Experiments on multi-label benchmarks show that CCR significantly reduces calibration errors while maintaining or even improving classification accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39682", "url": null, "sourceid": 45158, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39903, "uid": "5b8f025dafb56cb8d3088b7259aadcfb", "name": "InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection", "authors": [{"id": 193081, "fullname": "Xiaofan Que", "url": "http://cvpr.thecvf.com/api/miniconf/users/193081?format=json", "institution": "ARM"}, {"id": 193082, "fullname": "Dingrong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193082?format=json", "institution": "Rochester Institute of Technology"}, {"id": 191933, "fullname": "Xumin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191933?format=json", "institution": "Rochester Institute of Technology"}, {"id": 191932, "fullname": "Qi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191932?format=json", "institution": "Rochester Institute of Technology"}], "abstract": "Test-time prompt tuning (TPT) has emerged as a powerful technique for adapting pre-trained vision-language models (VLMs) to diverse downstream tasks, including image classification and visual reasoning. With the rise of text-driven object detectors, we extend TPT to object detection, unlocking new capabilities for cross-domain adaptation. However, a critical challenge in TPT is the inherent miscalibration caused by entropy minimization: domain shifts often lead to incorrect predictions, and enforcing high confidence exacerbates miscalibration, ultimately degrading performance. To tackle this, we introduce InsCal, a novel framework designed to enhance cross-domain object detection through three key innovations: (1) extending TPT to a multi-source paradigm, enabling knowledge aggregation across diverse domains; (2) reducing domain gaps via a novel text-driven style transfer strategy that aligns features to the source domain without requiring reference images; and (3) refining the entropy minimization objective with instance-specific calibration, ensuring robust and well-calibrated adaptation. Our approach not only mitigates miscalibration but also significantly improves cross-domain object detection performance, setting a new benchmark for test-time adaptation in VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39903", "url": null, "sourceid": 39518, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39364, "uid": "658bdae888193156bd7d5c18ce3bd6d4", "name": "The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection", "authors": [{"id": 183642, "fullname": "Anusha Achaya", "url": "http://cvpr.thecvf.com/api/miniconf/users/183642?format=json", "institution": "Rochester Institute of Technology"}, {"id": 191931, "fullname": "Hitesh Sapkota", "url": "http://cvpr.thecvf.com/api/miniconf/users/191931?format=json", "institution": "Amazon"}, {"id": 191932, "fullname": "Qi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191932?format=json", "institution": "Rochester Institute of Technology"}, {"id": 191933, "fullname": "Xumin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191933?format=json", "institution": "Rochester Institute of Technology"}], "abstract": "Weakly supervised learning provides a cost-effective framework for video anomaly detection by using video-level supervision instead of relying on the costly fine-grained segment-level labels. Although contemporary methods have shown promising results on challenging real-world surveillance videos, most of them are evaluated using the Area Under the Receiver Operating Characteristic Curve (AUROC). Our work reveals that a high AUROC could result in a very low recall given a meaningful False Positive Rate (FPR) threshold. Thus, these models suffer from limited practical values, especially in high-stake domains (\\eg public safety and medical diagnosis), where missing the true anomalies are highly costly. This surprising phenomenon is rooted in the interplay of weak supervision and the highly imbalanced distribution between normal and abnormal segments. To tackle this key challenge of building practical video anomaly detection systems, we propose a novel dual exploration strategy that combines temporal clustering with uncertainty-based segment exploration. Temporal clustering selects diverse segments based on both semantic and temporal similarity, while uncertainty-based sampling targets low-scoring segments with high model uncertainty. This ensures the model learns from a wide range of patterns, both diverse and ambiguous, resulting in more informed and robust decision-making, and reduction in false negatives. Meanwhile, we recommend two practical metrics to replace the commonly used AUROC score for a more effective measure for evaluation. Experiments conducted in challenging real-world videos demonstrate better dual exploration performance compared to competitive baselines on these metrics, which justifies its improved practical value in real-world settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39364", "url": null, "sourceid": 44435, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38788, "uid": "90e67987539164c4f62dd88d4558b1d6", "name": "FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution", "authors": [{"id": 148316, "fullname": "Aro Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/148316?format=json", "institution": "Kyungpook National University"}, {"id": 190675, "fullname": "Myeongjin Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190675?format=json", "institution": "Kyungpook National University"}, {"id": 190676, "fullname": "Chaewon Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190676?format=json", "institution": "Kyungpook National University"}, {"id": 190677, "fullname": "Youngjin Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190677?format=json", "institution": "Kyungpook National University"}, {"id": 190678, "fullname": "Jinwoo Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190678?format=json", "institution": null}, {"id": 149950, "fullname": "Sang-hyo Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/149950?format=json", "institution": "Kyungpook National University"}], "abstract": "Diffusion-based approaches have recently driven remarkable progress in real-world image super-resolution (SR). However, existing methods still struggle to simultaneously preserve fine details and ensure high-fidelity reconstruction, often resulting in suboptimal visual quality. In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. During training, we introduce a detail-aware weighting strategy that adaptively emphasizes regions where the model exhibits higher prediction errors. During inference, low- and high-frequency adaptive enhancers further refine the reconstruction without requiring model retraining, enabling flexible enhancement control. To further improve the reconstruction accuracy, FiDeSR incorporates a residual-in-residual noise refinement, which corrects prediction errors in the diffusion noise and enhances fine detail recovery. FiDeSR achieves superior real-world SR performance compared to existing diffusion-based methods, producing outputs with both high perceptual quality and faithful content restoration.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38788", "url": null, "sourceid": 31141, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36892, "uid": "92b1e36cb94cefdb60af51fdd432500b", "name": "Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion", "authors": [{"id": 186128, "fullname": "Honglin Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186128?format=json", "institution": "ShanghaiTech University"}, {"id": 186129, "fullname": "Chenjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186129?format=json", "institution": "Southeast University"}, {"id": 186130, "fullname": "Jianbiao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/186130?format=json", "institution": "Hithink RoyalFlush Information Network Co.,Ltd."}, {"id": 186131, "fullname": "Zixuan Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/186131?format=json", "institution": null}, {"id": 151400, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/151400?format=json", "institution": "Zhejiang University of Technology"}, {"id": 186132, "fullname": "Zhenpeng Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186132?format=json", "institution": "Hithink RoyalFlush Information Network Co.,Ltd."}, {"id": 158497, "fullname": "Qian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158497?format=json", "institution": "ShanghaiTech University"}], "abstract": "Real-world photographic post-processing is a formidable challenge due to the frequent co-occurrence of multiple, coupled image degradations. Current paradigms, such as monolithic \"all-in-one\" models, often face generalization bottlenecks, while recent agent-based systems suffer from time-consuming, sequential tool invocation and suboptimal coordination of isolated, single-task tools. To overcome these limitations, we propose a novel and efficient framework: a vision-language agent system for universal photographic post-processing. Our system employs a powerful Vision-Language Model (VLM) as an orchestrator agent to perform nuanced user intent understanding and in-depth degradation analysis. Based on its assessment, the VLM generates a structured plan, dynamically allocating weights to a suite of specialized expert LoRA modules. These experts, which adapt only the Key (K) and Value (V) matrices for enhanced composability, are then simultaneously merged into a pretrained diffusion backbone to execute a tailored restoration. To ensure perceptually optimal weights, we introduce a lightweight allocation branch trained on the VLM's features using Direct Preference Optimization (DPO) from human feedback. This dynamic fusion paradigm enables a synergistic, context-aware restoration in a single, efficient forward pass. Our method demonstrates state-of-the-art performance across a wide range of synthetic and real-world datasets with diverse degradations. Crucially, it exhibits remarkable zero-shot generalization, achieving excellent results on real-world data. Our code and weights will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36892", "url": null, "sourceid": 38200, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38780, "uid": "a5d3526558368a8cda8a30a3c35963cb", "name": "Efficiently Reconstructing Dynamic Scenes one D4RT at a Time", "authors": [{"id": 136125, "fullname": "Chuhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136125?format=json", "institution": "Google"}, {"id": 182924, "fullname": "Guillaume LE MOING", "url": "http://cvpr.thecvf.com/api/miniconf/users/182924?format=json", "institution": "Google DeepMind"}, {"id": 128673, "fullname": "Skanda Koppula", "url": "http://cvpr.thecvf.com/api/miniconf/users/128673?format=json", "institution": "Google Deepmind"}, {"id": 86793, "fullname": "Ignacio Rocco", "url": "http://cvpr.thecvf.com/api/miniconf/users/86793?format=json", "institution": "Facebook"}, {"id": 150948, "fullname": "Liliane Momeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/150948?format=json", "institution": "Google"}, {"id": 128081, "fullname": "Junyu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/128081?format=json", "institution": "University of Oxford"}, {"id": 130779, "fullname": "Shuyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/130779?format=json", "institution": "University of Oxford"}, {"id": 178259, "fullname": "Rahul Sukthankar", "url": "http://cvpr.thecvf.com/api/miniconf/users/178259?format=json", "institution": "Google DeepMind"}, {"id": 190650, "fullname": "Jo\u00eblle Barral", "url": "http://cvpr.thecvf.com/api/miniconf/users/190650?format=json", "institution": "Google"}, {"id": 190651, "fullname": "Raia Hadsell", "url": "http://cvpr.thecvf.com/api/miniconf/users/190651?format=json", "institution": "DeepMind"}, {"id": 190652, "fullname": "Zoubin Ghahramani", "url": "http://cvpr.thecvf.com/api/miniconf/users/190652?format=json", "institution": "Google DeepMind ; University of Cambridge"}, {"id": 75512, "fullname": "Andrew Zisserman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75512?format=json", "institution": "University of Oxford"}, {"id": 190653, "fullname": "Junlin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190653?format=json", "institution": "Google"}, {"id": 90310, "fullname": "Mehdi S. M. Sajjadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/90310?format=json", "institution": "Google"}], "abstract": "Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38780", "url": null, "sourceid": 32380, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40345?format=json"], "related_events_ids": [40345]}, {"id": 36643, "uid": "0cd5d9d528e8f06f787079c89480f5dc", "name": "UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling", "authors": [{"id": 157319, "fullname": "Yuchuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157319?format=json", "institution": "Queen&#x27;s University"}, {"id": 185539, "fullname": "Azadeh Motamedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185539?format=json", "institution": "Queen&#x27;s University"}, {"id": 185540, "fullname": "Hyock Ju Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/185540?format=json", "institution": "University of Waterloo"}, {"id": 185541, "fullname": "Chul Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/185541?format=json", "institution": "University of Toronto"}, {"id": 157321, "fullname": "Il-Min Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/157321?format=json", "institution": "Queen&#x27;s University"}], "abstract": "Out-of-distribution (OOD) detection is a key requirement for reliable deployment in open-world environments, where a model must recognize inputs that fall outside the semantic scope of known concepts. While recent advances in vision\u2013language models (VLMs) have achieved strong results in image-level OOD detection, most methods still assume that each image contains a single dominant object. This assumption severely limits their applicability to real-world settings where scenes are naturally composed of multiple objects that each demands independent OOD assessment. Existing object-level approaches, including the current SOTA method RUNA, remain constrained by coarse global representations and insufficient modeling of contextual dependencies between objects and their backgrounds. We propose UNI-OOD, a unified framework that performs both object- and image-level OOD detection within a single vision\u2013language model, without requiring prior knowledge of which task is being addressed at inference time. The key idea is to leverage cross-context attentive modeling that captures complementary visual and textual semantics. UNI-OOD learns to attend to fine-grained spatial details within each object, aligns visual and linguistic embeddings to strengthen semantic correspondence, and model interactions between target objects and their surrounding context. By jointly reasoning over object-centric and background cues, the framework disentangles informative visual evidence from spurious correlations and enables a consistent OOD scoring mechanism across different visual granularities. Extensive experiments on standard object- and image-level benchmarks demonstrate that UNI-OOD achieves substantial and consistent improvements over previous approaches, establishing new SOTA performance in both object-level and image-level OOD detection. Beyond empirical gains, this study provides the first holistic formulation of OOD detection that bridges the gap between object- and image-level detection within a single unified vision\u2013language paradigm, establishing a general foundation for open-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36643", "url": null, "sourceid": 33500, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40054, "uid": "0f783a2a5e30eec921c2f2311da42dc5", "name": "TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation", "authors": [{"id": 155194, "fullname": "Haodong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155194?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 180285, "fullname": "Disen Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180285?format=json", "institution": "Fudan University"}, {"id": 148676, "fullname": "Wen-Jie Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/148676?format=json", "institution": "bitdeer"}, {"id": 107568, "fullname": "Qingyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107568?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 193395, "fullname": "ZihanWang ZihanWang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193395?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193396, "fullname": "Sirui CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/193396?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 193397, "fullname": "Wenkai Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193397?format=json", "institution": "Northwestern Polytechnical University Xi&#x27;an"}, {"id": 129968, "fullname": "Kanghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129968?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 193398, "fullname": "Hongfei (Faye) Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193398?format=json", "institution": ""}, {"id": 193399, "fullname": "Zixin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193399?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 159820, "fullname": "Rongjin Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/159820?format=json", "institution": "City University of Hong Kong"}, {"id": 74051, "fullname": "Yu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74051?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 88283, "fullname": "Ying-Cong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88283?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose **TiViBench**, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) **Structural Reasoning & Search**, ii) **Spatial & Visual Pattern Reasoning**, iii) **Symbolic & Logical Reasoning**, and iv) **Action Planning & Task Execution**, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (*e.g.*, Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce **VideoTPO**, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40054", "url": null, "sourceid": 39002, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39034, "uid": "3ecf70058911d495069ec09c0c8c9190", "name": "Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters", "authors": [{"id": 182225, "fullname": "Mohammed Rahman Sherif Khan Mohammad", "url": "http://cvpr.thecvf.com/api/miniconf/users/182225?format=json", "institution": "Edge Hill University"}, {"id": 191215, "fullname": "Ardhendu Behera", "url": "http://cvpr.thecvf.com/api/miniconf/users/191215?format=json", "institution": "Edge Hill University"}, {"id": 191216, "fullname": "Sandip Pradhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191216?format=json", "institution": "Edge Hill University"}, {"id": 191217, "fullname": "Swagat Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/191217?format=json", "institution": "Edge Hill University"}, {"id": 191218, "fullname": "Amr Ahmed", "url": "http://cvpr.thecvf.com/api/miniconf/users/191218?format=json", "institution": "Edge Hill University"}], "abstract": "Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter\u2019s key\u2013value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39034", "url": null, "sourceid": 45684, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39377, "uid": "f481464eecb83ca952937d1f7e24908b", "name": "SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head", "authors": [{"id": 181237, "fullname": "Fatemeh Nazarieh", "url": "http://cvpr.thecvf.com/api/miniconf/users/181237?format=json", "institution": "University of Surrey"}, {"id": 154653, "fullname": "Zhenhua Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154653?format=json", "institution": "Jiangnan University"}, {"id": 191955, "fullname": "Diptesh Kanojia", "url": "http://cvpr.thecvf.com/api/miniconf/users/191955?format=json", "institution": "University of Surrey"}, {"id": 154654, "fullname": "Josef Kittler", "url": "http://cvpr.thecvf.com/api/miniconf/users/154654?format=json", "institution": "University of Surrey"}, {"id": 154655, "fullname": "Muhammad Awais", "url": "http://cvpr.thecvf.com/api/miniconf/users/154655?format=json", "institution": "University of Surrey"}], "abstract": "Generating realistic and expressive audio-driven talking avatars remains a central challenge in digital human synthesis. Existing methods often depend on intermediate representations such as pose estimations for natural body motion, which restricts flexibility and adds visual distortions. Moreover, most audio-driven approaches rely on discrete emotion classifiers or text labels to regulate facial expression, reducing complex affective dynamics to coarse categories such as happy, sad, or angry. Such categorical supervision fails to capture the continuous and fine-grained speech dynamics (rhythm, energy, intensity) resulting in limited synchronization and emotionally shallow motion. To overcome these limitations, we present SyncDreamer, a unified Diffusion Transformer framework that generates identity-preserving and emotionally expressive talking avatars from only a single image, speech audio, and text prompt.We propose a visual adapter with Attention Localization Loss to maintain identity fidelity, further incorporating an audio dynamics encoder for rhythm- and emotion-aware motion, and an RL-based Cross-Modal Prompt Enhancer grounding textual cues in visual context for fine-grained motion control. Extensive experiments on portrait and full-body benchmarks demonstrate state-of-the-art performance in realism, synchronization accuracy, and semantic controllability, establishing a scalable foundation for expressive digital avatars in interactive and creative applications.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39377", "url": null, "sourceid": 37124, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37420, "uid": "8d697c1323b49f7888addaf6429d74b7", "name": "TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events", "authors": [{"id": 187406, "fullname": "Jiaxiong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187406?format=json", "institution": "National University of Defense Technology"}, {"id": 187407, "fullname": "Zhen Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187407?format=json", "institution": "National University of Defense Technology"}, {"id": 187408, "fullname": "Jinpu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187408?format=json", "institution": "National University of Defense Technology"}, {"id": 154803, "fullname": "Yi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/154803?format=json", "institution": "Hunan University"}, {"id": 187409, "fullname": "Hui Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187409?format=json", "institution": "National University of Defense Technology"}, {"id": 132213, "fullname": "Xieyuanli Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132213?format=json", "institution": "National University of Defense Technology"}, {"id": 187410, "fullname": "Dewen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187410?format=json", "institution": "National University of Defense Technology"}], "abstract": "Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking.Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates\u2014bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light.To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions.Our method outperforms existing point trackers, achieving a $28.2$% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. We will release the code and dataset upon acceptance to support future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37420", "url": null, "sourceid": 40658, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36333, "uid": "a45d6d742a58d481c633a346de1c7c2f", "name": "Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation", "authors": [{"id": 184791, "fullname": "Lingfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184791?format=json", "institution": "Tsinghua University; Xiaomi Corporation"}, {"id": 184792, "fullname": "Yuchen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184792?format=json", "institution": "Georgia Institute of Technology"}, {"id": 184793, "fullname": "Hongsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184793?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 184794, "fullname": "Haoxiang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184794?format=json", "institution": null}, {"id": 184795, "fullname": "Yingbo Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184795?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 158846, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158846?format=json", "institution": "Wayve"}, {"id": 184797, "fullname": "Xiaojun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184797?format=json", "institution": "Pengcheng Laboratory"}, {"id": 156448, "fullname": "Xiaoshuai Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156448?format=json", "institution": "Beijing Academy of Artificial Intelligence(BAAl)"}, {"id": 184798, "fullname": "Wenbo Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/184798?format=json", "institution": "Tsinghua Univeresity"}], "abstract": "Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks.However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments.To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories\u2014Environmental Perception and Scene Understanding\u2014divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others.Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities.To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1 M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts.Extensive experimental results demonstrate that Sky-VLM  achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for drone scenarios.The dataset, benchmark toolkit, and associated code and model checkpoints will be publicly accessible.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36333", "url": null, "sourceid": 38217, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37207, "uid": "ff46fb7782a53ef41bc031428e1ed4cd", "name": "DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images", "authors": [{"id": 89003, "fullname": "Xiaoxue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89003?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186921, "fullname": "Ziyi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186921?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 181877, "fullname": "Yuantao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181877?format=json", "institution": "The Chinese University of Hong Kong,Shenzhen"}, {"id": 186922, "fullname": "Gen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186922?format=json", "institution": "Zhejiang University; Tsinghua University"}, {"id": 186923, "fullname": "Nan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186923?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 186924, "fullname": "Hongcheng Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186924?format=json", "institution": "Xiaomi Corporation"}, {"id": 158846, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158846?format=json", "institution": "Wayve"}, {"id": 180077, "fullname": "Haiyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180077?format=json", "institution": "Xiaomi Corporation"}, {"id": 126660, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126660?format=json", "institution": "Alibaba Group"}, {"id": 185801, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185801?format=json", "institution": "Xiaomi Corporation"}, {"id": 69669, "fullname": "Hongyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/69669?format=json", "institution": "The University of Hong Kong"}, {"id": 89005, "fullname": "Ya-Qin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89005?format=json", "institution": "AIR, Tsinghua University"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37207", "url": null, "sourceid": 38077, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37865, "uid": "e5654b80531b9a7338900193f90fbba5", "name": "VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving", "authors": [{"id": 157922, "fullname": "Jie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157922?format=json", "institution": "Tianjin University"}, {"id": 188438, "fullname": "Guang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188438?format=json", "institution": "Xiaomi Corporation"}, {"id": 188439, "fullname": "Zhijian Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188439?format=json", "institution": null}, {"id": 188440, "fullname": "Chenxu Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188440?format=json", "institution": "Huazhong University of Science and Technology; Xiaomi Corporation"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 86180, "fullname": "Yahong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86180?format=json", "institution": "Tianjin University"}, {"id": 158846, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158846?format=json", "institution": "Wayve"}], "abstract": "The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, $\\textbf{VGGDrive}$, which empowers $\\textbf{V}$ision-language models with cross-view $\\textbf{G}$eometric $\\textbf{G}$rounding for autonomous $\\textbf{Driv}$ing. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It\u2019s our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37865", "url": null, "sourceid": 31228, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38332, "uid": "d0ceaba6d228fd9ad99831d5df783c7c", "name": "From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity", "authors": [{"id": 189485, "fullname": "Zhuang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189485?format=json", "institution": "Shandong University"}, {"id": 189623, "fullname": "Yingpeng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189623?format=json", "institution": "Nanyang Technological University"}, {"id": 189624, "fullname": "Lei Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189624?format=json", "institution": "Shandong University"}, {"id": 153933, "fullname": "Guoqing Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153933?format=json", "institution": "Harbin Institute of Technology at Weihai"}, {"id": 149152, "fullname": "Lei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149152?format=json", "institution": "Shandong University"}, {"id": 157342, "fullname": "Han Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157342?format=json", "institution": "Nanyang Technological University"}, {"id": 189625, "fullname": "Xiangxu Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189625?format=json", "institution": "Shandong University"}], "abstract": "Exemplar replay has become an effective strategy for mitigating catastrophic forgetting in federated continual learning (FCL) by retaining representative samples from past tasks. Existing studies focus on designing sample-importance estimation mechanisms to identify information-rich samples. However, they typically overlook strategies for effectively utilizing the selected exemplars, which limits their performance under continual dynamic heterogeneity across clients and tasks. To address this issue, this paper proposes a federated geometry-aware correction method, termed FEAT, which alleviates imbalance-induced representation collapse that drags rare-class features toward frequent classes across clients. Specifically, it consists of two key modules: 1) the Geometric Structure Alignment module performs structural knowledge distillation by aligning the pairwise angular similarities between feature representations and their corresponding Equiangular Tight Frame prototypes, which are fixed and shared across clients to serve as a class-discriminative reference structure. This encourages geometric consistency across tasks and helps mitigate representation drift; 2) the Energy-based Geometric Correction module removes task-irrelevant directional components from feature embeddings, which reduces prediction bias toward majority classes. This improves sensitivity to minority classes and enhances the model's robustness under class-imbalanced data distributions. Extensive experiments on three benchmark datasets demonstrate that FEAT substantially achieves a 4%\u20138% improvement in Top-1 accuracy compared to nine state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38332", "url": null, "sourceid": 31291, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36487, "uid": "56363c2b36be6de4ac2f67eaa038b927", "name": "CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis", "authors": [{"id": 185171, "fullname": "Haoyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185171?format=json", "institution": "Xihua University"}, {"id": 185172, "fullname": "Xiaoliang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185172?format=json", "institution": "Xihua University"}, {"id": 184481, "fullname": "Duoqian Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184481?format=json", "institution": "Tongji University"}, {"id": 185173, "fullname": "Xiaolin Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185173?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185174, "fullname": "Xianyong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185174?format=json", "institution": "Xihua University"}, {"id": 185175, "fullname": "Yajun Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/185175?format=json", "institution": "Xihua University"}], "abstract": "Multimodal sentiment analysis requires integrating language, visual, and acoustic cues, yet these modalities are often noisy, incomplete, or contradictory, making fusion unreliable. Most existing methods assume uniformly trustworthy modalities and thus degrade when signals conflict.To address this, we propose CICA, a framework that couples Confidence-Aware Pretraining with Confidence-Informed Attention. In pretraining, each modality encoder learns to estimate the reliability of its own representation, producing both embeddings and confidence scores. These scores then guide a confidence-informed attention mechanism, which strengthens contributions from reliable modalities while suppressing noisy or conflicting ones, enabling adaptive fusion under varying signal conditions.CICA achieves state-of-the-art performance across four major benchmarks on MOSI, MOSEI, CH-SIMS, and CH-SIMSv2. It achieves MAE 0.630 and Corr 0.855 on MOSI, and MAE 0.489 and Corr 0.856 on MOSEI, significantly surpassing prior methods. Consistent improvements are also observed across Acc-7, Acc-2, and F1 metrics. Under noisy and missing-modality conditions, CICA maintains significantly more stable performance, indicating improved robustness and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36487", "url": null, "sourceid": 45666, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38278, "uid": "4c290319041a7b75c60a6261f74fea51", "name": "Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery", "authors": [{"id": 130131, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130131?format=json", "institution": "Monash University"}, {"id": 189484, "fullname": "Yiwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189484?format=json", "institution": "Monash University"}, {"id": 158746, "fullname": "Sijin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/158746?format=json", "institution": "Monash University"}, {"id": 189485, "fullname": "Zhuang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189485?format=json", "institution": "Shandong University"}, {"id": 158743, "fullname": "Zhongxing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158743?format=json", "institution": "Weill Cornell Medicine, Cornell University"}, {"id": 189486, "fullname": "Zhonghua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189486?format=json", "institution": "Monash University"}, {"id": 132272, "fullname": "feilong tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132272?format=json", "institution": "Monash University"}, {"id": 185612, "fullname": "Zongyuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185612?format=json", "institution": "Monash University"}], "abstract": "Generalized Category Discovery (GCD) aims to transfer knowledge from known categories to automatically discover new, unseen ones while preserving recognition of the known classes. Despite recent progress, existing GCD approaches typically assume that all data are drawn from the same distribution, which is rarely valid in real-world scenarios. In practice, data often experience simultaneous domain shifts and novel category emergence, causing severe performance degradation of existing systems. To address this challenge, we propose CausalGCD, a causality-inspired framework designed to mitigate domain-shift bias in category discovery. Specifically, we first analyze the causal graph to uncover the relationships among key variables in cross-domain GCD. We then introduce the concept of causal dependency risk and propose a Causal Dependency Risk Estimator to capture causal semantics, further deriving a theoretically computable upper bound to optimize this risk under cross-domain GCD settings. Furthermore, we propose a Causal Geometric Manifold Constraint that enforces invariant manifold-level associations between known and unknown categories across domains, thereby facilitating robust discovery of novel classes. Extensive experiments on the \\textbf{SSB-C} and \\textbf{DomainNet} benchmarks demonstrate the effectiveness of \\textbf{CausalGCD} and highlight the significance of causal reasoning in open-world category discovery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38278", "url": null, "sourceid": 45538, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36363, "uid": "86c1dedbd4a728cf270c4d7af3798b03", "name": "YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection", "authors": [{"id": 180455, "fullname": "Shasha Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/180455?format=json", "institution": "Ocean University of China"}, {"id": 184880, "fullname": "Chong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184880?format=json", "institution": "Ocean University of China"}, {"id": 184881, "fullname": "Xinning Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184881?format=json", "institution": "Ocean University of China"}, {"id": 184882, "fullname": "Xuebo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184882?format=json", "institution": "Ocean University of China"}], "abstract": "YOLO series lead object detection with superior accuracy and speed. However, both convolutional and self-attention based architectures suffer from parameter redundancy and insufficient computational efficiency. Existing lightweight methods excessively pursue speed while ignoring the loss of important information during feature extraction and spatial transformation across different stages. Thus, effective lightweighting is crucial for detection performance. We propose YOLO-ULM, an ultra-lightweight real-time detector that achieves accelerated inference while preserving high accuracy. We innovatively design a variety of dual efficiency- and accuracy-driven modules, including efficient feature aggregation and multi-scale downsampling modules, as well as a more focused complete-IoU loss function. To validate our approach, we train it from scratch on COCO dataset without pretrained weights. By refining backbone parameters, we extend it to YOLO-ULM-Turbo for accelerated inference. YOLO-ULM surpasses state-of-the-art real-time detectors like YOLOv11/YOLOv12/YOLOv13 and RT-DETR. On a T4 GPU, YOLO-ULM-N achieves 41.6\\% mAP with an inference latency of 1.52 ms, outperforming YOLOv11-N (2.2\\%$\\uparrow$) and YOLOv12-N (1.0\\%$\\uparrow$). YOLO-ULM-S exceeds RT-DETR-R18 by 1.6\\% mAP with 64.7\\% fewer FLOPs and 63\\% fewer parameters. YOLO-ULM-L / X surpass YOLOv13-L / X by 0.7\\% and 0.8\\% respectively in mAP. YOLO-ULM-Turbo matches YOLOv12-Turbo's performance but uses less computation, with Turbo-N variant achieving 0.3\\% higher mAP and 16\\% fewer parameters than YOLOv12-Turbo-N.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36363", "url": null, "sourceid": 32839, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38579, "uid": "e7d5d2680db41c2f20b454ca2ef90401", "name": "MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving", "authors": [{"id": 145782, "fullname": "junli wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145782?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 156360, "fullname": "Yinan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156360?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 190196, "fullname": "Xueyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190196?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 158537, "fullname": "Zebin Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/158537?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 157716, "fullname": "Pengfei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157716?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 186441, "fullname": "Kun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186441?format=json", "institution": null}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 185801, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185801?format=json", "institution": "Xiaomi Corporation"}, {"id": 188438, "fullname": "Guang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188438?format=json", "institution": "Xiaomi Corporation"}, {"id": 158846, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158846?format=json", "institution": "Wayve"}, {"id": 190197, "fullname": "Zhongpu Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/190197?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 190198, "fullname": "Qichao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190198?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance.To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We introduce ``MeanFlow Identity\", which models the mean velocity field between GMN and data distribution instead of the instantaneous velocity field used in na\u00efve flow-matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to consider all sampled proposals and adaptively decide whether to reconstruct a trajectory when none of the proposals is satisfactory. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38579", "url": null, "sourceid": 42803, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40379, "uid": "ef000db3b0f2a2c40465c2a8464b726b", "name": "Learning to Drive via Real-World Simulation at Scale", "authors": [{"id": 142811, "fullname": "Haochen Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/142811?format=json", "institution": "CASIA, Xiaomi EV"}, {"id": 185064, "fullname": "Tianyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185064?format=json", "institution": "Fudan University"}, {"id": 98926, "fullname": "Haochen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/98926?format=json", "institution": null}, {"id": 90131, "fullname": "Jiazhi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90131?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 192709, "fullname": "Yihang Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192709?format=json", "institution": "University of Hong Kong"}, {"id": 188438, "fullname": "Guang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188438?format=json", "institution": "Xiaomi Corporation"}, {"id": 145782, "fullname": "junli wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145782?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 192710, "fullname": "Yinfeng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192710?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 89150, "fullname": "Zhang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89150?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 87288, "fullname": "Liang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87288?format=json", "institution": "CASIA"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 158846, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158846?format=json", "institution": "Wayve"}, {"id": 69669, "fullname": "Hongyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/69669?format=json", "institution": "The University of Hong Kong"}], "abstract": "Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing these crucial massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by ego trajectory perturbations. Furthermore, we develop a pseudo-expert trajectory generation mechanism to provide feasible action supervision for these newly simulated states to provide action supervision.Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal crucial findings of such a sim-real paradigm, includingthe design of pseudo-experts and the scaling properties for different policy architectures. Simulation data and code would be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40379", "url": "https://github.com/OpenDriveLab/SimScale", "sourceid": -34548, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39714?format=json"], "related_events_ids": [39714]}, {"id": 37878, "uid": "d72c5c85c6c5a1da806d45e917a714ee", "name": "Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans", "authors": [{"id": 180302, "fullname": "Sizhong Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180302?format=json", "institution": "Tsinghua University"}, {"id": 188477, "fullname": "Ramon Weber", "url": "http://cvpr.thecvf.com/api/miniconf/users/188477?format=json", "institution": "University of California, Berkeley"}, {"id": 188478, "fullname": "Xinzheng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188478?format=json", "institution": "Tsinghua University"}], "abstract": "Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37878", "url": null, "sourceid": 34757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40218, "uid": "8e1ba2fadecb9dc939750d1104c8a7f2", "name": "Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning", "authors": [{"id": 193811, "fullname": "Jaekyun Ko", "url": "http://cvpr.thecvf.com/api/miniconf/users/193811?format=json", "institution": "Samsung Electronics"}, {"id": 129959, "fullname": "Dongjin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/129959?format=json", "institution": "Hanyang University"}, {"id": 193812, "fullname": "Soomin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193812?format=json", "institution": "Hanyang Universty"}, {"id": 155070, "fullname": "Guanghui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155070?format=json", "institution": "Toronto Metropolitan University"}, {"id": 77245, "fullname": "Tae Hyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/77245?format=json", "institution": "Hanyang Univ."}], "abstract": "Denoising in the sRGB image space is challenging due to noise variability.Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability.Therefore, we propose a novel framework called Prompt-Driven  Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise.By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis.Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40218", "url": null, "sourceid": 32278, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38694, "uid": "25c801f83373161c6a13f424af3c2791", "name": "Stable and Efficient Single-Rollout RL for Multimodal Reasoning", "authors": [{"id": 190475, "fullname": "Rui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190475?format=json", "institution": "University of Maryland, College Park"}, {"id": 190476, "fullname": "Dian Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190476?format=json", "institution": "Tencent AI Lab"}, {"id": 70165, "fullname": "Lei Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/70165?format=json", "institution": "HKUST &amp; ETH Zurich"}, {"id": 190477, "fullname": "Haolin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190477?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 190478, "fullname": "Yujun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190478?format=json", "institution": "University of Notre Dame"}, {"id": 154920, "fullname": "Zhenwen Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154920?format=json", "institution": "Tencent AI Lab"}, {"id": 190479, "fullname": "Haitao Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190479?format=json", "institution": "Tencent AI Lab"}, {"id": 190480, "fullname": "Pratap Tokekar", "url": "http://cvpr.thecvf.com/api/miniconf/users/190480?format=json", "institution": "University of Maryland, College Park"}, {"id": 185776, "fullname": "Dong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185776?format=json", "institution": "Capital One"}], "abstract": "Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this sample efficiency-stability trade-off, we introduce $\\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior rollout sample efficiency, achieving similar validation accuracy with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, sample-efficient, and effective RLVR for complex multimodal reasoning tasks. We will release code and checkpoints upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38694", "url": null, "sourceid": 38888, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38841, "uid": "722eaf15b47c6ce753f1482b352f5c48", "name": "AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation", "authors": [{"id": 182878, "fullname": "Milton Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182878?format=json", "institution": "Tsinghua University"}, {"id": 180302, "fullname": "Sizhong Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180302?format=json", "institution": "Tsinghua University"}, {"id": 180766, "fullname": "Yongzhi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180766?format=json", "institution": "Kuaishou"}, {"id": 185056, "fullname": "Quan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185056?format=json", "institution": "Kuaishou"}, {"id": 185057, "fullname": "Peng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185057?format=json", "institution": "Kuaishou Technology"}], "abstract": "Short-form videos have become a primary medium for digital advertising, requiring scalable and efficient content creation.However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end video ad editing framework based on multimodal discretization and controllable generation. AutoCut employs dedicated encoders to extract video and audio features, then applies residual vector quantization to discretize them into unified tokens aligned with textual representations, constructing a shared video\u2013audio\u2013text token space. Built upon a foundation model, we further develop a multimodal large language model for intelligent video editing through combined multimodal alignment and supervised fine-tuning, supporting tasks covering video selection and ordering, script generation, and background music selection within a unified editing framework. Finally, a complete production pipeline converts generated tokens into deployable long video outputs. Experiments on real-world advertisement datasets show that AutoCut reduces production cost and iteration time while substantially improving consistency and controllability, paving the way for scalable generative video creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38841", "url": null, "sourceid": 37287, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36778, "uid": "69c263f51349413d34f878c7f6b06144", "name": "An Empirical Study on How Video-LLMs Answer Videos Questions", "authors": [{"id": 126984, "fullname": "Chenhui Gou", "url": "http://cvpr.thecvf.com/api/miniconf/users/126984?format=json", "institution": "Monash University"}, {"id": 156331, "fullname": "Ziyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/156331?format=json", "institution": "Hunan University"}, {"id": 185854, "fullname": "Zicheng Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185854?format=json", "institution": "University of Adelaide"}, {"id": 185855, "fullname": "Haoyu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185855?format=json", "institution": "Tiktok"}, {"id": 158837, "fullname": "Feng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158837?format=json", "institution": "The University of Adelaide"}, {"id": 128844, "fullname": "Liyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128844?format=json", "institution": "University of Adelaide"}, {"id": 128740, "fullname": "Bohan Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128740?format=json", "institution": "Monash University"}, {"id": 126993, "fullname": "Jianfei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126993?format=json", "institution": "Monash University"}, {"id": 75933, "fullname": "Hamid Rezatofighi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75933?format=json", "institution": "Monash University"}], "abstract": "Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process\u2014lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter\u2019s high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36778", "url": null, "sourceid": 31085, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38655, "uid": "4cc79b05efacf681d3c957b92ec08ac2", "name": "ReLaX: Reasoning with Latent Exploration for Large Reasoning Models", "authors": [{"id": 180865, "fullname": "Shimin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180865?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 190398, "fullname": "Xianwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190398?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 180807, "fullname": "Yufan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180807?format=json", "institution": "Zhejiang University"}, {"id": 190399, "fullname": "Ziyuan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/190399?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 189621, "fullname": "Jibin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189621?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to premature policy convergence, resulting in early exploitation and performance saturation. While manipulating token-level entropy has proven effective for promoting early exploration, we argue that the latent dynamics underlying token generation provide richer computational structure for guiding policy optimization. To characterize the nonlinear latent structure of LRM and further facilitate measurement and manipulation on a tractable representation space, we leverage the Koopman operator theory to linearize the hidden state dynamics. We then introduce a new metric, $\\textbf{D}$ynamic $\\textbf{S}$pectral $\\textbf{D}$ispersion ($\\textbf{DSD}$),to quantify the diversity of the model's reasoning dynamics, which also serves as a direct measure of the degree of exploration. Building upon these foundations, we introduce a latent dynamics aware training paradigm, $\\textbf{Re}$asoning with $\\textbf{La}$tent e$\\textbf{X}$ploration ($\\textbf{ReLaX}$), to attain a better balance between exploration and exploitation during policy optimization. With the proposed ReLaX, we achieve state-of-the-art results across $7$ multimodal benchmarks and multidisciplinary reasoning benchmarks. Furthermore, comparative analysis reveals that ReLaX's mechanism of adaptive and semantically meaningful exploration cultivates more structured and robust reasoning than methods that merely optimize for token-level entropy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38655", "url": "https://github.com/ZhangShimin1/ReLaX", "sourceid": 37410, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65735, "file": "/media/PosterPDFs/CVPR%202026/38655.png", "modified": "2026-04-29T17:56:10.412201-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65736, "file": "/media/PosterPDFs/CVPR%202026/38655-thumb.png", "modified": "2026-04-23T20:33:05.574147-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65721, "modified": "2026-04-22T05:30:41.593075-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/38655.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39222, "uid": "e9356c402558dcf285db53208880d47e", "name": "ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence On Mobile Devices", "authors": [{"id": 182126, "fullname": "Dezhi Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182126?format=json", "institution": "Xiaomi Corporation"}, {"id": 191622, "fullname": "Zhengzhao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191622?format=json", "institution": "Zhejiang University"}, {"id": 191624, "fullname": "Qiliang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191624?format=json", "institution": "Peking University"}, {"id": 177265, "fullname": "Wang Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/177265?format=json", "institution": "Xiaomi Corporation"}, {"id": 191625, "fullname": "haofei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191625?format=json", "institution": "\u6e56\u5357\u6587\u7406\u5b66\u9662"}, {"id": 185210, "fullname": "Changpeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185210?format=json", "institution": "Xiaomi Corporation"}, {"id": 182318, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182318?format=json", "institution": "Xiaomi Corporation"}, {"id": 191626, "fullname": "Peng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191626?format=json", "institution": "Xiaomi Corporation"}, {"id": 191627, "fullname": "Shuai Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/191627?format=json", "institution": "Xiaomi Corporation"}, {"id": 185208, "fullname": "Hongzhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185208?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191628, "fullname": "Linfeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191628?format=json", "institution": "Northeastern University"}, {"id": 191629, "fullname": "Hao Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/191629?format=json", "institution": "Xiaomi Corporation"}, {"id": 185212, "fullname": "Jiaming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185212?format=json", "institution": "Xiaomi Corporation"}, {"id": 185213, "fullname": "Runyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185213?format=json", "institution": "Xiaomi Corporation"}, {"id": 185214, "fullname": "Ying Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185214?format=json", "institution": "Xiaomi Corporation"}], "abstract": "Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands.The emerging paradigm of \\textbf{proactive intelligence}, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce \\textbf{ProactiveMobile}, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries.Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15\\%, outperforming o1 (15.71\\%) and GPT-5 (7.39\\%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39222", "url": null, "sourceid": 41972, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36534, "uid": "48b928005eb587644756f16e5705e3f0", "name": "SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting", "authors": [{"id": 77347, "fullname": "Wenjie Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77347?format=json", "institution": "University of Science and Technology of China"}, {"id": 185297, "fullname": "Tianle Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185297?format=json", "institution": "University of Science and Technology of China"}, {"id": 88062, "fullname": "Wenfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88062?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Gaussian Splatting has emerged as a popular 3D representation technique, but still struggles with appearance inconsistencies, especially in dark scenes that require active illumination (e.g., camera flashes or co-moving light sources) to capture usable images, leading to dramatic local appearance fluctuations.While existing methods mainly focus on modeling global appearance changes for in-the-wild scenes, such as those caused by different times of day or weather conditions, they fail to handle the severe variations present in dark scenes with moving light sources.In this paper, we propose a novel Gaussian Splatting\u2013based approach for constructing scene representations in dark scenes where active light sources are rigidly attached to the camera and move together with it.Within this framework, we introduce an illumination-weighted loss function that drives the representation toward the underlying unlit scene. Furthermore, instead of adjusting the illumination of each individual Gaussian as in prior work, we employ a tile-based shading scheme that operates directly on the rendered images, greatly reducing computational cost while explicitly separating illumination from intrinsic scene appearance.Additionally, we further refine the learned Gaussian representation by combining the recovered unlit scene appearance with an advanced geometric prior model, which significantly improves geometric accuracy.Experimental results demonstrate that our method achieves superior reconstruction quality in challenging environments compared to state-of-the-art techniques.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36534", "url": null, "sourceid": 38945, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37103, "uid": "c8e48afc4b04bd84c86db33b225828a4", "name": "Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR", "authors": [{"id": 186662, "fullname": "Yulong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186662?format=json", "institution": "Fudan University; Shanghai; Shanghai Innovation Institute; Shanghai Jiaotong University"}, {"id": 180377, "fullname": "Tianyi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180377?format=json", "institution": "Shanghai Innovation Institute, OpenMOSS Team, Fudan University NLP Lab"}, {"id": 186663, "fullname": "Erfei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186663?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186664, "fullname": "Guoqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186664?format=json", "institution": "East China Normal University"}, {"id": 142670, "fullname": "Xu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/142670?format=json", "institution": "Fudan University; Renmin University of China; Virginia Polytechnic Institute and State University; Tsinghua University, Tsinghua University"}, {"id": 157378, "fullname": "Chenhui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157378?format=json", "institution": "East China Normal University"}, {"id": 178500, "fullname": "Gongshen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178500?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control.We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge.  Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1\\% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37103", "url": "https://github.com/Aslan-yulong/consensus-entropy", "sourceid": 32377, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40025, "uid": "8d0847366a208968be344b7c3e595291", "name": "ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation", "authors": [{"id": 88038, "fullname": "Huan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/88038?format=json", "institution": "University of Science and Technology of China"}, {"id": 144463, "fullname": "Yihan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144463?format=json", "institution": "University of Science and Technology of China"}, {"id": 183623, "fullname": "Chuxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183623?format=json", "institution": "University of Science and Technology of China"}, {"id": 178536, "fullname": "Nailong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178536?format=json", "institution": "Beijing Institute of Control Engineering"}, {"id": 88062, "fullname": "Wenfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88062?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency.To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations.Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors. Our method pioneers a new direction for future research by effectively and efficiently integrating shape completion into category-level object pose estimation. Code will be open.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40025", "url": null, "sourceid": 40534, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40391?format=json"], "related_events_ids": [40391]}, {"id": 38289, "uid": "81cee3cac73f7edc78814055d7236f4e", "name": "PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting", "authors": [{"id": 146058, "fullname": "Wubin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/146058?format=json", "institution": "Southeast University"}, {"id": 189508, "fullname": "Shaoyan Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189508?format=json", "institution": "Southeast University"}, {"id": 90684, "fullname": "Feipeng Da", "url": "http://cvpr.thecvf.com/api/miniconf/users/90684?format=json", "institution": "Southeast University"}], "abstract": "6D pose estimation is a key technology in computer vision and robotic manipulation. However, many methods remain heavily dependent on CAD models that are difficult to obtain. Object-level 3D reconstruction provides an alternative route, and 3D Gaussian Splatting (3DGS) shows convincing potential owing to its training and rendering efficiency. Nevertheless, under sparse reference views, 3DGS is prone to floating artifacts and appearance overfitting, which weakens the stability of pose estimation. We present PoseGaussian, a method for sparse-view 6D pose estimation for unseen object that builds on improved 3DGS. First, we use sparse RGB-D views to inject a depth structure prior into the 3DGS initialization for stable structure, and we adopt adaptive density control, view-warping augmentation, and joint photometric\u2013depth supervision to reduce floaters and appearance overfitting under sparse reference views. Next, in the pose estimation stage, we apply a two-stage learning-guided ICP initializer that exploits geometric features to obtain a stable initial pose. Finally, we introduce a 3DGS-based iterative pose refiner that aligns rendered and query images in both appearance and geometry, further improving pose estimation accuracy. Experiments on LINEMOD, GenMOP, and our real-world datasets show that PoseGaussian achieves significant improvements over baseline methods under model-free and sparse-view settings, demonstrating strong generalization to unseen objects and robustness to view sparsity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38289", "url": null, "sourceid": 37973, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37527, "uid": "49dda3f3dd880fba41cb7a7d74bd08bd", "name": "LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration", "authors": [{"id": 187649, "fullname": "Aditya Ranjan Dash", "url": "http://cvpr.thecvf.com/api/miniconf/users/187649?format=json", "institution": "RPTU Kaiserslautern"}, {"id": 181310, "fullname": "Ramy Battrawy", "url": "http://cvpr.thecvf.com/api/miniconf/users/181310?format=json", "institution": "German Research Center for Artificial Intelligence DFKI"}, {"id": 187650, "fullname": "Ren\u00e9 Schuster", "url": "http://cvpr.thecvf.com/api/miniconf/users/187650?format=json", "institution": "German Research Center for Artificial Intelligence (DFKI)"}, {"id": 89910, "fullname": "Didier Stricker", "url": "http://cvpr.thecvf.com/api/miniconf/users/89910?format=json", "institution": "Universit\u00e4t Kaiserslautern"}], "abstract": "Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37527", "url": null, "sourceid": 39092, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38021, "uid": "5e7ce02afd479a1ff12bc405e3af182a", "name": "A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens", "authors": [{"id": 134194, "fullname": "Tommie Kerssies", "url": "http://cvpr.thecvf.com/api/miniconf/users/134194?format=json", "institution": "Eindhoven University of Technology"}, {"id": 163745, "fullname": "Gabriele Berton", "url": "http://cvpr.thecvf.com/api/miniconf/users/163745?format=json", "institution": "Amazon"}, {"id": 188843, "fullname": "Ju He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188843?format=json", "institution": "Amazon"}, {"id": 88900, "fullname": "Qihang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88900?format=json", "institution": "Johns Hopkins University"}, {"id": 89979, "fullname": "Wufei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/89979?format=json", "institution": "Johns Hopkins University"}, {"id": 89339, "fullname": "Daan de Geus", "url": "http://cvpr.thecvf.com/api/miniconf/users/89339?format=json", "institution": "Eindhoven University of Technology"}, {"id": 89343, "fullname": "Gijs Dubbelman", "url": "http://cvpr.thecvf.com/api/miniconf/users/89343?format=json", "institution": "Eindhoven University of Technology"}, {"id": 94929, "fullname": "Liang-Chieh Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/94929?format=json", "institution": "Amazon Frontier AI &amp; Robotics"}], "abstract": "Anticipating diverse future states is a central challenge in video world modeling. A key limitation lies in the computational cost of generating multiple plausible futures with existing world models. Recent work demonstrates that predicting the future in the latent space of a vision foundation model (VFM), rather than in raw pixel space, greatly improves efficiency. Despite this progress, efficient VFM-based world models are still predominantly discriminative, producing predictions that implicitly average over many possible futures. To explicitly and efficiently model diverse plausible futures, we introduce DeltaWorld, the first VFM-based world model which shifts from deterministic prediction to the ability to generate multiple plausible futures in a single forward pass. At the core of DeltaWorld is DeltaTok, a tokenizer that encodes feature differences between consecutive frames into a single compact \u201cdelta\u201d token, effectively reducing redundancy among temporally adjacent feature maps. By representing futures as delta tokens, DeltaWorld efficiently generates multiple diverse predictions in parallel. Experiments on dense forecasting tasks demonstrate that DeltaWorld is capable of predicting futures that more closely align with real-world outcomes, while being orders of magnitude more efficient than existing generative world models. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38021", "url": null, "sourceid": 33518, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38351, "uid": "e03814c1d555e06aa5d562d95ed29bd2", "name": "Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation", "authors": [{"id": 182231, "fullname": "Xin Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182231?format=json", "institution": "University of California, San Diego"}, {"id": 183901, "fullname": "Meixi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/183901?format=json", "institution": "Tsinghua University"}, {"id": 189052, "fullname": "Dizhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189052?format=json", "institution": "insta360"}, {"id": 189677, "fullname": "Wenxuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189677?format=json", "institution": ""}, {"id": 189678, "fullname": "Haodong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189678?format=json", "institution": "University of California San Diego"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 128850, "fullname": "Truong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128850?format=json", "institution": "University of California, San Diego"}, {"id": 188215, "fullname": "Lu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188215?format=json", "institution": "Insta360; Wuhan University"}], "abstract": "In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. Code and models will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38351", "url": null, "sourceid": 31759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39760, "uid": "9ec04091466e231d20d349c52a689a21", "name": "SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection", "authors": [{"id": 152933, "fullname": "Yifan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152933?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 89465, "fullname": "Yian Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89465?format=json", "institution": "Peking University"}, {"id": 152932, "fullname": "Fanqi Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152932?format=json", "institution": "Tsinghua University"}, {"id": 188719, "fullname": "Xiaochen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188719?format=json", "institution": "University of Glasgow"}, {"id": 181272, "fullname": "YANG TANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/181272?format=json", "institution": "Tencent"}, {"id": 190534, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190534?format=json", "institution": null}, {"id": 87900, "fullname": "Wenming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87900?format=json", "institution": "Tsinghua University,"}], "abstract": "Existing monocular 3D detectors typically tame the pronounced nonlinear regression of 3D bounding box through decoupled prediction paradigm, which employs multiple branches to estimate geometric center, depth, dimensions, and rotation angle separately.Although this decoupling strategy simplifies the learning process, it inherently ignores the geometric collaborative constraints between different attributes, resulting in the lack of geometric consistency prior, thereby leading to suboptimal performance. To address this issue, we propose novel **S**patial-**P**rojection **A**lig**n**ment (**SPAN**) with two pivotal components: (i). ***Spatial Point Alignment*** enforces an explicit global spatial constraint between the predicted and ground\u2011truth 3D bounding boxes, thereby rectifying spatial drift caused by decoupled attribute regression. (ii). ***3D-2D Projection Alignment*** ensures that the projected 3D box is aligned tightly within its corresponding 2D detection bounding box on the image plane, mitigating projection misalignment overlooked in previous works. To ensure training stability, we further introduce a ***Hierarchical Task Learning*** strategy that progressively incorporates spatial-projection alignment as 3D attribute predictions refine, preventing early stage error propagation across attributes. Extensive experiments demonstrate that the proposed method can be easily integrated into any established monocular 3D detector and delivers significant performance improvements.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39760", "url": "https://wyfdut.github.io/SPAN/", "sourceid": 43018, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65751, "file": "/media/PosterPDFs/CVPR%202026/39760-thumb.png", "modified": "2026-04-29T20:13:35.743041-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65750, "file": "/media/PosterPDFs/CVPR%202026/39760.png", "modified": "2026-04-29T20:13:35.524877-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38703, "uid": "64a2a4ca55db5ee37e2b1fc46efcdf17", "name": "3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation", "authors": [{"id": 183175, "fullname": "Zhixue Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183175?format=json", "institution": "Beijing Dajia Internet Information Technology Co., Ltd."}, {"id": 187740, "fullname": "Xu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/187740?format=json", "institution": "Tsinghua University"}, {"id": 190492, "fullname": "Songlin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190492?format=json", "institution": "Kuaishou"}, {"id": 87626, "fullname": "Haoxian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87626?format=json", "institution": "Peking University"}, {"id": 177007, "fullname": "Qingfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/177007?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280; Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 156752, "fullname": "Xiaoqiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156752?format=json", "institution": "Kuaishou"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 156268, "fullname": "Kun Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156268?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}], "abstract": "Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision\u2014single-view, multi-view, and moving-camera videos\u2014forcing motion consistency across diverse viewpoints.  Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38703", "url": null, "sourceid": 36484, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37949, "uid": "99e07de9eb99b6a7122844b97af74796", "name": "UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement", "authors": [{"id": 91776, "fullname": "Weiqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91776?format=json", "institution": "Peking University"}, {"id": 70310, "fullname": "Xuanyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70310?format=json", "institution": "Peking University"}, {"id": 151874, "fullname": "Bin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151874?format=json", "institution": "Peking University"}, {"id": 188657, "fullname": "Jingfen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188657?format=json", "institution": "ByteDance Inc."}, {"id": 107035, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107035?format=json", "institution": "Nankai University"}, {"id": 188658, "fullname": "Kexin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188658?format=json", "institution": "ByteDance Inc."}, {"id": 130240, "fullname": "Junlin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130240?format=json", "institution": "ByteDance Inc."}, {"id": 130296, "fullname": "Li zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130296?format=json", "institution": "Bytedance Inc."}, {"id": 76749, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76749?format=json", "institution": "Peking University"}, {"id": 130256, "fullname": "Shijie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130256?format=json", "institution": "ByteDance Inc."}], "abstract": "Image quality assessment (IQA) and image restoration are fundamental problems in low-level vision. Although IQA and restoration are closely connected conceptually, most existing work treats them in isolation. Recent advances in unified multimodal understanding-generation models demonstrate promising results and indicate that stronger understanding can improve generative performance. This motivates a single model that unifies IQA and restoration and explicitly studies how IQA can guide restoration, a setting that remains largely underexplored yet highly valuable. In this paper, we propose UARE, to our knowledge the first Unified vision-language model for image quality Assessment, Restoration, and Enhancement. Built on pretrained unified understanding and generation models, we introduce a two-stage training framework. First, a progressive, easy-to-hard schedule expands from single-type distortions to higher-order mixed degradations, enabling UARE to handle multiple degradations. Second, we perform unified fine-tuning of quality understanding and restoration with interleaved text-image data, aligning IQA signals with restoration objectives. Through multi-task co-training, UARE leverages IQA to boost restoration and enhancement performance. Extensive experiments across IQA, restoration, and enhancement tasks demonstrate the effectiveness of UARE. The code and models will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37949", "url": null, "sourceid": 39875, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40132, "uid": "078d13da75a67637473e543d53ceca49", "name": "Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing", "authors": [{"id": 76370, "fullname": "Baifeng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76370?format=json", "institution": "University of California Berkeley"}, {"id": 126401, "fullname": "Stephanie Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126401?format=json", "institution": "University of California, Berkeley"}, {"id": 89036, "fullname": "Long Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89036?format=json", "institution": "University of California, Berkeley"}, {"id": 191164, "fullname": "Hanrong Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/191164?format=json", "institution": "NVIDIA"}, {"id": 193602, "fullname": "David Eigen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193602?format=json", "institution": "-"}, {"id": 141305, "fullname": "Aaron Reite", "url": "http://cvpr.thecvf.com/api/miniconf/users/141305?format=json", "institution": "NVIDIA"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 105890, "fullname": "Boyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/105890?format=json", "institution": "UC Berkeley / NVIDIA"}, {"id": 130672, "fullname": "David Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130672?format=json", "institution": "University of California Berkeley"}, {"id": 86710, "fullname": "Trevor Darrell", "url": "http://cvpr.thecvf.com/api/miniconf/users/86710?format=json", "institution": "Electrical Engineering &amp; Computer Science Department"}, {"id": 73958, "fullname": "Pavlo Molchanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/73958?format=json", "institution": "NVIDIA"}, {"id": 73956, "fullname": "Danny Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/73956?format=json", "institution": "NVIDIA"}], "abstract": "Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos---they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that reconstructs the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 66.5% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with multi-minute 4K videos, where an MLLM scaled with AutoGaze outperform the previous SOTA MLLM by 6.3%.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40132", "url": null, "sourceid": 33645, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36996, "uid": "1d782ef33c1a154781d25b9a4ef174a1", "name": "DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces", "authors": [{"id": 151498, "fullname": "Li Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151498?format=json", "institution": "University of Science and Technology of China"}, {"id": 186417, "fullname": "Mingyu Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186417?format=json", "institution": "Zhejiang University"}, {"id": 186418, "fullname": "Ailing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186418?format=json", "institution": "East China Normal University"}, {"id": 186419, "fullname": "Xianhui Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186419?format=json", "institution": "University of Science and Technology of China"}, {"id": 186420, "fullname": "Yan Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186420?format=json", "institution": "Peking University"}, {"id": 186421, "fullname": "Xinyuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/186421?format=json", "institution": "Emory University"}, {"id": 156603, "fullname": "Liu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156603?format=json", "institution": "Hefei University of Technology"}, {"id": 156602, "fullname": "RujingWang RujingWang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156602?format=json", "institution": "Chinese Academy Sciences"}, {"id": 186422, "fullname": "Zaixing He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186422?format=json", "institution": "Zhejiang University"}, {"id": 86440, "fullname": "Cewu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86440?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Articulated object pose estimation is a core task in embodied AI and computer vision. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this paper, we introduce DICArt (DIsCrete Diffusion for Articulated Object Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the ground-truth pose.To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a hierarchical kinematic coupling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure.We validate DICArt on both synthetic and real-world datasets with multi-hinged articulated objects. Experimental results demonstrate its superior performance and robustness over state-of-the-art baselines. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments. Codewill be publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36996", "url": null, "sourceid": 46157, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38783, "uid": "60ac4f31ab5b2ee3e0372030eb2a19de", "name": "When to Think and When to Look: Uncertainty-Guided Lookback", "authors": [{"id": 149423, "fullname": "Jing Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/149423?format=json", "institution": "University of Rochester"}, {"id": 97456, "fullname": "Filippos Bellos", "url": "http://cvpr.thecvf.com/api/miniconf/users/97456?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 190672, "fullname": "JunJia Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190672?format=json", "institution": null}, {"id": 97232, "fullname": "Yayuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/97232?format=json", "institution": "University of Michigan, Ann Arbor"}, {"id": 90333, "fullname": "Chao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90333?format=json", "institution": "Department of Computer Science, University of Rochester"}, {"id": 152930, "fullname": "Yunlong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152930?format=json", "institution": "University of Rochester"}, {"id": 96787, "fullname": "Luchuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/96787?format=json", "institution": "University of Rochester"}, {"id": 127813, "fullname": "Susan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127813?format=json", "institution": "University of Rochester"}, {"id": 190673, "fullname": "Zhongfei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190673?format=json", "institution": "State University of New York, Binghamton"}, {"id": 95802, "fullname": "Jason Corso", "url": "http://cvpr.thecvf.com/api/miniconf/users/95802?format=json", "institution": "Voxel51; University of Michigan"}, {"id": 90334, "fullname": "Chenliang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90334?format=json", "institution": "University of Rochester"}], "abstract": "Test-time \u201cthinking\u201d (i.e., generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision\u2013language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large-scale, controlled comparison of thinking for LVLMs, evaluating 10 variants from the InternVL3.5 and Qwen3-VL families on MMMUval under generous token budgets and multi-pass decoding. We show that more thinking is not always better: long chains often yield long-wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short \u201clookback\u201d phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty-guided lookback, a training-free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math-focused visual reasoning datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38783", "url": null, "sourceid": 39198, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37756, "uid": "cb6e99dc32f0a791b9cf73b03a739277", "name": "InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior", "authors": [{"id": 187569, "fullname": "Weimin Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187569?format=json", "institution": "Peking University"}, {"id": 180245, "fullname": "Suzhe Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180245?format=json", "institution": "Huaqiao University"}, {"id": 188175, "fullname": "Yiwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/188175?format=json", "institution": "Peking University"}, {"id": 127480, "fullname": "Jinhua Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127480?format=json", "institution": "Kuaishou Tech"}, {"id": 89506, "fullname": "Ming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/89506?format=json", "institution": "Kuaishou Tech"}, {"id": 137984, "fullname": "Wenzheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/137984?format=json", "institution": "Peking University"}, {"id": 73505, "fullname": "He Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/73505?format=json", "institution": "Peking University"}], "abstract": "Video inverse problems such as inpainting, deblurring and super-resolution are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers\u2014leading to temporal artifacts\u2014or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher\u2019s strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the standard VAE in video diffusion backbone with a highly efficient LeanVAE, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100x speedups over iterative video diffusion priors. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37756", "url": null, "sourceid": 38692, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36973, "uid": "1457f5bbc8999c178f82c02428a6308f", "name": "MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction", "authors": [{"id": 153168, "fullname": "JongMin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/153168?format=json", "institution": "Seoul National University"}, {"id": 85670, "fullname": "Seungyeop Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85670?format=json", "institution": "Seoul National University"}, {"id": 76208, "fullname": "Sungjoo Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76208?format=json", "institution": "Seoul National University"}], "abstract": "Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose \\textbf{MV-RoMa}, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36973", "url": null, "sourceid": 39072, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37843, "uid": "b0a054a7a795030eaaa7c53a035a4925", "name": "Language-driven Fine-grained Retrieval", "authors": [{"id": 88175, "fullname": "Shijie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88175?format=json", "institution": "University of Queensland"}, {"id": 183016, "fullname": "Xin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183016?format=json", "institution": "Adelaide University"}, {"id": 158034, "fullname": "Yadan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158034?format=json", "institution": "The University of Queensland"}, {"id": 184867, "fullname": "Zijian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184867?format=json", "institution": "The University of Queensland"}, {"id": 188396, "fullname": "Peng-Fei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188396?format=json", "institution": "University of Queensland"}, {"id": 90777, "fullname": "Zi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90777?format=json", "institution": "University of Queensland"}], "abstract": "Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision\u2013language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer it toward pinpointing visual details consistent with linguistic descriptions, thus modeling comparability among object details. Extensive evaluations show that LaFG achieves impressive performance on both fine- and coarse-grained benchmarks and generalizes well to unseen categories.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37843", "url": null, "sourceid": 46523, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36839, "uid": "a8225fc412044a06c16cf142c105e33d", "name": "SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models", "authors": [{"id": 185992, "fullname": "Senyu Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185992?format=json", "institution": "Tongji University &amp; Shanghai Innovation Institute"}, {"id": 185993, "fullname": "Siyin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185993?format=json", "institution": "Fudan University"}, {"id": 185994, "fullname": "Li Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/185994?format=json", "institution": "Fudan University"}, {"id": 185995, "fullname": "Ao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185995?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 185996, "fullname": "Shiduo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185996?format=json", "institution": "Fudan University"}, {"id": 175927, "fullname": "Liming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175927?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 185997, "fullname": "Jinlong Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185997?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 185998, "fullname": "Jingjing Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185998?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 185999, "fullname": "Xianzhong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185999?format=json", "institution": "Tongji University"}, {"id": 130875, "fullname": "Xipeng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130875?format=json", "institution": "Fudan University"}], "abstract": "Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model\u2019s own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of Latent World Representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model\u2019s latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO\u2019s efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36839", "url": null, "sourceid": 40216, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37664, "uid": "bf959c64f2ebaa7aa642e0b5280bbdf6", "name": "Just-in-Time: Tuning-Free Spatial Acceleration for Diffusion Transformers", "authors": [{"id": 181533, "fullname": "Wenhao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181533?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187973, "fullname": "Ji Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187973?format=json", "institution": "Capital Normal University"}, {"id": 128190, "fullname": "Zhaoqiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128190?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce **Just-in-Time (JiT)**, a novel tuning-free framework that addresses this challenge by accelerating the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of \"anchor'' tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a **7$\\times\\$ speedup** with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37664", "url": null, "sourceid": 46425, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37198, "uid": "c652aee4df4084c621a094692ee6f6c9", "name": "All-in-One Slider for Attribute Manipulation in Diffusion Models", "authors": [{"id": 180355, "fullname": "Weixin Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180355?format=json", "institution": "Beijing Jiaotong University"}, {"id": 186902, "fullname": "Hongguang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186902?format=json", "institution": "City University of Macau"}, {"id": 186903, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186903?format=json", "institution": "Beijing Jiaotong University"}, {"id": 90145, "fullname": "Yahui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90145?format=json", "institution": "University of Trento"}, {"id": 186904, "fullname": "Mengyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186904?format=json", "institution": "Beijing Jiaotong University"}, {"id": 86562, "fullname": "Xuecheng Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/86562?format=json", "institution": "national university of singaore, National University of Singapore"}], "abstract": "Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a \\textbf{One-for-One} manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the \\textbf{All-in-One} Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37198", "url": null, "sourceid": 34299, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38895, "uid": "d12dbd37a4a8bc4327e3faa9bf8a20cf", "name": "ChordEdit: One-Step Low-Energy Transport for Image Editing", "authors": [{"id": 190926, "fullname": "Liangsi Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190926?format=json", "institution": "Guangdong University of Technology"}, {"id": 188356, "fullname": "Xuhang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188356?format=json", "institution": "Huizhou University"}, {"id": 190927, "fullname": "Minzhe Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190927?format=json", "institution": "Guangdong University of Technology"}, {"id": 190928, "fullname": "Shichu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190928?format=json", "institution": "Shenzhen University"}, {"id": 176876, "fullname": "Jingchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176876?format=json", "institution": "the School of Computer Science, Peking University"}, {"id": 183044, "fullname": "Yang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183044?format=json", "institution": "Guangdong University of Technology"}], "abstract": "The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce \\textbf{ChordEdit}, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38895", "url": "https://chordedit.github.io/", "sourceid": 45497, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40352?format=json"], "related_events_ids": [40352]}, {"id": 40091, "uid": "d45121855efb5fa759a01b8def4ebe31", "name": "Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset", "authors": [{"id": 192812, "fullname": "Yang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192812?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 193491, "fullname": "Jun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193491?format=json", "institution": "Dalian University of Technology"}, {"id": 193492, "fullname": "Zhidong Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193492?format=json", "institution": "Dalian University of Technology"}, {"id": 155755, "fullname": "Xingyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155755?format=json", "institution": "Dalian University of Technology"}, {"id": 155757, "fullname": "Zhiying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155757?format=json", "institution": "Dalian Martime University"}, {"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}], "abstract": "Infrared Image Super-Resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments on real and synthetic datasets demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40091", "url": null, "sourceid": 43982, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37559, "uid": "c6a8488de86eddf87c84edf4136a1126", "name": "Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers", "authors": [{"id": 187725, "fullname": "Binxu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187725?format=json", "institution": "Harvard University"}, {"id": 187726, "fullname": "Jingxuan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187726?format=json", "institution": "Harvard University"}, {"id": 187727, "fullname": "Xu Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187727?format=json", "institution": "Harvard University, Harvard University"}], "abstract": "Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37559", "url": null, "sourceid": 43000, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37864, "uid": "27c20a93c89bfd0336394f370163d43c", "name": "DualSplat: Robust 3D Gaussian Splatting via  Pseudo-Mask Bootstrapping from Reconstruction Failures", "authors": [{"id": 180460, "fullname": "Xu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180460?format=json", "institution": "Beihang University"}, {"id": 159998, "fullname": "Zhiru Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159998?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 188435, "fullname": "Shiyun Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188435?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 188436, "fullname": "Chengwei Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188436?format=json", "institution": "Beihang Uinveristy"}, {"id": 188437, "fullname": "Yisong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188437?format=json", "institution": "Peking University"}], "abstract": "3D Gaussian Splatting achieves real-time photo-realistic rendering but struggles when training images contain transient objects that violate multi-view consistency. Existing methods face a fundamental dilemma: accurate transient detection requires well-reconstructed static scenes, yet clean reconstruction depends on reliable transient masks. This circular dependency causes persistent artifacts when both components are jointly optimized from poor initialization. We present DualSplat, a two-stage framework which sidesteps this dilemma by first generating pseudo masks from reconstruction failures, then using them to guide clean scene optimization. We observe that transient objects manifest as incomplete fragments during initial training, since they appear in only a subset of views. We consolidate these failures into pseudo masks via instance-level thresholding and a feature-residual filter guided by SAM2 boundaries. Then we trains a clean 3DGS model under pseudo-mask supervision, with a lightweight MLP refining masks online by progressively shifting from pseudo-priors to self-consistency as densification proceeds. Experiments on RobustNeRF and NeRF On-the-Go demonstrate that DualSplat achieves competitive performance with recent methods, with particularly strong results on scenes with high transient density.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37864", "url": null, "sourceid": 32647, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37092, "uid": "f645358b695e99125681b2129918ca81", "name": "Towards Visual Query Localization in the 3D World", "authors": [{"id": 180005, "fullname": "liang peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180005?format=json", "institution": "School of Computer Science, Wuhan University"}, {"id": 186635, "fullname": "Bohan Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186635?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85051, "fullname": "Zhipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85051?format=json", "institution": "Didi Research"}, {"id": 186636, "fullname": "Haobo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186636?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186637, "fullname": "Yifan Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186637?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 89415, "fullname": "Xingping Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89415?format=json", "institution": "Inception Institute of Artificial Intelligence"}, {"id": 88102, "fullname": "Libo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88102?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}], "abstract": "Visual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query localization. To facilitate comparison for subsequent research, we implement a series of representative 3D multimodal VQL baselines using PC and RGB. The experimental results show that existing methods exhibit significant performance variations across different fusion modules. To encourage future research, we propose a lift and attention fusion algorithm named LaF, which significantly outperforms than existing baseline models. Our benchmark and model will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37092", "url": null, "sourceid": 39042, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38291, "uid": "48bbfd2497b88edea00746d7c8cb96bf", "name": "Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models", "authors": [{"id": 189514, "fullname": "Christian Simon", "url": "http://cvpr.thecvf.com/api/miniconf/users/189514?format=json", "institution": "Sony"}, {"id": 154935, "fullname": "Masato Ishii", "url": "http://cvpr.thecvf.com/api/miniconf/users/154935?format=json", "institution": "Sony Research Inc."}, {"id": 189515, "fullname": "Wei-Yao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189515?format=json", "institution": "Sony Group Corporation"}, {"id": 189516, "fullname": "Koichi Saito", "url": "http://cvpr.thecvf.com/api/miniconf/users/189516?format=json", "institution": "Sony AI America; Sony Group Corporation, Tokyo"}, {"id": 154936, "fullname": "Akio Hayakawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/154936?format=json", "institution": "Sony AI"}, {"id": 189205, "fullname": "Dongseok Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189205?format=json", "institution": "Sony Group Corporation"}, {"id": 189517, "fullname": "Zhi Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189517?format=json", "institution": "Sony Group Corporation"}, {"id": 189518, "fullname": "Shuyang Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/189518?format=json", "institution": "Sony Group Corporation"}, {"id": 132907, "fullname": "Takashi Shibuya", "url": "http://cvpr.thecvf.com/api/miniconf/users/132907?format=json", "institution": "Sony AI"}, {"id": 189519, "fullname": "Shusuke Takahashi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189519?format=json", "institution": "Sony Group Corporation"}, {"id": 153173, "fullname": "Yuki Mitsufuji", "url": "http://cvpr.thecvf.com/api/miniconf/users/153173?format=json", "institution": "Sony AI"}], "abstract": "Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called  MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38291", "url": null, "sourceid": 40375, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39908, "uid": "ca75de9a5fde50bc6a0ba910b1b1d908", "name": "UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation", "authors": [{"id": 155755, "fullname": "Xingyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155755?format=json", "institution": "Dalian University of Technology"}, {"id": 145645, "fullname": "Songcheng Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/145645?format=json", "institution": "Northwest Polytechnical University"}, {"id": 192812, "fullname": "Yang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192812?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 187259, "fullname": "HaoYuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187259?format=json", "institution": null}, {"id": 155757, "fullname": "Zhiying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155757?format=json", "institution": "Dalian Martime University"}, {"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}], "abstract": "Image fusion aims to integrate complementary information from multiple source images to produce a more informative and visually consistent representation, benefiting both human perception and downstream vision tasks. Despite recent progress, most existing fusion methods are designed for specific tasks (i.e., multi-modal, multi-exposure, or multi-focus fusion) and struggle to effectively preserve source information during the fusion process. This limitation primarily arises from task-specific architectures and the degradation of source information caused by deep-layer propagation. To overcome these issues, we propose \\textbf{UniFusion}, a unified image fusion framework designed to achieve cross-task generalization. First, leveraging DINOv3 for modality-consistent feature extraction, UniFusion establishes a shared semantic space for diverse inputs. Second, to preserve the understanding of each source image, we introduce a reconstruction-alignment loss to maintain consistency between fused outputs and inputs. Finally, we employ a bilevel optimization strategy to decouple and jointly optimize reconstruction and fusion objectives, effectively balancing their coupling relationship and ensuring smooth convergence. Extensive experiments across multiple fusion tasks demonstrate UniFusion\u2019s superior visual quality, generalization ability, and adaptability to real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39908", "url": null, "sourceid": 37893, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37498, "uid": "bf84bd4c313f518513480e20d34412ba", "name": "The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models", "authors": [{"id": 187584, "fullname": "Runhao Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187584?format=json", "institution": "Harbin Institute of Technology"}, {"id": 132045, "fullname": "Hanshi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132045?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 187585, "fullname": "Yixiang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187585?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 151500, "fullname": "Qianli Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151500?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 174602, "fullname": "Jingmeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/174602?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 85051, "fullname": "Zhipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85051?format=json", "institution": "Didi Research"}], "abstract": "The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model's foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Dataset and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37498", "url": null, "sourceid": 38246, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40039, "uid": "a6ebf387b95966cea37aefa0b29ff33c", "name": "SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker", "authors": [{"id": 179995, "fullname": "Junbin Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/179995?format=json", "institution": "Yanshan University"}, {"id": 193360, "fullname": "Ziteng Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193360?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 157640, "fullname": "Shihui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157640?format=json", "institution": "Yan Shan University"}, {"id": 193361, "fullname": "Kun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193361?format=json", "institution": "Yan Shan University"}, {"id": 85031, "fullname": "Weiming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85031?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 85051, "fullname": "Zhipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85051?format=json", "institution": "Didi Research"}], "abstract": "Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Experiments show that AMG-LoRA alone establishes a remarkably simple yet strong baseline, outperforming SDSTrack on LasHeR by 3.3\\% in PR and 1.9\\% in SR with only 0.4\\% of its parameters (0.14M vs. 14.8M), while significantly boosting cross-modal fusion with negligible additional latency. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40039", "url": null, "sourceid": 36934, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40392?format=json"], "related_events_ids": [40392]}, {"id": 39926, "uid": "097e232de59f809f5a1cdf88e1240b08", "name": "Saliency-R1: Enforcing Interpretable and Faithful  Vision-language Reasoning via Saliency-map Alignment Reward", "authors": [{"id": 131637, "fullname": "Shizhan Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131637?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 193127, "fullname": "Minda Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 193128, "fullname": "Qiyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193128?format=json", "institution": "City University of Hong Kong"}, {"id": 193129, "fullname": "Chen Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193129?format=json", "institution": "City University of Hong Kong"}, {"id": 75510, "fullname": "Qi Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75510?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose $\\textbf{Saliency-R1}$, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance. The dataset, code and pretrained models of this paper will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39926", "url": null, "sourceid": 35802, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38983, "uid": "eb414a86ec0575dc801691b76791a24d", "name": "PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation", "authors": [{"id": 189382, "fullname": "Di Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189382?format=json", "institution": "University of Science and Technology of China"}, {"id": 126959, "fullname": "Yaohui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126959?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 101727, "fullname": "Shuai Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/101727?format=json", "institution": "Zhejiang Lab"}, {"id": 76930, "fullname": "Francois Bremond", "url": "http://cvpr.thecvf.com/api/miniconf/users/76930?format=json", "institution": "inria"}, {"id": 191121, "fullname": "Jiangtao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191121?format=json", "institution": "University of Teesside"}], "abstract": "Real-world human action understanding remains challenging due to long-tailed label distributions, compositional motion patterns, and viewpoint variations. Existing skeleton-based methods often lack a structured and transferable representation of motion, and task-specific models for generation, classification, and detection are usually trained independently, resulting in fragmented pipelines and limited cross-task generalization. We present PRISM, a PRImitive-centric Skeleton Modeling framework that learns a shared motion representation from a motion generation objective and transfers it to perception tasks. PRISM represents each action sequence as a trajectory in a primitive coefficient space, which captures how a set of learned atomic motion primitives contribute to the observed motion. A structured decomposition module learns this representation in a physically grounded and view-invariant manner via motion generation. Instead of enforcing joint or unified training across tasks, PRISM provides a single primitive-centric representation that can be sequentially transferred to downstream classification and frame-wise detection through lightweight task heads. This representation introduces structure, compositionality, and improved generalization across distinct supervisions. PRISM consistently improves performance on long-tailed and multi-label datasets and enables interpretable reasoning over compositional and rare actions. Extensive experimental results show that the structured primitive space serves as a transferable and robust foundation for diverse action understanding tasks in real-world datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38983", "url": null, "sourceid": 35963, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40392, "uid": "a6ebf387b95966cea37aefa0b29ff33c", "name": "SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker", "authors": [{"id": 179995, "fullname": "Junbin Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/179995?format=json", "institution": "Yanshan University"}, {"id": 193360, "fullname": "Ziteng Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193360?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 157640, "fullname": "Shihui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157640?format=json", "institution": "Yan Shan University"}, {"id": 193361, "fullname": "Kun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193361?format=json", "institution": "Yan Shan University"}, {"id": 85031, "fullname": "Weiming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85031?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 85051, "fullname": "Zhipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85051?format=json", "institution": "Didi Research"}], "abstract": "Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Experiments show that AMG-LoRA alone establishes a remarkably simple yet strong baseline, outperforming SDSTrack on LasHeR by 3.3\\% in PR and 1.9\\% in SR with only 0.4\\% of its parameters (0.14M vs. 14.8M), while significantly boosting cross-modal fusion with negligible additional latency. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40392", "url": null, "sourceid": -36934, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40039?format=json"], "related_events_ids": [40039]}, {"id": 37619, "uid": "6bec22d989fd9944a46e42ea11bebee1", "name": "Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation", "authors": [{"id": 152076, "fullname": "Yunhong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152076?format=json", "institution": "Zhejiang University"}, {"id": 94115, "fullname": "Yanhong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/94115?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 186636, "fullname": "Haobo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186636?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 89797, "fullname": "Hao Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89797?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 93859, "fullname": "Qiuyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93859?format=json", "institution": "Ant Group"}, {"id": 91740, "fullname": "Ka Leong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/91740?format=json", "institution": "HKUST"}, {"id": 153747, "fullname": "Jiapeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153747?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 155319, "fullname": "Hengyuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155319?format=json", "institution": "Zhejiang University"}, {"id": 85051, "fullname": "Zhipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85051?format=json", "institution": "Didi Research"}, {"id": 186273, "fullname": "Xing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186273?format=json", "institution": "Ant Group"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 155322, "fullname": "Min Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155322?format=json", "institution": "Zhejiang University"}], "abstract": "Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37619", "url": null, "sourceid": 33666, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39057, "uid": "7e0bb07735b79680eb1ee707248353fd", "name": "Learnability-Guided Diffusion for Dataset Distillation", "authors": [{"id": 191269, "fullname": "Jeffrey A. Chan-Santiago", "url": "http://cvpr.thecvf.com/api/miniconf/users/191269?format=json", "institution": "University of Central Florida"}, {"id": 73977, "fullname": "Mubarak Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/73977?format=json", "institution": "Amazon"}], "abstract": "Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled datasets, either by producing diverse samples or by matching the training gradients of the original data. However, existing distilled datasets contain redundant training signals\u2014samples provide overlapping information. Empirically, disjoint subsets of existing distilled datasets capture 70\u201380\\% overlapping training signals. This redundancy arises because existing methods optimize for visual diversity or average training trajectories without accounting for training signal similarity across samples. This produces datasets where multiple samples teach the model similar information rather than providing complementary knowledge across training stages.We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small distilled dataset, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce learnability-guided diffusion that balances current-model informativeness with reference-model validity, automatically generating curriculum-aligned samples. Our approach reduces redundancy by 39.1\\%, enables specialization across training phases, and achieves state-of-the-art results on ImageNet-1K (60.1\\%), ImageNette (87.2\\%), and ImageWoof (72.9\\%)", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39057", "url": null, "sourceid": 36916, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39148, "uid": "ebde33c70e6e38c6377f49e3d1956fd1", "name": "Omni-AD: A Large-scale and Versatile Benchmark for Industrial Anomaly Detection", "authors": [{"id": 89340, "fullname": "Dahu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/89340?format=json", "institution": "Zhejiang University"}, {"id": 191449, "fullname": "Chengshen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191449?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191450, "fullname": "Shaochen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191450?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 103049, "fullname": "Bo Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/103049?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191451, "fullname": "Xiaochen Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191451?format=json", "institution": "Hikrobot Co., Ltd."}, {"id": 191452, "fullname": "Wencong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191452?format=json", "institution": "Hikrobot Co., Ltd."}, {"id": 76677, "fullname": "Xing Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76677?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Industrial Anomaly Detection (IAD) has attracted significant attention and witnessed rapid development.However, the advancement in this field is hindered by two key issues: the performance saturation of existing benchmarks, limiting discriminative evaluation of different IAD methods, and the absence of benchmarks tailored to assess recent multi-modal large language models (MLLMs) in anomaly detection.To this end, we present Omni-AD, a comprehensive IAD benchmark featuring:\\romannumeral1) \\textbf{Large scale}:The dataset consists of approximately 35K images (6$\\times$ larger than MVTec) with 150 product categories (10$\\times$ larger than MVTec) spanning 16 industrial sectors, delivering unprecedented diversity in terms of both category and image scale compared with existing datasets.\\romannumeral2) \\textbf{Versatility}:The benchmark supports both conventional unsupervised and emerging MLLM-based IAD evaluation protocols. The latter is achieved by defining three subtasks of progressive difficulty, with two structured as visual question answering (VQA) and one as visual grounding.\\romannumeral3) \\textbf{Challenge}:Extensive experimental results of state-of-the-art methods reveal that the Omni-AD benchmark is more challenging than existing benchmarks, which can drive the future development of the IAD field.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39148", "url": null, "sourceid": 36089, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40090, "uid": "9802a535bbdbcec203871db6d9595586", "name": "Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models", "authors": [{"id": 191319, "fullname": "Qifan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191319?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 101988, "fullname": "Xingyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/101988?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 138953, "fullname": "Jinhua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/138953?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193490, "fullname": "Weiyi You", "url": "http://cvpr.thecvf.com/api/miniconf/users/193490?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 86553, "fullname": "Shuhang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86553?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to diffusion sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $\\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40090", "url": null, "sourceid": 31432, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37295, "uid": "b80b8aad9b9104c9b2cfb95c6d1fdc97", "name": "NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting", "authors": [{"id": 182232, "fullname": "Brent Zoomers", "url": "http://cvpr.thecvf.com/api/miniconf/users/182232?format=json", "institution": "Hasselt University"}, {"id": 172813, "fullname": "Florian Hahlbohm", "url": "http://cvpr.thecvf.com/api/miniconf/users/172813?format=json", "institution": "TU Braunschweig"}, {"id": 174940, "fullname": "Joni Vanherck", "url": "http://cvpr.thecvf.com/api/miniconf/users/174940?format=json", "institution": "Hasselt University"}, {"id": 187103, "fullname": "Lode Jorissen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187103?format=json", "institution": "Hasselt University"}, {"id": 185325, "fullname": "Marcus Magnor", "url": "http://cvpr.thecvf.com/api/miniconf/users/185325?format=json", "institution": "Institute for Computer Graphics, TU Braunschweig"}, {"id": 187104, "fullname": "Nick Michiels", "url": "http://cvpr.thecvf.com/api/miniconf/users/187104?format=json", "institution": "Hasselt University"}], "abstract": "3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging tensor cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utilizing a combination of our instanced rasterizer and occlusion culling MLP, and exhibits complementary properties to existing LoD techniques.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37295", "url": null, "sourceid": 34667, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37112, "uid": "14b5caec7ac6a9609e748d56a17c174b", "name": "PromptDepth: Efficient and Promptable Geometric 3D Vision Model \\\\ for Embodied Intelligence", "authors": [{"id": 186689, "fullname": "Xianyun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186689?format=json", "institution": "Harbin Institute of Technology"}, {"id": 73555, "fullname": "Jiaxu Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73555?format=json", "institution": "Zhejiang University"}, {"id": 186690, "fullname": "Tian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186690?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 186691, "fullname": "Siyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186691?format=json", "institution": "Chinese University of Hong Kong, Shenzhen"}, {"id": 186692, "fullname": "Yuehao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186692?format=json", "institution": "Harbin Institute of Technology"}, {"id": 186693, "fullname": "Haoyang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186693?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}, {"id": 86326, "fullname": "Yonghong Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/86326?format=json", "institution": "Peking University"}, {"id": 88598, "fullname": "Jun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88598?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Vision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instance-level perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Considering the substantial discrepancy between panoptic and instanced depth map, we further introduce a novel Instanced Label Distribution Smoothing (ILDS) loss, followed by Gram Anchoring, to mitigate the inherent conflict between dense and discrete representation. Trained on synthetic data only, our model achieves state-of-the-art results in both depth estimation and interactive segmentation on public benchmarks. Extensive experiments demonstrate superior visual efficiency in embodied tasks compared to current fundamental models. We believe that our efficient and flexible geometric 3D model offers a new foundation for vision tasks in embodied intelligence. The dataset and the code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37112", "url": null, "sourceid": 31593, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36833, "uid": "ddf22c927d3f99e95403afefc4501a8d", "name": "When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models", "authors": [{"id": 103169, "fullname": "Junrong Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/103169?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 158730, "fullname": "Weijian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158730?format=json", "institution": "Australian National University"}, {"id": 73598, "fullname": "Pengxu Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/73598?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 185980, "fullname": "Yaqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185980?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87065, "fullname": "Qixiang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/87065?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 75470, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75470?format=json", "institution": "Sun Yat-sen University"}], "abstract": "This work studies how latent space structure impacts the performance of Latent Diffusion Models (LDMs). We show that effective generation requires a latent space that is simultaneously locally smooth, enabling stable and reliable reconstruction, and globally dispersive, allowing the model to draw diverse and meaningful samples without collapsing into narrow regions of the latent space. However, existing approaches often emphasize smoothness, which may lead to concentrated latent regions and limited exploration of the broader space. To address these limitations, we propose Self-Organized Representation Learning (SORL), a bottom-up training paradigm inspired by self-organization in complex systems, where global structure emerges naturally from simple local interactions. The critical latent properties of smoothness and maximal dispersity are not explicitly imposed. Instead, SORL promotes these properties through two complementary local mechanisms: local attraction, which encourages coherent reconstructions among nearby latent codes, and local repulsion, which prevents latent codes from collapsing into dense clusters. Through their interaction, SORL induces a latent manifold that maintains both local smoothness and global dispersity, leading to improved reconstruction and generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36833", "url": null, "sourceid": 46447, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37953, "uid": "1f4be5aac180c4920caae2c75e77076d", "name": "4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models", "authors": [{"id": 128862, "fullname": "Yiting Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128862?format=json", "institution": "University of Science and Technology of China"}, {"id": 188667, "fullname": "Wei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188667?format=json", "institution": "University of Science and Technology of China"}, {"id": 188668, "fullname": "Peiyan Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188668?format=json", "institution": "Zhejiang University"}, {"id": 188669, "fullname": "Haoran Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188669?format=json", "institution": "University of Science and Technology of China"}, {"id": 130884, "fullname": "Hanxin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130884?format=json", "institution": "University of Science and Technology of China"}, {"id": 188670, "fullname": "Zihao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188670?format=json", "institution": "University of Science and Technology of China"}, {"id": 188671, "fullname": "Xingrui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188671?format=json", "institution": "University of Science and Technology of China"}, {"id": 188672, "fullname": "Xinyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188672?format=json", "institution": "University of Science and Technology of China"}, {"id": 188673, "fullname": "Xinge Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188673?format=json", "institution": "University of Science and Technology of China"}, {"id": 158204, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158204?format=json", "institution": "University of Science and Technology of China"}, {"id": 85129, "fullname": "Zhibo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85129?format=json", "institution": "University of Science and Technology of China"}], "abstract": "World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, Embodied Intelligence, and content creation.However, prior benchmarks, however, each emphasize different evaluation dimensions and lack a unified assessment of world-realism capability.To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition\u20134D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical Realism, and cross-modal coherence.Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments.We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from \"visual generation\" to \"world generation.\"", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37953", "url": null, "sourceid": 40025, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38197, "uid": "1e492efd68b177927702d6ade06c849e", "name": "DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior", "authors": [{"id": 189284, "fullname": "Junjia Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189284?format=json", "institution": "Sun yat-sen University"}, {"id": 189285, "fullname": "Binbin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189285?format=json", "institution": "ByteDance Inc."}, {"id": 89930, "fullname": "Pengxiang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89930?format=json", "institution": "ByteDance Inc."}, {"id": 189286, "fullname": "JiyangLiu JiyangLiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189286?format=json", "institution": "ByteDance Inc."}, {"id": 87856, "fullname": "Bin Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87856?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 189287, "fullname": "Zhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189287?format=json", "institution": "ByteDance Inc."}, {"id": 87905, "fullname": "Yitong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87905?format=json", "institution": "ByteDance Inc"}, {"id": 75470, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75470?format=json", "institution": "Sun Yat-sen University"}, {"id": 74074, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74074?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Storyboard synthesis plays a crucial role in visual storytelling, aiming to generate coherent shot sequences that visually narrate cinematic events with consistent characters, scenes, and transitions. However, existing approaches are mostly adapted from text-to-image diffusion models, which struggle to maintain long-range temporal coherence, consistent character identities, and narrative flow across multiple shots. In this paper, we introduce DreamShot, a video generative model based storyboard framework that fully exploits powerful video diffusion priors for controllable multi-shot synthesis. DreamShot supports both Text-to-Shot and Reference-to-Shot generation, as well as story continuation conditioned on previous frames, enabling flexible and context-aware storyboard generation. By leveraging the spatial-temporal consistency inherent in video generative models, DreamShot produces visually and semantically coherent sequences with improved narrative fidelity and character continuity. Furthermore, DreamShot incorporates a multi-reference role conditioning module that accepts multiple character reference images and enforces identity alignment via a Role-Attention Consistency Loss, explicitly constraining attention between reference and generated roles. Extensive experiments demonstrate that DreamShot achieves superior scene coherence, role consistency, and generation efficiency compared to state-of-the-art text-to-image storyboard models, establishing a new direction toward controllable video model-driven visual storytelling.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38197", "url": null, "sourceid": 35179, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38049, "uid": "7c40611ab3917aa747d7f7fb249b9620", "name": "RankOOD - Class Ranking-based Out-of-Distribution Detection", "authors": [{"id": 188927, "fullname": "Dishanika Dewani Denipitiyage", "url": "http://cvpr.thecvf.com/api/miniconf/users/188927?format=json", "institution": "University of Sydney"}, {"id": 188928, "fullname": "Naveen Karunanayake", "url": "http://cvpr.thecvf.com/api/miniconf/users/188928?format=json", "institution": "Future Secure AI Pty Ltd"}, {"id": 181978, "fullname": "Suranga Seneviratne", "url": "http://cvpr.thecvf.com/api/miniconf/users/181978?format=json", "institution": "The University of Sydney"}, {"id": 188929, "fullname": "Sanjay Chawla", "url": "http://cvpr.thecvf.com/api/miniconf/users/188929?format=json", "institution": "Qatar Computing Research Institute, Hamad Bin Khalifa University"}], "abstract": "We propose  RankOOD, a rank-based Out-of-Distribution (OOD) detection approach based on training a model with the Placket-Luce loss, which is now extensively used for preference alignment tasks in foundational models. Our approach is based on the insight that with a deep learning model trained using the Cross Entropy Loss,  in-distribution (ID) class prediction induces a ranking pattern for each ID class prediction. The  RankOOD framework formalizes the insight by first extracting a rank list for each class using an initial classifier and then uses another round of training with the Plackett-Luce loss, where the class rank, a fixed permutation for each class, is the predicted variable. An OOD example may get assigned with high probability to an ID example, but the probability of it respecting the ranking classification is likely to be small.  RankOOD, achieves  SOTA performance on the near-ODD TinyImageNet evaluation benchmark, reducing FPR95 by 4.3%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38049", "url": null, "sourceid": 42198, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36312, "uid": "70eeae4cf0f0ddd6e05606e961ec423e", "name": "Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation", "authors": [{"id": 181611, "fullname": "Yongjie Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181611?format=json", "institution": "Sun Yat-sen University"}, {"id": 184748, "fullname": "Zhouxia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184748?format=json", "institution": "Nanyang Technological University"}, {"id": 153470, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153470?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184749, "fullname": "Kaijun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184749?format=json", "institution": "SUN YAT-SEN UNIVERSITY; Beijing University of Aeronautics and Astronautics"}, {"id": 184750, "fullname": "Yifan Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184750?format=json", "institution": "SUN YAT-SEN UNIVERSITY; Shandong University"}, {"id": 184751, "fullname": "Mingtong Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184751?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 151703, "fullname": "weixing chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151703?format=json", "institution": "Sun Yat-Sen University"}, {"id": 73559, "fullname": "Ziliang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73559?format=json", "institution": "Pengcheng Laboratory"}, {"id": 184752, "fullname": "Lingbo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184752?format=json", "institution": "Pengcheng Laboratory"}, {"id": 74074, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74074?format=json", "institution": "Sun Yat-sen University"}, {"id": 75470, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75470?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36312", "url": null, "sourceid": 41287, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36240, "uid": "5282dceffe1d3da5d599abe98cf874de", "name": "Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models", "authors": [{"id": 184541, "fullname": "Ci Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184541?format=json", "institution": "University of Georgia"}, {"id": 184542, "fullname": "Zhaojun Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/184542?format=json", "institution": "University of Georgia"}, {"id": 184543, "fullname": "Chence Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184543?format=json", "institution": "University of Georgia"}, {"id": 184544, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184544?format=json", "institution": "Carnegie Mellon University"}, {"id": 184545, "fullname": "Xiaoming Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184545?format=json", "institution": "University of Georgia"}, {"id": 184546, "fullname": "Shaoyi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184546?format=json", "institution": "Stevens Institute of Technology"}, {"id": 184547, "fullname": "Beiwen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184547?format=json", "institution": "University of Georgia"}, {"id": 184548, "fullname": "Xiaolong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184548?format=json", "institution": "University of Arizona"}, {"id": 184549, "fullname": "Jin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184549?format=json", "institution": "University of Georgia"}, {"id": 184550, "fullname": "Geng Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184550?format=json", "institution": "University of Georgia"}], "abstract": "Pruning-based unlearning has recently emerged as a fast, training-free, and data-independent approach to remove undesired concepts from diffusion models. It promises high efficiency and robustness, offering an attractive alternative to traditional fine-tuning or editing-based unlearning. However, in this paper we uncover a hidden danger behind this promising paradigm. We find that the locations of pruned weights, typically set to zero during unlearning, can act as side-channel signals that leak critical information about the erased concepts.To verify this vulnerability, we design a novel attack framework capable of reviving erased concepts from pruned diffusion models in a fully data-free and training-free manner. Our experiments confirm that pruning-based unlearning is not inherently secure, as erased concepts can be effectively revived without any additional data or retraining.Finally, we explore potential defense strategies and advocate safer pruning mechanisms that conceal pruning locations while preserving unlearning effectiveness, providing practical insights for designing more secure pruning-based unlearning frameworks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36240", "url": null, "sourceid": 38347, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40146, "uid": "2cd019e887a1ef10c8c8b3ccd92f2f9b", "name": "H$^{2}$A$^{2}$: Homogeneity-Aware and Heterogeneity-Aware Feature Perception for Unified Indoor 3D Object Detection", "authors": [{"id": 76581, "fullname": "Tao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/76581?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193620, "fullname": "Tao An", "url": "http://cvpr.thecvf.com/api/miniconf/users/193620?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193621, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193621?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193622, "fullname": "Jin Wensheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193622?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193623, "fullname": "Zhengyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193623?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187102, "fullname": "lijun zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187102?format=json", "institution": "Harbin Institute of Technology"}, {"id": 71512, "fullname": "Ruifeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/71512?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "In this work, we observe that for indoor 3D object detection, fundamental geometric cues induce homogeneous spatial responses across scenes, whereas scene-specific structure yields heterogeneous signatures. However, existing detectors lack effective mechanisms to jointly extract and exploit such dual properties, which imposes inherent limitations on detection performance. Guided by this insight, we propose H$^2$A$^2$, a homogeneity-aware and heterogeneity-aware feature perception network for unified indoor 3D object detection under cross-scene training paradigms.Technically, we introduce a structural-feature-aware kernel selection (SF-KS) method, which encompasses three core components:(i) task-aware linear modulation, a channel-wise affine transformation that strengthens scene-structural feature representation; (ii) kernel weight selection strategy that integrates an offset validity prior to suppress non-informative cross-scene transfer while utilizing a structural consistency posterior to capture scene-homogeneous cues. and (iii) task-aware channel gating that suppresses scene-irrelevant feature responses. Overall, SF-KS enables the precise optimization of homogeneous features while specializing in scene-specific heterogeneous ones. In addition, to stabilize cross-scene optimization, we further introduce norm-based gradient homogenization (NGH) algorithm, which normalizes and dynamically reweights per-task gradient norms to mitigate conflicts and promote consistent updates. Extensive experiments on diverse indoor benchmarks show that H$^2$A$^2$ delivers consistent gains over strong baselines and improves cross-scene generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40146", "url": null, "sourceid": 31936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37168, "uid": "e24d28623b5189670316bb727f50961a", "name": "A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods", "authors": [{"id": 180808, "fullname": "Qinqian Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/180808?format=json", "institution": "National University of Singapore"}, {"id": 69287, "fullname": "Bo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69287?format=json", "institution": "University of Mississippi"}, {"id": 85616, "fullname": "Robby T. Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85616?format=json", "institution": "National University of Singapore"}], "abstract": "Human-object interaction (HOI) detection has traditionally been addressed using task-specific models, sometimes augmented by early vision-language models such as CLIP. With the emergence of large, generative VLMs, a natural question arises: can standalone VLMs perform HOI detection effectively, and how do they compare to specialized HOI methods? Existing benchmarks like HICO-DET rely on exact label matching under incomplete annotations, counting any unmatched prediction as wrong. This leads to incorrect penalization, especially for VLMs whose outputs are less constrained, making fair comparison between the two paradigms difficult. To address this limitation, we introduce a multi-choice HOI benchmark with explicitly defined positives and curated negatives, enabling unified and correct evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37168", "url": null, "sourceid": 46149, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36374, "uid": "689b243e8f33e0d27343ebb0d83b513a", "name": "TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies", "authors": [{"id": 183759, "fullname": "Guang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183759?format=json", "institution": "Nanjing University"}, {"id": 153324, "fullname": "Jie Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153324?format=json", "institution": "Nanjing University"}, {"id": 152096, "fullname": "Ningyuan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152096?format=json", "institution": "Nanjing University"}, {"id": 184905, "fullname": "Xinyao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184905?format=json", "institution": "University of Science and Technology of China"}, {"id": 69175, "fullname": "Jianxin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/69175?format=json", "institution": "Nanjing University"}], "abstract": "Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a *data-independent, mechanically-produced artifact of training*, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables **full-model FP8 pre-training** with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly **W8A8 per-tensor static quantization** of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36374", "url": null, "sourceid": 31959, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36958, "uid": "0e45a87d6e065b04784891b6d20ceefb", "name": "Towards Sparse Video Understanding and Reasoning", "authors": [{"id": 180584, "fullname": "Chenwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180584?format=json", "institution": "Northwestern University"}, {"id": 186310, "fullname": "Zhen Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/186310?format=json", "institution": "Johns Hopkins University"}, {"id": 186311, "fullname": "Shang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186311?format=json", "institution": "Northwestern University, Northwestern University"}, {"id": 186312, "fullname": "Weijian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186312?format=json", "institution": "Northwestern University"}, {"id": 163117, "fullname": "Zihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/163117?format=json", "institution": "Northwestern University, Northwestern University"}, {"id": 105089, "fullname": "Zhuofan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/105089?format=json", "institution": "Tsinghua University"}, {"id": 186313, "fullname": "Lie Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186313?format=json", "institution": "Dolby Laboratories"}, {"id": 186314, "fullname": "Pranav Maneriker", "url": "http://cvpr.thecvf.com/api/miniconf/users/186314?format=json", "institution": "Dolby Laboratories"}, {"id": 186315, "fullname": "Fan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/186315?format=json", "institution": "Dolby"}, {"id": 150949, "fullname": "Manling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/150949?format=json", "institution": "Northwestern University"}, {"id": 186316, "fullname": "Han Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186316?format=json", "institution": "Northwestern University"}], "abstract": "We present **ReViSe** (_**Re**asoning with **Vi**deo **S**parsity_), a framework that combines multi-round reasoning with adaptive frame selection for video question answering (VQA).  Existing vision\u2013language models (VLMs) uniformly sample video frames, which introduces redundancy or irrelevancy.  In contrast, ReViSe*interactively selects informative frames through multi-round reasoning.  To achieve this, ReViSe includes three modules:  a multi-round conversation module that retains frame selection history as memory;  a reasoning tracer that maintains a chain-of-thought across rounds;  and a self-correction mechanism that enforces structural and behavioral validity.  ReViSe integrates seamlessly with both proprietary and open-source VLMs.  It supports proprietary models in a \u201cplug-and-play\u201d manner and enables reinforcement fine-tuning for open-source models.  Experiments on multiple VQA benchmarks show that **ReViSe** improves the video understanding ability of VLMs by improving accuracy while reducing the number of frames used.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36958", "url": null, "sourceid": 31351, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37094, "uid": "7b85277ad0a91a4b738f853380f9f6f2", "name": "Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding", "authors": [{"id": 165443, "fullname": "\u6d77\u6d0b \u95eb", "url": "http://cvpr.thecvf.com/api/miniconf/users/165443?format=json", "institution": "casia"}, {"id": 180294, "fullname": "Hongyun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180294?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 186639, "fullname": "Peng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186639?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 186640, "fullname": "FengXiaoxue FengXiaoxue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186640?format=json", "institution": ""}, {"id": 147650, "fullname": "Mengyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147650?format=json", "institution": "Department of Content Security, Kuaishou Technology"}], "abstract": "Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37094", "url": null, "sourceid": 42219, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36914, "uid": "ca81596ca792c84a9be8deebe401d211", "name": "VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light environment", "authors": [{"id": 180928, "fullname": "Yanming hui", "url": "http://cvpr.thecvf.com/api/miniconf/users/180928?format=json", "institution": "Tianjin University"}, {"id": 106136, "fullname": "Fanhua Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106136?format=json", "institution": "Tianjin University"}, {"id": 153901, "fullname": "Hongying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153901?format=json", "institution": "Tianjin University"}, {"id": 167448, "fullname": "Ben Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/167448?format=json", "institution": "Tianjin University"}, {"id": 173108, "fullname": "Zhenwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173108?format=json", "institution": "Tianjin University"}, {"id": 153902, "fullname": "Liang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153902?format=json", "institution": "Tianjin University"}, {"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}, {"id": 186209, "fullname": "Tong Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186209?format=json", "institution": "Tianjin University"}, {"id": 186210, "fullname": "Bingqin Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/186210?format=json", "institution": "Tianjin University"}], "abstract": "We propose an integrated learning scheme of Video Super-Resolution and Enhancement in Low-Light environment, named VSRELL, which aims to recover Well-Illuminated High-Resolution (WIHR) sequence from Low-Light Low-Resolution (LLLR) counterparts. Due to the complex coupling of multiple degradations, this joint task has received relatively little attention. Our approach jointly models illumination enhancement and spatial-temporal super-resolution to disentangle intertwined degradations. Specifically, we introduce an Illumination-Noise Co-Optimization (INCO) network that employs a dynamic window partitioning strategy to explicitly model physical priors of illumination variations and noise distributions within individual frames of a long-term sequence. This effectively suppresses cross-frame noise accumulation and illumination flickering, achieving simultaneous optimization of motion compensation and brightness correction.Additionally, an Illumination-Sensitive Feature Propagation (ISFP) mechanism is introduced, which utilizes hierarchical illumination-sensing gating unit to adaptively modulate feature channel responses. By adjusting feature propagation intensity and using memory feature attenuation strategy, it can enhance the weighting of high-quality features and suppress error accumulation propagation and strengthen transmission efficiency. The experiments show that VSRELL can explicitly strengthen the brightness continuity and texture fidelity of the restored output, maintaining temporal consistency across the video.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36914", "url": null, "sourceid": 35951, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38192, "uid": "e4807834012f253392148c5bfdcb8020", "name": "Circular-DPO: Aligning Multi-Stage 3D Generative Models via Preference Feedback Loop", "authors": [{"id": 131293, "fullname": "Zejian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131293?format=json", "institution": "Zhejiang University"}, {"id": 189259, "fullname": "Jiarui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189259?format=json", "institution": "Zhejiang University"}, {"id": 189260, "fullname": "Han Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189260?format=json", "institution": "Jiangnan University"}, {"id": 189261, "fullname": "Weiting Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189261?format=json", "institution": ""}, {"id": 189262, "fullname": "Yangrui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189262?format=json", "institution": "Wuhan University"}, {"id": 189263, "fullname": "Chenye Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189263?format=json", "institution": "Zhejiang University"}, {"id": 189264, "fullname": "Pei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189264?format=json", "institution": "Zhejiang University"}, {"id": 128127, "fullname": "Ling Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128127?format=json", "institution": "Peking University"}, {"id": 189265, "fullname": "Zhiyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189265?format=json", "institution": "Alibaba Group"}, {"id": 189266, "fullname": "Changyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189266?format=json", "institution": null}, {"id": 189267, "fullname": "Guang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189267?format=json", "institution": "Zhejiang University; Alibaba Group"}, {"id": 189268, "fullname": "Immanuel Koh", "url": "http://cvpr.thecvf.com/api/miniconf/users/189268?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 131272, "fullname": "Lingyun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131272?format=json", "institution": "Zhejiang University"}], "abstract": "Multi-stage generative models have shown great promise in 3D content creation due to focused generation of structure or texture in different stages, but their outputs often fail to align with human preferences. The key bottleneck to apply alignment methods is the presence of non-differentiable operations between generative stages.This disconnection stops preference signals applied to the final output from being backpropagated to the crucial, early stages of generation, while simple separated stage-wise alignment leads to texture-geometry inconsistency.To address this challenge, we introduce Circular-DPO, which builds a preference feedback loop to align multi-stage 3D generation models to human preference.Our method first applies Direct Preference Optimization (DPO) to refine the final 3D asset.We then construct new preference pairs by sampling and decoding the assets generated by the optimized model.These newly-formed pairs are used to train the preceding generative stage, effectively creating a feedback loop that bridges the non-differentiable gap. Furthermore, to enhance robustness against noisy data, we introduce a quality-aware weighting mechanism that prioritizes reliable preference pairs during training. Experiments demonstrate that our approach improves the alignment of generated 3D content with human preferences by enabling holistic, multi-stage optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38192", "url": null, "sourceid": 43801, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40092, "uid": "9577014358ebabb5010e7513a7439a82", "name": "Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training", "authors": [{"id": 193493, "fullname": "Aojun Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193493?format=json", "institution": "Sichuan University"}, {"id": 131066, "fullname": "Tao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131066?format=json", "institution": "Tsinghua University"}, {"id": 126537, "fullname": "Hangjie Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126537?format=json", "institution": "Zhejiang University"}, {"id": 192115, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192115?format=json", "institution": "Sichuan University"}, {"id": 128886, "fullname": "Yanan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128886?format=json", "institution": "Sichuan University"}], "abstract": "The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL\u2019s generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40092", "url": null, "sourceid": 45481, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38125, "uid": "9c6dc3ff593093879a0eaf172f13589a", "name": "TOWARDS CALIBRATING PROMPT TUNING OF VISION- LANGUAGE MODELS", "authors": [{"id": 143271, "fullname": "Ashshak Sharifdeen", "url": "http://cvpr.thecvf.com/api/miniconf/users/143271?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 153189, "fullname": "Fahad Shamshad", "url": "http://cvpr.thecvf.com/api/miniconf/users/153189?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 76446, "fullname": "Muhammad Akhtar Munir", "url": "http://cvpr.thecvf.com/api/miniconf/users/76446?format=json", "institution": "MBZUAI-UAE"}, {"id": 189104, "fullname": "Abhishek Basu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189104?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 189105, "fullname": "Mohamed Ismithdeen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189105?format=json", "institution": null}, {"id": 189106, "fullname": "Jeyapriyan Jeyamohan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189106?format=json", "institution": null}, {"id": 189107, "fullname": "Chathurika Silva", "url": "http://cvpr.thecvf.com/api/miniconf/users/189107?format=json", "institution": "University of Colombo"}, {"id": 153190, "fullname": "Karthik Nandakumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/153190?format=json", "institution": "Michigan State University"}, {"id": 76698, "fullname": "Muhammad Haris Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76698?format=json", "institution": "Mohamed Bin Zayed University of Artificial Intelligence"}], "abstract": "Prompt tuning of large-scale vision-language models such as CLIP enables efficienttask adaptation without updating model weights. However, it often leads to poorconfidence calibration and unreliable predictive uncertainty. We address thisproblem by proposing a calibration framework that enhances predictive reliabilitywhile preserving the geometry of the pretrained CLIP embedding space, which isrequired for robust generalization. Our approach extends the standard cross-entropyloss with two complementary regularizers: (1) a mean\u2013variance margin penalty thatstabilizes inter-class logit margins by maximizing their average while minimizingdispersion, mitigating underconfidence and overconfidence spikes; and (2) a textmoment-matching loss that aligns the first and second moments of tuned textembeddings with their frozen CLIP counterparts, preserving semantic dispersioncrucial for generalization. Through extensive experiments across 7 prompt-tuningmethods and 11 diverse datasets, we demonstrate that our approach significantlyreduces the Expected Calibration Error (ECE) compared to competitive calibrationtechniques on both base and novel classes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38125", "url": null, "sourceid": 41190, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38423, "uid": "51ec0e93aa36ff91430519184ad6f244", "name": "ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning", "authors": [{"id": 152112, "fullname": "Rui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152112?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 70631, "fullname": "Jianfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70631?format=json", "institution": "NUS"}, {"id": 89549, "fullname": "Jing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89549?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185269, "fullname": "Xuanyu Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185269?format=json", "institution": "ByteDance Inc."}, {"id": 152113, "fullname": "Yixun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152113?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 130380, "fullname": "Guan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/130380?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 157911, "fullname": "Xiu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157911?format=json", "institution": "Bytedance"}, {"id": 189840, "fullname": "Zeming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189840?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 126273, "fullname": "Ping Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126273?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Single-image-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry\u2014especially occluded surfaces\u2014from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose \\textit{Visibility Learning}, a training paradigm that injects visibility structure and positional inductive bias into the image-to-3D pipeline. Our method comprises two synergistic components: (1) \\textit{Visibility Grouping} (VG), which explicitly partitions VecSet tokens into visible and invisible subsets by exploiting the spatial locality of VecSet VAE decoders; and (2) \\textit{Visibility-Aware Positional Encoding} (VAPE), which assigns shared positional embeddings to image tokens and visible shape tokens to amplify their correspondence, while using distinct encodings for invisible tokens to guide hallucination. By explicitly disentangling visible reconstruction from invisible hallucination, our approach shrinks the effective hypothesis space and provides clear structural guidance for diffusion models. Extensive experiments demonstrate that \\textit{Visibility Learning} accelerates training convergence by up to \\textcolor{red}{4.4$\\times$} while achieving superior generation quality compared to strong VecSet-based baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38423", "url": null, "sourceid": 31527, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37722, "uid": "195e177726dca8a0914e3720f7445285", "name": "TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration", "authors": [{"id": 188085, "fullname": "Haowei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188085?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 155423, "fullname": "Tingxuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155423?format=json", "institution": "Northeastern University"}, {"id": 154816, "fullname": "XING WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/154816?format=json", "institution": "ByteDance"}, {"id": 188086, "fullname": "Tianyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188086?format=json", "institution": null}, {"id": 188087, "fullname": "Jiexi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188087?format=json", "institution": "ByteDance Inc."}, {"id": 188088, "fullname": "Weifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188088?format=json", "institution": null}, {"id": 188089, "fullname": "Xurui Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188089?format=json", "institution": "ByteDance Inc."}, {"id": 188090, "fullname": "Fangmin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188090?format=json", "institution": null}, {"id": 86082, "fullname": "Jun-Hai Yong", "url": "http://cvpr.thecvf.com/api/miniconf/users/86082?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 127348, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127348?format=json", "institution": "Tsinghua University"}], "abstract": "Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to compute proxy losses for a compact family of candidate predictors (instantiated primarily with Taylor expansions of varying order and horizon), then assigns each token the predictor with the smallest proxy error. This per-token \u201cprobe-then-select\u2019\u2019 strategy exploits heterogeneous temporal dynamics, requires no additional training, and is compatible with various predictor designs. TAP incurs negligible overhead while enabling large speedups with little or no perceptual quality loss. Extensive experiments across multiple diffusion architectures and generation tasks show that TAP substantially improves the accuracy\u2013efficiency frontier compared to fixed global predictors and caching-only baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37722", "url": null, "sourceid": 42708, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37511, "uid": "c6e986623dc64e92d943aa147603ad49", "name": "Think Before You Drive: World Model-Inspired Multimodal Grounding", "authors": [{"id": 186171, "fullname": "Haicheng Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186171?format=json", "institution": "University of Macau"}, {"id": 187609, "fullname": "Huanming Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187609?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187610, "fullname": "Bonan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187610?format=json", "institution": "University of Macau"}, {"id": 104448, "fullname": "yong kang li", "url": "http://cvpr.thecvf.com/api/miniconf/users/104448?format=json", "institution": "Purdue University"}, {"id": 184167, "fullname": "Yihong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184167?format=json", "institution": "ServiceNow Inc; McGill University, McGill University"}, {"id": 187611, "fullname": "Chengyue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187611?format=json", "institution": "University of Macau"}, {"id": 187612, "fullname": "Dingyi Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187612?format=json", "institution": null}, {"id": 186175, "fullname": "Kehua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186175?format=json", "institution": "University of Washington"}, {"id": 187613, "fullname": "HAI YANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/187613?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86400, "fullname": "Cheng-Zhong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86400?format=json", "institution": "University of Macau"}, {"id": 186177, "fullname": "Zhenning Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186177?format=json", "institution": "University of Macao"}], "abstract": "Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods in AD struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses SOTA baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it also shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data. Our anonymous code submission accompanies this paper, and the dataset will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37511", "url": null, "sourceid": 40891, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39936, "uid": "8dc56b3dd5380fcd7402ce0fbc75cb1e", "name": "FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment", "authors": [{"id": 158115, "fullname": "Myunsoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/158115?format=json", "institution": "Korea University"}, {"id": 158117, "fullname": "Seong-Woong Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/158117?format=json", "institution": "Korea University"}, {"id": 158118, "fullname": "Byung-Jun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/158118?format=json", "institution": "Korea University"}], "abstract": "False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across three vision-language learning frameworks (ALBEF, BLIP-2, SigLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39936", "url": "https://github.com/ku-dmlab/FALCON", "sourceid": 39118, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65755, "file": "/media/PosterPDFs/CVPR%202026/39936.png", "modified": "2026-04-30T02:09:25.373886-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65756, "file": "/media/PosterPDFs/CVPR%202026/39936-thumb.png", "modified": "2026-04-30T02:09:25.607933-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39559, "uid": "6b9ff2319b3a1f076a994924fb764ee3", "name": "Differentiable Laplacian Matrix Guided Superpixel Segmentation", "authors": [{"id": 180692, "fullname": "Jeremy Juybari", "url": "http://cvpr.thecvf.com/api/miniconf/users/180692?format=json", "institution": "University of Maine"}, {"id": 192344, "fullname": "Joshua Hamilton", "url": "http://cvpr.thecvf.com/api/miniconf/users/192344?format=json", "institution": "University of Maine"}, {"id": 192345, "fullname": "Shuvra Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/192345?format=json", "institution": "University of Maine"}, {"id": 152892, "fullname": "Chaofan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152892?format=json", "institution": "University of Maine"}, {"id": 192346, "fullname": "Andre Khalil", "url": "http://cvpr.thecvf.com/api/miniconf/users/192346?format=json", "institution": "University of Maine"}, {"id": 175771, "fullname": "Yifeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175771?format=json", "institution": "University of Maine"}], "abstract": "Superpixels partition an image into perceptually coherent regions, reducing the cost of downstream vision tasks. Modern deep learning methods excel at superpixel generation but often yield irregular boundaries and isolated pixels, necessitating non-differentiable post-processing to enforce connectivity. This undermines the end-to-end learning capabilities. We propose a simple, fully differentiable graph-Laplacian loss that encourages spatial regularity and connectivity during training. The loss is model-agnostic and can be seamlessly integrated into the training of existing architectures to improve the quality of superpixels. In addition, we introduce two novel metrics, the average stray pixel count and excess component count, to measure the quality of superpixels. We demonstrate both qualitative and quantitative improvements over state-of-the-art methods with and without enforced connectivity. Our approach represents a significant step toward eliminating non-differentiable post-processing.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39559", "url": null, "sourceid": 41490, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40374?format=json"], "related_events_ids": [40374]}, {"id": 40374, "uid": "6b9ff2319b3a1f076a994924fb764ee3", "name": "Differentiable Laplacian Matrix Guided Superpixel Segmentation", "authors": [{"id": 180692, "fullname": "Jeremy Juybari", "url": "http://cvpr.thecvf.com/api/miniconf/users/180692?format=json", "institution": "University of Maine"}, {"id": 192344, "fullname": "Joshua Hamilton", "url": "http://cvpr.thecvf.com/api/miniconf/users/192344?format=json", "institution": "University of Maine"}, {"id": 192345, "fullname": "Shuvra Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/192345?format=json", "institution": "University of Maine"}, {"id": 152892, "fullname": "Chaofan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152892?format=json", "institution": "University of Maine"}, {"id": 192346, "fullname": "Andre Khalil", "url": "http://cvpr.thecvf.com/api/miniconf/users/192346?format=json", "institution": "University of Maine"}, {"id": 175771, "fullname": "Yifeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175771?format=json", "institution": "University of Maine"}], "abstract": "Superpixels partition an image into perceptually coherent regions, reducing the cost of downstream vision tasks. Modern deep learning methods excel at superpixel generation but often yield irregular boundaries and isolated pixels, necessitating non-differentiable post-processing to enforce connectivity. This undermines the end-to-end learning capabilities. We propose a simple, fully differentiable graph-Laplacian loss that encourages spatial regularity and connectivity during training. The loss is model-agnostic and can be seamlessly integrated into the training of existing architectures to improve the quality of superpixels. In addition, we introduce two novel metrics, the average stray pixel count and excess component count, to measure the quality of superpixels. We demonstrate both qualitative and quantitative improvements over state-of-the-art methods with and without enforced connectivity. Our approach represents a significant step toward eliminating non-differentiable post-processing.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40374", "url": null, "sourceid": -41490, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39559?format=json"], "related_events_ids": [39559]}, {"id": 39450, "uid": "c4202309735b5048a6579de2a879e4e1", "name": "LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving", "authors": [{"id": 143724, "fullname": "Long Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/143724?format=json", "institution": "University T\u00fcbingen"}, {"id": 169611, "fullname": "Micha Fauth", "url": "http://cvpr.thecvf.com/api/miniconf/users/169611?format=json", "institution": "University of Tuebingen"}, {"id": 192101, "fullname": "Bernhard Jaeger", "url": "http://cvpr.thecvf.com/api/miniconf/users/192101?format=json", "institution": "KE:SAI \u2013 KYUTAI ELLIS Scalable Autonomous Intelligence gGmbH"}, {"id": 141169, "fullname": "Daniel Dauner", "url": "http://cvpr.thecvf.com/api/miniconf/users/141169?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 106666, "fullname": "Maximilian Igl", "url": "http://cvpr.thecvf.com/api/miniconf/users/106666?format=json", "institution": "NVIDIA"}, {"id": 69174, "fullname": "Andreas Geiger", "url": "http://cvpr.thecvf.com/api/miniconf/users/69174?format=json", "institution": "University of T\u00fcbingen"}, {"id": 188187, "fullname": "Kashyap Chitta", "url": "http://cvpr.thecvf.com/api/miniconf/users/188187?format=json", "institution": "NVIDIA"}], "abstract": "Simulation-generated datasets for autonomous driving rely on omniscient data collection 'expert' policies, which use unobservable scene information (e.g., from occluded regions) to make driving decisions.When such data is used for end-to-end policy training, it results in an information asymmetry between the expert and the 'learner' policy, which has limited sensor coverage and navigational intent information compared to the expert. We show that this asymmetry leads to a significant drop in the performance of the learner. To combat this, we present LEAD, a new high-quality synthetic dataset collected in the CARLA simulator with three key improvements.(1) The expert minimizes its use of unobservable information by removing entities from its input state that would be occluded in the learner's field of view.By providing the learner with (2) detailed driver intent information and (3) rich sensor modalities (cameras, LiDARs, radars, and odometry), the dataset narrows down the information gap between the learner and expert. We then propose TransFuser v6 (TFv6), a simple end-to-end learner policy trained on LEAD.As a result of our improvements, TFv6 substantially advances the state of the art on all publicly available CARLA closed-loop driving benchmarks, reaching driving scores of 95 on Bench2Drive, 62 on Longest6 v2, and 15 on the Town13 validation routes.Finally, we aggregate the LEAD dataset with several public real-world datasets under a unified repository to enable cross-dataset evaluation.We show that pre-training TFv6 on synthetic data from LEAD leads to consistent performance gains when followed by fine-tuning with real data from the NAVSIM v1, NAVSIM v2, and WOD-E2E benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39450", "url": null, "sourceid": 41376, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37798, "uid": "4295ef9278970f7b412f413a1de942fc", "name": "Reasoning Diffusion for Unpaired Text-Image to Video Generation", "authors": [{"id": 188292, "fullname": "Zirui Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188292?format=json", "institution": "Tsinghua University"}, {"id": 77260, "fullname": "Xin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77260?format=json", "institution": "Tsinghua University"}, {"id": 188293, "fullname": "Yipeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188293?format=json", "institution": "Computer Science and Technology, Tsinghua University, Tsinghua University"}, {"id": 102575, "fullname": "Hong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/102575?format=json", "institution": null}, {"id": 90363, "fullname": "Kecheng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90363?format=json", "institution": "Ant Group"}, {"id": 84883, "fullname": "Wenwu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84883?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Text-image to video generation aims to synthesize a video conditioned on the given text-image inputs. Nevertheless, existing methods generally assume that the semantic information carried in the input text and image tends to be perfectly paired and temporally aligned, occurring simultaneously in the generated video. As such, existing literature struggles with ``unpaired'' text-image inputs in the more universal and realistic scenario where i) the semantic information carried by the text and image may occur at different timestamps and ii) the condition image can appear at an arbitrary position rather than the first frame of the synthesized video. Video generation under this unpaired setting poses an urgent need to conduct reasoning over the intrinsic connections between the given textual description and referred image, which is challenging and remains unexplored. To address the challenge, in this paper we study the problem of unpaired text-image to video generation for the first time, proposing ReasonDiff, a novel model for accurate video generation from unpaired text-image inputs. Specifically, ReasonDiff designs a VisionNarrator module to harness the powerful reasoning abilities of a multi-modal large language model to analyze the conditioned unpaired text-image inputs, producing coherent per-frame narratives that temporally align them. Building upon this VisionNarrator module, ReasonDiff further introduces a novel AlignFormer module, which employs a Multi-stage Temporal Anchor Attention mechanism to predict frame-wise latent representations. These reasoning-enhanced latents are subsequently fused with the condition frame, providing structured guidance throughout the video generation process. Extensive experiments and ablation studies demonstrate that ReasonDiff significantly beats state-of-the-art baselines in terms of video generation quality with unpaired text-image inputs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37798", "url": null, "sourceid": 40576, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37577, "uid": "db120c26cd221825b31fcdb62f740192", "name": "CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs", "authors": [{"id": 187762, "fullname": "Jingyu Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187762?format=json", "institution": "Zhejiang University"}, {"id": 86591, "fullname": "Gaoang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86591?format=json", "institution": "Zhejiang University"}, {"id": 187763, "fullname": "Der-Horng Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/187763?format=json", "institution": "Zhejiang University"}], "abstract": "Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative multimodal benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37577", "url": null, "sourceid": 34897, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37524, "uid": "0af373f9c874b826b6891a67bd6320e4", "name": "CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning", "authors": [{"id": 152900, "fullname": "Darshan Singh S", "url": "http://cvpr.thecvf.com/api/miniconf/users/152900?format=json", "institution": "Google DeepMind"}, {"id": 75518, "fullname": "Arsha Nagrani", "url": "http://cvpr.thecvf.com/api/miniconf/users/75518?format=json", "institution": "Google "}, {"id": 187637, "fullname": "Kawshik Manikantan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187637?format=json", "institution": "Google DeepMind"}, {"id": 187638, "fullname": "Harman Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187638?format=json", "institution": "University of California, Berkeley"}, {"id": 187639, "fullname": "Dinesh Tewari", "url": "http://cvpr.thecvf.com/api/miniconf/users/187639?format=json", "institution": "Google DeepMind"}, {"id": 187640, "fullname": "Tobias Weyand", "url": "http://cvpr.thecvf.com/api/miniconf/users/187640?format=json", "institution": "Google DeepMind"}, {"id": 69185, "fullname": "Cordelia Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/69185?format=json", "institution": "Inria / Google"}, {"id": 85993, "fullname": "Anelia Angelova", "url": "http://cvpr.thecvf.com/api/miniconf/users/85993?format=json", "institution": "Google"}, {"id": 187641, "fullname": "Shachi Dave", "url": "http://cvpr.thecvf.com/api/miniconf/users/187641?format=json", "institution": "Google DeepMind"}], "abstract": "Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE, a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. We will release CURVE to foster the development of more equitable and capable multimodal foundation models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37524", "url": null, "sourceid": 37285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40094, "uid": "8efa7d8b50a797e8a598d9b5e2805001", "name": "A Causal Marriage between VLM and IRM from Understanding to Reasoning", "authors": [{"id": 73559, "fullname": "Ziliang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73559?format=json", "institution": "Pengcheng Laboratory"}, {"id": 193495, "fullname": "Tianang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193495?format=json", "institution": "polyic"}, {"id": 147015, "fullname": "jusheng zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147015?format=json", "institution": "National University of Singapore; SUN YAT-SEN UNIVERSITY"}, {"id": 153473, "fullname": "Yongsen Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153473?format=json", "institution": "Nanyang Technological University"}, {"id": 153470, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153470?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 193496, "fullname": "Zhao-Rong Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193496?format=json", "institution": "Jinan University"}, {"id": 75470, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75470?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Vision-Language Models (VLMs) like CLIP exhibit extraordinary out-of-distribution (OOD) generalization, while the theoretical foundations underlying this robustness remain largely unexplored. This work establishes a connection between CLIP and Invariant Risk Minimization (IRM), the principled paradigm to overcome OOD problems, through token-level causal representation learning. Our key insight is that CLIP's contrastive objective, when optimally trained, recovers modality-invariant causal factors at the word-and-phrase granularity. By decomposing text prompts into class-specific tokens (causal factors) and class-agnostic context tokens (environmental factors), we prove that a vocabulary-constrained InfoNCE objective becomes formally equivalent to IRM's invariance criterion. Grounded in this equivalence, we propose a mid-training paradigm aiming to inject invariant learning signals into pre-trained CLIP without architectural modification, yielding CLIP-IRM with superior OOD performance. We further extend this causal alignment to multimodal reasoning via using CLIP-IRM's invariant alignment scores as process-level rewards in reinforcement learning, effectively transplanting IRM's guarantees to robust sequential decision-making in Multimodal Large Language Models. Extensive experiments validate our theoretical framework and present substantial improvements in both multimodal OOD understanding and reasoning tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40094", "url": null, "sourceid": 46710, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36406, "uid": "9f2df8e8663790ab3ee45e4facc17908", "name": "GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering", "authors": [{"id": 184971, "fullname": "Xincheng Shuai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184971?format=json", "institution": "Fudan University"}, {"id": 184972, "fullname": "Ziye Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184972?format=json", "institution": "Fudan University"}, {"id": 76198, "fullname": "Henghui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/76198?format=json", "institution": "Fudan University"}, {"id": 73994, "fullname": "Dacheng Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73994?format=json", "institution": "Nanyang Technological University"}], "abstract": "Generating accurate glyphs for visual text rendering is essential yet challenging. Existing methods typically enhance text rendering by training on a large amount of high-quality scene text images, but the limited coverage of glyph variations and excessive stylization often compromise glyph accuracy, especially for complex or out-of-domain characters. Some methods leverage reinforcement learning to alleviate this issue, yet their reward models usually depend on text recognition systems that are insensitive to fine-grained glyph errors, so images with incorrect glyphs may still receive high rewards. Inspired by Direct Preference Optimization (DPO), we propose ***GlyphPrinter***, a preference-based text rendering method that eliminates reliance on explicit reward models. However, the standard DPO objective only models overall preference between two samples, which is insufficient for visual text rendering where glyph errors typically occur in localized regions. To address this issue, we construct the ***GlyphCorrector*** dataset with region-level glyph preference annotations and propose ***Region-Grouped DPO*** (***R-GDPO***), a region-based objective that optimizes inter- and intra-sample preferences over annotated regions, substantially enhancing glyph accuracy. Furthermore, we introduce ***Regional Reward Guidance***, an inference strategy that samples from an optimal distribution with controllable glyph accuracy. Extensive experiments demonstrate that the proposed GlyphPrinter outperforms existing methods in glyph accuracy while maintaining a favorable balance between stylization and precision.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36406", "url": null, "sourceid": 37860, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37942, "uid": "8de91ccecaa259579f4184ce2f632609", "name": "cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold", "authors": [{"id": 188646, "fullname": "Zain Shabeeb", "url": "http://cvpr.thecvf.com/api/miniconf/users/188646?format=json", "institution": "Georgia Institute of Technology"}, {"id": 182381, "fullname": "Daniel Saeedi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182381?format=json", "institution": "Georgia Institute of Technology"}, {"id": 188647, "fullname": "Darin Tsui", "url": "http://cvpr.thecvf.com/api/miniconf/users/188647?format=json", "institution": "Georgia Institute of Technology"}, {"id": 188648, "fullname": "Vida Jamali", "url": "http://cvpr.thecvf.com/api/miniconf/users/188648?format=json", "institution": "Georgia Institute of Technology"}, {"id": 188649, "fullname": "Amirali Aghazadeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/188649?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Cryo-electron microscopy (cryo-EM) enables the atomic-resolution visualization of biomolecules; however, modern direct detectors generate data volumes that far exceed the available storage and transfer bandwidth, thereby constraining practical throughput. We introduce cryoSENSE, the computational realization of a hardware-software co-designed framework for compressive cryo-EM sensing and acquisition. We show that cryo-EM images of proteins lie on low-dimensional manifolds that can be independently represented using sparse priors in predefined bases and generative priors captured by a denoising diffusion model. cryoSENSE leverages these low-dimensional manifolds to enable faithful image reconstruction from spatial and Fourier-domain undersampled measurements while preserving downstream structural resolution. In experiments, cryoSENSE increases acquisition throughput by up to 2.5$\\times$ while retaining the original 3D resolution, offering controllable trade-offs between the number of masked measurements and the level of downsampling. Sparse priors favor faithful reconstruction from Fourier-domain measurements and moderate compression, whereas generative diffusion priors achieve accurate recovery from pixel-domain measurements and more severe undersampling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37942", "url": null, "sourceid": 41046, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38211, "uid": "9cdbcc9724d284c630cf44c29ba6270a", "name": "FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain", "authors": [{"id": 189326, "fullname": "YuAn Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189326?format=json", "institution": "Baidu"}, {"id": 157000, "fullname": "Xiaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157000?format=json", "institution": "BAIDU Inc,"}, {"id": 189327, "fullname": "Chi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189327?format=json", "institution": "Baidu"}, {"id": 189328, "fullname": "Wenhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189328?format=json", "institution": "Nanjing University"}, {"id": 189329, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189329?format=json", "institution": "Baidu"}, {"id": 189330, "fullname": "Bosheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189330?format=json", "institution": null}, {"id": 189331, "fullname": "Xun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189331?format=json", "institution": null}, {"id": 189332, "fullname": "Jun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189332?format=json", "institution": "Baidu"}], "abstract": "In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce **FaithFusion**, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications. Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38211", "url": null, "sourceid": 45389, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38462, "uid": "cea735b5027e1808f8c3960cc30a7210", "name": "RADAR: VQ-VAE decoder of VAR is a good student for Restoring Against Degradation by Acceleration", "authors": [{"id": 133293, "fullname": "Ziyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133293?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 189902, "fullname": "Yue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189902?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184054, "fullname": "Mingdao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184054?format=json", "institution": "Tsinghua University"}, {"id": 189903, "fullname": "Yasen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189903?format=json", "institution": "Xiaomi Corporation"}, {"id": 189904, "fullname": "Teer Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/189904?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 189905, "fullname": "Yu Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189905?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 149038, "fullname": "Xueming LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/149038?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Visual Autoregressive Modeling (VAR) has recently emerged as a powerful paradigm for image generation that surpasses diffusion models in efficiency and quality. However, accelerating attention computation in VAR is still challenging because attention patterns across scales exhibit strong and complex semantic biases that early coarse-scale tokens dominate global structure, while fine-scale tokens mainly refine local details. Existing acceleration methods rely on heuristic token pruning or fixed attention masks, lacking a principled way to balance acceleration and semantic fidelity. In this work, we propose a two-stage acceleration framework for VAR. First, we introduce a semantic-cost-aware masking strategy (SCA-Mask) that quantifies the importance of each attention tile and formulates mask shape design as a cost-constrained optimization problem. This enables adaptive pruning under a given compute budget while preserving essential semantic context. Second, we present Post-Acceleration Adaptation (PAA), a decoder-side fine-tuning scheme that employs internal knowledge distillation to restore image quality from pruned latents. PAA does not require external data and uses a lightweight LoRA-based adaptation, providing a highly efficient alternative to retraining the autoregressive transformer. Comprehensive experiments across multiple VAR tasks demonstrate that our method achieves decent speedup with negligible loss of visual fidelity, yielding a principled and effective pathway toward fast and high-quality visual autoregressive generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38462", "url": null, "sourceid": 43656, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40244, "uid": "6dd7c703068fe3f6d84b2d6d9da3a93f", "name": "MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction", "authors": [{"id": 193863, "fullname": "Xingzu Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193863?format=json", "institution": "Shenzhen University; Carnegie Mellon University"}, {"id": 158971, "fullname": "Runmin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158971?format=json", "institution": "Carnegie Mellon University"}, {"id": 183329, "fullname": "Vatsal Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/183329?format=json", "institution": "Carnegie Mellon University"}, {"id": 193864, "fullname": "Tanush Swaminathan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193864?format=json", "institution": "Carnegie Mellon"}, {"id": 193865, "fullname": "Yanwen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193865?format=json", "institution": "Carnegie Mellon University"}, {"id": 193866, "fullname": "Genpei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193866?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 193867, "fullname": "Haili Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193867?format=json", "institution": "New York University"}, {"id": 92284, "fullname": "Min Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/92284?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Isotropic microscopy reconstruction remains challenging because the anisotropic point spread function in optical systems yields much poorer axial resolution and hampers accurate 3D analysis. Hardware strategies can approach isotropy, yet they are complex, costly, susceptible to sidelobes, and introduce phototoxicity. Deep learning based approaches reduce acquisition burden, but common synthetic pipelines blur with Gaussian kernels that do not match the physical degradation, and many methods lack explicit volumetric geometry constraints since they process 2D slices independently. These gaps lead to low-fidelity reconstructions. To address these challenges, we present MicroFM, which synthesizes realistic training data using physical PSFs matched to the target microscope. MicroFM also introduces the first flow-matching framework for 3D microscopy reconstruction, guided by a continuous implicit geometry prior to achieve high-fidelity isotropic recovery. Across four fluorescence microscopy systems and datasets, MicroFM achieves state-of-the-art performance, producing sharper structures, more isotropic spectra, and substantial gains in both full-reference and no-reference metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40244", "url": null, "sourceid": 33026, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36870, "uid": "8af69d2baafa5a3a9bc1bcdecec02add", "name": "Nonlinear Color Transfer via Learnable Bezier Flows", "authors": [{"id": 186067, "fullname": "Junhyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/186067?format=json", "institution": "easywith"}, {"id": 186068, "fullname": "Seongwoon Jo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186068?format=json", "institution": "logp(x) lab, MODULABS"}, {"id": 186069, "fullname": "Jeong-Hun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/186069?format=json", "institution": "Easywith"}, {"id": 186070, "fullname": "Yeonji Ryou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186070?format=json", "institution": "Easywith"}, {"id": 186071, "fullname": "Jeongha Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186071?format=json", "institution": null}, {"id": 186072, "fullname": "Jangho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186072?format=json", "institution": "Kookmin University"}], "abstract": "Color transfer aims to match the color distribution of a content image (source) to that of a style image (target) while preserving structure and perceptual realism. Yet modulation-based flow models such as ModFlows often produce trajectory misalignment and artifacts because they rely on strictly linear transport paths. We propose NCT, a nonlinear color transfer framework that replaces linear paths with Bezier trajectories, enabling smooth, nonlinear, and perceptually coherent color transfer. This parameterization lets the transport bend toward plausible intermediate color regimes, improving content\u2013style alignment and reducing chromatic distortion. We further incorporate a Mixture of Experts (MoE) module in the encoder to select trajectory experts for different chromatic regimes, improving generalization to heterogeneous data with complex illumination and materials. Experiments show that NCT reduces artifacts and achieves more stable color transfer than prior flow-based methods, especially on 3D-rendered or highly textured images. The code is provided in supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36870", "url": null, "sourceid": 42622, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38249, "uid": "a7572e51d89da5dcfc00ce1c2e20e86c", "name": "Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species", "authors": [{"id": 174862, "fullname": "Jinyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174862?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 174874, "fullname": "Tianqi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174874?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 181695, "fullname": "Xiaonan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181695?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189419, "fullname": "Letian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189419?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189420, "fullname": "Songliang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189420?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189421, "fullname": "Meng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189421?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 128704, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128704?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant counting benchmark taking plant taxonomy into account. Our dataset couples instance-level point annotations with complete Linnaean labels (kingdom$\\rightarrow$species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The datasetfeatures $10,000$ images with $678,090$ point annotations, includes $268$ countable plant categories over $242$ plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy.We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38249", "url": null, "sourceid": 32681, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40331?format=json"], "related_events_ids": [40331]}, {"id": 37595, "uid": "a551f123e505d4d5d75058aae10e993b", "name": "UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models", "authors": [{"id": 181722, "fullname": "Hewen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181722?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187784, "fullname": "Cong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187784?format=json", "institution": "Meituan"}, {"id": 187785, "fullname": "liang da shuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187785?format=json", "institution": null}, {"id": 187786, "fullname": "Zepeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187786?format=json", "institution": null}, {"id": 132123, "fullname": "Gaopengfei Gaopengfei", "url": "http://cvpr.thecvf.com/api/miniconf/users/132123?format=json", "institution": "Beijing SanKuai Online Technology Co., Ltd."}, {"id": 152975, "fullname": "Ziqi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152975?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187787, "fullname": "Lulu Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/187787?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 127854, "fullname": "Pengfei Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127854?format=json", "institution": "Meituan"}, {"id": 84905, "fullname": "Xiaoming Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84905?format=json", "institution": "Meituan"}, {"id": 84784, "fullname": "Minghui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84784?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 84790, "fullname": "Shengshan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84790?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce $\\textbf{UFVideo}$, the first Video LLM with $\\textbf{unified multi-grained cooperative understanding}$ capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the $\\textbf{UFVideo-Bench}$ consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37595", "url": null, "sourceid": 36103, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37679, "uid": "21dfc13281c485bd21bf5f5539ae01d0", "name": "AVGGT: Rethinking Global Attention for Accelerating VGGT", "authors": [{"id": 181349, "fullname": "Xianbing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181349?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 175492, "fullname": "Zhikai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175492?format=json", "institution": "SJTU"}, {"id": 187992, "fullname": "Zhengyu Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187992?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187993, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187993?format=json", "institution": "Alibaba Group"}, {"id": 187994, "fullname": "Jinyang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187994?format=json", "institution": "Ant Group"}, {"id": 88759, "fullname": "Liqing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88759?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187995, "fullname": "He Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187995?format=json", "institution": "Alibaba Group"}, {"id": 187996, "fullname": "Jianfu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187996?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Since DUSt3R, models such as VGGT and $\\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component.We instantiate this strategy on VGGT and $\\pi^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37679", "url": null, "sourceid": 45622, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38573, "uid": "c828b4e93bd75e8f7307dbddedea6480", "name": "Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification", "authors": [{"id": 190173, "fullname": "Qianhao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190173?format=json", "institution": "Dongguan University of Technology"}, {"id": 190174, "fullname": "Jiajia Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190174?format=json", "institution": "Guangdong University of Technology"}, {"id": 190175, "fullname": "Mingtao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190175?format=json", "institution": "Qingdao University"}, {"id": 190176, "fullname": "JingSheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190176?format=json", "institution": null}, {"id": 190177, "fullname": "ShuYang Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190177?format=json", "institution": null}, {"id": 181938, "fullname": "Weiling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181938?format=json", "institution": "Dongguan University of Technology"}], "abstract": "Unsupervised Anomaly Detection (UAD) and anomaly classification are frequently used in industrial and medical scenarios. Specifically, UAD identifies anomalous regions at the fine-grained pixel-level, while anomaly classification distinguishes anomaly types at the anomaly region level. However, existing approaches typically treat these tasks independently and sequentially, overlooking the benefits of jointly training them to suppress Local Visual Ambiguity (LVA) caused by the similarities of different types of anomalies in local visual patterns. Moreover, a multi-task learning framework cannot be directly applied to jointly train the two tasks, since UAD and anomaly classification exhibit feature preference incompatibility. To address these limitations, we propose the Prototype-Guided Semi-Supervised Feature Disentanglement (PG-SFD) framework, which makes a paradigm shift from implicit feature sharing to explicit feature disentanglement and explicitly constructs normal and category prototypes to eliminate implicit normal-abnormal semantic coupling via a Dual-Prototype Disentanglement Module (DPRM). Moreover, for cross-task feature differential injection and gradient conflict mitigation, the Differential Gated Interaction (DGI) and Geometry-Regularized Optimization (GRO) are proposed to form a cohesive framework with DPRM. PG-SFD demonstrates high effectiveness in both UAD tasks and weakly supervised classification tasks. Meanwhile, it exhibits stable performance across multiple types of datasets, including industrial and medical datasets, indicating its strong generalizability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38573", "url": null, "sourceid": 42284, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38176, "uid": "3ca5c9417edca73a996f346dd4a69625", "name": "CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning", "authors": [{"id": 181751, "fullname": "Yongxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181751?format=json", "institution": "MBZUAI"}, {"id": 189219, "fullname": "Zhicheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189219?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 70644, "fullname": "Meng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/70644?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 107317, "fullname": "Mingfei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/107317?format=json", "institution": "MBZUAI; UTS; ByteDance"}, {"id": 189220, "fullname": "Haokun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189220?format=json", "institution": null}, {"id": 189221, "fullname": "Yingying Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189221?format=json", "institution": "Transsion"}, {"id": 89609, "fullname": "Xiaojun Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89609?format=json", "institution": "University of Technology Sydney"}, {"id": 69930, "fullname": "Xiaodan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69930?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has\u2014the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup $z$-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling(RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38176", "url": null, "sourceid": 38034, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36916, "uid": "507828c42f28e5cfc38062b67960699a", "name": "Learned Image Compression via Sparse Attention and Adaptive Frequency", "authors": [{"id": 180721, "fullname": "Huidong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/180721?format=json", "institution": "Nankai University"}, {"id": 186216, "fullname": "Xinyan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186216?format=json", "institution": "Nankai University"}, {"id": 186217, "fullname": "Sun Hui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186217?format=json", "institution": "Microsoft; Nankai University"}, {"id": 186218, "fullname": "Xiaofei Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186218?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186219, "fullname": "xiaoguang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186219?format=json", "institution": "Nankai University"}, {"id": 186220, "fullname": "Gang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186220?format=json", "institution": "Nankai University"}, {"id": 186221, "fullname": "Wentong Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186221?format=json", "institution": "Nanyang Technological University"}], "abstract": "Learned image compression (LIC) methods surpass traditional algorithms in rate-distortion (RD) performance, but still struggle to optimally balance effectiveness and efficiency. Moreover, many methods often overlook the importance of frequency-domain information. Even the few recent methods that incorporate fixed frequency transforms lack content-adaptive capabilities. Therefore, we propose an efficient spatial-frequency dual-path LIC method. Specifically, for the spatial path, we introduce Cross-Sparse Window Attention, leveraging sparse, window-conditioned global tokens to efficiently model long-range dependencies. It achieves lower computational cost and superior effectiveness than standard Window-based Multi-head Self-attention. For the frequency path, we design a content-adaptive frequency transform, employing a decomposition weight generator and learnable global weights to adaptively process multi-scale frequency components. Furthermore, we propose Denoising-as-Regularizer, a training-only module that structures and smooths the latent representation via a denoising task, enhancing reconstruction quality at zero inference cost. Experiments on the Kodak, CLIC, and Tecnick datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods in both RD performance and latency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36916", "url": null, "sourceid": 40820, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37617, "uid": "5468b368b1b5adcbfc133ee8ee94bfa1", "name": "Structural\u2013Semantic Perception for Diffusion-Guided Temporal Forgery Localization", "authors": [{"id": 180322, "fullname": "Ligong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180322?format=json", "institution": "National University of Defense Technology"}, {"id": 187871, "fullname": "Yeting Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187871?format=json", "institution": "National University of Defense Technology"}, {"id": 187872, "fullname": "Haoang Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187872?format=json", "institution": "National University of Defense Technology; Tsinghua University"}], "abstract": "Temporal Forgery Localization (TFL) is crucial for enhancing the interpretability and accountability of deepfake forensics by precisely pinpointing the manipulated segments.However, existing methods face two limitations: (1) localization precision, where one-shot boundary prediction models fail to rectify inherent initial prediction biases, and temporal emphasis overlooks modality-internal semantic forgery cues, resulting in noise-sensitive localization, and (2) cross-dataset generalization, where fixed-scale temporal receptive fields struggle to accommodate varying manipulation durations across real-world scenarios. To address these challenges, we propose a unified framework based on structural\u2013semantic perception and diffusion-guided refinement. The structural\u2013semantic perception comprises two complementary components: (1) structural perception, which adaptively models manipulation durations across varying temporal spans using a designed scale weight allocation network, and (2) semantic perception, which analyzes the semantic consistency within each modality through intra-modal distillation.In this way, it first suppresses low-quality forgery localization proposals, yielding a structurally and semantically reliable candidate set. Then a diffusion-based regression head further iteratively refines the candidates into precise and temporally coherent boundary trajectories.Extensive experiments on multiple TFL benchmarks demonstrate that our method achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37617", "url": null, "sourceid": 37013, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38660, "uid": "48be364fb41f65675f5074ed6a0354f0", "name": "Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation", "authors": [{"id": 174121, "fullname": "Peiyang Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/174121?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 190412, "fullname": "Longyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190412?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 130941, "fullname": "Lu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130941?format=json", "institution": "Dalian University of Technology"}, {"id": 190413, "fullname": "Kuniaki Saito", "url": "http://cvpr.thecvf.com/api/miniconf/users/190413?format=json", "institution": "OMRON SINICX"}, {"id": 190414, "fullname": "Yap-Peng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190414?format=json", "institution": "VinUniversity; Nanyang Technological University"}, {"id": 72996, "fullname": "Fumin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72996?format=json", "institution": "UESTC"}, {"id": 84847, "fullname": "Heng Tao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84847?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 126913, "fullname": "Xiaofeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126913?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 130944, "fullname": "Ping Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130944?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Adverse-weather LiDAR point cloud generation is challenged by complex weather-induced degradations. These degradations affect geometry and reflectance in fundamentally different ways, making joint modeling difficult and ambiguous, especially when diverse real-world training data is limited. To address this, we propose $\\textit{Structure-to-Intensity Diffusion}$ (SiD), a diffusion-based framework that explicitly factorizes the denoising process at each time step: it first reconstructs the geometric structure, then conditions reflectance intensity denoising on the estimated structure. This structure-conditioned design decomposes the joint distribution, reduces modeling ambiguity, and leads to point clouds that are both geometrically coherent and radiometrically realistic. To mitigate data scarcity, we introduce $\\textit{Real-Prior Weather Simulation}$ (RPWS), a degradation module that leverages real-world sensor statistics to synthesize physically plausible adverse-weather point clouds from clear scans. Extensive experiments demonstrate that, with similar model complexity, our approach outperforms the previous state-of-the-art in generating adverse-weather LiDAR scans with both structural and radiometric properties more closely aligned with real-world data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38660", "url": null, "sourceid": 31240, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36405, "uid": "a020312b805eb00036b5b5c530bd4816", "name": "HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars", "authors": [{"id": 184969, "fullname": "Gent Serifi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184969?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 184970, "fullname": "Marcel Buehler", "url": "http://cvpr.thecvf.com/api/miniconf/users/184970?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. While tremendous successes have been achieved for static faces, animatable avatars from dynamic videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the 'inverse covariance trick'. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into two state-of-the-art methods for face avatars: FlashAvatar and GaussianHeadAvatar. Our evaluation on 29 subjects from 6 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyes, teeth, wrinkles, and specular reflections.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36405", "url": null, "sourceid": 36990, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40238, "uid": "b3dcb2651a444aa904bb7c9c5e90ba8c", "name": "H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers", "authors": [{"id": 181040, "fullname": "Ayushi Mehrotra", "url": "http://cvpr.thecvf.com/api/miniconf/users/181040?format=json", "institution": "California Institute of Technology"}, {"id": 193848, "fullname": "Dipkamal Bhusal", "url": "http://cvpr.thecvf.com/api/miniconf/users/193848?format=json", "institution": "Rochester Institute of Technology"}, {"id": 193849, "fullname": "Michael Clifford", "url": "http://cvpr.thecvf.com/api/miniconf/users/193849?format=json", "institution": "Toyota InfoTech Labs        Toyota Motor North America"}, {"id": 193850, "fullname": "Nidhi Rastogi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193850?format=json", "institution": "Rochester Institute of Technology; Rensselaer Polytechnic Institute"}], "abstract": "Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40238", "url": null, "sourceid": 43140, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39123, "uid": "9ffc7855e8f1aafb9db9c5c1817a59e8", "name": "AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning", "authors": [{"id": 180160, "fullname": "Zixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180160?format=json", "institution": "Nanjing University"}, {"id": 191403, "fullname": "Xiangrong Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191403?format=json", "institution": "Nanjing university"}, {"id": 191404, "fullname": "Jieqi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191404?format=json", "institution": "nanjing university"}, {"id": 130892, "fullname": "Lin Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130892?format=json", "institution": "National University of Singapore"}, {"id": 88409, "fullname": "Jing Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/88409?format=json", "institution": "Nanjing University"}, {"id": 86625, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86625?format=json", "institution": "Nanjing University"}], "abstract": "The robust execution of long-horizon manipulation tasks remains a central challenge in embodied intelligence, necessitating both coherent high-level planning and reliable low-level control. Existing approaches often encounter two critical limitations: the accumulation of prediction errors in subgoal planning, leading to compounding deviations over time; and the planning-execution gap, where high-level abstract plans fail to be effectively grounded in the continuous perception-action space. To address these challenges, we propose a novel unified framework, Affordance-Grounded Bidirectional Latent Planning (AGiLe). AGiLe introduces a bidirectional latent planning mechanism that jointly optimizes a backward planner and a forward critic. The backward planner generates goal-directed subgoals from the final objective, while the forward critic assesses their reachability, thereby ensuring temporal robustness through sustained consistency in long-horizon planning. Furthermore, AGiLe bridges the planning-execution gap by leveraging affordance as structural guidance, grounding abstract subgoals into dense, pixel-level visual affordances that drive action. This approach enhances spatial robustness, enabling the system to effectively adapt to semantic and visual distractors. Extensive empirical evaluations across both simulation and real-world settings confirm that AGiLe significantly outperforms strong baselines, achieving an 8.5% improvement over prior state-of-the-art methods in simulation and demonstrating its effectiveness and robustness in long-horizon manipulation tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39123", "url": null, "sourceid": 30741, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36436, "uid": "98f242e5ddc10fd4d0eaf738ee59407a", "name": "MS^2Gait: A Multi-Scale Spatio-Temporal Fusion Network for LiDAR-based Gait Recognition", "authors": [{"id": 181739, "fullname": "Shenyin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181739?format=json", "institution": "Wuhan University"}, {"id": 185041, "fullname": "Yishan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185041?format=json", "institution": "Wuhan University"}, {"id": 185042, "fullname": "Xinyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185042?format=json", "institution": "Wuhan University"}, {"id": 181741, "fullname": "Rui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181741?format=json", "institution": "Wuhan University"}, {"id": 86262, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86262?format=json", "institution": "Wuhan University"}, {"id": 87075, "fullname": "Xin Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/87075?format=json", "institution": "Wuhan University"}], "abstract": "3D LiDAR-based gait recognition has gained increasing attention due to its robustness to illumination, privacy preservation, and capability for long-range and non-contact identity verification. However, existing point cloud-based methods suffer from two critical limitations: they fail to model semantically distant correlations across spatial scales and employ simplistic temporal aggregation that cannot handle gait's inherent heterogeneity. To address these limitations, we propose MS^2Gait, a multi-scale spatio-temporal framework tailored for raw point cloud gait recognition. Our Hierarchical Spatial Feature Extraction module introduces four complementary interaction strategies to explicitly capture long-range semantic dependencies and recover structural information under blockage. Additionally, a Similarity-based Temporal Enhancement Transformer strategy leverages multi-scale aggregation to dynamically weight frames based on motion coherence, effectively handling temporal heterogeneity without explicit supervision. Extensive evaluations on SUSTech1K and FreeGait demonstrate that MS^2Gait achieves 93.5% and 83.1% in Rank-1 accuracy, respectively, outperforming prior state-of-the-art methods, while exhibiting significant robustness against non-gait nuisance factors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36436", "url": null, "sourceid": 37889, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37974, "uid": "c76359a4473736b9e7e6c711727a6c0a", "name": "UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm", "authors": [{"id": 181939, "fullname": "Zhehan Kan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181939?format=json", "institution": "Tsinghua University"}, {"id": 102222, "fullname": "Xinghua Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102222?format=json", "institution": "Tencent YouTu Lab"}, {"id": 188718, "fullname": "Yanlin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188718?format=json", "institution": "Tsinghua University"}, {"id": 188719, "fullname": "Xiaochen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188719?format=json", "institution": "University of Glasgow"}, {"id": 106601, "fullname": "ZHIXIANG WEI", "url": "http://cvpr.thecvf.com/api/miniconf/users/106601?format=json", "institution": "University of science and technology of china"}, {"id": 188720, "fullname": "Shifeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188720?format=json", "institution": "University of Science and Technology of China"}, {"id": 188721, "fullname": "Yubo Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188721?format=json", "institution": "nanjing university"}, {"id": 187652, "fullname": "Qingmin Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187652?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 87900, "fullname": "Wenming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87900?format=json", "institution": "Tsinghua University,"}, {"id": 127206, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127206?format=json", "institution": "Tencent Youtu Lab"}, {"id": 127217, "fullname": "Yinsong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127217?format=json", "institution": "Tencent Youtu Lab"}, {"id": 87747, "fullname": "Deqiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87747?format=json", "institution": "Tencent YouTu Lab"}, {"id": 126816, "fullname": "Xing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/126816?format=json", "institution": "Tencent YouTu Lab"}], "abstract": "Despite remarkable advancements in multimodal large language models (MLLMs), their fine-grained visual understanding is constrained by reliance on pure textual supervision. To unify understanding and generation capabilities, unified autoregressive multimodal models introduce visual supervision; however, they impair multimodal understanding due to the effects of visual feature discretization and orthogonality between image-text loss gradients. In this paper, we observe that pixel-level image patches and textual tokens coexist in raw high-dimensional spaces with inherent input symmetry. Motivated by this insight, we propose UVU, a novel vision-language unified autoregressive framework that eschews vector quantization. It uniquely employs continuous visual encoding for lossless representation of visual inputs and proposes a large-scale iterative hierarchical clustering algorithm to construct a pixel-level visual codebook, thereby extending the vocabulary for unified supervision and enabling autoregressive generation of pixel-level image tokens alongside textual tokens. UVU effectively synergizes pixel-level visual perception with semantic-level visual understanding, internalizing visual generation capabilities and, for the first time, unlocking the facilitative role of visual supervision in enhancing understanding. Extensive experiments across multiple tasks demonstrate that MLLMs are capable of achieving superior multimodal understanding performance under the supervised learning paradigm of UVU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37974", "url": null, "sourceid": 33894, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38943, "uid": "3222a81e0d64425fa749d4cebdfe9179", "name": "HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning", "authors": [{"id": 180527, "fullname": "Zihao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180527?format=json", "institution": "Beijing Normal University"}, {"id": 191019, "fullname": "Nan Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191019?format=json", "institution": "Beijing Normal University"}, {"id": 191020, "fullname": "Jiandian Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191020?format=json", "institution": "Beijing Normal University"}, {"id": 191021, "fullname": "Guo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191021?format=json", "institution": "Beijing Normal University"}, {"id": 191022, "fullname": "Ke Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191022?format=json", "institution": "Beijing Normal University"}, {"id": 191023, "fullname": "Boyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191023?format=json", "institution": "Zhengzhou University"}, {"id": 191024, "fullname": "Tian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191024?format=json", "institution": null}], "abstract": "Vision Transformers (ViTs) have been widely adopted in vision tasks due to their strong transferability. In Federated Learning (FL), where full fine-tuning is communication-heavy, Low-Rank Adaptation (LoRA) provides an efficient and communication-friendly way to adapt ViTs. However, existing LoRA-based federated tuning methods overlook latent client structures in real-world settings, limiting shared representation learning and hindering generalization to unseen clients. To address this, we propose HiLoRA, a hierarchical LoRA framework that places adapters at three levels: root, cluster, and leaf, each designed to capture global, subgroup, and client-specific knowledge, respectively. Through cross-tier orthogonality and cascaded optimization, HiLoRA separates update subspaces and aligns each tier with its residual personalized objective. In particular, we develop a LoRA-Subspace Adaptive Clustering mechanism that infers latent client groups via subspace similarity analysis, thereby facilitating knowledge sharing across structurally aligned clients. Theoretically, we establish a tier-wise generalization analysis that supports HiLoRA\u2019s design. Experiments on ViT backbones with CIFAR-100 and DomainNet demonstrate consistent improvements in both personalization and generalization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38943", "url": null, "sourceid": 38225, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37909, "uid": "3632435cf99eec2a53ee7e4d8eeab451", "name": "MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data", "authors": [{"id": 164353, "fullname": "Hunor Laczko", "url": "http://cvpr.thecvf.com/api/miniconf/users/164353?format=json", "institution": "Universitat Autonoma de Barcelona"}, {"id": 188565, "fullname": "Libang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/188565?format=json", "institution": "Universitat de Barcelona"}, {"id": 188566, "fullname": "Phat Truong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188566?format=json", "institution": "University of Barcelona"}, {"id": 188567, "fullname": "Diego Hern\u00e1ndez", "url": "http://cvpr.thecvf.com/api/miniconf/users/188567?format=json", "institution": "Computer Vision Center"}, {"id": 129850, "fullname": "Sergio Escalera", "url": "http://cvpr.thecvf.com/api/miniconf/users/129850?format=json", "institution": "Computer Vision Center"}, {"id": 188568, "fullname": "Jordi Gonz\u00e0lez", "url": "http://cvpr.thecvf.com/api/miniconf/users/188568?format=json", "institution": "Computer Vision Center, Universitat Aut\u00f2noma de Barcelona"}, {"id": 86091, "fullname": "Meysam Madadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86091?format=json", "institution": "Computer Vision Center, Universitat Aut\u00f2noma de Barcelona"}], "abstract": "Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g., tucked shirts, rolled sleeves). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37909", "url": null, "sourceid": 31437, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36328, "uid": "19349730ee8ba1777f6620e9eab4e4c4", "name": "Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery", "authors": [{"id": 184780, "fullname": "Yangyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184780?format=json", "institution": "Hunan Normal University"}, {"id": 184781, "fullname": "Junbo Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/184781?format=json", "institution": "Hunan Normal University"}, {"id": 184782, "fullname": "You-Wei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184782?format=json", "institution": "Hunan Normal University"}, {"id": 184783, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184783?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Tensor Ring (TR) decomposition is a powerful tool for high-order data modeling, but is inherently restricted to discrete forms defined on fixed meshgrids. In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). However, optimizing this continuous framework to capture fine-scale details is intrinsically difficult. Through a frequency-domain analysis, we demonstrate that the spectral structure of TR factors determines the frequency composition of the reconstructed tensor and limits the high-frequency modeling capacity. To mitigate this, we propose a reparameterized TR functional decomposition, in which each TR factor is a structured combination of a learnable latent tensor and a fixed basis. This reparameterization is theoretically shown to improve the training dynamics of TR factor learning. We further derive a principled initialization scheme for the fixed basis and prove the Lipschitz continuity of our proposed model. Extensive experiments on image inpainting, denoising, super-resolution, and point cloud recovery demonstrate that our method achieves consistently superior performance over existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36328", "url": null, "sourceid": 38996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40230, "uid": "19cc05ddb2a645130576802c2e69cf05", "name": "Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features", "authors": [{"id": 184781, "fullname": "Junbo Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/184781?format=json", "institution": "Hunan Normal University"}, {"id": 184780, "fullname": "Yangyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184780?format=json", "institution": "Hunan Normal University"}, {"id": 184783, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184783?format=json", "institution": "Southern University of Science and Technology"}, {"id": 184782, "fullname": "You-Wei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184782?format=json", "institution": "Hunan Normal University"}], "abstract": "Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev features as a complementary component to Fourier bases. This combination provides a stronger and more stable frequency representation.  Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach, consistently achieving superior performance over existing methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40230", "url": null, "sourceid": 36259, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38634, "uid": "5bbf0b1a188dba2f1fbbc0366538ce5a", "name": "Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts", "authors": [{"id": 143723, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143723?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 190354, "fullname": "Jiajin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190354?format=json", "institution": "Hupan Lab; DAMO Academy, Alibaba Group; Shanghai Jiaotong University"}, {"id": 190355, "fullname": "Yaojun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190355?format=json", "institution": "Zhejiang University"}, {"id": 190356, "fullname": "Bingguang Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190356?format=json", "institution": "Alibaba Group; University of Electronic Science and Technology of China"}, {"id": 190357, "fullname": "Xin Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190357?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 90471, "fullname": "Yingda Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/90471?format=json", "institution": "Alibaba Group"}, {"id": 89486, "fullname": "Danyang Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89486?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 153814, "fullname": "Shi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153814?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}, {"id": 90452, "fullname": "Ling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90452?format=json", "institution": "Alibaba Group"}], "abstract": "A faithful decision-making process requires models to ground human-understandable concepts both spatially (where they appear in the image) and causally (how they influence the prediction). Recent advances in Vision\u2013Language Models (VLMs) enable concept-level alignment and have inspired Concept Bottleneck Models (CBMs), which explain predictions by mapping image representations to human-understandable concepts, allowing users to trace decisions through explicit semantic reasoning. However, existing CBMs suffer from two key inconsistencies. First, semantic inconsistency: VLMs often fail to localize fine-grained part\u2013attribute concepts, producing noisy or incomplete masks. Second, object inconsistency: object-agnostic concepts such as \"head: streamlined front profile\" may describe multiple categories (e.g., fish or human); without enforcing object identity, non-targeted regions can introduce spurious evidence that corrupts the bottleneck representation. To address these challenges, we propose a new Object-Aware Concept Bottleneck Model (OA-CBM) that jointly enforces semantic- and object-level consistency. Specifically, (1) we redefine concepts as part\u2013attribute pairs to enhance VLM robustness at the semantic level, and (2) introduce class-agnostic object clustering to suppress irrelevant visual evidence. We further annotate two grounding datasets with part\u2013attribute descriptions and conduct extensive experiments. Results demonstrate that OA-CBM produces more faithful and robust explanations while maintaining competitive predictive performance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38634", "url": null, "sourceid": 37774, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36970, "uid": "3ae3b9d70e3a1be516add1d7e16bfd20", "name": "GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding", "authors": [{"id": 186342, "fullname": "Jiaqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186342?format=json", "institution": "Jilin University"}, {"id": 186343, "fullname": "Ronghao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186343?format=json", "institution": "Jilin University"}, {"id": 186344, "fullname": "Haoran Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186344?format=json", "institution": "Jilin University"}, {"id": 186345, "fullname": "Lang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186345?format=json", "institution": "Jilin University"}, {"id": 186346, "fullname": "Qipeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186346?format=json", "institution": "Jilin University"}, {"id": 186347, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186347?format=json", "institution": "Jilin University"}], "abstract": "Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36970", "url": null, "sourceid": 37451, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39708, "uid": "6d7364731230122df155fddea6878dd3", "name": "SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking", "authors": [{"id": 179940, "fullname": "Qiuyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179940?format=json", "institution": "Tongji University"}, {"id": 192695, "fullname": "Jiujun Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192695?format=json", "institution": "Tongji University"}, {"id": 192696, "fullname": "Qichao Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192696?format=json", "institution": null}, {"id": 192697, "fullname": "Cong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192697?format=json", "institution": "Nova University of lisbon"}, {"id": 192698, "fullname": "Yu Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192698?format=json", "institution": "Tongji University"}, {"id": 192699, "fullname": "Yuhong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192699?format=json", "institution": "Stockholm University"}, {"id": 192700, "fullname": "Mengying Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/192700?format=json", "institution": "Shanghai University"}, {"id": 192701, "fullname": "Shangce Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192701?format=json", "institution": "University of Toyama"}], "abstract": "Spiking Neural Networks (SNNs)  promise energy-efficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons\u2019 spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy.  To address this,  we introduce SpikeTrack, a  spike-driven framework  for  energy-efficient RGB object  tracking.   SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing  spatiotemporal dynamics while cutting computation.  To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently  queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time.  Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers.  Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To  our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39708", "url": null, "sourceid": 39999, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38810, "uid": "156cc4979a3e0b77f9835408ca53795a", "name": "PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation", "authors": [{"id": 190737, "fullname": "Yuanzhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190737?format=json", "institution": "University of Pennsylvania"}, {"id": 190738, "fullname": "Jingyuan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190738?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 190739, "fullname": "Yuchen Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190739?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 190740, "fullname": "Gen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190740?format=json", "institution": "Nanyang Technological University"}, {"id": 135921, "fullname": "Xu Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135921?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 190741, "fullname": "Jin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190741?format=json", "institution": "University of Oxford"}, {"id": 179565, "fullname": "Yifan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/179565?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 158727, "fullname": "Zhengyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158727?format=json", "institution": "Purdue University"}, {"id": 152649, "fullname": "Tianjiao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152649?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 190742, "fullname": "Wenzhen Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190742?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 190165, "fullname": "Fangqiang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190165?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 152651, "fullname": "Ismini Lourentzou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152651?format=json", "institution": "University of Illinois Urbana - Champaign"}], "abstract": "Recent advancements in vision-language\u2013action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions. Across extensive simulation and real-world experiments, PALM consistently outperforms baselines, achieving a 91.8\\% success rate on LIBERO-LONG, a 12.5\\% improvement in average length on CALVIN ABC$\\rightarrow$D, and a 2$\\times$ improvement over real-world baselines across three long-horizon generalization settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38810", "url": null, "sourceid": 39690, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36801, "uid": "347c3daf9d774f4a54dad45ec3153ddd", "name": "Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration", "authors": [{"id": 185900, "fullname": "Shaochen Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185900?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 182546, "fullname": "Yuting He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182546?format=json", "institution": "Case Western Reserve University"}, {"id": 152543, "fullname": "Weiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152543?format=json", "institution": "Hong Kong Metropolitan University"}, {"id": 87506, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87506?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR process two images simultaneously as input, the combination relationship between features has grown exponentially, ultimately the model considers more interfering features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enable the model to eliminate the interfering features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore interfering feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the feature relationships with greater correlation. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code will be released on the website.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36801", "url": null, "sourceid": 38130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38432, "uid": "d9cd172570d87a489155b135f3fee210", "name": "Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors", "authors": [{"id": 189857, "fullname": "Peiyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189857?format=json", "institution": "University of Melbourne"}, {"id": 85739, "fullname": "Naveed Akhtar", "url": "http://cvpr.thecvf.com/api/miniconf/users/85739?format=json", "institution": "UNIVERSITY OF WESTERN AUSTRALIA"}, {"id": 107171, "fullname": "Jiantong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107171?format=json", "institution": "The University of Western Australia"}, {"id": 90780, "fullname": "Ajmal Mian", "url": "http://cvpr.thecvf.com/api/miniconf/users/90780?format=json", "institution": "University of Western Australia"}], "abstract": "The performance of neural network models deteriorates due to their unreliable behavior on non-robust features of corrupted samples. Owing to their opaque nature, rectifying models to address this problem often necessitates arduous data cleaning and model retraining, resulting in huge computational and manual overhead. In this work, we leverage rank-one model editing to establish an attribution-guided model rectification framework that effectively locates and corrects model unreliable behaviors. We first distinguish our rectification setting from existing model editing, yielding a formulation that corrects unreliable behavior while preserving model performance and reducing reliance on large budgets of cleansed samples. We further reveal a bottleneck of model rectifying arising from heterogeneous editability across layers. To target the primary source of misbehavior, we introduce an attribution-guided layer localization method that quantifies layer-wise editability and identifies the layer most responsible for unreliabilities. Extensive experiments demonstrate the effectiveness of our method in correcting unreliabilities observed for neural Trojans, spurious correlations and feature leakage. Our method shows remarkable performance by achieving its editing objective with as few as a single cleansed sample, which makes it appealing for practice.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38432", "url": null, "sourceid": 33522, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40207, "uid": "7929d0c1c363d229873f32a2364556f9", "name": "Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain", "authors": [{"id": 192950, "fullname": "Chengyin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192950?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}, {"id": 193778, "fullname": "Xin wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193778?format=json", "institution": "China University of Petroleum (Beijing)"}, {"id": 193779, "fullname": "Rui Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193779?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}, {"id": 193780, "fullname": "Zhe Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193780?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}, {"id": 193781, "fullname": "Yingying Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193781?format=json", "institution": "China University of Petroleum (Beijing)"}, {"id": 193782, "fullname": "Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193782?format=json", "institution": "China Electric Power Research Institute Co., Ltd. (CEPRI)"}, {"id": 178421, "fullname": "Xu Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178421?format=json", "institution": null}, {"id": 183259, "fullname": "Yiwei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/183259?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}], "abstract": "Infrared pedestrian detection is crucial in safety-critical systems but remains vulnerable to adversarial attacks. Existing physical attacks often rely on fixed, static patterns. However, they often lack robustness across scales, as their hand-crafted or uniformly generated structures are fundamentally limited by a fixed receptive field and fail to adapt to varying distances and scene contexts. In light of this, we propose AdvFractal, a black-box attack that exploits the innate self-similarity and structural richness of fractal geometry to naturally generate multi-scale, physically realizable adversarial perturbations. By modeling perturbations with H-type fractals and optimizing parameters via Particle Swarm Optimization, AdvFractal seamlessly coordinates attacks across scales, progressively disrupting detector features from local textures to global shapes. Experiments show AdvFractal achieves an attack success rate (ASR) of 97.54% in the physical domain and 99.16% cross-dataset, significantly outperforming state-of-the-art methods. The perturbations are highly effective in the infrared spectrum while remaining stealthy in visible light, offering a novel approach for evaluating and understanding the security of infrared detection systems.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40207", "url": null, "sourceid": 34250, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36584, "uid": "f8169d1798f912a4f544390d8a53cb92", "name": "OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning", "authors": [{"id": 185412, "fullname": "Zhijia Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185412?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 126336, "fullname": "Jiaming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126336?format=json", "institution": "Baidu"}, {"id": 89410, "fullname": "Weikai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89410?format=json", "institution": "Tencent America"}, {"id": 87903, "fullname": "Yanhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87903?format=json", "institution": "Oppo Research Institute"}, {"id": 154487, "fullname": "Haonan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154487?format=json", "institution": "OPPO Guangdong Mobile Telecommunications Co., Ltd."}, {"id": 74074, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74074?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Streaming video reasoning requires models to operate in a setting where history grows without bound while meaningful evidence remains scarce. In such a landscape, relevant signal is like an oasis -- small, critical, and easily lost in a desert of redundancy. Enlarging memory only widens the desert; aggressive compression dries up the oasis. The real difficulty lies in discovering where to look, not how much to remember. We therefore introduce OASIS, a novel framework for streaming video reasoning that tackles this challenge through structured, on-demand retrieval. It organizes streaming history into hierarchical events and performs reasoning as controlled refinement -- short-context inference first, followed by semantically grounded retrieval only when uncertainty arises. As the retrieval is driven by high-level intent rather than embedding similarity, the retrieve memory is substantially more accurate and less noisy. Additionally, the mechanism is plug-and-play, training-free, and compatible with any streaming MLLM. Experiments across multiple benchmarks show that OASIS achieves strong gains in long-horizon accuracy and compositional reasoning with far less memory budget.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36584", "url": null, "sourceid": 35017, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39339, "uid": "a97b25ab48ed33b8adde0e1c0e199b42", "name": "Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting", "authors": [{"id": 152649, "fullname": "Tianjiao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152649?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187956, "fullname": "Vedant Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/187956?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 152650, "fullname": "Muntasir Wahed", "url": "http://cvpr.thecvf.com/api/miniconf/users/152650?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 169953, "fullname": "Ying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/169953?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 152647, "fullname": "Kiet A. Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152647?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 152651, "fullname": "Ismini Lourentzou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152651?format=json", "institution": "University of Illinois Urbana - Champaign"}], "abstract": "Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part$^{2}$GS consistently outperforms state-of-the-art methods by up to 10$\\times$ in Chamfer Distance for movable parts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39339", "url": null, "sourceid": 42453, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39897, "uid": "0ff9d41871184a7b3d80f580eaba64e1", "name": "PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics", "authors": [{"id": 169953, "fullname": "Ying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/169953?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 193070, "fullname": "Jerry Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193070?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 152649, "fullname": "Tianjiao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152649?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 152651, "fullname": "Ismini Lourentzou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152651?format=json", "institution": "University of Illinois Urbana - Champaign"}], "abstract": "Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics.In this work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose PHANTOM, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states,  PHANTOM jointly predicts latent physical dynamics and generates future video frames.PHANTOM leverages a physics-aware video representation that serves as an abstract yet informative embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, PHANTOM produces video sequences that are both visually realistic and physically consistent.Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that PHANTOM not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39897", "url": null, "sourceid": 38312, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38699, "uid": "f321d1828715ef2d889efb20b8c706b0", "name": "CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness", "authors": [{"id": 157544, "fullname": "Wenhao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157544?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 143106, "fullname": "Zhaoran Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/143106?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 157536, "fullname": "Peng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157536?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 190487, "fullname": "Sheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190487?format=json", "institution": "Peking University"}, {"id": 185053, "fullname": "Qian Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185053?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 190488, "fullname": "Derui LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/190488?format=json", "institution": "Beijing University of Post and Telecommunications"}], "abstract": "Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SDAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing autocorrelation and embedding LR self-similarity priors. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38699", "url": null, "sourceid": 32782, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39297, "uid": "6784ac07a6aeb1433c961ace509971d0", "name": "MSAG: A Multispectral Aerial\u2013Ground Benchmark for Any-Scenario Person Re-Identification", "authors": [{"id": 191790, "fullname": "Yuxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191790?format=json", "institution": "Wuhan University"}, {"id": 191791, "fullname": "Zhongao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191791?format=json", "institution": "Wuhan University"}, {"id": 130757, "fullname": "Bin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130757?format=json", "institution": "Wuhan University"}, {"id": 87442, "fullname": "He Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87442?format=json", "institution": "Wuhan University"}, {"id": 151723, "fullname": "Jian Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151723?format=json", "institution": "Wuhan University"}, {"id": 130766, "fullname": "Jun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130766?format=json", "institution": "Wuhan University"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "Recent person re-identification (ReID) leverages heterogeneous sensing with multiple modalities and viewpoints to improve robustness across diverse conditions. However, most approaches target predefined scenario pairs (e.g., visible-infrared or aerial-ground) and train separate task-specific models. In contrast, real-world applications require retrieving identities from galleries that cover all scenarios, making such designs inefficient and complex to deploy. To bridge this gap, we introduce Any-Scenario ReID (AS-ReID): given a query from any (modality, viewpoint) scenario, a single model retrieves the same identity from a heterogeneous gallery spanning all scenarios. Progress toward AS-ReID is limited by two factors: (i) the lack of a real-world-aligned benchmark with broad scenario coverage, and (ii) the challenge of learning representations that are cohesive within identities and strongly discriminative across identities under diverse scenarios. To this end, we construct MSAG, a Multispectral Aerial-Ground benchmark with 2,337 identities and 434,620 images captured by RGB, near-infrared, and thermal infrared cameras on both ground and UAV platforms. MSAG spans day-night, multiple seasons, and varied weather conditions, and supports AS-ReID as well as conventional ReID tasks. We further propose the Unified Alignment and Discrimination (UAD) framework. Progressive Center Alignment (ProCA) aggregates multi-view features into modality centers and then aligns them toward identity centers to reduce scenario bias. Global Prototype Discrimination (GPD) contrasts samples against global identity prototypes to enforce large-margin discrimination. Extensive experiments highlight the challenges of MSAG and demonstrate the effectiveness of UAD on AS-ReID. The dataset and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39297", "url": null, "sourceid": 31186, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38112, "uid": "6c5532cd89a43796f19e4ac21f3b8c72", "name": "TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition", "authors": [{"id": 180475, "fullname": "Fang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180475?format=json", "institution": "Beihang university"}, {"id": 189078, "fullname": "Shihao Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189078?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 189079, "fullname": "Weixin Si", "url": "http://cvpr.thecvf.com/api/miniconf/users/189079?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 131069, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131069?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 129752, "fullname": "Shuai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129752?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 103538, "fullname": "Aimin Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/103538?format=json", "institution": "Beihang University"}], "abstract": "Understanding complex surgical scenes requires recognizing multiple interdependent entities\u2014such as instruments, actions, and targets\u2014and maintaining their relational consistency across time. Existing surgical triplet recognition methods struggle to jointly model intra-frame label dependencies and inter-frame temporal semantics in a unified manner. To address these limitations, we propose a unified framework that integrates spatial, relational, and temporal cues for robust surgical triplet recognition. Specifically, class-specific spatial priors are first extracted through a multi-scale encoder. Then, these priors are refined by a Label Correlation Modeling module with multi-scale class activation map-guided relational extraction (MS-CAMRE), enabling the model to capture both static co-occurrence and dynamic contextual dependencies among triplet components. Furthermore, a Bidirectional Temporal\u2013Relational Fusion Attention (BTRFA) module harmonizes temporal and relational representations to achieve coherent temporal reasoning. We also introduce a new evaluation metric, the Triplet Consistency Error Rate (TCER), which quantitatively measures the model\u2019s capability to preserve causal and semantic consistency across triplets. Extensive experiments on the CholecT45 and ProStaTD datasets show that our method achieves state-of-the-art (SOTA) performance, improving $AP_{IVT}$ by 5.1\\% and 7.8\\%, respectively. Moreover, on the TCER metric, our approach yields over 36\\% and 25\\% relative reductions on the two datasets, respectively, underscoring the effectiveness of our framework in temporal\u2013relational co-reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38112", "url": null, "sourceid": 40224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39561, "uid": "c4d8eae871684a2bddeb72d46fe8a030", "name": "CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis", "authors": [{"id": 181829, "fullname": "Xin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/181829?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 157536, "fullname": "Peng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157536?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188437, "fullname": "Yisong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188437?format=json", "institution": "Peking University"}, {"id": 188436, "fullname": "Chengwei Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188436?format=json", "institution": "Beihang Uinveristy"}, {"id": 190487, "fullname": "Sheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190487?format=json", "institution": "Peking University"}], "abstract": "Novel view synthesis (NVS) under large view deviations remains an underexplored challenge for 3D Gaussian Splatting (3DGS). In urban scenes with limited training coverage, models often fail to maintain geometric consistency when extrapolating to unseen viewpoints, resulting in severe distortions and degraded rendering quality. We introduce Context-Aware Gaussian Splatting (CoRoGS), a $\\textbf{Co}$ntext-aware framework for $\\textbf{Ro}$bust large-deviation novel view synthesis (LD-NVS) that embeds contextual reasoning into 3DGS. Instead of treating Gaussians as independent primitives, CoRoGS adopts a contextual formulation that explicitly models inter-Gaussian dependencies. This representation is implemented by constructing a 3D Gaussian graph, which propagates relational geometry and semantics via message passing, resulting in context-aware Gaussian updates. To further maintain structural consistency under substantial view deviation, we incorporate a progressive graph expansion strategy that adaptively grows and prunes Gaussians, leading to more coherent and complete scene reconstructions. Extensive experiments demonstrate that CoRoGS outperforms state-of-the-art 3DGS-based methods, producing higher-quality results. We highlight that CoRoGS robustly handles a wide range of view shifts, including lateral deviations (e.g., lane-level offsets) and cross-level transitions such as from ground-level driving views to elevated perspectives.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39561", "url": null, "sourceid": 44546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36708, "uid": "91aa9457862756fd05da015bda13887e", "name": "Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding", "authors": [{"id": 70672, "fullname": "Jin-Seop Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/70672?format=json", "institution": "Sungkyunkwan University"}, {"id": 181850, "fullname": "Sungjoon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/181850?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 185696, "fullname": "SeongJun Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/185696?format=json", "institution": "Sung Kyun Kwan University; Sung Kyun Kwan University"}, {"id": 84599, "fullname": "Boyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84599?format=json", "institution": "Nanyang Technological University"}, {"id": 129882, "fullname": "Jee-Hyong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/129882?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant.To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG.Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives\u2014format, refuse-IoU, explain, and query correction\u2014to improve both relevance discrimination and fine-grained semantic reasoning.In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers.We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36708", "url": null, "sourceid": 46094, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36432, "uid": "94c83444ad7225785cf634d94a286da0", "name": "Efficient Equivariant Transformer for Self-Driving Agent Modeling", "authors": [{"id": 181816, "fullname": "Scott Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181816?format=json", "institution": "Department of Computer Science, University of Toronto; Waabi"}, {"id": 185032, "fullname": "Dian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185032?format=json", "institution": "Apple"}, {"id": 87941, "fullname": "Kelvin Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87941?format=json", "institution": "Waabi"}, {"id": 185033, "fullname": "Chris Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185033?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 185034, "fullname": "Kion Fallah", "url": "http://cvpr.thecvf.com/api/miniconf/users/185034?format=json", "institution": "Waabi Innovations Inc"}, {"id": 87951, "fullname": "Raquel Urtasun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87951?format=json", "institution": "Waabi"}], "abstract": "Accurately modeling agent behaviors is an important task in self-driving.It is also a task with many symmetries, such as equivariance to the order ofagents and objects in the scene or equivariance to arbitrary roto-translationsof the entire scene as a whole; i.e., SE(2)-equivariance.The transformer architecture is a ubiquitous tool for modeling these symmetries.While standard self-attention is inherently permutation equivariant,explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance.However, this approach introduces an additional cost that is quadratic in the number of agents,limiting its scalability to larger scenes and batch sizes.In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achievesSE(2)-equivariance without the computational cost of existing methods.Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elementsas multivectors in the 2D projective geometric algebra $\\mathbb{R}^*_{2,0,1}$ and processesthem with a stack of equivariant transformer blocks.Crucially, DriveGATr models geometric relationships using standard attentionbetween multivectors, eliminating the need for costly explicit pairwise relative positional encodings.Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to thestate-of-the-art in traffic simulation and establishes a superior Pareto front for performancevs computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36432", "url": null, "sourceid": 43374, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39872, "uid": "407688b935e4079fecd2b2daefb6432e", "name": "History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation", "authors": [{"id": 190397, "fullname": "Guangzhao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190397?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 155013, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155013?format=json", "institution": "Renmin University of China"}, {"id": 132274, "fullname": "Zihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132274?format=json", "institution": "National University of Singapore"}, {"id": 90405, "fullname": "Guo-Sen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90405?format=json", "institution": "inception institute of artificial intelligence (iiai)"}, {"id": 191786, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191786?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 87633, "fullname": "Jinshan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87633?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 76028, "fullname": "Qianru Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76028?format=json", "institution": "Singapore Management University"}, {"id": 157797, "fullname": "Xiangbo Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157797?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Vision-and-Language Navigation in Continuous Environment (VLN-CE) requires an agent to follow language instructions to navigate the target destination.  With the advancement of large language models (LLMs), recent efforts have explored adapting them for zero-shot VLN-CE, offering a promising solution in addressing the drawbacks of poor generalization in the training-based paradigm. However, existing LLM-based works primarily perform naive reasoning for decision-making and lack feedback, e.g., reviewing historical errors and predicting future potentials. Consequently, it may suffer from continuous failure for those initial error tasks. In this paper, we rethink LLM-based zero-shot VLN-CE and propose a new paradigm, named EvoNav, to improve the agent's decision-making with future thought and history experience via Future Chain-of-Thought (F-CoT) and History Chain-of-Experience (H-CoE). F-CoT predicts future actions and landmarks as thoughts to assist navigation progress estimation and direction selection, while H-CoE summarizes historical trajectories and scenes as experience to improve navigation decision reliability. Both F-CoT and H-CoE cooperatively evolve the agent\u2019s decision-making. Extensive experiments in both the simulator and real-world environments demonstrate the effectiveness of our EvoNav. Source code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39872", "url": null, "sourceid": 31732, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38654, "uid": "ec4f1bd61012641a6eb0aa63cd06cf39", "name": "D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation", "authors": [{"id": 132274, "fullname": "Zihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132274?format=json", "institution": "National University of Singapore"}, {"id": 152779, "fullname": "Seungjun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/152779?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 190397, "fullname": "Guangzhao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190397?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning, while modular systems ignore cross-component interdependencies and synergies. To bridge this gap, we propose the Dynamic 3D Vision-Language-Planning Model (D3D-VLP). Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data. This allows different CoT components to mutually reinforce and implicitly supervise each other. To this end, we construct a large-scale dataset with 10M hybrid samples from 5K real scans and 20K synthetic scenes that are compatible with online learning methods such as RL and DAgger. Our D3D-VLP achieves state-of-the-art results on multiple benchmarks, including Vision-and-Language Navigation (R2R-CE, REVERIE-CE, NavRAG-CE), Object-goal Navigation (HM3D-OVON), and Task-oriented Sequential Grounding and Navigation (SG3D). Real-world mobile manipulation experiments further validate the effectiveness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38654", "url": null, "sourceid": 35297, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36876, "uid": "4e4e70d504b4c0006c8287dedc99d0fc", "name": "Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression", "authors": [{"id": 186080, "fullname": "Minh-Duong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186080?format=json", "institution": "VinUniversity"}, {"id": 180397, "fullname": "Senura Hansaja Wanasekara", "url": "http://cvpr.thecvf.com/api/miniconf/users/180397?format=json", "institution": "University of Sydney"}, {"id": 186081, "fullname": "Le-Tuan Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186081?format=json", "institution": null}, {"id": 186082, "fullname": "Ken-Tye Yong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186082?format=json", "institution": "The University of Sydney"}, {"id": 186083, "fullname": "Quoc-Viet Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/186083?format=json", "institution": "University of Dublin, Trinity College"}, {"id": 186084, "fullname": "Nguyen H. Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/186084?format=json", "institution": "The University of Sydney"}, {"id": 186085, "fullname": "Dung D. Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/186085?format=json", "institution": "VinUniversity"}], "abstract": "Federated Unlearning (FUL) aims to remove specific participants' data contributions from a trained Federated Learning model, thereby ensuring data privacy and compliance with regulatory requirements. Despite its potential, progress in FUL has been limited due to several challenges, including the cross-client knowledge inaccessibility and high computational and communication costs.To overcome these challenges, we propose Federated On-server Unlearning (FOUL), a novel framework that comprises two key stages. The learning-to-unlearn stage serves as a preparatory learning phase, during which the model identifies and encodes the key features associated with the forget clients. This stage is communication-efficient and establishes the basis for the subsequent unlearning process.Subsequently, on-server knowledge aggregation phase aims to perform the unlearning process at the server without requiring access to client data, thereby preserving both efficiency and privacy.We introduce a new data setting for FUL, which enables a more transparent and rigorous evaluation of unlearning. To highlight the effectiveness of our approach, we propose a novel evaluation metric termed time-to-forget, which measures how quickly the model achieves optimal unlearning performance.Extensive experiments conducted on three datasets under various unlearning scenarios demonstrate that FOUL outperforms the Retraining in FUL. Moreover, FOUL achieves competitive or superior results with significantly reduced time-to-forget, while maintaining low communication and computation costs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36876", "url": null, "sourceid": 41269, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37778, "uid": "ed1b9807eea87cdd31bc57a81490bb43", "name": "Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors", "authors": [{"id": 87463, "fullname": "Puria Azadi Moghadam", "url": "http://cvpr.thecvf.com/api/miniconf/users/87463?format=json", "institution": "University of British Columbia"}, {"id": 188246, "fullname": "Ali Khajegili Mirabadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188246?format=json", "institution": "University of British Columbia"}, {"id": 188247, "fullname": "Behnam Maneshgar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188247?format=json", "institution": "University of British Columbia"}, {"id": 87444, "fullname": "Hossein Farahani", "url": "http://cvpr.thecvf.com/api/miniconf/users/87444?format=json", "institution": "University of British Columbia"}, {"id": 87428, "fullname": "Ali Bashashati", "url": "http://cvpr.thecvf.com/api/miniconf/users/87428?format=json", "institution": "University of British Columbia"}], "abstract": "Accurate cancer risk assessment is critical for personalized treatment planning. While multimodal models that integrate histopathology with complementary data modalities (e.g., genomics, or clinical reports) exhibit superior prognostic capability, they typically assume full data availability, an unrealistic expectation in real-world clinical settings. In contrast, histopathology slides are routinely collected, universally accessible, and information-rich, making them a practical anchor for robust survival prediction.In this study, we propose a novel framework that leverages histopathology as a basis for outcome prediction, while using other data modalities when training the models.Extensive experiments across eight cancer types and scenarios, including various data modalities, demonstrate that our model outperforms all baselines, with up to 8\\% gains over methods that solely use histopathology at training time, and a 1.4\\% gap compared to models that utilize all data modalities. Our model also stratifies patients into meaningful risk groups in 67\\% of risk stratification scenarios (vs. 50\\% for best SOTA), generalizes well under varying modality missingness, and matches the best SOTA even with 40\\% higher rate of missing data during training. It also preserves semantic alignment in zero-shot settings.These results highlight the practical utility and robustness of our approach for real-world cancer risk prediction in resource-limited or modality-incomplete settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37778", "url": null, "sourceid": 36116, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36254, "uid": "c64fa745fa1d9b2a6835cf63d3fa103b", "name": "ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models", "authors": [{"id": 181887, "fullname": "Youngeun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/181887?format=json", "institution": "Amazon AWS AI Labs"}, {"id": 152301, "fullname": "Youjia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152301?format=json", "institution": "Sungkyunkwan University"}, {"id": 180298, "fullname": "Huiling Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180298?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 184589, "fullname": "Aecheon Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/184589?format=json", "institution": "Sungkyunkwan University"}, {"id": 184590, "fullname": "Sunwoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184590?format=json", "institution": "Inha University"}, {"id": 152303, "fullname": "Sungeun Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/152303?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Large Vision\u2013Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token\u2019s influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36254", "url": "https://aim-skku.github.io/ZOO-Prune/", "sourceid": 42235, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65702, "file": "/media/PosterPDFs/CVPR%202026/36254.png", "modified": "2026-04-16T23:10:05.022516-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65703, "file": "/media/PosterPDFs/CVPR%202026/36254-thumb.png", "modified": "2026-04-16T23:10:05.222751-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37703, "uid": "3ccd4c93c3d2577f70308360bf4f4149", "name": "3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects", "authors": [{"id": 182398, "fullname": "Zhicheng Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182398?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188050, "fullname": "Haoyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188050?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 169085, "fullname": "Boyan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/169085?format=json", "institution": "The Chinese University of Hong Kong(shenzhen)"}, {"id": 188051, "fullname": "Dayou Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188051?format=json", "institution": "Capital Normal University"}, {"id": 188052, "fullname": "Zijian Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188052?format=json", "institution": "The Chinese University of Hong Kong,Shenzhen"}, {"id": 188053, "fullname": "Tianyi Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188053?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188054, "fullname": "Junhua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188054?format=json", "institution": "University of Southern California"}, {"id": 90349, "fullname": "Shuguang Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/90349?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188055, "fullname": "Fangxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188055?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 million multi-view frames. It encompasses diverse materials, complex lighting conditions, and a wide range of geometric forms\u2014including shapes generated from both real and LLM-synthesized 2D images using diffusion-based methods. To support robust evaluation, we design benchmarks for four core tasks: image matching, reflection removal, structure-from-motion, and novel view synthesis. Through extensive experiments, we show that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models. We release the dataset, baselines, and evaluation suite to facilitate progress in this direction, which can be accessed at supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37703", "url": null, "sourceid": 40723, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40310?format=json"], "related_events_ids": [40310]}, {"id": 37917, "uid": "03ba283fed2772f4e477eea4f0f236b8", "name": "MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition", "authors": [{"id": 131103, "fullname": "Xinyu Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/131103?format=json", "institution": "Peking University"}, {"id": 188585, "fullname": "Kangrui Cen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188585?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 185161, "fullname": "Hongyang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185161?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188586, "fullname": "Zhen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188586?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 188587, "fullname": "Bairui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188587?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 153954, "fullname": "Zeqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153954?format=json", "institution": "Sun Yat-Sen University"}, {"id": 188588, "fullname": "Jinrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188588?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., **Multi-Image Composition** (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data.To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts.Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in **MICo-150K**, a comprehensive dataset for MICo with identity consistency.We further build a Decomposition-and-Recomposition (**De\\&Re**) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions.To enable comprehensive evaluation, we construct **MICo-Bench** with 100 cases per task and 300 challenging De\\&Re cases, and further introduce a new metric, **Weighted-Ref-VIEScore**, specifically tailored for MICo evaluation.Finally, we fine-tune multiple models on**MICo-150K** and evaluate them on **MICo-Bench**. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills.Notably, **Qwen-MICo**, fine-tuned from Qwen-Image-Edit, matches **Qwen-Image-2509** in 3-image composition while supporting arbitrary multi-image inputs beyond the latter\u2019s limitation.Our dataset and benchmark will be valuable resources for advancing MICo research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37917", "url": null, "sourceid": 42705, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40310, "uid": "3ccd4c93c3d2577f70308360bf4f4149", "name": "3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects", "authors": [{"id": 182398, "fullname": "Zhicheng Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182398?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188050, "fullname": "Haoyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188050?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 169085, "fullname": "Boyan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/169085?format=json", "institution": "The Chinese University of Hong Kong(shenzhen)"}, {"id": 188051, "fullname": "Dayou Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188051?format=json", "institution": "Capital Normal University"}, {"id": 188052, "fullname": "Zijian Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188052?format=json", "institution": "The Chinese University of Hong Kong,Shenzhen"}, {"id": 188053, "fullname": "Tianyi Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188053?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188054, "fullname": "Junhua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188054?format=json", "institution": "University of Southern California"}, {"id": 90349, "fullname": "Shuguang Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/90349?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 188055, "fullname": "Fangxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188055?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 million multi-view frames. It encompasses diverse materials, complex lighting conditions, and a wide range of geometric forms\u2014including shapes generated from both real and LLM-synthesized 2D images using diffusion-based methods. To support robust evaluation, we design benchmarks for four core tasks: image matching, reflection removal, structure-from-motion, and novel view synthesis. Through extensive experiments, we show that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models. We release the dataset, baselines, and evaluation suite to facilitate progress in this direction, which can be accessed at supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40310", "url": null, "sourceid": -40723, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37703?format=json"], "related_events_ids": [37703]}, {"id": 40113, "uid": "009d58d33b1d8b377e7ddf00aeb43138", "name": "M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG", "authors": [{"id": 180486, "fullname": "David Anugraha", "url": "http://cvpr.thecvf.com/api/miniconf/users/180486?format=json", "institution": "Stanford University"}, {"id": 193564, "fullname": "Patrick Irawan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193564?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 193565, "fullname": "Anshul Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/193565?format=json", "institution": "Indian Institute of Science, Indian institute of science, Bangalore"}, {"id": 193566, "fullname": "En-Shiun Annie Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193566?format=json", "institution": "Ontario Tech University; Ontario Tech University; University of Toronto"}, {"id": 193567, "fullname": "Genta Indra Winata", "url": "http://cvpr.thecvf.com/api/miniconf/users/193567?format=json", "institution": "Capital One"}], "abstract": "Vision\u2013language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image\u2013question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40113", "url": null, "sourceid": 41876, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37997, "uid": "6f53f5d7325787f8abdbac3b186e80c1", "name": "Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness", "authors": [{"id": 188787, "fullname": "Boya Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188787?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188788, "fullname": "Naiyang Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188788?format=json", "institution": "Academy of Military Science"}, {"id": 188789, "fullname": "Yi Xiaodong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188789?format=json", "institution": "National University of Defense Technology"}], "abstract": "Existing dynamic scene rendering methods often adopt rigid-body or direction-limited assumptions, yet real-world motion and contact routinely violate these, producing artifacts near occlusion boundaries. To address this, we introduce a unified, source-aware framework for dynamic rendering that enforces the consistency of Gaussian primitives under an explicit manifold constraints. We project predicted velocities onto physically grounded priors via efficient, parallel inner solves: (i) a Helmholtz parameterization that separates divergence-free and potential-flow motion components; (ii) an anisotropic, compressible directional prior; and (iii) an affine family that disentangles rotation from isotropic scaling. Experiments on extensive benchmarks show consistent improvements over state-of-the-art methods in reconstruction fidelity and temporal coherence. Our approach ensures physically realistic rendering, especially near contacts, and substantially reduces motion-boundary artifacts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37997", "url": null, "sourceid": 40476, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36854, "uid": "369eb968269fa49b22b4596a2ce9fbbe", "name": "Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees", "authors": [{"id": 186028, "fullname": "Arya Fayyazi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186028?format=json", "institution": "University of Southern California"}, {"id": 186029, "fullname": "Haleh Akrami", "url": "http://cvpr.thecvf.com/api/miniconf/users/186029?format=json", "institution": "Nuro"}], "abstract": "We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. Each perception or logic node outputs a conformal set $\\Gamma^{(t)}_\\delta(x)$, yielding calibrated, stepwise uncertainty; a lightweight controller uses these certificates to allocate compute under a budget\u2014expanding with extra tool calls only when needed and stopping early otherwise. This grounds answers in verifiable evidence, reduces error compounding and hallucinations, and enables principled accuracy\u2013compute trade-offs. Across document, chart, and multi-image QA benchmarks, PoP improves performance and reliability over strong chain-of-thought, ReAct-style, and program-of-thought baselines while using computation more efficiently.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36854", "url": null, "sourceid": 33897, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36712, "uid": "b651fe05fb7ed93f9b8f33ae9506662e", "name": "Adaptive 3D Perception Under Sparse Sampling via Reinforcement Learning", "authors": [{"id": 131235, "fullname": "Shenghai Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131235?format=json", "institution": "National Technological University"}, {"id": 185708, "fullname": "Wei Yihan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185708?format=json", "institution": "Nanyang Technological University"}, {"id": 185709, "fullname": "Jason Yee", "url": "http://cvpr.thecvf.com/api/miniconf/users/185709?format=json", "institution": "Nanyang Technological University"}, {"id": 185710, "fullname": "Zhuoran Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185710?format=json", "institution": ""}, {"id": 185711, "fullname": "boyang lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185711?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 181707, "fullname": "Enwen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181707?format=json", "institution": "Beijing university of Posts and Telecommunications"}], "abstract": "Detecting small aerial targets (SATs) from long-range LiDAR is challenging because point density changes dramatically with motion: fast flights produce ultra-sparse returns, while hovering or slow motion yields dense local clusters, breaking fixed-voxel and static-threshold assumptions in standard 3D detectors and trackers. We introduce A3PRL, an RL-driven adaptive perception framework that closes the loop between LiDAR sensing and tracking. A3PRL builds on a sparsity-aware proposal stage with Temporal Dispersion Signatures and velocity-change cues, and deploys a lightweight 5D policy that jointly adjusts voxel resolution, detection sensitivity, and association gating based on purely label-free statistics summarizing spatio\u2013temporal sparsity, foreground acceptance, and tracking continuity. The policy is trained with privileged supervision from ground-truth trajectories to shape a reward that balances geometric accuracy, temporal stability, and regularized acceptance, but runs fully label-free at test time. On the public MMAUD benchmark, training on V1 and evaluating on unseen V2/V3 domains, A3PRL reduces 3D localization error by about 19\\% compared to its non-RL counterpart and consistently outperforms LiDAR-only and multimodal baselines under both day and night conditions. We further show that the same policy transfers to an in-house LiDAR\u2013RTK setup and a public multi-LiDAR SAT dataset with heterogeneous scan patterns, where it maintains accurate trajectories and stable tracks under varying sparsity, while adding less than 2 ms per frame on a 10 Hz LiDAR budget.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36712", "url": null, "sourceid": 46780, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39659, "uid": "530f5a2885388ae3ec8cbfc6188c0294", "name": "What\u2019s Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution", "authors": [{"id": 145155, "fullname": "Xingsong Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/145155?format=json", "institution": "Fudan University"}, {"id": 144871, "fullname": "Yongkun Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/144871?format=json", "institution": "Fudan University"}, {"id": 192586, "fullname": "Jiaxin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192586?format=json", "institution": "Tencent"}, {"id": 86641, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86641?format=json", "institution": "WeChat, Tencent"}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 88873, "fullname": "Zhineng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88873?format=json", "institution": "Fudan University"}], "abstract": "Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce **UnionST**, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct **UnionST-S**, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.Code is provided in the supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39659", "url": null, "sourceid": 38576, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39336, "uid": "115679d0a013d1e98730c72c91dd0123", "name": "EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing", "authors": [{"id": 155617, "fullname": "Runjia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155617?format=json", "institution": "University of Oxford"}, {"id": 131383, "fullname": "Moayed Haji Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/131383?format=json", "institution": "Rice University"}, {"id": 186495, "fullname": "Ashkan Mirzaei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186495?format=json", "institution": "NVIDIA; Snap Inc."}, {"id": 106929, "fullname": "Chaoyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106929?format=json", "institution": "Snap Inc"}, {"id": 191876, "fullname": "Arpit Sahni", "url": "http://cvpr.thecvf.com/api/miniconf/users/191876?format=json", "institution": "Snap Inc."}, {"id": 87280, "fullname": "Ivan Skorokhodov", "url": "http://cvpr.thecvf.com/api/miniconf/users/87280?format=json", "institution": "KAUST"}, {"id": 184760, "fullname": "Aliaksandr Siarohin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184760?format=json", "institution": "Snap Inc.; Snap Inc."}, {"id": 90766, "fullname": "Tomas Jakab", "url": "http://cvpr.thecvf.com/api/miniconf/users/90766?format=json", "institution": "University of Oxford"}, {"id": 155618, "fullname": "Junlin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/155618?format=json", "institution": "University of Oxford"}, {"id": 85389, "fullname": "Sergey Tulyakov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85389?format=json", "institution": "Snap Inc."}, {"id": 75532, "fullname": "Philip H.S. Torr", "url": "http://cvpr.thecvf.com/api/miniconf/users/75532?format=json", "institution": "University of Oxford"}, {"id": 88383, "fullname": "Willi Menapace", "url": "http://cvpr.thecvf.com/api/miniconf/users/88383?format=json", "institution": "University of Trento"}], "abstract": "We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges \u2014 including rapid egomotion, and frequent hand\u2013object interactions \u2014 that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction.To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion.Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks\u2014where existing methods struggle\u2014while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39336", "url": null, "sourceid": 40450, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39070, "uid": "3716117c62e07aa6f23b6fe84990378e", "name": "Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention", "authors": [{"id": 191298, "fullname": "Jianfei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191298?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191299, "fullname": "Feng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191299?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191300, "fullname": "Xin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191300?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191301, "fullname": "Chong Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191301?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191302, "fullname": "Zhixing Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191302?format=json", "institution": "Zhongguancun Laboratory"}], "abstract": "Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference.To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model\u2019s focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described.In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36\\%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention.Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39070", "url": null, "sourceid": 39561, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37774, "uid": "fc75bd9622425bbc421653770069faf5", "name": "THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT", "authors": [{"id": 188235, "fullname": "Stefanos Koutoupis", "url": "http://cvpr.thecvf.com/api/miniconf/users/188235?format=json", "institution": "Foundation for Research and Technology-Hellas, University of Crete"}, {"id": 188236, "fullname": "Michaela Areti Zervou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188236?format=json", "institution": "Institute of Computer Science - FORTH"}, {"id": 188237, "fullname": "Konstantinos Kontras", "url": "http://cvpr.thecvf.com/api/miniconf/users/188237?format=json", "institution": "KU Leuven"}, {"id": 188238, "fullname": "Maarten De Vos", "url": "http://cvpr.thecvf.com/api/miniconf/users/188238?format=json", "institution": "KU Leuven"}, {"id": 188239, "fullname": "Panagiotis Tsakalides", "url": "http://cvpr.thecvf.com/api/miniconf/users/188239?format=json", "institution": "University of Crete"}, {"id": 188240, "fullname": "Grigorios Tsagkatakis", "url": "http://cvpr.thecvf.com/api/miniconf/users/188240?format=json", "institution": "University of Crete; Foundation for Research and Technology - Hellas (FORTH)"}], "abstract": "Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37774", "url": null, "sourceid": 30837, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38626, "uid": "cdbe7ddc282f782ae1ab37b04fa0242d", "name": "Object-Generalized Re-Identification: A Step Towards Universal Instance Perception", "authors": [{"id": 190336, "fullname": "Shuoyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190336?format=json", "institution": "Wuhan University"}, {"id": 190337, "fullname": "Yurui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190337?format=json", "institution": "Wuhan University"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "The object re-identification (ReID) task aims to recognize the same individual object across diverse viewpoints and sensing conditions.Although person and vehicle ReID have achieved remarkable success, most existing methods are built on the assumption that training and testing data come from the same object category.This constraint requires separate models for each category, which limits scalability and generalization.To address this limitation, we introduce Object-Generalized Re-Identification (OG-ReID), a new paradigm that learns unified identity representations transferable across different object categories.Unlike conventional domain generalization that focuses on appearance variations within a single category, OG-ReID deals with category shifts caused by intrinsic structural differences in identity cues. To achieve this goal, we introduce the Meta-Generalized Object Re-Identification (MGOR) framework, which treats meta-learning as semantic distributional regularization, exposing the model to controlled category shifts so that invariance emerges as an equilibrium between semantic diversity and identity discrimination.Extensive evaluations on more than 100 unseen object categories from multiple domains show that MGOR outperforms existing ReID approaches without any target-domain adaptation, advancing toward universal identity perception beyond domain and category boundaries.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38626", "url": null, "sourceid": 38749, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38967, "uid": "49cd70349fd46e2251b90c6009945469", "name": "RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation", "authors": [{"id": 181121, "fullname": "Ganlin Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181121?format=json", "institution": "Western University"}, {"id": 180331, "fullname": "Yuxi Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/180331?format=json", "institution": "Western University"}, {"id": 191082, "fullname": "Hafsa Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/191082?format=json", "institution": null}, {"id": 191083, "fullname": "Erin Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191083?format=json", "institution": "University of Toronto"}, {"id": 191084, "fullname": "Fahad Butt", "url": "http://cvpr.thecvf.com/api/miniconf/users/191084?format=json", "institution": "University of Western Ontario"}, {"id": 191085, "fullname": "Qian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191085?format=json", "institution": "The University of Winnipeg"}, {"id": 134864, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134864?format=json", "institution": "Concordia University"}, {"id": 191086, "fullname": "Pingzhao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191086?format=json", "institution": "Western Univeristy"}], "abstract": "Rare diseases often manifest with distinctive facial phenotypes in children, offering valuable diagnostic cues for clinicians and AI-assisted screening systems. However, progress in this field is severely limited by the scarcity of curated, ethically sourced facial data and the high similarity among phenotypes across different conditions. To address these challenges, we introduce RDFace, a curated benchmark dataset comprising 456 pediatric facial images spanning 103 rare genetic conditions (average 4.4 samples per condition). Each ethically verified image is paired with standardized metadata.  RDFace enables the development and evaluation of data-efficient AI models for rare disease diagnosis under real-world low-data constraints. We benchmark multiple pretrained vision backbones using cross-validation and explore synthetic augmentation with DreamBooth and FastGAN. Generated images are filtered via facial landmark similarity to maintain phenotype fidelity and merged with real data, improving diagnostic accuracy by up to 13.7% in ultra-low-data regimes. To assess semantic validity, phenotype descriptions generated by a vision\u2013language model from real and synthetic images achieve a report similarity score of 0.84. RDFace establishes a transparent, benchmark-ready dataset for equitable rare disease AI research and presents a scalable framework for evaluating both diagnostic performance and the integrity of synthetic medical imagery.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38967", "url": null, "sourceid": 35064, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37981, "uid": "f06819f66743e5f0161904b908c53739", "name": "DSO: Direct Steering Optimization for Bias Mitigation", "authors": [{"id": 188738, "fullname": "Lucas Monteiro Paes", "url": "http://cvpr.thecvf.com/api/miniconf/users/188738?format=json", "institution": "Apple"}, {"id": 188739, "fullname": "Nivedha Sivakumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188739?format=json", "institution": "Apple"}, {"id": 94982, "fullname": "Yinong Oliver Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/94982?format=json", "institution": "Carnegie Mellon University"}, {"id": 188740, "fullname": "Masha Fedzechkina", "url": "http://cvpr.thecvf.com/api/miniconf/users/188740?format=json", "institution": "Apple"}, {"id": 188741, "fullname": "Barry-John Theobald", "url": "http://cvpr.thecvf.com/api/miniconf/users/188741?format=json", "institution": "University of East Anglia; Apple"}, {"id": 188742, "fullname": "Luca Zappella", "url": "http://cvpr.thecvf.com/api/miniconf/users/188742?format=json", "institution": "Apple"}, {"id": 188743, "fullname": "Nicholas Apostoloff", "url": "http://cvpr.thecvf.com/api/miniconf/users/188743?format=json", "institution": "Apple"}], "abstract": "Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37981", "url": null, "sourceid": 33758, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38418, "uid": "be1412586e4824ef8dc2a8ea42503bf4", "name": "Learning to Solve PDEs on Neural Shape Representations", "authors": [{"id": 189831, "fullname": "Lilian Welschinger", "url": "http://cvpr.thecvf.com/api/miniconf/users/189831?format=json", "institution": "WILLOW, Inria Paris"}, {"id": 189832, "fullname": "Yilin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189832?format=json", "institution": "Autodesk"}, {"id": 144470, "fullname": "Zican Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144470?format=json", "institution": "University College London"}, {"id": 85661, "fullname": "Niloy J. Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/85661?format=json", "institution": "University College London"}], "abstract": "Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38418", "url": null, "sourceid": 32056, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39577, "uid": "8e0c5ad3b0edbdad9151b03cce7344d8", "name": "SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis", "authors": [{"id": 180256, "fullname": "Jiangang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/180256?format=json", "institution": "Chang&#x27;an University"}, {"id": 192390, "fullname": "Yiquan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/192390?format=json", "institution": "Chang&#x27;an University"}, {"id": 192391, "fullname": "Pengxiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192391?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 192392, "fullname": "Lili Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192392?format=json", "institution": "Chang&#x27;an University"}, {"id": 192393, "fullname": "Yuanlin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192393?format=json", "institution": "Chang&#x27;an University"}, {"id": 192394, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192394?format=json", "institution": "Chang&#x27; an University"}], "abstract": "Bridging the modality gap between infrared and visible imagery is critical for cross-modal understanding and for enriching multimodal benchmarks. However, existing approaches remain confined to one-to-one mappings and are typically evaluated on unidirectional or closed-set scenarios. To address this challenge, we present SynthRGB-T, a unified framework for diverse and bidirectional image translation. Specifically, we formulate image translation as a vision-language guided denoising diffusion process, enabling flexible conditioning and open-world generalization. To enhance semantic alignment, a Visual Grounding Pipeline (VGP) is introduced to exploit the world knowledge of foundation models for fine-grained translation guidance. During the diffusion process, we propose to adopt a decoupling injection strategy to alleviate interference among multiple guidance. In addition, a Dual Conditional Cross-Attention (DCCA) module is designed to facilitate collaborative representation learning in latent space. SynthRGB-T is simple and versatile\u2014capable of synthesizing diverse, high-fidelity data that substantially extends multimodal resources within the community. Comprehensive evaluations on multiple real-world benchmarks confirm that SynthRGB-T delivers superior performance and enhanced visual diversity over existing approaches. All code, models, and large-scale synthetic datasets will be released upon camera ready version.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39577", "url": null, "sourceid": 39802, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38075, "uid": "c8aeda90efcb4fc4ff5ab2c92ee5b98f", "name": "NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training", "authors": [{"id": 154617, "fullname": "Dengdi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/154617?format=json", "institution": "Anhui University"}, {"id": 188994, "fullname": "Xiaoya Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188994?format=json", "institution": "Anhui University"}, {"id": 71047, "fullname": "Xiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71047?format=json", "institution": "Anhui University"}, {"id": 188995, "fullname": "Hao Si", "url": "http://cvpr.thecvf.com/api/miniconf/users/188995?format=json", "institution": "Anhui University"}, {"id": 188996, "fullname": "Wanli Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188996?format=json", "institution": "Anhui University"}, {"id": 126842, "fullname": "Jin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126842?format=json", "institution": "Anhui University"}, {"id": 188997, "fullname": "Bin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188997?format=json", "institution": "Anhui University"}], "abstract": "Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38075", "url": null, "sourceid": 30818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39827, "uid": "3d1327c18a2010bbcada3a8f322c5a2e", "name": "Scene Reconstruction as Mapping Priors for 3D Detection", "authors": [{"id": 102978, "fullname": "Yang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102978?format=json", "institution": "University of California San Diego"}, {"id": 133883, "fullname": "Yuliang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/133883?format=json", "institution": "Waymo"}, {"id": 89962, "fullname": "Hao Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89962?format=json", "institution": "University of California, Los Angeles"}, {"id": 134826, "fullname": "Xin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134826?format=json", "institution": "Waymo"}, {"id": 157167, "fullname": "Yijing Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157167?format=json", "institution": "Waymo"}, {"id": 192931, "fullname": "Chen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/192931?format=json", "institution": "Waymo LLC"}, {"id": 192932, "fullname": "Weijing Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192932?format=json", "institution": "Meta"}, {"id": 89960, "fullname": "Govind Thattai", "url": "http://cvpr.thecvf.com/api/miniconf/users/89960?format=json", "institution": "Amazon"}, {"id": 85225, "fullname": "Dragomir Anguelov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85225?format=json", "institution": "Waymo"}, {"id": 138724, "fullname": "Mingxing Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/138724?format=json", "institution": "Waymo"}, {"id": 75524, "fullname": "Yingwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/75524?format=json", "institution": "Waymo"}], "abstract": "In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks like 3D object detection. Maps can provide robust structural priors of the static environment, suited to resolving ambiguities and correcting for sensor data sparsity or noise \u2014 issues especially prevalent for distant objects or during adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for achieving efficient, large-scale deployment. In this paper, we propose a scalable solution to systemically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Prior Augmented 3D detection (MPA3D) framework to effectively integrate the mapping priors with the distinct modalities of sensor data. Our extensive experiments on the Waymo Open Dataset demonstrate that our approach achieves new state-of-the-art results, and proving the effectiveness of using scalable, reconstructed scene priors to enhance 3D detection.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39827", "url": null, "sourceid": 39590, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37898, "uid": "0c31dda78664045d19fd1c04dc76abab", "name": "Dedelayed: Deleting remote inference delay via on-device correction", "authors": [{"id": 176731, "fullname": "Dan Jacobellis", "url": "http://cvpr.thecvf.com/api/miniconf/users/176731?format=json", "institution": "The University of Texas at Austin"}, {"id": 188523, "fullname": "Mateen Ulhaq", "url": "http://cvpr.thecvf.com/api/miniconf/users/188523?format=json", "institution": "InterDigital"}, {"id": 188524, "fullname": "Fabien Racap\u00e9", "url": "http://cvpr.thecvf.com/api/miniconf/users/188524?format=json", "institution": "InterDigital"}, {"id": 188525, "fullname": "Hyomin Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188525?format=json", "institution": "InterDigital"}, {"id": 188526, "fullname": "Neeraja Yadwadkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188526?format=json", "institution": "University of Texas at Austin"}], "abstract": "Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology.Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications.One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time.But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications.The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy.To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame.The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame.The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel.We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference---an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at URL REDACTED FOR ANONYMITY.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37898", "url": null, "sourceid": 32197, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36681, "uid": "59aee406105cd10e79b22ee05173ead7", "name": "Interpretable Debiasing of Vision-Language Models for Social Fairness", "authors": [{"id": 182201, "fullname": "Na Min An", "url": "http://cvpr.thecvf.com/api/miniconf/users/182201?format=json", "institution": "KAIST"}, {"id": 185634, "fullname": "Yoonna Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185634?format=json", "institution": "University of Copenhagen"}, {"id": 89052, "fullname": "Yusuke Hirota", "url": "http://cvpr.thecvf.com/api/miniconf/users/89052?format=json", "institution": "Osaka University"}, {"id": 86491, "fullname": "Ryo Hachiuma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86491?format=json", "institution": "Konica Minolta, Inc."}, {"id": 185635, "fullname": "Isabelle Augenstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/185635?format=json", "institution": "University of Copenhagen"}, {"id": 144939, "fullname": "Hyunjung Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/144939?format=json", "institution": "KAIST"}], "abstract": "The rapid advancement of Vision-Language models (VLMs) has raised growing concern that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DEBIASLENS, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are less represented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36681", "url": null, "sourceid": 45493, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38394, "uid": "eda5fe1762293c38fb40f42774a274c4", "name": "SO-Bench: A Structural Output Evaluation of Multimodal LLM", "authors": [{"id": 135860, "fullname": "Di Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/135860?format=json", "institution": "Apple Inc."}, {"id": 189787, "fullname": "Kaixin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189787?format=json", "institution": "Apple"}, {"id": 189788, "fullname": "Feng Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189788?format=json", "institution": "Apple"}, {"id": 189789, "fullname": "Haofeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189789?format=json", "institution": "Apple"}, {"id": 187797, "fullname": "Bohan Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187797?format=json", "institution": "Snowflake"}, {"id": 158635, "fullname": "David Griffiths", "url": "http://cvpr.thecvf.com/api/miniconf/users/158635?format=json", "institution": "Apple Inc."}, {"id": 88896, "fullname": "Mingfei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88896?format=json", "institution": "Apple"}, {"id": 87966, "fullname": "Zhe Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87966?format=json", "institution": "Apple"}, {"id": 94162, "fullname": "Eshan Verma", "url": "http://cvpr.thecvf.com/api/miniconf/users/94162?format=json", "institution": "Apple Inc."}, {"id": 84650, "fullname": "Yinfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84650?format=json", "institution": "Apple"}, {"id": 189790, "fullname": "Zhifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189790?format=json", "institution": "Apple"}, {"id": 151014, "fullname": "Afshin Dehghan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151014?format=json", "institution": "Apple"}], "abstract": "Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to pre-defined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-BENCH benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-BENCH is built from over 6.5K diverse JSON schemas and 1.8K curated image\u2013schema pairs with human-verified quality. Benchmarking experiments on open-sourced andfrontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model\u2019s structured output capability. We plan to make the benchmark available to the community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38394", "url": null, "sourceid": 46214, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39492, "uid": "16805acc50701e4904620cfe12273731", "name": "SimRecon: SimReady Compositional Scene Reconstruction from Real Videos", "authors": [{"id": 179963, "fullname": "Chong Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/179963?format=json", "institution": "Tsinghua University"}, {"id": 192193, "fullname": "Kai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192193?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 192194, "fullname": "Zizhuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192194?format=json", "institution": "Tsinghua University"}, {"id": 132561, "fullname": "Fangfu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132561?format=json", "institution": "Tsinghua University"}, {"id": 126782, "fullname": "Zhizheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126782?format=json", "institution": "Microsoft Research"}, {"id": 76969, "fullname": "Yueqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76969?format=json", "institution": "Tsinghua University"}], "abstract": "Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a ''Perception-Generation-Simulation\" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39492", "url": null, "sourceid": 43306, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38976, "uid": "b667279b6400b4c05f3b5c4241e8bf7f", "name": "Outlier-Robust Diffusion Solvers for Inverse Problems", "authors": [{"id": 181997, "fullname": "Yang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181997?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191105, "fullname": "Jiahua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191105?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191106, "fullname": "Tongyao Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191106?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88470, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88470?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 128190, "fullname": "Zhaoqiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128190?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DM-based methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38976", "url": null, "sourceid": 31761, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37381, "uid": "509212008def037bf1dc80aeb77ff6db", "name": "240FPS Stereo Vision from Monocular Mixed Spikes", "authors": [{"id": 129436, "fullname": "Yeliduosi Xiaokaiti", "url": "http://cvpr.thecvf.com/api/miniconf/users/129436?format=json", "institution": "Peking University"}, {"id": 77413, "fullname": "Yakun Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77413?format=json", "institution": null}, {"id": 187303, "fullname": "Yang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187303?format=json", "institution": "Peking University"}, {"id": 102064, "fullname": "Zhaojun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102064?format=json", "institution": "Peking University"}, {"id": 70351, "fullname": "Peiqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/70351?format=json", "institution": "Peking University"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}], "abstract": "Stereo vision is fundamental for enabling machines to perceive and interact with the world. While monocular stereo methods offer hardware compactness, they struggle with generalization due to reliance on data-driven priors. Binocular and multi-view systems improve accuracy but incur higher hardware complexity and data inefficiency. In this paper, we introduce a monocular solution for high-frame-rate stereo vision via temporal optical modulation. The modulation directs light from two views in a mixed manner while periodically attenuates one view at 60Hz. To capture the temporal variations introduced by this modulation, we employ a high-speed spike camera that records the mixed scene as temporally dense spikes. And the high temporal resolution of these spikes enables the construction of a linear system for efficient binocular video decoupling.Consequently, we introduce a two-stage decoding methodology for achieving high-quality stereo vision: An efficient least-squares based baseline reconstruction followed by a deep learning refinement module. Experimental results demonstrate that our approach achieves 240FPS binocular video reconstruction with superior accuracy compared to monocular systems, while maintaining the hardware compactness and data efficiency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37381", "url": null, "sourceid": 46595, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38426, "uid": "94860396908b3cb9a7286bd070132db8", "name": "BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation", "authors": [{"id": 150600, "fullname": "Miaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150600?format=json", "institution": "University of Edinburgh, University of Edinburgh"}, {"id": 189844, "fullname": "Qingxuan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189844?format=json", "institution": "Cornell University"}, {"id": 150390, "fullname": "Zhi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/150390?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 97232, "fullname": "Yayuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/97232?format=json", "institution": "University of Michigan, Ann Arbor"}, {"id": 75678, "fullname": "Oisin Mac Aodha", "url": "http://cvpr.thecvf.com/api/miniconf/users/75678?format=json", "institution": "University of Edinburgh"}, {"id": 95802, "fullname": "Jason Corso", "url": "http://cvpr.thecvf.com/api/miniconf/users/95802?format=json", "institution": "Voxel51; University of Michigan"}, {"id": 86201, "fullname": "Amir Vaxman", "url": "http://cvpr.thecvf.com/api/miniconf/users/86201?format=json", "institution": "University of Edinburgh"}], "abstract": "Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and  discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our  closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points.  Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity losses for motion-restoration quality.  To train our model, we collate BIMO, a new dataset containing diverse variable-length 3D motion sequences with rich, high-quality text annotations.  Extensive evaluations show that our feed-forward framework BiMotion generates more expressive, higher-quality, and better prompt-aligned motions than existing state-of-the-art methods, while also achieving faster generation.  The code is available in the supplemental material and will be made publicly available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38426", "url": null, "sourceid": 36025, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40030, "uid": "f49a02e0f135ae9f2bcf60f55bcd0174", "name": "FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning", "authors": [{"id": 193340, "fullname": "Zhiqiang Kou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193340?format=json", "institution": "Hong Kong Polytechnic University; RIKEN"}, {"id": 180824, "fullname": "Junxiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180824?format=json", "institution": "Southeast University"}, {"id": 184315, "fullname": "Wenke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184315?format=json", "institution": "Nanyang Technological University"}, {"id": 144760, "fullname": "Wenwen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/144760?format=json", "institution": "Wuhan University"}, {"id": 154615, "fullname": "Ming-Kun Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/154615?format=json", "institution": "RIKEN"}, {"id": 158850, "fullname": "Changwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158850?format=json", "institution": "Qilu University of Technology (Shandong Academy of Sciences)"}, {"id": 193341, "fullname": "Yuheng Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193341?format=json", "institution": "Southeast University"}, {"id": 193342, "fullname": "Di Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193342?format=json", "institution": "Hong Kong Polytechnic University; WeBank"}, {"id": 193343, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193343?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 84884, "fullname": "Xin Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84884?format=json", "institution": "Southeast University"}, {"id": 155217, "fullname": "Qiang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155217?format=json", "institution": "Webank"}], "abstract": "Multi-label representations encode higher-order label dependencies, yet in federated settings the local estimates of these dependencies are statistically inconsistent, causing structural drift across clients and rendering naive quantity-weighted aggregation suboptimal. We propose FedHarmony, a federated multi-label learning framework that harmonizes heterogeneous label correlations without sharing raw data. A Correlation Expert is formed by leave-one-out consolidation of clients\u2019 label\u2013label correlation statistics to provide a round-wise global consensus. Guided by this expert, each client performs consensus-guided correction that aligns its local correlation to the consensus within clusters of strongly related labels obtained via spectral clustering of the expert matrix. This block-wise alignment targets dense, high-signal subspaces. We establish two guarantees: (i) restricting alignment to in-cluster pairs strictly improves optimization curvature and linear convergence rate; (ii) ignoring cross-cluster entries incurs only a bounded, quantitatively small information loss when the consensus is near block-diagonal. Finally, a correlation-aware central aggregation combines data quantity with a dynamic measure of correlation learning quality, using a dynamic balance factor that transitions from quantity-driven weighting in early rounds to structure-driven weighting later. Extensive experiments under diverse non-IID regimes (varying label distributions, client heterogeneity, and client counts) show consistent gains over federated baselines in mAP/F1/Hamming Loss, with improved stability and communication efficiency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40030", "url": null, "sourceid": 39460, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37047, "uid": "d33802931e5b740f13b45b0aff89aaf4", "name": "Understanding Counting Mechanisms in Large Language and Vision-Language Models", "authors": [{"id": 186557, "fullname": "Hosein Hasani", "url": "http://cvpr.thecvf.com/api/miniconf/users/186557?format=json", "institution": "Sharif University of Technology"}, {"id": 186558, "fullname": "Amirmohammad Izadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186558?format=json", "institution": "Sharif University of Technology, Sharif University of Technology"}, {"id": 186559, "fullname": "Fatemeh Askari", "url": "http://cvpr.thecvf.com/api/miniconf/users/186559?format=json", "institution": "Sharif University of Technology, Sharif University of Technology"}, {"id": 186560, "fullname": "Mobin Bagherian", "url": "http://cvpr.thecvf.com/api/miniconf/users/186560?format=json", "institution": "Sharif University of Technology, Sharif University of Technology"}, {"id": 186561, "fullname": "Sadegh Mohammadian", "url": "http://cvpr.thecvf.com/api/miniconf/users/186561?format=json", "institution": "Sharif University of Technology"}, {"id": 186562, "fullname": "Mohammad Izadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186562?format=json", "institution": "Sharif University of Technology"}, {"id": 130177, "fullname": "Mahdieh Baghshah", "url": "http://cvpr.thecvf.com/api/miniconf/users/130177?format=json", "institution": "Sharif University of Technology"}], "abstract": "This paper examines how large language models (LLMs) and large vision-language models (LVLMs) represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze model behavior through causal mediation and activation patching. To this end, we design a specialized tool, CountScope, for mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region and transferable between contexts. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition.Models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37047", "url": null, "sourceid": 33651, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40210, "uid": "479761079a2771a7a1863678f85198da", "name": "Ego2Web: A Web Agent Benchmark Grounded on Egocentric Videos", "authors": [{"id": 147560, "fullname": "Shoubin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147560?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 193785, "fullname": "Lei Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193785?format=json", "institution": "Google"}, {"id": 90739, "fullname": "Antoine Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90739?format=json", "institution": "Inria"}, {"id": 193786, "fullname": "Yao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193786?format=json", "institution": "University of Edinburgh"}, {"id": 193787, "fullname": "Srinivas Sunkara", "url": "http://cvpr.thecvf.com/api/miniconf/users/193787?format=json", "institution": "Google Deepmind"}, {"id": 193788, "fullname": "Maria Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193788?format=json", "institution": "Google"}, {"id": 193789, "fullname": "Jindong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193789?format=json", "institution": "Google"}, {"id": 75594, "fullname": "Mohit Bansal", "url": "http://cvpr.thecvf.com/api/miniconf/users/75594?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 88081, "fullname": "Boqing Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88081?format=json", "institution": "Google"}], "abstract": "Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and perception, lacking grounding in the user's real-world physical surroundings. This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user's surroundingsand then complete a related task online (e.g., making a purchase).To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and multimodal web agent execution. Ego2Web pairs real-world first-person video recordings with web tasks that require visual understanding, web task planning, and interaction in an online environment for successful completion. We utilize an automatic data-generation pipeline combined with human verification to curate well-constructed, high-quality video-task pairs across diverse web task types, including e-commercial, navigation, media search, and so on. To facilitate a more accurate and scalable evaluation for our novel benchmark, we also develop a novel LLM-as-a-Judge automatic evaluation method Ego2WebJudge, and demonstrate around 85\\% agreement with human judgment, substantially higher than existing evaluation methods.Experiments with diverse SoTA multimodal agents show that they perform significantly below the human level, revealing a major gap in capability. We also conduct a comprehensive ablation study on task design, highlighting the necessity of video perception in the proposed task and the limitations of current agents.We hope Ego2Web can be a critical new resource for developing truly capable AI assistants that can seamlessly see, understand, and act across the physical and digital worlds.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40210", "url": null, "sourceid": 36597, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39134, "uid": "1ad74caec14332594a6e910b3d183e87", "name": "Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution", "authors": [{"id": 191421, "fullname": "Jun Young Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191421?format=json", "institution": "Dongguk University"}, {"id": 191422, "fullname": "Joo Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/191422?format=json", "institution": "Sogang University"}, {"id": 157125, "fullname": "Sangyeon Ahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/157125?format=json", "institution": "Dongguk University"}, {"id": 184172, "fullname": "Yoonseo Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/184172?format=json", "institution": "Sogang University"}, {"id": 191423, "fullname": "Yong Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/191423?format=json", "institution": "Dongguk University"}, {"id": 191424, "fullname": "Bogyeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191424?format=json", "institution": "Kookmin University; Dongguk University"}, {"id": 185262, "fullname": "Sung In Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/185262?format=json", "institution": "Sogang University"}], "abstract": "Although deep learning-based image super-resolution (SR) models have achieved remarkable progress in reconstruction quality, their high computational and memory demands make them unsuitable for lightweight platforms. To address this issue, various quantization techniques have been introduced. Among them, mixed-precision quantization (MPQ) introduces a layer-wise bit-width allocation to balance computational efficiency with reconstruction quality. However, existing MPQ methods based on post-training quantization (PTQ) for SR models face two critical limitations. First, quantization sensitivity estimation using static statistics fails to capture the accurate quantization error induced by each layer, resulting in suboptimal bit allocation. Second, removing batch normalization (BN) to preserve high-frequency details leads to scale inconsistencies across activations, making fixed quantization ranges insufficient to accurately represent their distribution. Therefore, we propose a novel PTQ-based MPQ framework tailored for SR models. Our method estimates the quantization sensitivity of weights and activations by leveraging gradients of the objective function with respect to bit-widths, enabling adaptive layer-wise bit allocation and fast convergence. Additionally, we introduce a dynamic activation range normalization that alleviates the distributional imbalance caused by the absence of BN, ensuring stable quantization under fixed range constraints. Our method outperforms existing PTQ-based methods by 1.26 dB in peak signal-to-noise ratio (PSNR) on the Urban100 dataset and reduces quantization time by \u00d71.9 for 3-bit quantization of EDSR \u00d74.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39134", "url": null, "sourceid": 45740, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39324, "uid": "92e62389b8bf9750a3215b44d54adee2", "name": "Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection", "authors": [{"id": 153649, "fullname": "Yongchao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153649?format=json", "institution": "University of Science and Technology of China"}, {"id": 90691, "fullname": "Jiawei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90691?format=json", "institution": "University of Science and Technology of China"}, {"id": 191849, "fullname": "Junfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191849?format=json", "institution": "University of Science and Technology of China"}, {"id": 144197, "fullname": "Sen Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144197?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences; University of the Chinese Academy of Sciences; University of Science and Technology of China"}, {"id": 191850, "fullname": "Na Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191850?format=json", "institution": "Capital Normal University"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Open-Vocabulary Human\u2013Object Interaction (OV-HOI) detection aims to recognize novel HOI categories beyond the training set.Existing OV-HOI detection approaches typically leverage CLIP to extract global visual representations and perform cross-attention between learnable queries and global features to localize human\u2013object pairs.However, such one-stage paradigms tend to overfit seen interactions, limiting their generalization to unseen categories, while the coarse spatial awareness of CLIP also hinders the localization of fine-grained interaction cues.To address these issues, we propose a novel Semantic-Diversified and Interaction-Focused framework (SD-IF), which integrates reinforcement-guided adaptive optimization to jointly enhance semantic generalization and spatial discrimination.Specifically, we introduce a Semantic Diversification (SD) module that applies reinforcement-driven stochastic semantic perturbations and dual-level semantic exploration, expanding the semantic coverage of queries while maintaining visual coherence and effectively encouraging exploration beyond the seen semantic clusters.Furthermore, we design an Interaction Focusing (IF) module that formulates an actor\u2013critic optimization scheme to adaptively refine attention distributions based on detection features and interaction representations, guided by a hybrid reward combining spatial focusing and semantic consistency.This cooperative learning paradigm enables the model to capture discriminative interaction cues and achieve spatially interpretable reasoning.Extensive experiments on two widely used benchmarks demonstrate that SD-IF achieves state-of-the-art performance, significantly surpassing existing OV-HOI detection methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39324", "url": null, "sourceid": 40342, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40195, "uid": "0fc8dee1ef46c0467df068405aa59715", "name": "3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion", "authors": [{"id": 176625, "fullname": "Minchong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/176625?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 183435, "fullname": "Xiaoyun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/183435?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 176616, "fullname": "Junzhe Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/176616?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 193751, "fullname": "Jianing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193751?format=json", "institution": "Fudan University"}, {"id": 193752, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193752?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40195", "url": null, "sourceid": 30762, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38613, "uid": "d471ba80482c93d408c57f0974b23a1e", "name": "Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling", "authors": [{"id": 190300, "fullname": "Yixuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190300?format=json", "institution": "Southeast University"}, {"id": 190301, "fullname": "Jinhao Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190301?format=json", "institution": ""}, {"id": 190302, "fullname": "Wenxin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190302?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 154150, "fullname": "Quyu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154150?format=json", "institution": "Alibaba Group"}, {"id": 129879, "fullname": "Feng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129879?format=json", "institution": "Renmin University of China"}], "abstract": "Although artificial neural networks are often described as brain-inspired, their representations typically rely on continuous activations, such as the continuous latent variables in variational autoencoders (VAEs), which limits their biological plausibility compared to the discrete spike-based signaling in real neurons. Extensions like the Poisson VAE introduce discrete count-based latents, but their equal mean-variance assumption fails to capture overdispersion in neural spikes, leading to less expressive and informative representations. To address this, we propose NegBio-VAE, a negative-binomial latent-variable model with a dispersion parameter for flexible spike count modeling. NegBio-VAE preserves interpretability while improving representation quality and training feasibility via novel KL estimation and reparameterization. Experiments on four datasets demonstrate that NegBio-VAE consistently achieves superior reconstruction and generation performance, and yields robust, informative latent representations for downstream tasks. Extensive ablation studies are performed to verify the model\u2019s robustness w.r.t. various components.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38613", "url": null, "sourceid": 42672, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38593, "uid": "6df0f11ac980d2c59e5b4dab6a8e7611", "name": "Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models", "authors": [{"id": 190232, "fullname": "Yiming Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190232?format=json", "institution": "Zhejiang University of Technology"}, {"id": 190233, "fullname": "Chenghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190233?format=json", "institution": "Zhejiang University of Technology"}, {"id": 190234, "fullname": "Wu kun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190234?format=json", "institution": "Zhejiang University of Technology"}, {"id": 190235, "fullname": "Chong Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190235?format=json", "institution": "Zhengzhou University"}, {"id": 190236, "fullname": "Biru Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190236?format=json", "institution": "Information Engineering University"}, {"id": 190237, "fullname": "Zhenyu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190237?format=json", "institution": "Zhejiang University of Technology"}, {"id": 190238, "fullname": "Zhen Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190238?format=json", "institution": "Zhejiang University of Technology"}], "abstract": "Deep learning models are well known to be susceptible to backdoor attacks, and text-to-image generation models are no exception. When a specific trigger is embedded in the input, a backdoored model can be manipulated to perform attacker-defined malicious behaviors, such as generating harmful or inappropriate images. Existing backdoor attacks on text-to-image generation models are largely limited to dirty-label attacks, where misaligned image-caption pairs are injected into the training data. While effective in controlled settings, such methods are often easily detectable, limiting their practicality in realistic applications. To address this limitation, we propose the first clean-label backdoor attack for text-to-image generative models, which preserves semantic consistency within poisoned image-caption pairs to evade detection. We design a dual-modality manipulation strategy that injects nearly imperceptible noise into images while embedding a composite semantic text trigger. The text trigger combines synonym substitution and syntactic restructuring, enabling stealthy yet effective backdoor implantation without compromising the visual\u2013textual alignment. Experimental results demonstrate that our method achieves high attack success while effectively preserving model utility and evading mainstream defenses, including commercial content filters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38593", "url": null, "sourceid": 35930, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37149, "uid": "6c79c78cd728ec85bca7b10b07082e65", "name": "REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement", "authors": [{"id": 182121, "fullname": "Hankyeol Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182121?format=json", "institution": "Yonsei University"}, {"id": 150188, "fullname": "WOOYEOL BAEK", "url": "http://cvpr.thecvf.com/api/miniconf/users/150188?format=json", "institution": "Yonsei University"}, {"id": 172794, "fullname": "Seongdo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/172794?format=json", "institution": "Yonsei University"}, {"id": 186778, "fullname": "Jongyoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186778?format=json", "institution": "Yonsei University"}], "abstract": "Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, guided by the prior's geometric cues and the backbone's pretrained 3D knowledge. By initializing the process with the encoded latent of a source mesh instead of the prior, the framework also supports 3D editing conditioned on an edited image. To quantify volume and surface flatness, we propose Compactness and Normal Anisotropy. We validate Compactness and Normal Anisotropy through a user study, showing that these metrics align with human perception of volume and quality. We show that REVIVE 3D achieves state-of-the-art performance on a challenging flat image dataset, based on extensive qualitative and quantitative evaluations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37149", "url": null, "sourceid": 36165, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38130, "uid": "74a7ac36372d9ef2c7b7ff0cc02001d2", "name": "HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images", "authors": [{"id": 174931, "fullname": "Yi Chen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174931?format=json", "institution": "UCAS"}, {"id": 189122, "fullname": "Donghao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189122?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189123, "fullname": "Jie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189123?format=json", "institution": null}, {"id": 189124, "fullname": "Xin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189124?format=json", "institution": "ByteDance Inc."}, {"id": 189125, "fullname": "Guisheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189125?format=json", "institution": null}, {"id": 189126, "fullname": "Jiatong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189126?format=json", "institution": "ByteDance Inc."}, {"id": 189127, "fullname": "Quanwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189127?format=json", "institution": null}, {"id": 176403, "fullname": "Qiang Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176403?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 76222, "fullname": "Lanqing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76222?format=json", "institution": "Nanyang Technological University"}, {"id": 87119, "fullname": "Shilei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87119?format=json", "institution": "bytedance"}, {"id": 189128, "fullname": "Weiqiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189128?format=json", "institution": "University of Chinese Academy of Sciences, CAS"}, {"id": 87709, "fullname": "Pheng-Ann Heng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87709?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images. Our data, model, and code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38130", "url": null, "sourceid": 39897, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38503, "uid": "211521eb56443380d6a56187203ab1f2", "name": "One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer", "authors": [{"id": 190006, "fullname": "Shijun Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190006?format=json", "institution": "Jiangnan University"}, {"id": 154799, "fullname": "Jing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154799?format=json", "institution": "Kunlun"}, {"id": 154801, "fullname": "Zhihang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154801?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 190007, "fullname": "Chunli Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190007?format=json", "institution": "skywork"}, {"id": 185403, "fullname": "Xiaoda Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185403?format=json", "institution": "Zhejiang University"}, {"id": 184441, "fullname": "Lijing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184441?format=json", "institution": "Institute of Remote Sensing and Geographic Information System, School of Earth and Space SciencesPeking University"}, {"id": 154802, "fullname": "Kai Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154802?format=json", "institution": "Jiangnan University"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}], "abstract": "Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38503", "url": "https://ssj9596.github.io/one-to-all-animation-project/", "sourceid": 41250, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39405, "uid": "331ea134f8056aace0c5bb939ccf88ae", "name": "RNED: Rotary Number Encoding and Decoding for Quantitative Medical VLM Analysis", "authors": [{"id": 129507, "fullname": "Fengbei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129507?format=json", "institution": "Cornell University"}, {"id": 192006, "fullname": "Sunwoo Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/192006?format=json", "institution": "Cornell University"}, {"id": 192007, "fullname": "Nusrat Binta Nizam", "url": "http://cvpr.thecvf.com/api/miniconf/users/192007?format=json", "institution": "Cornell University"}, {"id": 192008, "fullname": "Ilan Richter", "url": "http://cvpr.thecvf.com/api/miniconf/users/192008?format=json", "institution": "Columbia University Irving Medical Center"}, {"id": 192009, "fullname": "Ashley Beecy", "url": "http://cvpr.thecvf.com/api/miniconf/users/192009?format=json", "institution": "Sutter Health"}, {"id": 192010, "fullname": "Jayant Raikhelkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/192010?format=json", "institution": null}, {"id": 192011, "fullname": "Deborah Estrin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192011?format=json", "institution": "Cornell Tech, Cornell University"}, {"id": 183383, "fullname": "Mert Sabuncu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183383?format=json", "institution": "Cornell"}], "abstract": "Vision-Language Models (VLMs) are increasingly adopted for medical applications, but their clinical utility is limited by a core weakness in quantitative reasoning. This limitation affects tasks ranging from regression of lesion sizes to prediction of bounding-box coordinates and stems from the discrete tokenization schemes underlying Large Language Models (LLMs). To address this, we propose \\emph{Rotary Number Encoding and Decoding} (RNED), a principled method for embedding continuous numerical values directly in the representation space of a VLM. Analogous to rotary position encoding, RNED represents a scalar by applying a number-specific rotation matrix to a dedicated numeric token embedding. This norm-preserving transformation maintains ordinal structure over a wide numerical range and integrates seamlessly with pretrained model weights. For decoding, we introduce a robust score-matching\u2013based scheme to recover continuous values from hidden states in the presence of stochastic noise. We evaluate RNED on two quantitative tasks: radiological measurement estimation and medical visual grounding. On both internal and public benchmarks, RNED consistently outperforms existing VLM baselines. Together, these results show that RNED offers a robust, generalizable solution for numerical reasoning in medical VLMs, enabling models that are both quantitatively reliable and clinically applicable. We will release code for experiments on public datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39405", "url": null, "sourceid": 42902, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36749, "uid": "423b036257e9b247694318b327fdef68", "name": "Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition", "authors": [{"id": 183349, "fullname": "Shengkai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/183349?format=json", "institution": "HFUT"}, {"id": 185781, "fullname": "Zhiyong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185781?format=json", "institution": "Hefei University of Technology"}, {"id": 185782, "fullname": "Zefan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185782?format=json", "institution": "Jilin University"}, {"id": 180225, "fullname": "Jianfeng Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180225?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 185783, "fullname": "Zhihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185783?format=json", "institution": "University of Science and Technology of China"}, {"id": 85089, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85089?format=json", "institution": "Hefei University of Technology"}], "abstract": "Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton\u2011based action recognition. However, existing state\u2011of\u2011the\u2011art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre\u2011training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre\u2011training substantially but also improves downstream recognition accuracy, surpassing current state\u2011of\u2011the\u2011art approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36749", "url": null, "sourceid": 37959, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38781, "uid": "75a4f927be158aa885fea8c355d01af3", "name": "AeroAgent: A Vision\u2013Physics\u2013Decision Framework for Aerodynamic Vehicle Design", "authors": [{"id": 180632, "fullname": "Ye Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180632?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 190654, "fullname": "shouyiliu shouyiliu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190654?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190655, "fullname": "Huiyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190655?format=json", "institution": "Southern University of Science and Technology"}, {"id": 190656, "fullname": "Jianghang gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190656?format=json", "institution": "Peking University"}, {"id": 190657, "fullname": "fanwenhao fanwenhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190657?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190658, "fullname": "Zhongxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190658?format=json", "institution": "Peking University"}, {"id": 190659, "fullname": "Ding Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190659?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190660, "fullname": "Simeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190660?format=json", "institution": "Southern University of Science and Technology"}, {"id": 190661, "fullname": "Zirun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190661?format=json", "institution": "Southern University of Science and Technology"}, {"id": 190662, "fullname": "Yuanwei Bin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190662?format=json", "institution": null}, {"id": 190663, "fullname": "Shiyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190663?format=json", "institution": "Eastern Institute for Advanced Study"}, {"id": 139952, "fullname": "Yuntian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/139952?format=json", "institution": "Eastern Institute of Technology, Ningbo"}], "abstract": "Modern generative models can propose striking 3D vehicle shapes from text and images, but turning these sketches intoaerodynamically efficient, regulation-compliant designs still requires weeks of high-fidelity computational fluiddynamics (CFD) and manual iteration. As a result, fast 3D generation without trustworthy physics in the loop doeslittle to reduce end-to-end design time. We study how an AI agent can close this loop under a strict CFD budget.We introduce AeroAgent, a vision\u2013physics\u2013decision framework built around a single 3D, editable surfacerepresentation for vehicle shapes. A vision module turns text and 2D references into diverse, standardized 3Dcandidates and supports image-level edits. A physics module, AeroFormer, is a geometry-guidedTransformer surrogate trained on a large-scale vehicle aerodynamics dataset of roughly 50k CFD simulations; threetask-specific heads predict drag ($C_d$), surface pressure, and velocity fields on shared 3D grids. A decision module encodesregulatory size limits and aesthetic constraints as feasibility tests, uses prototype priors and surrogate sensitivitiesto guide free-form deformation edits, and runs a budget-aware propose\u2013evaluate\u2013refine loop in which only the finaltop-$K$ shapes are confirmed by high-fidelity CFD.In extensive experiments across five common vehicle classes, running only five propose\u2013evaluate\u2013refine iterations per vehiclereduces drag by an average of 2\u201312\\% and cuts high-fidelity CFD calls by 50\u201380\\% compared to baseline workflows, whilepreserving or improving styling quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38781", "url": null, "sourceid": 33833, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36306, "uid": "719bdf36f0352752837458a9d1b16bc8", "name": "LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map", "authors": [{"id": 184739, "fullname": "Jinzhou Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184739?format=json", "institution": "University of California, San Diego"}, {"id": 184740, "fullname": "Sidi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184740?format=json", "institution": "The Hong Kong University of Science and Technology(Guangzhou)"}, {"id": 184741, "fullname": "Waikit Xiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184741?format=json", "institution": "The university of Hong Kong"}, {"id": 151703, "fullname": "weixing chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151703?format=json", "institution": "Sun Yat-Sen University"}, {"id": 128912, "fullname": "Keze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128912?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\\%-3.5\\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36306", "url": null, "sourceid": 34853, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37392, "uid": "7efa530244b16507ac557be187cedf5b", "name": "Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for More Discriminative Spiking Features", "authors": [{"id": 187328, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187328?format=json", "institution": "State Key laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Science"}, {"id": 187329, "fullname": "Li Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187329?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 187330, "fullname": "Yufei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187330?format=json", "institution": "Nankai University"}, {"id": 187331, "fullname": "Han Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187331?format=json", "institution": "Chang&#x27;an University"}, {"id": 103222, "fullname": "Boyu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/103222?format=json", "institution": "Dalian Maritime University"}, {"id": 85031, "fullname": "Weiming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85031?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "Spiking Neural Networks (SNNs) naturally process visual inputs across multiple timesteps, offering rich temporal dynamics and energy-efficient computation. However, the temporally invariant supervision commonly used in training tends to reinforce the same dominant response patterns across timesteps, leading to redundant representations and limiting temporal discriminability.To overcome this limitation, we introduce \\emph{Temporal Representation Enhancement} (TRE), a novel learning-to-forget paradigm that encourages more diverse and complementary temporal representations. TRE identifies high-contribution semantic patterns through class-specific contribution estimation and temporal accumulation, and selectively suppresses them using a dynamic modulation strategy. By redirecting the model\u2019s attention toward alternative yet informative semantic cues, TRE promotes the learning of complementary features across timesteps.This approach not only strengthens the temporal discriminative capacity of SNNs but also enables more effective multi-timestep learning by leveraging richer semantic information. Extensive experiments on both static image datasets and dynamic neuromorphic datasets demonstrate that TRE consistently improves classification accuracy and feature diversity across different SNN backbones.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37392", "url": null, "sourceid": 40250, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40093, "uid": "0aee8232dea4a0983e006ec371a1b274", "name": "SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names", "authors": [{"id": 187329, "fullname": "Li Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187329?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 103222, "fullname": "Boyu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/103222?format=json", "institution": "Dalian Maritime University"}, {"id": 187328, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187328?format=json", "institution": "State Key laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Science"}, {"id": 193494, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193494?format=json", "institution": "Intime Department Store"}, {"id": 88203, "fullname": "Chunfeng Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88203?format=json", "institution": ", Institute of automation, Chinese academy of science"}, {"id": 84979, "fullname": "Bing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/84979?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 85031, "fullname": "Weiming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85031?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "Open-vocabulary object detection (OVD) aims to detect objects described by arbitrary text, but most existing methods operate at a coarse category level and struggle with fine-grained, attribute-sensitive queries. We address this from both model and data perspectives. We propose a Semantic-Retrieval-Augmented Detector (SRA-Det) that uses an attention-based module to retrieve multiple semantic facets from token-level text features, and a soft-min matching rule that behaves like a differentiable logical AND over these facets, ensuring that all key attributes are satisfied. In parallel, we introduce an automatic attribute-augmented data pipeline that uses an LLM to generate category-specific visual attributes and a dual CLIP-based similarity check to verify them at the instance level. With a Swin-T backbone, our approach achieves 54.9 mAP in the zero-shot setting on FG-OVD and 40.4 AP on LVIS, establishing strong fine-grained and general OVD performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40093", "url": null, "sourceid": 33350, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39655, "uid": "04f75408812398c1b907d10e22aac579", "name": "Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets", "authors": [{"id": 129865, "fullname": "Zhuoxuan Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129865?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 162957, "fullname": "Boan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/162957?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192577, "fullname": "Xingjian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192577?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 192578, "fullname": "Wenying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192578?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 88196, "fullname": "S.-H. Gary Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88196?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Current mmWave datasets for human pose estimation (HPE) are scarce and lack diversity in both point cloud (PC) attributes and human poses, severely hampering the generalization ability of their trained models. On the other hand, unlabeled mmWave HPE data and diverse LiDAR HPE datasets are readily available. We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. EMDUL trains a pseudo-label estimator to annotate the unlabeled mmWave data and is able to convert, or translate, a given annotated LiDAR PC to its mmWave counterpart. Expanded with both LiDAR-converted and pseudo-labeled mmWave PCs, our mmWave dataset significantly boosts the performance and generalization ability of all our HPE models, with substantial 15.1% and 18.9% error reductions for in-domain and out-of-domain settings, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39655", "url": null, "sourceid": 36906, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38194, "uid": "50a7d2b0a34ff60d996a9420699ba8b9", "name": "InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation", "authors": [{"id": 185908, "fullname": "Yuxin Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185908?format=json", "institution": "JD.com"}, {"id": 180361, "fullname": "Ke Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180361?format=json", "institution": "University of Science and Technology of China"}, {"id": 189270, "fullname": "Haowei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189270?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 184465, "fullname": "Ao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184465?format=json", "institution": "JD.com"}, {"id": 189271, "fullname": "Fengheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189271?format=json", "institution": "JD.com"}, {"id": 189272, "fullname": "Honghe Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189272?format=json", "institution": "JD.com"}, {"id": 185914, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185914?format=json", "institution": "JD"}, {"id": 173123, "fullname": "Run Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/173123?format=json", "institution": "Northeastern University"}, {"id": 180040, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180040?format=json", "institution": "JD"}, {"id": 184463, "fullname": "Xuanhua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184463?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 184466, "fullname": "Zhanjie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184466?format=json", "institution": "Zhejiang University"}, {"id": 189273, "fullname": "Zhen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189273?format=json", "institution": "JD.com"}, {"id": 189274, "fullname": "Haoyi Bian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189274?format=json", "institution": "JD.com"}, {"id": 185915, "fullname": "Jingjing Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185915?format=json", "institution": "JD"}, {"id": 185916, "fullname": "Junjie Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185916?format=json", "institution": "JD"}, {"id": 185917, "fullname": "Ching Law", "url": "http://cvpr.thecvf.com/api/miniconf/users/185917?format=json", "institution": "JD.com"}], "abstract": "E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style.To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, text, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops.To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38194", "url": null, "sourceid": 43857, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39907, "uid": "aa721632f9066fae441aa6f0b09bd9e4", "name": "EVLF: Early Vision-Language Fusion for Generative Dataset Distillation", "authors": [{"id": 180946, "fullname": "WENQI CAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/180946?format=json", "institution": "Toyama University"}, {"id": 193091, "fullname": "Yawen Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193091?format=json", "institution": "Toyama University"}, {"id": 74158, "fullname": "Guang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74158?format=json", "institution": "Hokkaido University"}, {"id": 190936, "fullname": "Chunzhi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190936?format=json", "institution": "University of Fukui"}, {"id": 193092, "fullname": "Chao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193092?format=json", "institution": null}], "abstract": "Dataset distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with significantly fewer samples. Recent diffusion-based DD methods commonly introduce semantic guidance through late-stage cross-attention, where textual prompts tend to dominate the generative process. Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision\u2013Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. By incorporating a lightweight cross-attention module at this transition, the early representations simultaneously encode local textures and global semantic directions across the denoising process. Importantly, EVLF is plug-and-play and can be easily integrated into any diffusion-based dataset distillation pipeline with an encoder. It works across different denoiser architectures and sampling schedules without any task-specific modifications. Extensive experiments demonstrate that EVLF generates semantically faithful and visually coherent synthetic data, yielding consistent improvements in downstream classification accuracy across varied settings. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39907", "url": null, "sourceid": 33526, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37358, "uid": "40b0de5f68597721b8591fa6701857f7", "name": "ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving", "authors": [{"id": 180510, "fullname": "Zhiyu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180510?format=json", "institution": "Wuhan University"}, {"id": 187246, "fullname": "Shaoyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187246?format=json", "institution": null}, {"id": 187247, "fullname": "haoran yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187247?format=json", "institution": "Horizon Robotics"}, {"id": 187248, "fullname": "xinbang zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187248?format=json", "institution": "Horizon Robotics"}, {"id": 187249, "fullname": "Jialv Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187249?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86179, "fullname": "Xinggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86179?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184347, "fullname": "Qian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184347?format=json", "institution": "Horizon Robotics"}, {"id": 187250, "fullname": "Lefei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187250?format=json", "institution": "Wuhan University"}], "abstract": "End-to-end autonomous driving (E2EAD) systems, which learn to predict future trajectories directly from sensor data, are fundamentally challenged by the inherent spatio-temporal imbalance of trajectory data. This imbalance creates a significant optimization burden, causing models to learn spurious correlations instead of robust driving logic, while also prioritizing uncertain, distant predictions, thereby compromising immediate safety. To address these issues, we propose ResAD, a novel Normalized Residual Trajectory Modeling framework. Instead of predicting the future trajectory directly, our approach reframes and simplifies the learning task by predicting the residual deviation from a deterministic inertial reference. This inertial reference serves as a strong physical prior, compelling the model to move beyond simple pattern-matching and instead focus its capacity on learning the necessary, context-driven deviations ($\\textit{e.g.}$, traffic rules, obstacles) from this default, inertially-guided path. To mitigate the optimization imbalance caused by uncertain, long-term horizons, ResAD further incorporates Point-wise Normalization of the predicted residual. This technique re-weights the optimization objective, preventing large-magnitude errors associated with distant, uncertain waypoints from dominating the learning signal. On the NAVSIM v1 and v2 benchmarks, ResAD achieves state-of-the-art results of 88.8 PDMS and 85.5 EPDMS with only two denoising steps, demonstrating that ResAD significantly simplifies the learning task and improves planning performance. The code will be released to facilitate further research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37358", "url": null, "sourceid": 39064, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38315, "uid": "c6c7afffb963f408548b6470a0520bcc", "name": "Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models", "authors": [{"id": 189580, "fullname": "Zhuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189580?format=json", "institution": "Mila &amp; McGill University"}, {"id": 182616, "fullname": "Alireza Dehghanpour Farashah", "url": "http://cvpr.thecvf.com/api/miniconf/users/182616?format=json", "institution": "Mila / McGill"}, {"id": 189581, "fullname": "Rik de Vries", "url": "http://cvpr.thecvf.com/api/miniconf/users/189581?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 189582, "fullname": "Golnoosh Farnadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189582?format=json", "institution": "McGill University"}], "abstract": "Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38315", "url": null, "sourceid": 34014, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38353, "uid": "6ce911816178efa4b3b09d68d55c6c0a", "name": "Neural Mixture Density Processes", "authors": [{"id": 189680, "fullname": "yi ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/189680?format=json", "institution": null}, {"id": 189681, "fullname": "Qi Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189681?format=json", "institution": "National University of Defense Technology"}, {"id": 189682, "fullname": "Xingxing Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189682?format=json", "institution": "Nudt"}, {"id": 189683, "fullname": "Longfei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189683?format=json", "institution": "National University of Defense Technology"}, {"id": 189684, "fullname": "Yiqin Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/189684?format=json", "institution": "National University of Defense Technology"}, {"id": 189685, "fullname": "weitao song", "url": "http://cvpr.thecvf.com/api/miniconf/users/189685?format=json", "institution": "National University of Defense and Technology"}, {"id": 189686, "fullname": "Fangjie Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189686?format=json", "institution": "National University of Defense Technology"}, {"id": 189687, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189687?format=json", "institution": "National University of Defense Technology"}, {"id": 189688, "fullname": "Guangquan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189688?format=json", "institution": "National University of Defense Technology"}], "abstract": "The neural process (NP) is a probabilistic meta learning model that learns distributions over functions via a global latent variable.It enables fast adaptation in few-shot scenarios by leveraging past experience. However, the design of latent variable structures and conditioning mechanisms in NPs remains underexplored, despite their importance in capturing diverse functional distributions.This paper proposes a new variant of NPs via mixture density modeling, referred to as the neural mixture density process (NMDP).The NMDP decomposes model parameters into task-agnostic and task-specific components to represent function distributions more flexibly. We train the model via the Expectation\u2013Maximization algorithm to construct expressive functional priors.Compared with existing work, our method maintains several advantages: (i) less overfitting by updating a small part of the network parameters, (ii) compact task representation via distributions in the simplex,(iii) an improvement guarantee of generative likelihoods over iteration. Experimental results show that our method can achieve competitive performance with adequate explainability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38353", "url": null, "sourceid": 43782, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38553, "uid": "9c58911404dd3adca8c9266ea983f86d", "name": "Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding", "authors": [{"id": 180736, "fullname": "Sosuke Yamao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180736?format=json", "institution": "Fujitsu Limited"}, {"id": 190119, "fullname": "Natsuki Miyahara", "url": "http://cvpr.thecvf.com/api/miniconf/users/190119?format=json", "institution": "Fujitsu Research"}, {"id": 85042, "fullname": "Yuankai Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85042?format=json", "institution": "The University of Adelaide"}, {"id": 190120, "fullname": "Shun Takeuchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190120?format=json", "institution": "Fujitsu Australia; Macquarie University; Fujitsu Research"}], "abstract": "In the context of long-term video understanding with large multimodal models, many frameworks have been proposed.While transformer-based visual compressors and memory-augmented approaches are often used to process long videos, yet they usually compress each frame independently and therefore fail to achieve strong performance on tasks that require understanding complete events, such as temporal ordering tasks in MLVU and VNBench. This motivates us to rethink the conventional one-way scheme from perception to memory, and instead aims to establish a feedback-driven process in which past visual contexts stored in memory can benefit ongoing perception.To this end,we propose Question-guided Visual Compression with Memory Feedback (QViC-MF),a framework for long-term video understanding.At its core is a Question-guided Multimodal Selective Attention (QMSA),which learns to preserve visual information related to the given question from both the current clip and the past related frames in memories. The compressor and memory-feedback works iteratively for each clip of the entire video.This simple yet effective design yields large performance gains on long-term video understanding tasks. Extensive experiments on four benchmarks demonstrate that our method achieves significant improvement over currentstate-of-the-art methods by 6.1\\% on MLVU-test, 8.3\\% on LVBench, 18.3\\% on VNBench Long, and 3.7\\% on VideoMME Long.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38553", "url": null, "sourceid": 36554, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36485, "uid": "ea7943712d1d8b761c51b326e0573127", "name": "Hyper-PCN: Hypergraph-based Point Cloud Completion via High-order Correlation Modeling", "authors": [{"id": 180013, "fullname": "Linfei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180013?format=json", "institution": "Tsinghua University"}, {"id": 185168, "fullname": "Pei Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185168?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 129210, "fullname": "Siqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129210?format=json", "institution": "Tsinghua University"}, {"id": 89067, "fullname": "Changqing Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89067?format=json", "institution": "Zhejiang University"}, {"id": 90513, "fullname": "Yue Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90513?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Point cloud completion is an important yet challenging problem in 3D computer vision, which aims to reconstruct complete and dense 3D shapes from partial point clouds. Although transformer-based and geometry-based approaches have made significant progress, they often struggle to capture the complex, high-order correlations inherent in point clouds. To address this limitation, we propose Hyper-PCN, a point cloud completion framework that leverages hypergraphs to explicitly model complex, higher-order correlations within incomplete inputs for more accurate completion. It comprises two key modules: Hyper Refinement Stack, designed to progressively capture coarse-to-fine high-order correlations through a series of hypergraph learning stages, and Anchor-based Hypergraph Neural Network, which employs a two-stage sampling strategy to construct collaborative hypergraphs, ensuring robust modeling of global structures. Extensive experiments on multiple datasets demonstrate that our approach consistently outperforms state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36485", "url": null, "sourceid": 43225, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38458, "uid": "1ef14ce4ee3294e6a1214136ce45e85b", "name": "SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning", "authors": [{"id": 180380, "fullname": "Yongkang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180380?format=json", "institution": "East China Normal University"}, {"id": 182926, "fullname": "Yu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182926?format=json", "institution": "East China Normal University / Shanghai Innovation Institute"}, {"id": 189896, "fullname": "YuShuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189896?format=json", "institution": "East China Normal University"}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 189897, "fullname": "Zhaoxia Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189897?format=json", "institution": "East China Normal University"}], "abstract": "The widespread misuse of image generation technologies has raised security concerns, driving the development of AI-generated image detection methods. However, generalization has become a key challenge and open problem: existing approaches struggle to adapt to emerging generative methods and content types in real-world scenarios. To address this issue, we propose a Scene-Aware and Importance-Guided Dynamic Optimization detection framework with continual learning (SAIDO). Specifically, we design Scene-Awareness-Based Expert Module (SAEM) that dynamically identifies and incorporates new scenes using VLLMs. For each scene, independent expert modules are dynamically allocated, enabling the framework to capture scene-specific forgery features better and enhance cross-scene generalization. To mitigate catastrophic forgetting when learning from multiple image generative methods, we introduce Importance-Guided Dynamic Optimization Mechanism (IDOM), which optimizes each neuron through an importance-guided gradient projection strategy, thereby achieving an effective balance between model plasticity and stability. Extensive experiments on continual learning tasks demonstrate that our method outperforms the current SOTA method in both stability and plasticity, achieving 44.22\\% and 40.57\\% relative reductions in average detection error rate and forgetting rate, respectively. On open-world datasets, it improves the average detection accuracy by 9.47\\% compared to the current SOTA method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38458", "url": null, "sourceid": 35060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39002, "uid": "883f28f679481ee7e374a23077c81e03", "name": "Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score", "authors": [{"id": 181372, "fullname": "Xuanning Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181372?format=json", "institution": "Southern University of Science and Technology"}, {"id": 191162, "fullname": "Zihao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191162?format=json", "institution": "Fudan University"}, {"id": 191163, "fullname": "Hao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191163?format=json", "institution": "Southern University of Science and Technology"}, {"id": 84803, "fullname": "Xiaobo Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/84803?format=json", "institution": "The University of Sydney"}, {"id": 189577, "fullname": "Bingyi Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/189577?format=json", "institution": "The Chinese University of Hong Kong; South University of Science and Technology"}, {"id": 189576, "fullname": "Hongxin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/189576?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Conformal prediction (CP) is a powerful framework for uncertainty quantification, generating prediction sets with coverage guarantees. Split conformal prediction relies on labeled data in the calibration procedure. However, the labeled data is often limited in real-world scenarios, leading to unstable coverage performance in different runs. To address this issue, we extend CP to the semi-supervised setting and propose SemiCP, a new paradigm that leverages both labeled and unlabeled data for calibration. To achieve this, we introduce an unlabeled nonconformity score, Nearest Neighbor Matching (NNM) score. Specifically, NNM estimates the nonconformity scores of unlabeled samples using their most similar pseudo-labeled counterparts during calibration, while maintaining the original scores for labeled data. Theoretically, we demonstrate that the average coverage gap (i.e., the absolute difference between the empirical marginal coverage and the target coverage) of SemiCP can decrease significantly at a rate $\\mathcal{O}\\bigl(1/N\\bigr)$, where $N$ is the number of unlabeled data. Extensive experiments validate the effectiveness of SemiCP under limited labeled data, reducing the average coverage gap by up to 77\\% on common benchmarks with 4000 unlabeled examples, when there are only 20 labeled examples.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39002", "url": null, "sourceid": 38143, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37575, "uid": "8f047b38280e17991af74f09b722e28d", "name": "Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance", "authors": [{"id": 185643, "fullname": "Naifu Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/185643?format=json", "institution": "Communication University of China"}, {"id": 131172, "fullname": "Zhaoyang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/131172?format=json", "institution": "University of Science and Technology of China"}, {"id": 132590, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132590?format=json", "institution": "Microsoft Research Asia"}, {"id": 88028, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88028?format=json", "institution": "Microsoft"}, {"id": 174684, "fullname": "Zihan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/174684?format=json", "institution": "University of Science and Technology of China"}, {"id": 187761, "fullname": "Yuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187761?format=json", "institution": "Communication University of China"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "While traditional and neural video codecs (NVCs) have achieved remarkable rate\u2013distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S$^2$VC, a Single-Step diffusion\u2013based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S$^2$VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37575", "url": null, "sourceid": 45880, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36686, "uid": "b8cd49eecd9cfca69f942e62a2325f2c", "name": "CoD: A Diffusion Foundation Model for Image Compression", "authors": [{"id": 131172, "fullname": "Zhaoyang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/131172?format=json", "institution": "University of Science and Technology of China"}, {"id": 174684, "fullname": "Zihan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/174684?format=json", "institution": "University of Science and Technology of China"}, {"id": 185643, "fullname": "Naifu Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/185643?format=json", "institution": "Communication University of China"}, {"id": 132590, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132590?format=json", "institution": "Microsoft Research Asia"}, {"id": 88028, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88028?format=json", "institution": "Microsoft"}, {"id": 185301, "fullname": "Zongyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185301?format=json", "institution": "Microsoft Research"}, {"id": 88233, "fullname": "Xiaoyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88233?format=json", "institution": "Research, Microsoft"}, {"id": 89074, "fullname": "Houqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89074?format=json", "institution": "University of Science and Technology of China"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion.However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates.To address it, we introduce CoD, the first Compression-oriented Diffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs.It offers several advantages: High compression efficiency, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); Low-cost and reproducible training, 300$\\times$ faster training than Stable Diffusion ($\\sim$ 20 vs. $\\sim$ 6,250 A100 GPU days) on entirely open image-only datasets; Providing new insights, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters.We hope CoD lays the foundation for future diffusion codec research.Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36686", "url": null, "sourceid": 46608, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38689, "uid": "ddaf8b7602e9011c04f95ad0bdf57c8f", "name": "A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation", "authors": [{"id": 190469, "fullname": "Shuang Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190469?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126479, "fullname": "Pengfei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/126479?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 153469, "fullname": "Haifeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153469?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 190470, "fullname": "Ting Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190470?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126480, "fullname": "Qi Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126480?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188951, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188951?format=json", "institution": "China Unicom Network Communications Co., Ltd."}, {"id": 190471, "fullname": "Cong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190471?format=json", "institution": null}, {"id": 126466, "fullname": "Jianxin Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126466?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126469, "fullname": "Jingyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126469?format=json", "institution": "Beijing University of Post and Telecommunication, Tsinghua University"}], "abstract": "Controllable hand image generation aims to synthesize geometrically accurate images with consistent appearance. Recently, diffusion models have achieved remarkable success in image generation and have been applied for hand image synthesis. Through input-level fusion or feature-level modulation, existing methods inject control signals with fixed strength across all denoising timesteps. However, this static modulation ignores the progressive characteristic of the denoising process. In this paper, we reveal that the modulation process of control signals depends on the denoising state and the conditions complexity. Due to their distinct semantic distributions and information densities, it remains challenging to achieve effective interaction of these heterogeneous representations. To address this, we propose a Temporal and Content Co-Awareness Latent Diffusion method that designs a temporal- and content-driven modulation mechanism for controllable hand image generation. To achieve temporal and content co-awareness among the heterogeneous representations, we propose a query-based interaction mechanism designed to mitigate information redundancy and align semantic distributions. Leveraging this cross-domain interaction, the model infers the control information required at the current denoising state and dynamically adjust pose and appearance injection strengths. To obtain a stable appearance representation from multi-pose images of the same identity, we design the Pose-Invariant Appearance Encoder that captures both global appearance consistency and local texture details. Furthermore, we employ a feature orthogonal decomposition to mitigate pose leakage into appearance subspaces. Both quantitative and qualitative experimental results demonstrate the superiority of our method over the state of the arts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38689", "url": null, "sourceid": 30658, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38065, "uid": "0b672b01a4ca61d669cd81383f6a4baa", "name": "W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning", "authors": [{"id": 181327, "fullname": "Zirui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181327?format=json", "institution": "Changzhou University"}, {"id": 188960, "fullname": "Biao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188960?format=json", "institution": "Changzhou University"}, {"id": 183315, "fullname": "rongrong Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/183315?format=json", "institution": "changzhou university"}, {"id": 186704, "fullname": "Zhongkai Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186704?format=json", "institution": null}, {"id": 188961, "fullname": "Shaobo Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188961?format=json", "institution": "Changzhou University"}], "abstract": "Pedestrian trajectory prediction is crucial for applications such as autonomous driving and social robots. Recently, language model (LM)\u2013based trajectory prediction has offered both prediction accuracy and interpretability. However, the L2 loss commonly used in trajectory prediction cannot be directly applied to LM optimization, resulting in degraded prediction performance. Moreover, current LM-based trajectory prediction methods lack explicit expressions of social interactions, and their scene descriptions are overly simplistic, making it challenging to impose practical scene constraints. To address these issues, we propose Write-to-Walk (W2W). First, we construct a pedestrian trajectory dataset with explicit interaction semantics and generate parsable prompts based on observed trajectories and interaction cues (companion/following/obstacle), alleviating the lack of interaction semantics in prompts. Afterward, a T5-Small backbone is trained in a two-stage manner: (1) Full-parameter supervised fine-tuning with cross-entropy loss for language learning, endowing W2W with the capability for formatted question answering; (2) Reinforcement Learning (RL) to optimize W2W, where a reward function that integrates ADE error and off-road penalties strengthens scene constraints, producing future trajectories consistent with the scene context and further improving prediction accuracy. Experiments on the benchmarking datasets (ETH/UCY and SDD) demonstrate that W2W outperforms LM-based prediction methods on ADE/FDE metrics and achieves competitive results compared with SOTA trajectory prediction methods. Meanwhile, the interpretability of LMs further enhances W2W\u2019s prospects for deployment in safety-critical domains, such as autonomous driving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38065", "url": null, "sourceid": 43135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38329, "uid": "caea84e7e50ea556a7cd84ee1d0963a8", "name": "Generative Video Compression with One-Dimensional Latent Representation", "authors": [{"id": 174684, "fullname": "Zihan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/174684?format=json", "institution": "University of Science and Technology of China"}, {"id": 131172, "fullname": "Zhaoyang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/131172?format=json", "institution": "University of Science and Technology of China"}, {"id": 185643, "fullname": "Naifu Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/185643?format=json", "institution": "Communication University of China"}, {"id": 132590, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132590?format=json", "institution": "Microsoft Research Asia"}, {"id": 88028, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88028?format=json", "institution": "Microsoft"}, {"id": 185301, "fullname": "Zongyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185301?format=json", "institution": "Microsoft Research"}, {"id": 88233, "fullname": "Xiaoyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88233?format=json", "institution": "Research, Microsoft"}, {"id": 187283, "fullname": "Zhenghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187283?format=json", "institution": "University of Newcastle"}, {"id": 89074, "fullname": "Houqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89074?format=json", "institution": "University of Science and Technology of China"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "Recent advancements in generative video codec (GVC) typically encode video into a 2D latent grid and employ high-capacity generative decoders for reconstruction. However, this paradigm still leaves two key challenges in fully exploiting spatial-temporal redundancy: Spatially, the 2D latent grid inevitably preserves intra-frame redundancy due to its rigid structure, where adjacent patches remain highly similar, thereby necessitating a higher bitrate. Temporally, the 2D latent grid is less effective for modeling long-term correlations in a compact and semantically coherent manner, as it hinders the aggregation of common contents across frames. To address these limitations, we introduce Generative Video Compression with One-Dimensional (1D) Latent Representation (GVC1D). GVC1D encodes the video data into extreme compact 1D latent tokens conditioned on both short- and long-term contexts. Without the rigid 2D spatial correspondence, these 1D latent tokens can adaptively attend to semantic regions and naturally facilitate token reduction, thereby reducing spatial redundancy. Furthermore, the proposed 1D memory provides semantically rich long-term context while maintaining low computational cost, thereby further reducing temporal redundancy. Experimental results indicate that GVC1D attains superior compression efficiency, where it achieves bitrate reductions of 60.4\\% under LPIPS and 68.8\\% under DISTS on the HEVC Class B dataset, surpassing the previous video compression methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38329", "url": null, "sourceid": 44209, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37777, "uid": "6d1eda42e54a1c2b1083d901f073529a", "name": "Occluded Human Body Capture with Frequency Domain Denoising Prior", "authors": [{"id": 120005, "fullname": "Buzhen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/120005?format=json", "institution": "Southeast University"}, {"id": 126752, "fullname": "Chongyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126752?format=json", "institution": "Sichuan University"}, {"id": 188243, "fullname": "Wentao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188243?format=json", "institution": "University of Tokyo"}, {"id": 188244, "fullname": "Yuan Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188244?format=json", "institution": "Southeast University"}, {"id": 188245, "fullname": "Jingyi Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/188245?format=json", "institution": "Southeast University"}, {"id": 182804, "fullname": "Binghui Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182804?format=json", "institution": "Southeast University"}, {"id": 90020, "fullname": "Yangang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90020?format=json", "institution": "Southeast University"}], "abstract": "Monocular human motion capture in occlusion scenarios presents significant challenges. Although a few works have explicitly considered the occlusion problem, image-based methods are unreliable due to the lack of temporal constraints while video-based approaches cannot gain sufficient knowledge from time domain motion priors to address long-term occlusions. However, occluded human motion typically exhibits periodic patterns and consistent momentum. Inspired by this observation, we exploit reliable image observations in frequency domain and formulate the motion capture task as a wavelet coefficients selection process. Specifically, we first construct probabilistic distributions for the occluded 2D keypoints, and then introduce a frequency domain diffusion model to refine the distributions by learning long-term periodic information and physical momentum with Discrete Wavelet Transform (DWT). Consequently, the learned denoising prior can select valid wavelet components to facilitate the 3D motion capture with a 3D decoder. By employing a joint reprojection strategy, we can also use the same diffusion process to train the 3D decoder. To further promote human occlusion-related tasks, we also present the first 3D occluded motion dataset, OcMotion, which serves as a new benchmark for both training and evaluation. Experimental results demonstrate that our method can produce accurate and coherent human motions from occluded videos. The dataset and code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37777", "url": null, "sourceid": 37092, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37914, "uid": "ec4a6c0d2b86d3993a40f23899d1fe82", "name": "Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning", "authors": [{"id": 145890, "fullname": "Ziyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145890?format=json", "institution": "Wuhan University"}, {"id": 188579, "fullname": "Li Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188579?format=json", "institution": "Sun Yat-Sen University"}, {"id": 184409, "fullname": "Deheng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184409?format=json", "institution": "Tencent; Tencent"}, {"id": 128962, "fullname": "Yong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128962?format=json", "institution": "Wuhan University"}, {"id": 188580, "fullname": "Huangxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188580?format=json", "institution": "Wuhan University"}, {"id": 188581, "fullname": "Meng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188581?format=json", "institution": null}, {"id": 177107, "fullname": "Wei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177107?format=json", "institution": "Wuhan University"}, {"id": 187250, "fullname": "Lefei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187250?format=json", "institution": "Wuhan University"}], "abstract": "Text-to-multiview (T2MV) diffusion models have shown great promise in generating multiple views of a scene from a single text prompt. While few-step backbones enable real-time T2MV generation, they often compromise key aspects of generation quality, such as per-view fidelity and cross-view consistency. Reinforcement learning (RL) finetuning offers a potential solution, yet existing approaches desgined for single-image diffusion do not readily extend to the few-step T2MV setting, as they neglect cross-view coordination and suffer from weak learning signals in few-step regimes. To address this, we propose MVC-ZigAL, a tailored RL finetuning framework for few-step T2MV diffusion models. Specifically, its core insights are: (1) a new MDP formulation that jointly models all generated views and assesses their collective quality via a joint-view reward; (2) a novel advantage learning strategy that exploits the performance gains of a self-refinement sampling scheme over standard sampling, yielding stronger learning signals for effective RL finetuning; and (3) a unified RL framework that extends advantage learning with a Lagrangian dual formulation for multiview-constrained optimization, balancing single-view and joint-view objectives through adaptive primal-dual updates under a self-paced threshold curriculum that harmonizes exploration and constraint enforcement. Collectively, these designs enable robust and balanced RL finetuning for few-step T2MV diffusion models, yielding substantial gains in both per-view fidelity and cross-view consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37914", "url": null, "sourceid": 44712, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39701, "uid": "a1c61921bdd6ed619ed8d854edfa535a", "name": "FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing", "authors": [{"id": 192681, "fullname": "Yilei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192681?format=json", "institution": "Zhejiang University"}, {"id": 180301, "fullname": "Zhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180301?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 155063, "fullname": "Yanghao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155063?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 88598, "fullname": "Jun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88598?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 129046, "fullname": "Yueting Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129046?format=json", "institution": "Zhejiang University"}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}, {"id": 90895, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90895?format=json", "institution": "HKUST"}], "abstract": "With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for **simple editing** that only contains a single editing target. However, to satisfy the exploding editing requirements, the **complex editing** that contains multiple editing targets is posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency.In this paper, we propose **FlowDC**, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency.To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39701", "url": null, "sourceid": 39292, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40213, "uid": "34760ceac693e42fec03f15aa139b9e6", "name": "TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection", "authors": [{"id": 180731, "fullname": "Zhijin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/180731?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 193795, "fullname": "Shuo Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193795?format=json", "institution": "University of Liverpool"}, {"id": 128633, "fullname": "Siyue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128633?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 193796, "fullname": "Shuwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193796?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 89336, "fullname": "Bingfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89336?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 193797, "fullname": "Li Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193797?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 89348, "fullname": "Jimin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89348?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7% gains over the recent training-free method). Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40213", "url": null, "sourceid": 35959, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36597, "uid": "61ae94caa4df330f8633334fa4cdc804", "name": "SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings", "authors": [{"id": 185433, "fullname": "Yuchen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185433?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 129120, "fullname": "Jiahe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129120?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 155480, "fullname": "Xiaohan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155480?format=json", "institution": "Macquarie University"}, {"id": 185434, "fullname": "Lina Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185434?format=json", "institution": "Institute of Semiconductors, Chinese Academy of Sciences"}, {"id": 128746, "fullname": "Jin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128746?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 128753, "fullname": "Xiao Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128753?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36597", "url": null, "sourceid": 42436, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40052, "uid": "2913779eaaea51ea2ca2d21953d845c4", "name": "GeoSANE: Learning Geospatial Representations From Models, Not Data", "authors": [{"id": 92812, "fullname": "Jo\u00eblle Hanna", "url": "http://cvpr.thecvf.com/api/miniconf/users/92812?format=json", "institution": "Universit\u00e4t St. Gallen"}, {"id": 193384, "fullname": "Damian Falk", "url": "http://cvpr.thecvf.com/api/miniconf/users/193384?format=json", "institution": "University of St. Gallen"}, {"id": 73935, "fullname": "Stella X. Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73935?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 92809, "fullname": "Damian Borth", "url": "http://cvpr.thecvf.com/api/miniconf/users/92809?format=json", "institution": "University of St.Gallen"}], "abstract": "Recent advances in remote sensing have led to an increase in the number of available foundation models;  each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation.We introduce GeoSANE, a geospatial model foundry that learns a unified neural representation from the weights of existing foundation models and task-specific models, able to generate novel neural networks weights on-demand. Given a target architecture, GeoSANE generates weights ready for finetuning for classification, segmentation, and detection tasks across multiple modalities.Models generated by GeoSANE consistently outperform their counterparts trained from scratch, match or surpass state-of-the-art remote sensing foundation models, and outperform models obtained through pruning or knowledge distillation when generating lightweight networks. Evaluations across ten diverse datasets and on GEO-Bench confirm its strong generalization capabilities.By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. Code is available at \\url{anonymized}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40052", "url": null, "sourceid": 33509, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39426, "uid": "f1b967e673681c3b9cdbc9c568949344", "name": "CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization", "authors": [{"id": 192054, "fullname": "Xindong Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192054?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 192055, "fullname": "Hang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192055?format=json", "institution": "Beihang University"}, {"id": 185433, "fullname": "Yuchen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185433?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 129120, "fullname": "Jiahe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129120?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 128753, "fullname": "Xiao Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128753?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 128746, "fullname": "Jin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128746?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "Scene Coordinate Regression (SCR) has emerged as a memory-efficient paradigm for visual localization.While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments.Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity.In this work, we propose **CoLoR**, a novel training framework tailored for large-scale SCR.First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points.Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce feature consistency.Our method achieves state-of-the-art performance on multiple challenging large-scale datasets and significantly narrows the accuracy gap with classical feature matching based approaches while retaining a compact map size.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39426", "url": null, "sourceid": 31093, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36618, "uid": "010ed37e44e2fdc175b4c5c6c930805a", "name": "More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization", "authors": [{"id": 183769, "fullname": "Xiaowen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183769?format=json", "institution": "People&#x27;s Public Security University of China"}, {"id": 185475, "fullname": "Jing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185475?format=json", "institution": "East China Normal University"}, {"id": 185476, "fullname": "Hongtao Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185476?format=json", "institution": "People&#x27;s Public Security University of China"}, {"id": 185477, "fullname": "Haozhe Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185477?format=json", "institution": "People\u2019s Public Security University of China"}, {"id": 185478, "fullname": "Renhua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185478?format=json", "institution": "People\u2019s Public Security University of China"}, {"id": 185479, "fullname": "Xu Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185479?format=json", "institution": "People&#x27;s Public Security University of China; Criminal Investigation Police University of China"}], "abstract": "Existing image fusion methods face difficulties in adapting to unseen fusion tasks and have limitations in balancing semantic information with pixel-level details. This limitation can be attributed to three key challenges: (1) the lack of a unified, task-agnostic optimization objective; (2) the inherent difficulty in balancing semantic fidelity and pixel-level richness; and (3) an over-reliance on supervised learning, which limits transferability across tasks. To overcome these issues, this work proposes a unified fusion framework that generalizes to diverse fusion tasks even when trained solely on infrared\u2013visible image pairs. Specifically, inspired by the free-energy principle, we introduce a fusion paradigm that combines high pixel-entropy expectation with low semantic-entropy expectation, and we design a frequency-aware feature decoupling mechanism to balance semantic content and pixel detail. Furthermore, an unsupervised dual-path trade-off strategy provides collaborative constraints at both semantic and pixel levels. Experiments show that our method significantly outperforms existing state-of-the-art methods in visual quality and downstream-task performance. It not only handles trained tasks efficiently but also generalizes well to unseen fusion tasks, while featuring lightweight model parameters and strong practical applicability. Code and data will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36618", "url": null, "sourceid": 43181, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39155, "uid": "dfc9e4d35fdc35e6f3afeec8d7e0e474", "name": "SCoRe: Salience-Coverage Reduction for Vision Token Pruning in  Vision-Language Models", "authors": [{"id": 184260, "fullname": "Tong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184260?format=json", "institution": "Institute of Microelectronics, Chinese Academy of Sciences"}, {"id": 191470, "fullname": "Hailong Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191470?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 191471, "fullname": "Xingyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191471?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "The heavy computational burden of Large Vision-Language Models (LVLMs) stems primarily from the lengthy visual token sequences generated by their vision encoders. To mitigate this, recent work has shifted towards pruning tokens within the vision encoder. However, we observe that these methods predominantly rely on a suboptimal decoupled heuristic method. This method is conceptually flawed: it is prone to sampling collapse, fails to fundamentally eliminate token redundancy, and tends to systematically discard secondary yet important semantic clusters.Addressing this limitation, this paper proposes to formalize visual token pruning as a unified Representativeness Optimization problem. We introduce SCoRe (Salience-Coverage Reduction), a unified optimization method theoretically grounded in the Weighted k-Center Problem. SCoRe constructs the final token set by greedily selecting tokens\u2014at each iteration, choosing the token that maximizes the current set's unified representativeness score, thereby achieving the optimization of global representativeness. Extensive experiments demonstrate that SCoRe achieves State-of-the-Art (SOTA) performance across multiple benchmarks. Notably, with negligible computational overhead, our method reduces tokens by 94.4% while retaining 95% of the full performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39155", "url": null, "sourceid": 43839, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36414, "uid": "2131847b1d4cccf8c5e06f7bef69cf60", "name": "SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning", "authors": [{"id": 184981, "fullname": "Hezhao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184981?format=json", "institution": "Xiamen University"}, {"id": 180950, "fullname": "jiacheng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180950?format=json", "institution": "Xiamen University"}, {"id": 184649, "fullname": "Junlong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184649?format=json", "institution": "Xiamen University"}, {"id": 184982, "fullname": "Mengke Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184982?format=json", "institution": "Shenzhen University"}, {"id": 153842, "fullname": "Yiqun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153842?format=json", "institution": "Hong Kong Baptist University"}, {"id": 163658, "fullname": "Shreyank Gowda Gowda", "url": "http://cvpr.thecvf.com/api/miniconf/users/163658?format=json", "institution": "University of Nottingham"}, {"id": 90817, "fullname": "Yang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90817?format=json", "institution": "Xiamen University"}], "abstract": "In open-world semi-supervised learning (OWSSL), a model learns from labeled data and unlabeled data containing both known and novel classes. In practical OWSSL applications, models are expected to perform rigorous classification by directly selecting the most semantically relevant label from a candidate set for each sample. Existing OWSSL methods fail to achieve this because novel samples are trained without explicit supervision, and these methods lack mechanisms to extract latent semantic information, resulting in predicted labels that have no semantic correspondence to candidate textual labels. To address this, we introduce SEmantic Capture for Open-world Semi-supervised learning (SECOS), which directly predicts textual labels from the candidate set without post-processing, meeting the requirements of practical OWSSL applications. SECOS leverages external knowledge to extract and align semantic representations across modalities for both known and novel classes, providing explicit supervisory signals for training novel classes. Extensive experiments demonstrate that even when existing OWSSL methods are evaluated under the more lenient post-hoc matching setting, SECOS still surpasses them by up to 5.4\\% without such assistance, highlighting its superior effectiveness. Code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36414", "url": null, "sourceid": 33647, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37184, "uid": "2dcde3677e9b9888f0fc505018ec4ea8", "name": "Decision Boundary-aware Generation for Long-tailed Learning", "authors": [{"id": 180950, "fullname": "jiacheng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180950?format=json", "institution": "Xiamen University"}, {"id": 180949, "fullname": "Ruichi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180949?format=json", "institution": "Xiamen University"}, {"id": 180937, "fullname": "Chikai Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180937?format=json", "institution": "Xiamen University"}, {"id": 184982, "fullname": "Mengke Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184982?format=json", "institution": "Shenzhen University"}, {"id": 153841, "fullname": "Xinyi Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153841?format=json", "institution": "University College London, University of London"}, {"id": 184649, "fullname": "Junlong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184649?format=json", "institution": "Xiamen University"}, {"id": 185918, "fullname": "Yonggang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185918?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 90817, "fullname": "Yang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90817?format=json", "institution": "Xiamen University"}], "abstract": "Long-tailed data bias decision boundaries toward head classes and degrade tail class accuracy. Diffusion-based generative augmentation address this problem by generating additional data, while head-to-tail transfer further mitigate the generator bias inherit from long-tailed dataset. However, we show that while head-to-tail transfer helps balance the decision space of the classifier, it also induces latent non-local feature mixing that entangles inter-class features, causing decision boundary overlap and tail class distribution shift. To address this, we first identify the problem of boundary ambiguity and then propose Decision Boundary-aware Generation (DBG) framework, which promotes near-boundary representation learning by generating informative near-boundary samples. Overall, DBG rebalances the long-tailed dataset while yielding more separable decision space for long-tailed learning. Across standard long-tailed benchmarks, DBG consistently improves tail class and overall accuracy with less inter-class overlap. The code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37184", "url": null, "sourceid": 39767, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39952, "uid": "46f3eb0e98ba1f34a396ebc7a9ba9f7c", "name": "When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse", "authors": [{"id": 180594, "fullname": "Yihuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180594?format=json", "institution": "Wuhan University"}, {"id": 192649, "fullname": "Jun Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192649?format=json", "institution": "Wuhan University"}, {"id": 193180, "fullname": "Liu Jiajun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193180?format=json", "institution": "Wuhan University"}, {"id": 193181, "fullname": "Daixian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193181?format=json", "institution": "Wuhan University"}, {"id": 193182, "fullname": "Tong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193182?format=json", "institution": "Wuhan University"}, {"id": 193183, "fullname": "Zhuolin Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193183?format=json", "institution": "Wuhan University"}, {"id": 193184, "fullname": "Yanzhen Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193184?format=json", "institution": "Wuhan University"}, {"id": 193185, "fullname": "Kai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193185?format=json", "institution": "Tsinghua University"}], "abstract": "Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct MLD-VC, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5\\% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39952", "url": null, "sourceid": 30908, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37457, "uid": "e6b5c4f9b23ce25d8722f4f8f395db85", "name": "CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning", "authors": [{"id": 180949, "fullname": "Ruichi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180949?format=json", "institution": "Xiamen University"}, {"id": 180937, "fullname": "Chikai Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180937?format=json", "institution": "Xiamen University"}, {"id": 180950, "fullname": "jiacheng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180950?format=json", "institution": "Xiamen University"}, {"id": 184982, "fullname": "Mengke Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184982?format=json", "institution": "Shenzhen University"}, {"id": 187495, "fullname": "Yang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187495?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}, {"id": 184649, "fullname": "Junlong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184649?format=json", "institution": "Xiamen University"}, {"id": 90817, "fullname": "Yang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90817?format=json", "institution": "Xiamen University"}], "abstract": "Long-tailed distributions are common in real-world recognition tasks, where a few head classes have many samples while most tail classes have very few. Recently, fine-tuning foundation models for long-tailed learning has gained attention due to their excellent performance. However, most existing methods focus solely on mitigating long-tailed distribution bias while overlooking concept confusion caused by the long-tailed distribution. In this paper, we study this problem and attribute it to the mutual exclusivity of single-label supervision under long-tailed distributions, which suppresses feature sharing among related classes and amplifies the dominance of head classes, leading to disrupted inter-class discriminality. To address this, we propose $\\textbf{CUE}$,  $\\underline{C}$oncept-aware m$\\underline{U}$lti-label $\\underline{E}$xpansion, which introduces multi-label concept signals to preserve disrupted inter-class relationships. Specifically, CUE constructs concept sets by $\\textbf{(i)}$ extracting instance-level visual cues from zero-shot CLIP and $\\textbf{(ii)}$ generating class-level semantic cues with LLM; the two cues are incorporated via separately weighted Binary Logit-Adjustmen (BLA) auxiliary losses and jointly optimized with the baseline Logit-Adjustmen (LA) loss. Experiments on several long-tailed benchmarks, CUE achieves balanced and strong performance, surpassing recent state-of-the-art methods. The code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37457", "url": null, "sourceid": 44725, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37125, "uid": "0ff768b9aec0057b915265fb8fccbe3a", "name": "Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch  for Category 6D Pose Estimation and Robotic Grasping", "authors": [{"id": 186723, "fullname": "Guillaume Duret", "url": "http://cvpr.thecvf.com/api/miniconf/users/186723?format=json", "institution": null}, {"id": 182056, "fullname": "Danylo Mazurak", "url": "http://cvpr.thecvf.com/api/miniconf/users/182056?format=json", "institution": "ECOLE CENTRALE DE LYON"}, {"id": 178310, "fullname": "Florence Zara", "url": "http://cvpr.thecvf.com/api/miniconf/users/178310?format=json", "institution": "Universit\u00e9 Claude Bernard Lyon 1"}, {"id": 186724, "fullname": "Jan Peters", "url": "http://cvpr.thecvf.com/api/miniconf/users/186724?format=json", "institution": "German Research Center for AI; TU Darmstadt"}, {"id": 158663, "fullname": "Liming Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158663?format=json", "institution": "Ecole Centrale de Lyon"}], "abstract": "While 2D vision has been revolutionized by large-scale datasets like ImageNet, 3D vision remains constrained by the scarcity of high-quality, canonically aligned data. We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. Our method overcomes key challenges by: (1) ensuring reliable, scalable asset generation via a controlled text-to-image-to-3D pipeline; (2) enforcing built-in canonical alignment through depth-conditioned generation, achieving a 96\\% pose consistency rate; and (3) enabling large-scale 6D annotation via mixed reality rendering. The pipeline produces high-quality, aligned 3D meshes in under 3 minutes per object\u2014a 5\u201320$\\times$ speedup over traditional scanning. We generate over 1,000 instances for each of the 153 categories in the Omni6Dpose benchmark, culminating in 153,000 aligned meshes\u2014a >40$\\times$ increase in instances per category over previous aligned real-world datasets. Extensive evaluation demonstrates competitive zero-shot sim2real transfer on the NOCS 6D pose benchmark and superior robotic grasping performance in both simulation and real-world zero-shot transfer, where aligned meshes prove essential for success. We release the largest publicly available aligned 3D mesh dataset, largest category-level 6D pose dataset, grasping simulation environments, and open-source pipeline, providing a critical step toward foundation models for 3D understanding and enabling efficient, unlimited generation of task-specific 3D data from scratch.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37125", "url": null, "sourceid": 34958, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37434, "uid": "f4d135e23bccef02aedebf969915d885", "name": "Reallocating Attention Across Layers to Reduce Multimodal Hallucination", "authors": [{"id": 187439, "fullname": "Haolang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187439?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187440, "fullname": "Bolun Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187440?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187441, "fullname": "WeiYe Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187441?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 130585, "fullname": "Guoshun Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130585?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187442, "fullname": "Junning Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187442?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 187443, "fullname": "Minghui Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187443?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 182645, "fullname": "Qiankun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182645?format=json", "institution": "Nanyang Technological University"}, {"id": 76212, "fullname": "Yi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76212?format=json", "institution": "Nanyang Technological University, Singapore"}, {"id": 187444, "fullname": "Hua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187444?format=json", "institution": "Guangxi Transportation Science and Technology Group Co., Ltd."}, {"id": 187445, "fullname": "Kun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187445?format=json", "institution": "Nanyang Technological University"}], "abstract": "Multimodal large reasoning models (MLRMs) often suffer from hallucinations that stem not only from insufficient visual grounding but also from imbalanced allocation between perception and reasoning processes. Building upon recent interpretability findings suggesting a staged division of attention across layers, we analyze how this functional misalignment leads to two complementary failure modes: perceptual bias in shallow layers and reasoning drift in deeper layers.To alleviate these issues, we propose Functional Head Identification and Class-Conditioned Rescaling , a lightweight, training-free plugin that identifies perception- and reasoning-oriented heads and adaptively rebalances their layerwise contributions. Our method improves reasoning consistency and visual faithfulness without retraining or any architectural modification. Evaluations across three representative MLRMs and five multimodal reasoning benchmarks show an average 4.2-percentage-point gain, with less than 1\\% additional computation and only 9\\% baseline latency. Beyond empirical improvements, our study provides an interpretable perspective on regulating cross-layer functional dynamics to enhance the reliability of multimodal reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37434", "url": null, "sourceid": 32912, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38524, "uid": "48e6328a802b323d0b14b31a10227588", "name": "UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling", "authors": [{"id": 182344, "fullname": "Kaiyuan Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182344?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 190062, "fullname": "Yingying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190062?format=json", "institution": null}, {"id": 88626, "fullname": "Ziyue Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88626?format=json", "institution": "Nankai University"}, {"id": 190063, "fullname": "Mingfei Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190063?format=json", "institution": null}, {"id": 190064, "fullname": "HAOHUI ZHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/190064?format=json", "institution": null}, {"id": 180077, "fullname": "Haiyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180077?format=json", "institution": "Xiaomi Corporation"}, {"id": 126660, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126660?format=json", "institution": "Alibaba Group"}, {"id": 185801, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185801?format=json", "institution": "Xiaomi Corporation"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}], "abstract": "Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38524", "url": null, "sourceid": 38355, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36755, "uid": "dcb3e8c1e29848df4f0326f764cd0131", "name": "From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection", "authors": [{"id": 180197, "fullname": "yepeng liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180197?format=json", "institution": "Wuhan University"}, {"id": 185795, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185795?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185796, "fullname": "Liwen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185796?format=json", "institution": "Wuhan University"}, {"id": 185797, "fullname": "Fangzhen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185797?format=json", "institution": "Xiaomi Corporation"}, {"id": 185798, "fullname": "Xudi Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185798?format=json", "institution": "Wuhan University"}, {"id": 185799, "fullname": "Yuliang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185799?format=json", "institution": null}, {"id": 185800, "fullname": "kuang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185800?format=json", "institution": "Xiaomi Corporation"}, {"id": 126660, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126660?format=json", "institution": "Alibaba Group"}, {"id": 185801, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185801?format=json", "institution": "Xiaomi Corporation"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 185802, "fullname": "Yongchao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185802?format=json", "institution": "Wuhan University"}], "abstract": "Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across  sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the Track-quality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art keypoint detection and description methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36755", "url": null, "sourceid": 40771, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40275?format=json"], "related_events_ids": [40275]}, {"id": 39531, "uid": "7b1b6d2fdf573b01058f16f6e398b69c", "name": "NeAR: Coupled Neural Asset\u2013Renderer Stack", "authors": [{"id": 127767, "fullname": "Hong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127767?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 90336, "fullname": "Chongjie Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/90336?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 145090, "fullname": "Houyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145090?format=json", "institution": "nanjing university"}, {"id": 192275, "fullname": "Weiqing Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192275?format=json", "institution": "Nanjing University"}, {"id": 144302, "fullname": "Ziyang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144302?format=json", "institution": "University of Trento"}, {"id": 192276, "fullname": "Lixing Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192276?format=json", "institution": "Zhejiang University"}, {"id": 90325, "fullname": "Zhaoxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90325?format=json", "institution": "MMLab@NTU, Nanyang Technological University"}, {"id": 151573, "fullname": "Jianfeng XIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/151573?format=json", "institution": "Tsinghua University"}, {"id": 185393, "fullname": "Shaocong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185393?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 192277, "fullname": "Xuhui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192277?format=json", "institution": "Beijing University"}, {"id": 184477, "fullname": "Yikai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184477?format=json", "institution": "Tsinghua University"}, {"id": 87855, "fullname": "Baochang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87855?format=json", "institution": "Beihang University"}, {"id": 88683, "fullname": "Xiaoguang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88683?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 129072, "fullname": "Jiaolong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129072?format=json", "institution": "Microsoft Research"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Neural asset authoring and neural rendering have emerged as largely disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the joint design of the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an end-to-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility with **NeAR**: a Coupled Neural Asset\u2013Renderer Stack. On the **asset** side, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On the **renderer** side, we design a lighting-aware neural renderer that uses this neural asset, along with explicit view embeddings and HDR environment maps, to produce lighting-aware renderings in realtime. We validate NeAR on four tasks: (1) G-buffer\u2013based forward rendering, (2) random-lit single-image reconstruction, (3) unknown-lit single-image relighting, and (4) novel-view relighting, where our coupled stack surpasses state-of-the-art baselines in quantitative metrics and perceptual quality. We hope this coupled asset-renderer perspective inspires new graphics stacks that view neural assets and renderers as co-designed components instead of independent ones.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39531", "url": null, "sourceid": 33100, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36922, "uid": "5bf38e0fcbeebe423cbc6b20775bc2b4", "name": "Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions", "authors": [{"id": 182240, "fullname": "Boyuan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182240?format=json", "institution": "Bournemouth University"}, {"id": 178403, "fullname": "Yingjie Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/178403?format=json", "institution": "Bournemouth University"}, {"id": 186225, "fullname": "Rui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186225?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 143576, "fullname": "Jinhe Na", "url": "http://cvpr.thecvf.com/api/miniconf/users/143576?format=json", "institution": "Dalian Minzu University"}, {"id": 186226, "fullname": "Ying Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186226?format=json", "institution": "ShanghaiTech University"}, {"id": 186227, "fullname": "Pengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186227?format=json", "institution": "Dalian Minzu University"}, {"id": 186228, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186228?format=json", "institution": "Bournemouth University"}, {"id": 186229, "fullname": "Xiaosong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186229?format=json", "institution": "Bournemouth University"}], "abstract": "Animation relies heavily on effective cinematography to enhance narrative clarity and emotional resonance, yet crafting optimal character interactions and camera positioning remains a resource-intensive challenge. Existing methods typically require extensive, predefined datasets, which restrict their effectiveness when encountering unfamiliar character interactions or novel animation contexts. We introduce an innovative approach to jointly generate character interactions and camera placements through unconditional diffusion-based generative models. Our method leverages a unified framework to simultaneously synthesize realistic two-person motions and corresponding cinematographic compositions without relying on predefined visual datasets. By integrating 3D motion representations and Toric features, our diffusion model effectively captures spatial orientation and relative positioning, enabling coherent and expressive scene generation. Experiments demonstrate that our approach can autonomously produce diverse and plausible dual-character interactions coupled with compelling camera movements, enhancing creative flexibility in animated storytelling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36922", "url": null, "sourceid": 36530, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38064, "uid": "f1a860c4306a9bf87570bf8491809064", "name": "Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding", "authors": [{"id": 188955, "fullname": "Jincen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188955?format=json", "institution": "Bournemouth University"}, {"id": 188956, "fullname": "Qianyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188956?format=json", "institution": "Jilin University"}, {"id": 188957, "fullname": "Yuhang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188957?format=json", "institution": null}, {"id": 188958, "fullname": "Kui Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/188958?format=json", "institution": "Hangzhou City University"}, {"id": 181397, "fullname": "Meili Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181397?format=json", "institution": "Northwest A&amp;F University"}, {"id": 188959, "fullname": "Jian Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188959?format=json", "institution": "Bournemouth University"}, {"id": 186228, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186228?format=json", "institution": "Bournemouth University"}, {"id": 156718, "fullname": "Xuequan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156718?format=json", "institution": "The University of Western Australia"}], "abstract": "While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using centroid-based topology and geodesic curvature continuity. We further devise hierarchical domain-aware modeling (HDM) that stabilizes cross-domain reasoning by consolidating intra-domain structure and fusing inter-domain relations. At test time, we introduce a lightweight spectral graph alignment (SGA) that shifts target features toward source prototypes in the spectral domain without updating model parameters, ensuring structure-preserving test-time feature shifting. In addition, we introduce MP3DObject, a real-scan object dataset for multi-task DG evaluation. Comprehensive experiments demonstrate that the proposed approach improves structural fidelity and consistently outperforms state-of-the-art methods across multiple tasks including reconstruction, denoising, and registration.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38064", "url": null, "sourceid": 33690, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40275, "uid": "dcb3e8c1e29848df4f0326f764cd0131", "name": "From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection", "authors": [{"id": 180197, "fullname": "yepeng liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180197?format=json", "institution": "Wuhan University"}, {"id": 185795, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185795?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185796, "fullname": "Liwen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185796?format=json", "institution": "Wuhan University"}, {"id": 185797, "fullname": "Fangzhen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185797?format=json", "institution": "Xiaomi Corporation"}, {"id": 185798, "fullname": "Xudi Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185798?format=json", "institution": "Wuhan University"}, {"id": 185799, "fullname": "Yuliang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185799?format=json", "institution": null}, {"id": 185800, "fullname": "kuang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185800?format=json", "institution": "Xiaomi Corporation"}, {"id": 126660, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126660?format=json", "institution": "Alibaba Group"}, {"id": 185801, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185801?format=json", "institution": "Xiaomi Corporation"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 185802, "fullname": "Yongchao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185802?format=json", "institution": "Wuhan University"}], "abstract": "Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across  sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the Track-quality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art keypoint detection and description methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40275", "url": null, "sourceid": -40771, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36755?format=json"], "related_events_ids": [36755]}, {"id": 37941, "uid": "f5eb0cf67469a4982ea81ddf8b4d4048", "name": "CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis", "authors": [{"id": 188642, "fullname": "Yansong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188642?format=json", "institution": "Beijing Normal University"}, {"id": 188643, "fullname": "Zhongxi Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188643?format=json", "institution": "Southern University of Science and Technology"}, {"id": 188644, "fullname": "Yun Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/188644?format=json", "institution": "Beijing Normal University"}, {"id": 188645, "fullname": "Zheng jinyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188645?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 90856, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90856?format=json", "institution": "Case Western Reserve University"}], "abstract": "Cardiac magnetic resonance (CMR) is the clinical gold standard for assessing cardiovascular diseases, but its interpretation relies on expert experience and remains challenging, particularly for identifying rare diseases. Existing automated methods lack interpretable reasoning processes, limiting clinical adoption. Although vision-language models (VLMs) possess basic visual understanding and text generation capabilities, they still lack verifiable reasoning chains in medical diagnosis and underperform on minority classes in long-tail distributions. To address these challenges, we propose CMR-RD, to our knowledge the first VLM for interpretable diagnosis in CMR, capable of generating explicit diagnostic chains aligned with imaging evidence. We construct a CMR dataset that reflects real-world clinical distributions, comprising five disease categories (including two rare conditions) plus normal controls. Building on this, the general-purpose VLM is aligned to medical and CMR semantics using large-scale medical vision\u2013text data, and cold-start training is used to enhance its understanding of medical concepts and basic reasoning. To enhance reasoning and performance on rare samples, we propose Group Phase Policy Optimization (GPPO), which combines online multi-stage reinforcement learning (RL)with adaptive sampling. GPPO enables the model to proactively explore rare and underperforming classes, thereby effectively mitigating long-tail bias. Experiments demonstrate that CMR-RD achieves state-of-the-art accuracy and reasoning-chain correctness compared with medical and general VLM baselines, shows stronger recognition of rare categories, and exhibits higher data efficiency. These results provide an interpretable pathway for automated CMR diagnosis.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37941", "url": null, "sourceid": 39508, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37002, "uid": "22ab6168adc1c019f687009e997b43a0", "name": "DriveLaW: Unifying Planning and Video Generation in a Latent Driving World", "authors": [{"id": 186438, "fullname": "Tianze Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/186438?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 147616, "fullname": "Yongkang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/147616?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186439, "fullname": "Lijun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186439?format=json", "institution": "Xiaomi Corporation"}, {"id": 153927, "fullname": "Jingfeng Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153927?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186440, "fullname": "Kaixin Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186440?format=json", "institution": "Xiaomi Corporation"}, {"id": 180077, "fullname": "Haiyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180077?format=json", "institution": "Xiaomi Corporation"}, {"id": 126660, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126660?format=json", "institution": "Alibaba Group"}, {"id": 186441, "fullname": "Kun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186441?format=json", "institution": null}, {"id": 185801, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185801?format=json", "institution": "Xiaomi Corporation"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}, {"id": 86175, "fullname": "Wenyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86175?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86179, "fullname": "Xinggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86179?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37002", "url": null, "sourceid": 46662, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37841, "uid": "37e0eaaff0973a8ab20092edeacf2ff0", "name": "Self-Diffusion Driven Blind Imaging", "authors": [{"id": 188394, "fullname": "Yanlong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188394?format=json", "institution": "James Cook University"}, {"id": 182547, "fullname": "Guanxiong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182547?format=json", "institution": "Independent"}], "abstract": "Optical imaging systems are inherently imperfect due to diffraction limits, lens manufacturing tolerances, assembly misalignment, and other physical constraints. In addition, unavoidable camera shake and object motion further introduce non-ideal degradations during acquisition. These aberrations and motion-induced variations are typically unknown, difficult to measure, and costly to model or calibrate in practice. Blind inverse problems offer a promising direction by jointly estimating both the latent image and the unknown degradation kernel. However, existing approaches often suffer from convergence instability, limited prior expressiveness, and sensitivity to hyperparameters. Inspired by recent advances in self-diffusion, we propose DeblurSDI, a zero-shot, self-supervised blind imaging framework that requires no pre-training. DeblurSDI formulates blind image recovery as an iterative reverse self-diffusion process that begins from pure noise and progressively refines both the sharp image and the blur kernel. Extensive experiments on combined optical aberrations and motion blur demonstrate that DeblurSDI consistently outperforms other methods by a substantial margin.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37841", "url": null, "sourceid": 32291, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39438, "uid": "121a8fa990428e63ef770c4b64f8aa2a", "name": "ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking", "authors": [{"id": 130099, "fullname": "Xiaobao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130099?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 192074, "fullname": "Zhangjie Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/192074?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 192075, "fullname": "Yuxiang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192075?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 129692, "fullname": "Zunjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129692?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 192076, "fullname": "Yunfei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192076?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 190062, "fullname": "Yingying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190062?format=json", "institution": null}, {"id": 192077, "fullname": "Shan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192077?format=json", "institution": "Xiaomi Corporation"}, {"id": 89661, "fullname": "Ming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89661?format=json", "institution": "Intel Labs China"}, {"id": 180077, "fullname": "Haiyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180077?format=json", "institution": "Xiaomi Corporation"}, {"id": 126660, "fullname": "Bing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126660?format=json", "institution": "Alibaba Group"}, {"id": 185801, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185801?format=json", "institution": "Xiaomi Corporation"}, {"id": 189365, "fullname": "Rongfeng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189365?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184796, "fullname": "Hangjun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/184796?format=json", "institution": "Xiaomi Corporation"}], "abstract": "Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D,  specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39438", "url": null, "sourceid": 35314, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40152, "uid": "e65314c557950debea6692c6d0d4b278", "name": "U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation", "authors": [{"id": 183485, "fullname": "Xunpei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/183485?format=json", "institution": "Sun Yat-sen University"}, {"id": 193648, "fullname": "Wenwei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193648?format=json", "institution": null}, {"id": 90470, "fullname": "Yi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90470?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 156851, "fullname": "Gang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156851?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Existing unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly  estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40152", "url": null, "sourceid": 36230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40396?format=json"], "related_events_ids": [40396]}, {"id": 40396, "uid": "e65314c557950debea6692c6d0d4b278", "name": "U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation", "authors": [{"id": 183485, "fullname": "Xunpei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/183485?format=json", "institution": "Sun Yat-sen University"}, {"id": 193648, "fullname": "Wenwei Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193648?format=json", "institution": null}, {"id": 90470, "fullname": "Yi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90470?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 156851, "fullname": "Gang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156851?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Existing unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly  estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40396", "url": null, "sourceid": -36230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40152?format=json"], "related_events_ids": [40152]}, {"id": 36571, "uid": "a315bb3296a043597fc3758b29dbe9ff", "name": "Learning Latent Concepts for Detecting Out-of-Distribution Objects", "authors": [{"id": 185378, "fullname": "Ting Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185378?format=json", "institution": "Ningbo University of Finance and Economy"}, {"id": 107430, "fullname": "Junhao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107430?format=json", "institution": "Nanyang Technological University &amp; A*STAR"}, {"id": 131393, "fullname": "Yew-Soon Ong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131393?format=json", "institution": "Nanyang Technological University"}], "abstract": "Detecting out-of-distribution (OOD) objects is indispensable for safely deploying object detectors in the wild. Current approaches enable the unknown-aware ability by regularizing the instance-level feature space, such as outlier synthesis. Despite the general efficacy, it is challenging to truly learn the concept of `unknown' under the absence of real unknown data. In this paper, we propose UNO-Adapter, a simple yet highly effective framework tailored for OOD object detection. Our key insight is that in object detection, where in-distribution~(ID) and OOD objects may coexist within the same context, we need global abstraction and reasoning to help the detector learn their differences, i.e., unknown injection. UNO-Adapter consists of two key steps: unsupervised concept discovery and neural concept binder. The former introduces an object-centric learning paradigm to abstract and model the holistic image, including both ID and OOD, obtaining sparse and compressed slot-based representations with relational constraints. The latter dynamically combines slots with object candidates extracted by the detector, binding the concept of unknown to the de facto detector. During inference, we introduce an image-guided OOD object score to reinforce the distinction between ID and OOD. Experiments on standard benchmarks demonstrate the superiority of the proposed method. In particular, UNO-Adapter reduces the FPR95 by up to 11.96% compared to the previous best OOD object detection method.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36571", "url": null, "sourceid": 33226, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40266?format=json"], "related_events_ids": [40266]}, {"id": 40147, "uid": "9117b58934e7639cd9a09be9db43fb7d", "name": "Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection", "authors": [{"id": 193624, "fullname": "Kaiqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193624?format=json", "institution": "Qilu University of Technology (ShanDong Academy of Science)"}, {"id": 193625, "fullname": "Gang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193625?format=json", "institution": "Qilu University of Technology"}, {"id": 193626, "fullname": "Mingle Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193626?format=json", "institution": "Qilu University of Technology"}, {"id": 193627, "fullname": "Min Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193627?format=json", "institution": "Qilu University of Technology"}, {"id": 193628, "fullname": "Delong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/193628?format=json", "institution": null}, {"id": 193629, "fullname": "Jin Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193629?format=json", "institution": "Qilu University of Technology"}], "abstract": "Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring anomalous training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies.  In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40147", "url": null, "sourceid": 44594, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40266, "uid": "a315bb3296a043597fc3758b29dbe9ff", "name": "Learning Latent Concepts for Detecting Out-of-Distribution Objects", "authors": [{"id": 185378, "fullname": "Ting Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185378?format=json", "institution": "Ningbo University of Finance and Economy"}, {"id": 107430, "fullname": "Junhao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107430?format=json", "institution": "Nanyang Technological University &amp; A*STAR"}, {"id": 131393, "fullname": "Yew-Soon Ong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131393?format=json", "institution": "Nanyang Technological University"}], "abstract": "Detecting out-of-distribution (OOD) objects is indispensable for safely deploying object detectors in the wild. Current approaches enable the unknown-aware ability by regularizing the instance-level feature space, such as outlier synthesis. Despite the general efficacy, it is challenging to truly learn the concept of `unknown' under the absence of real unknown data. In this paper, we propose UNO-Adapter, a simple yet highly effective framework tailored for OOD object detection. Our key insight is that in object detection, where in-distribution~(ID) and OOD objects may coexist within the same context, we need global abstraction and reasoning to help the detector learn their differences, i.e., unknown injection. UNO-Adapter consists of two key steps: unsupervised concept discovery and neural concept binder. The former introduces an object-centric learning paradigm to abstract and model the holistic image, including both ID and OOD, obtaining sparse and compressed slot-based representations with relational constraints. The latter dynamically combines slots with object candidates extracted by the detector, binding the concept of unknown to the de facto detector. During inference, we introduce an image-guided OOD object score to reinforce the distinction between ID and OOD. Experiments on standard benchmarks demonstrate the superiority of the proposed method. In particular, UNO-Adapter reduces the FPR95 by up to 11.96% compared to the previous best OOD object detection method.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40266", "url": null, "sourceid": -33226, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36571?format=json"], "related_events_ids": [36571]}, {"id": 37764, "uid": "0ad5d0c8747fd99d37ab4cd396ae9662", "name": "Contact-Aware Neural  Dynamics", "authors": [{"id": 184247, "fullname": "Changwei Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/184247?format=json", "institution": "University of California San Diego"}, {"id": 188204, "fullname": "Jai Bandi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188204?format=json", "institution": "University of California, San Diego"}, {"id": 188205, "fullname": "Jianglong Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/188205?format=json", "institution": "University of California, San Diego"}, {"id": 188206, "fullname": "Rocky Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188206?format=json", "institution": "Amazon FAR"}, {"id": 84867, "fullname": "Pieter Abbeel", "url": "http://cvpr.thecvf.com/api/miniconf/users/84867?format=json", "institution": "Covariant"}, {"id": 91790, "fullname": "Xiaolong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91790?format=json", "institution": "UCSD"}, {"id": 188207, "fullname": "Sha Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188207?format=json", "institution": "University of California, San Diego"}], "abstract": "High-fidelity physics simulation is essential for scalable robotic learning, but the sim-to-real gap persists, especially for tasks involving complex, dynamic, and discontinuous interactions like physical contacts. Explicit system identification, which tunes explicit simulator parameters, is often insufficient to align the intricate, high-dimensional, and state-dependent dynamics of the real world. To overcome this, we propose an implicit sim-to-real alignment framework that learns to directly align the simulator's dynamics with contact information. Our method treats the off-the-shelf simulator as a base prior and learns a contact-aware neural dynamics model to refine simulated states using real-world observations. We show that using tactile contact information from robotic hands can effectively model the non-smooth discontinuities inherent in contact-rich tasks, resulting in a neural dynamics model grounded by real-world data. We demonstrate that this learned forward dynamics model improves state prediction accuracy and can be effectively used to predict policy performance and refine policies trained purely in standard simulators, offering a scalable, data-driven approach to sim-to-real alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37764", "url": null, "sourceid": 43179, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39036, "uid": "a1a01d83de53972346fe4c46dbebdef7", "name": "Affordance-First Decomposition for Continual Learning in Video\u2013Language Understanding", "authors": [{"id": 191220, "fullname": "Mengzhu xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191220?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 191221, "fullname": "Hanzhi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191221?format=json", "institution": "University of California, Santa Barbara"}, {"id": 182729, "fullname": "Ningkang Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182729?format=json", "institution": "Nanjing Normal University"}, {"id": 182325, "fullname": "qianyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182325?format=json", "institution": "NTU"}, {"id": 185896, "fullname": "Canran Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185896?format=json", "institution": "Shenzhen Campus of Sun Yat-sen University"}], "abstract": "Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39036", "url": null, "sourceid": 34576, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38876, "uid": "c8f2e54fe7b8ab5c291ea8d5831669df", "name": "MAMMA: Markerless Accurate Multi-person Motion Acquisition", "authors": [{"id": 183193, "fullname": "Hanz Cuevas Velasquez", "url": "http://cvpr.thecvf.com/api/miniconf/users/183193?format=json", "institution": "Max Planck Institute for Intelligent Systems"}, {"id": 183825, "fullname": "Anastasios Yiannakidis", "url": "http://cvpr.thecvf.com/api/miniconf/users/183825?format=json", "institution": "Max Planck Institute"}, {"id": 119973, "fullname": "Soyong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/119973?format=json", "institution": "Carnegie Mellon University"}, {"id": 128457, "fullname": "Giorgio Becherini", "url": "http://cvpr.thecvf.com/api/miniconf/users/128457?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 131971, "fullname": "Markus H\u00f6schle", "url": "http://cvpr.thecvf.com/api/miniconf/users/131971?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 88491, "fullname": "Joachim Tesch", "url": "http://cvpr.thecvf.com/api/miniconf/users/88491?format=json", "institution": "Max Planck Institute for Intelligent Systems"}, {"id": 190902, "fullname": "Taylor Obersat", "url": "http://cvpr.thecvf.com/api/miniconf/users/190902?format=json", "institution": "Max-Planck Institute"}, {"id": 190903, "fullname": "Tsvetelina Alexiadis", "url": "http://cvpr.thecvf.com/api/miniconf/users/190903?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 132038, "fullname": "Eni Halilaj", "url": "http://cvpr.thecvf.com/api/miniconf/users/132038?format=json", "institution": "Carnegie Mellon University"}, {"id": 85690, "fullname": "Michael J. Black", "url": "http://cvpr.thecvf.com/api/miniconf/users/85690?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video.Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialised hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks.The result is a system capable of accurately capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, method, code, and model weights for research purposes.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38876", "url": null, "sourceid": 39983, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40350?format=json"], "related_events_ids": [40350]}, {"id": 40350, "uid": "c8f2e54fe7b8ab5c291ea8d5831669df", "name": "MAMMA: Markerless Accurate Multi-person Motion Acquisition", "authors": [{"id": 183193, "fullname": "Hanz Cuevas Velasquez", "url": "http://cvpr.thecvf.com/api/miniconf/users/183193?format=json", "institution": "Max Planck Institute for Intelligent Systems"}, {"id": 183825, "fullname": "Anastasios Yiannakidis", "url": "http://cvpr.thecvf.com/api/miniconf/users/183825?format=json", "institution": "Max Planck Institute"}, {"id": 119973, "fullname": "Soyong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/119973?format=json", "institution": "Carnegie Mellon University"}, {"id": 128457, "fullname": "Giorgio Becherini", "url": "http://cvpr.thecvf.com/api/miniconf/users/128457?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 131971, "fullname": "Markus H\u00f6schle", "url": "http://cvpr.thecvf.com/api/miniconf/users/131971?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 88491, "fullname": "Joachim Tesch", "url": "http://cvpr.thecvf.com/api/miniconf/users/88491?format=json", "institution": "Max Planck Institute for Intelligent Systems"}, {"id": 190902, "fullname": "Taylor Obersat", "url": "http://cvpr.thecvf.com/api/miniconf/users/190902?format=json", "institution": "Max-Planck Institute"}, {"id": 190903, "fullname": "Tsvetelina Alexiadis", "url": "http://cvpr.thecvf.com/api/miniconf/users/190903?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 132038, "fullname": "Eni Halilaj", "url": "http://cvpr.thecvf.com/api/miniconf/users/132038?format=json", "institution": "Carnegie Mellon University"}, {"id": 85690, "fullname": "Michael J. Black", "url": "http://cvpr.thecvf.com/api/miniconf/users/85690?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video.Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialised hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks.The result is a system capable of accurately capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, method, code, and model weights for research purposes.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40350", "url": null, "sourceid": -39983, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38876?format=json"], "related_events_ids": [38876]}, {"id": 39287, "uid": "0eaf0f796b918b4c9c7eec8cc4a9e0cb", "name": "Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection", "authors": [{"id": 181985, "fullname": "Keyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181985?format=json", "institution": "Hebei University of Technology"}, {"id": 191770, "fullname": "Shuai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191770?format=json", "institution": "Hebei University of Technology"}, {"id": 191771, "fullname": "Hengda Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191771?format=json", "institution": "Hebei University of Technology"}, {"id": 191772, "fullname": "Lukui Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191772?format=json", "institution": "Hebei University of Technology"}, {"id": 191773, "fullname": "Chenhaiyong Chenhaiyong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191773?format=json", "institution": null}], "abstract": "RGB-Event object detection is able to capture clear and detailed features of the target while maintaining high-speed information collection.It is suitable for high dynamic or harsh environments and has become a research hotspot in recent years. The existing RGB-Event object detectors all struggle to fully utilize the fusion features of two modalities, but ignore the independent role of single features. To fully tap into the potential of single features, we propose a frequency-domain coherence-based Shared and Private Features Decoupling method for RGB-Event object detection method, SPFD network. First, we design a FCFS module to separate shared and private features by exploring the spectral energy distribution differences between dual modalities. Then, we design a TriAdapt Encoder to process the shared and private features, selectively emphasizing texture-rich RGB features in static regions and motion-sensitive event features in dynamic regions, thereby achieving a robust balance between spatial detail and temporal awareness. Finally, a TriInject Decoder is proposed to emphasize the most discriminative modality features dynamically.  Experimental results on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that our model achieves competitive performance with state-of-the-arts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39287", "url": null, "sourceid": 32258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39073, "uid": "e7231cdd39ec8d177d83ba538cac70ba", "name": "Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning", "authors": [{"id": 191306, "fullname": "Shuai Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191306?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131641, "fullname": "Yixiong Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131641?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 131642, "fullname": "Yuhua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131642?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 86547, "fullname": "Ruixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86547?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39073", "url": null, "sourceid": 43894, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37364, "uid": "bc55ea9e207158cae19485c14e573efd", "name": "NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation", "authors": [{"id": 159864, "fullname": "Yi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/159864?format=json", "institution": "Beijing Jiaotong University"}, {"id": 153035, "fullname": "Tao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153035?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184577, "fullname": "Yi Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184577?format=json", "institution": "Beijing Jiaotong University"}, {"id": 153034, "fullname": "Congyan Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153034?format=json", "institution": "Beijing jiaotong university"}, {"id": 153036, "fullname": "Yidong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153036?format=json", "institution": "Beijing Jiaotong University"}, {"id": 92055, "fullname": "Haibin Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/92055?format=json", "institution": "State University of New York, Stony Brook"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS and LERF-OVS benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37364", "url": null, "sourceid": 34360, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36248, "uid": "f5baef61033f3d8d192d865b8ce49faf", "name": "CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion", "authors": [{"id": 180163, "fullname": "Yushan Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/180163?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184576, "fullname": "Hui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184576?format=json", "institution": "Beijing Jiaotong University"}, {"id": 128691, "fullname": "Qiming Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/128691?format=json", "institution": "XMU"}, {"id": 184577, "fullname": "Yi Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184577?format=json", "institution": "Beijing Jiaotong University"}, {"id": 153036, "fullname": "Yidong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153036?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Collaborative perception empowers autonomous agents to share complementary information and overcome perception limitations. While early fusion offers more perceptual complementarity and is inherently robust to model heterogeneity, its high communication cost has limited its practical deployment, prompting most existing works to favor intermediate or late fusion. To address this, we propose a communication-efficient early Collaborative perception framework that incorporates LiDAR Completion to restore scene completeness under sparse transmission, dubbed as CoLC. Specifically, the CoLC integrates three complementary designs. First, each neighbor agent applies Foreground-Aware Point Sampling (FAPS) to selectively transmit informative points that retain essential structural and contextual cues under bandwidth constraints. The ego agent then employs Completion-Enhanced Early Fusion (CEEF) to reconstruct dense pillars from the received sparse inputs and adaptively fuse them with its own observations, thereby restoring spatial completeness. Finally, the Dense-Guided Dual Alignment (DGDA) strategy enforces semantic and geometric consistency between the enhanced and dense pillars during training, ensuring consistent and robust feature learning. Experiments on both simulated and real-world datasets demonstrate that CoLC achieves superior perception-communication trade-offs and remains robust under heterogeneous model settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36248", "url": null, "sourceid": 33054, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37256, "uid": "ff4776b449efb88b35fbf6187af9771e", "name": "GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting", "authors": [{"id": 180139, "fullname": "Xianben Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180139?format=json", "institution": "Beijing Jiaotong University"}, {"id": 153035, "fullname": "Tao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153035?format=json", "institution": "Beijing Jiaotong University"}, {"id": 187019, "fullname": "Yuxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187019?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184577, "fullname": "Yi Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184577?format=json", "institution": "Beijing Jiaotong University"}, {"id": 92055, "fullname": "Haibin Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/92055?format=json", "institution": "State University of New York, Stony Brook"}], "abstract": "3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering.   Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points.   Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting.  Extensive experiments validate that GS\\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37256", "url": null, "sourceid": 37157, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36350, "uid": "acd8d58c7c352df2d1d729701488f54f", "name": "Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning", "authors": [{"id": 180304, "fullname": "Yuqiao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180304?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184845, "fullname": "Xu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184845?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184846, "fullname": "Tengfei Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184846?format=json", "institution": null}, {"id": 184847, "fullname": "Yiqing Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184847?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184577, "fullname": "Yi Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184577?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184848, "fullname": "Hui Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184848?format=json", "institution": "University of Glasgow"}], "abstract": "Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning (AL) mitigates this challenge by selectively annotating informative samples. In multimodal settings, many approaches implicitly assume that modality importance is stable across rounds and keep selection rules fixed at the fusion stage, which leaves them insensitive to the dynamic nature of multimodal learning, where the relative value of modalities and the difficulty of instances shift as training proceeds. To address this issue, we propose \\textbf{RL-MBA}, a reinforcement-learning framework for modality-balanced, difficulty-aware multimodal active learning. RL-MBA models sample selection as a Markov Decision Process, where the policy adapts to modality contributions, uncertainty, and diversity, and the reward encourages accuracy gains and balance. Two key components drive this adaptability: (1) Adaptive Modality Contribution Balancing (AMCB), which dynamically adjusts modality weights via reinforcement feedback, and (2) Evidential Fusion for Difficulty-Aware Policy Adjustment (EFDA), which estimates sample difficulty via uncertainty-based evidential fusion to prioritize informative samples. Experiments on Food101, KineticsSound, and VGGSound demonstrate that RL-MBA consistently outperforms strong baselines, improving both classification accuracy and modality fairness under limited labeling budgets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36350", "url": null, "sourceid": 34173, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39643, "uid": "cfe12d07973eb647d2ba40f76257ce1a", "name": "FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing", "authors": [{"id": 192553, "fullname": "Xijie Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192553?format=json", "institution": "Fudan University"}, {"id": 156408, "fullname": "Chengming Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156408?format=json", "institution": "Tencent"}, {"id": 131219, "fullname": "Donghao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131219?format=json", "institution": "Tencent YouTu Lab"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 192554, "fullname": "Peng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192554?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 156406, "fullname": "Xu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156406?format=json", "institution": "Tencent YouTu Lab"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 86912, "fullname": "Chengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86912?format=json", "institution": "Tencent Youtu Lab; Shanghai Jiao Tong University"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}], "abstract": "First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39643", "url": null, "sourceid": 33883, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39199, "uid": "eba9f1a976a8f27e7e0d3a428571b9bb", "name": "Bidirectional Query-Driven Generation of Parametric CAD Sketch", "authors": [{"id": 179972, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179972?format=json", "institution": "Beijing Institute of Technology; Nanyang Technological University"}, {"id": 191565, "fullname": "Daxuan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/191565?format=json", "institution": "Nanyang Technological University"}, {"id": 191566, "fullname": "Yijie Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/191566?format=json", "institution": "Nanyang Technological University"}, {"id": 152258, "fullname": "Jianmin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152258?format=json", "institution": "Nanyang Technological University"}, {"id": 191567, "fullname": "Fang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191567?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Learning-based CAD modeling shows great promise in automating parametric design, yet existing approaches often overlook the incremental and state-dependent nature of sketch construction. We present CADSketcher, a query-driven bidirectional framework for completing partial parametric sketches by internalizing the non-linear construction logic of interactive CAD processes. At the core of CADSketcher are two key innovations. First, a bidirectional sketch learner recovers both prior and posterior contexts from arbitrary-span partial sketches via a bidirectional query mechanism, enabling exploration of multiple plausible modeling trajectories. Second, a confidence-guided completion pipeline adaptively determines the expansion direction through a confidence gate and ensures executable instruction generation using a validity compiler, while a progressive context updater preserves sketch consistency throughout the evolving sketch state. In addition, a hybrid positional encoding integrates global modeling progression with local geometric semantics, reinforcing structural coherence during both learning and completion. Extensive experiments demonstrate that CADSketcher achieves superior geometric validity and instruction consistency across diverse sketch completion tasks, offering a robust and interpretable framework toward intelligent CAD automation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39199", "url": null, "sourceid": 30874, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36711, "uid": "129b4030c9f6ad08dccd421ada7705fa", "name": "ARC Is a Vision Problem!", "authors": [{"id": 175376, "fullname": "Keya Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175376?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185702, "fullname": "Ali Cy", "url": "http://cvpr.thecvf.com/api/miniconf/users/185702?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185703, "fullname": "Linlu Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185703?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185704, "fullname": "Delores(Xiaoman) Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185704?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185705, "fullname": "Runqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185705?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185706, "fullname": "Yeyin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185706?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185707, "fullname": "Jacob Andreas", "url": "http://cvpr.thecvf.com/api/miniconf/users/185707?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 150920, "fullname": "Kaiming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/150920?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a \u201ccanvas\u201d that can be processed like natural images.It is then straightforward for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36711", "url": null, "sourceid": 36006, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36862, "uid": "1dc7eb7c735224b4aeec0c6bb0b36d94", "name": "Beyond Appearance: Camouflaged Object Detection via Geometric Structure", "authors": [{"id": 182433, "fullname": "Jinyu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/182433?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 72312, "fullname": "changguang wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72312?format=json", "institution": "Nanjing university of science and technology"}, {"id": 184398, "fullname": "Fuming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184398?format=json", "institution": "Dalian Minzu University"}, {"id": 89846, "fullname": "Jinhui Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89846?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Depth priors provide salient geometric structure that benefits camouflaged object detection (COD), but directly using Monocular Depth Estimation (MDE) causes a task misalignment that still fails to identify camouflaged objects.To address this issue, we propose the Depth Segment Anything Model (DepthSAM), a MDE-adapted method specifically designed to mitigate this misalignment.DepthSAM incorporates two core innovations: (1) a Sparse Mixture-of-Experts Adapter (SMEA) that enables DEM to learn semantic information unique to camouflaged scenes, and (2) a Geometric\u2013Semantic Fusion Module (GSFM) that efficiently integrates geometric cues with high-level semantics. With these components, DepthSAM achieves both robust semantic understanding in camouflaged environments and accurate segmentation of camouflaged objects.Extensive experiments show that DepthSAM achieves new SOTA performance on three major benchmarks. For example, on COD10K, its $S_{\\alpha}$ and $F_{\\beta}^{\\omega}$ metrics surpass the best competing methods by 3.0\\% and 4.3\\%, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36862", "url": null, "sourceid": 37114, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40099, "uid": "5b592a9e37c3474d12c4a05a1ef50598", "name": "Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits", "authors": [{"id": 181558, "fullname": "Zelong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181558?format=json", "institution": "Renmin University of China"}, {"id": 193525, "fullname": "Jiahui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193525?format=json", "institution": "Renmin University of China"}, {"id": 193526, "fullname": "Ying Ba", "url": "http://cvpr.thecvf.com/api/miniconf/users/193526?format=json", "institution": "Renmin University of China"}, {"id": 193527, "fullname": "Dong Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/193527?format=json", "institution": "Renmin University of China"}, {"id": 128491, "fullname": "Zhiwu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128491?format=json", "institution": "Renmin University of China"}], "abstract": "As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) **complex multi-attribute modifications** such as pose, spatial layout, and camera viewpoint; and (2) **high-fidelity detail preservation** including identity, clothing, and accessories. To address these challenges, we propose **CHEESE**, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose **SCheese**, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance in handling complex edits with identity and fine-grained details consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40099", "url": null, "sourceid": 32565, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36192, "uid": "06725569496062bb79a86ff4bc931950", "name": "Training-Free Open-Vocabulary Camouflaged Object Segmentation via Fine-Grained Object Binding and Adaptive Hybrid Prompt", "authors": [{"id": 180941, "fullname": "Peng Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/180941?format=json", "institution": "Jilin University"}, {"id": 184396, "fullname": "Cheng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184396?format=json", "institution": "Jilin University"}, {"id": 184397, "fullname": "Chuande Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184397?format=json", "institution": "Jilin University"}, {"id": 184398, "fullname": "Fuming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184398?format=json", "institution": "Dalian Minzu University"}, {"id": 184399, "fullname": "Tian Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184399?format=json", "institution": "Jilin University"}], "abstract": "Vision-Language models (e.g., CLIP) facilitate the development of open-vocabulary camouflaged object segmentation (OVCOS), but existing methods still rely on mask annotations for fully-supervised training. In contrast, the training-free paradigm can rapidly process unseen data, representing a highly promising solution. However, in camouflage scenarios, existing training-free methods utilize sparse textual prompts and ignore the category similarity between visual patches, leading to inadequate object binding capability. To alleviate these issues, we propose a fine-grained object binding and adaptive hybrid prompt framework for training-free OVCOS. The framework first employs multimodal large language models (MLLMs) to explicitly model fine-grained textual descriptions of camouflaged objects and background. Building on this, we construct a semantic probe to decouple object and background features and explicitly model category similarity between visual patches via semantic consistency ranking, thereby achieving accurate object binding. Subsequently, we propose an entropy-guided text embedding adjustment strategy to adjust textual embeddings, aiming to further enhance fine-grained object binding. Finally, we utilize an adaptive hybrid prompt generation strategy to generate hybrid prompts, assisting SAM in accurately segmenting camouflaged objects. Experimental results on the OVCamo benchmark demonstrate that our method achieves excellent performance, significantly surpassing the advanced training-free ResCLIP.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36192", "url": null, "sourceid": 40465, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37948, "uid": "7e2aa3ad63bd5f4025e9162ea8738c7a", "name": "EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement", "authors": [{"id": 188656, "fullname": "Wenjing Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188656?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 130183, "fullname": "Chuanguang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130183?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 130221, "fullname": "Zhulin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/130221?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185313, "fullname": "Libo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185313?format=json", "institution": "Institute of Computing Technology"}, {"id": 105398, "fullname": "boyu diao", "url": "http://cvpr.thecvf.com/api/miniconf/users/105398?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences."}, {"id": 105588, "fullname": "Yongjun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105588?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Visual place recognition (VPR) faces critical challenges in handling extreme environmental variations while meeting the computational constraints of practical applications. Current methods predominantly address these challenges by either scaling up model capacity or employing computationally intensive reranking stages, creating a significant efficiency bottleneck. To overcome this limitation, we propose EfficientVPR, a lightweight one-stage framework that achieves unprecedented speed-accuracy trade-offs through two key innovations: i) a scene-aware visual prompt tuning method which adapts pretrained features with less parameters while dynamically adjusting to sample-specific characteristics, and ii) an instance-dependent key local feature enhancement module that further reinforces discriminative regions. Comprehensive evaluations on Pitts250k, MSLS, Eynsham, AmsterTime and SVOX demonstrate that our method establishes a new SOTA for DINOv2-small models by outperforming all same-scale competitors, and delivers a 73\u00d7 speedup with 60% lower-dimensional features while maintaining competitive (within 2.5% average R@1 gap) against the SOTA DINOv2-large-based two-stage method. The code is available in Supplementary Material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37948", "url": null, "sourceid": 41225, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38824, "uid": "e55fc7f1333cfdc8a8ef3f972613e7ef", "name": "Illustrator\u2019s Depth: Monocular Layer Index Prediction for Image Decomposition", "authors": [{"id": 190767, "fullname": "Nissim Maruani", "url": "http://cvpr.thecvf.com/api/miniconf/users/190767?format=json", "institution": "INRIA"}, {"id": 190768, "fullname": "Peiying Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190768?format=json", "institution": "City University of Hong Kong"}, {"id": 85692, "fullname": "Siddhartha Chaudhuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/85692?format=json", "institution": "Adobe Systems"}, {"id": 85655, "fullname": "Matthew Fisher", "url": "http://cvpr.thecvf.com/api/miniconf/users/85655?format=json", "institution": "Adobe Research"}, {"id": 88458, "fullname": "Nanxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88458?format=json", "institution": "Adobe Research"}, {"id": 86499, "fullname": "Vladimir G. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86499?format=json", "institution": "Adobe Systems"}, {"id": 128779, "fullname": "Pierre Alliez", "url": "http://cvpr.thecvf.com/api/miniconf/users/128779?format=json", "institution": "INRIA"}, {"id": 128776, "fullname": "Mathieu Desbrun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128776?format=json", "institution": "INRIA"}, {"id": 159462, "fullname": "Wang Yifan", "url": "http://cvpr.thecvf.com/api/miniconf/users/159462?format=json", "institution": "Adobe Systems"}], "abstract": "We introduce Illustrator\u2019s Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist\u2019s compositional process, illustrator\u2019s depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38824", "url": null, "sourceid": 34318, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38185, "uid": "263f71c11bcabae5f862a1d8c05a8738", "name": "PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation", "authors": [{"id": 181186, "fullname": "Jiacheng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181186?format=json", "institution": "Capital Normal University"}, {"id": 179907, "fullname": "Hui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/179907?format=json", "institution": "Capital Normal University"}, {"id": 189237, "fullname": "Shiyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189237?format=json", "institution": null}, {"id": 189238, "fullname": "Guoping Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189238?format=json", "institution": "China University of Mining Technology - Beijing"}], "abstract": "Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided Region Network)\u2014an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian\u2013Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS 2019, BraTS 2023, and MSD Task01 show that PGR-Net consistently outperforms existing approaches, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor (WT) region, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38185", "url": null, "sourceid": 37619, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39819, "uid": "0bf76210b3602105312e1c7f64972acf", "name": "Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World", "authors": [{"id": 157801, "fullname": "Yuzhi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157801?format=json", "institution": "Xiamen University"}, {"id": 156997, "fullname": "Kairun Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156997?format=json", "institution": "Xiamen University"}, {"id": 192915, "fullname": "Rongxin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192915?format=json", "institution": "Xiamen University"}, {"id": 192916, "fullname": "Dongxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192916?format=json", "institution": "Xiamen University"}, {"id": 192917, "fullname": "Yibin Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192917?format=json", "institution": "Southern University of Science and Technology"}, {"id": 192918, "fullname": "Jie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192918?format=json", "institution": "Tsinghua University"}, {"id": 192919, "fullname": "Jing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192919?format=json", "institution": "The School of Artificial Intelligence and Computer Science"}, {"id": 186074, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186074?format=json", "institution": "Xiamen University"}, {"id": 192920, "fullname": "Zheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192920?format=json", "institution": "Xiamen University"}, {"id": 144220, "fullname": "yunlong lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/144220?format=json", "institution": "Xiamen university"}, {"id": 152583, "fullname": "Chenxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152583?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 181092, "fullname": "Panwang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181092?format=json", "institution": "ByteDance"}, {"id": 192921, "fullname": "Junbin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192921?format=json", "institution": "University of Washington"}, {"id": 157882, "fullname": "Jingyan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157882?format=json", "institution": "Shenzhen Technology University"}, {"id": 90595, "fullname": "Xinghao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/90595?format=json", "institution": "Xiamen University"}, {"id": 90598, "fullname": "Yue Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90598?format=json", "institution": "Xiamen University"}, {"id": 119978, "fullname": "Zhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/119978?format=json", "institution": "SIGS, Tsinghua University"}], "abstract": "Humans inhabit a physical $\\textbf{4D world}$, where spatial geometry and semantic content evolve over time, forming a dynamic reality. While current Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in understanding static visual inputs, it remains unclear whether they can effectively $\\textbf{\"think in dynamics,\" $\\textit{i.e.}$, perceive, track, and reason}$ about spatio-temporal evolution in complex scenes.To systematically evaluate these abilities, we introduce $\\texttt{Dyn-Bench}$, a large-scale benchmark designed to assess spatio-temporal reasoning and localized dynamics perception. Constructed through multi-stage filtering over massive 2D and 4D data sources, $\\texttt{Dyn-Bench}$ provides a high-quality collection of diverse dynamic scenes, consisting of $\\textbf{1k videos}$, $\\textbf{7k visual question answering (VQA) pairs}$, and $\\textbf{3k dynamic object grounding samples}$.We comprehensively study general-purpose, spatial-aware, and region-level MLLMs to understand how they ``think in dynamics'' from both linguistic and visual perspectives. Our results reveal that existing models struggle to jointly excel in both $\\textbf{spatio-temporal reasoning}$ and $\\textbf{dynamic object grounding}$, often producing inconsistent interpretations of motion and interaction. Conventional prompting strategies $\\textit{i.e.}$, chain-of-thought or caption-based hints) provide only limited improvements.In contrast, structured integration approaches, including $\\textbf{Mask-Guided Fusion}$ and the $\\textbf{Spatio-Temporal Textual Cognitive Map (ST-TCM)}$, substantially enhance MLLMs' dynamic perception and spatio-temporal reasoning in an evolving $\\textbf{4D world}$. These findings underscore the importance of explicit spatio-temporal structural cues to bridge the gap between static perception and dynamic reasoning in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39819", "url": null, "sourceid": 43330, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37439, "uid": "a528cdb26f26b20fecd0d76902990f0b", "name": "Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors", "authors": [{"id": 128867, "fullname": "Hakyeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/128867?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 143492, "fullname": "Ruicheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143492?format=json", "institution": "University of Science and Technology of China"}, {"id": 187454, "fullname": "Chengtang Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187454?format=json", "institution": "Microsoft"}, {"id": 129072, "fullname": "Jiaolong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129072?format=json", "institution": "Microsoft Research"}, {"id": 76517, "fullname": "Min H. Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/76517?format=json", "institution": "KAIST"}], "abstract": "Direct Time-of-Flight (dToF) sensors provide highly accurate metric depth and are more robust than indirect ToF systems in challenging real-world conditions. However, their high manufacturing cost and limited photodiode array size produce depth maps that are extremely sparse, low-resolution, and noisy, making them unsuitable for VR/XR, robotics, and 3D perception tasks that require dense metric depth. Existing monocular and depth completion methods struggle to handle the unique sampling patterns and hardware artifacts of dToF devices, and their performance often deteriorates significantly under severe sparsity or noise. We present a generalizable framework for dense metric depth completion from sparse dToF measurements, capable of operating across diverse sensor types, sparsity levels, and noise conditions. Our model employs a depth-guided dual-branch Vision Transformer encoder that processes RGB images and sparse dToF measurements separately, while a masked joint attention module allows depth tokens to reliably guide image features without being overwritten by them. A lightweight decoder reconstructs dense metric depth efficiently, without diffusion-based or refinement-heavy post-processing. To address the scarcity of paired training data, we introduce a comprehensive dToF simulation pipeline that reproduces the characteristics of flash, sub-VGA flash, and rotating sensors, including hardware-induced degradation, irregular sparsity, and realistic noise distributions. Trained entirely on synthetic data, our model achieves strong zero-shot generalization across 6 datasets and 2 real dToF devices, outperforming state-of-the-art approaches in both accuracy and computational efficiency. This establishes a robust and practical solution for dense metric depth completion from sparse direct ToF sensors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37439", "url": null, "sourceid": 42168, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40027, "uid": "912527e69aef16ac97324684dda53015", "name": "DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance", "authors": [{"id": 190768, "fullname": "Peiying Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190768?format=json", "institution": "City University of Hong Kong"}, {"id": 88458, "fullname": "Nanxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88458?format=json", "institution": "Adobe Research"}, {"id": 85655, "fullname": "Matthew Fisher", "url": "http://cvpr.thecvf.com/api/miniconf/users/85655?format=json", "institution": "Adobe Research"}, {"id": 180251, "fullname": "Yiran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180251?format=json", "institution": "Adobe Research"}, {"id": 86410, "fullname": "Jing Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86410?format=json", "institution": "City University of Hong Kong"}, {"id": 106443, "fullname": "Difan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106443?format=json", "institution": "Adobe Research"}], "abstract": "Recent vision\u2013language model (VLM)\u2013based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model\u2019s native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40027", "url": null, "sourceid": 37351, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38352, "uid": "c590b8e5b4f1a5fe839466462e187182", "name": "CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers", "authors": [{"id": 189470, "fullname": "Maisha Maliha", "url": "http://cvpr.thecvf.com/api/miniconf/users/189470?format=json", "institution": "University of Oklahoma"}, {"id": 189679, "fullname": "Dean F. Hougen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189679?format=json", "institution": "University of Oklahoma"}], "abstract": "Vision Transformers often rely on spurious background correlations rather than foreground object features. While prior model pruning approaches focus solely on improving accuracy, they lack interpretability and fail to verify whether predictions are actually made by focusing on the main foreground object, providing no causal validation of which components drive spurious behavior.  We introduce CIGMA (Causal Information Gain Mechanistic Attribution), a general framework for explaining the internal computation of Vision Transformers. CIGMA provides a mechanistic, information theoretic explanation by quantifying the importance of each attention head and determining whether it supports the main object or routes spurious background cues. It ranks attention heads by measuring object versus context reliance with Jensen Shannon based information gain computed from the model's full predictive distributions after two complementary edits, removing the object region and removing the surrounding context, which reveals a spurious subnet that carries background signals and a complementary set of evidence aligned heads. Evaluated on CIFAR-10, CIFAR-100, and Tiny-ImageNet across three VLM architectures (InternVL2-26B, LLaVA-1.6, LLaVA-1.5-13B), CIGMA improves accuracy by 7.6-24.8 percentage points over unmodified models while reducing background reliance by 79.5-88.1\\%, substantially outperforming all baselines, demonstrating that causal head-level interventions enable more effective spurious correlation mitigation than token pruning or retraining approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38352", "url": null, "sourceid": 44433, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37661, "uid": "fe443b928b6126aa5df1d7ef91bd091b", "name": "SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model", "authors": [{"id": 168581, "fullname": "Yukai Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/168581?format=json", "institution": "Tsinghua University"}, {"id": 73844, "fullname": "Weiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/73844?format=json", "institution": "Shandong University"}, {"id": 187970, "fullname": "Zihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187970?format=json", "institution": "Lightillusions"}, {"id": 87606, "fullname": "Hongyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87606?format=json", "institution": "South China University of Technology"}, {"id": 91696, "fullname": "Xingyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91696?format=json", "institution": "Xiaobing.AI"}, {"id": 126273, "fullname": "Ping Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126273?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 84971, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84971?format=json", "institution": "International Digital Economy Academy (IDEA)"}], "abstract": "We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets will be released if the paper is accepted.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37661", "url": null, "sourceid": 34977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36460, "uid": "00cada81b307d5db4a1920b952e2137e", "name": "ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting", "authors": [{"id": 180441, "fullname": "Yingdong Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180441?format=json", "institution": "\u6b66\u6c49\u5927\u5b66"}, {"id": 159219, "fullname": "Shaocheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/159219?format=json", "institution": "Wuhan University"}, {"id": 70579, "fullname": "Zhenjun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/70579?format=json", "institution": "University of Zaragoza"}, {"id": 185106, "fullname": "Yuan Kou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185106?format=json", "institution": null}, {"id": 185107, "fullname": "Jianxin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185107?format=json", "institution": ""}, {"id": 155281, "fullname": "Pengcheng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/155281?format=json", "institution": "Wuhan University"}, {"id": 155282, "fullname": "Jiayuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155282?format=json", "institution": "Wuhan University"}], "abstract": "Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted $\\alpha$-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted feature fusion. We further introduce keypoint-consensus landmark sampling to select reliable Gaussians and local geometric consistency verification to reject mismatches caused by rendering artifacts. On the Cambridge Landmarks dataset, ULF-Loc reduces the mean median translation error by 17\\% compared to the state-of-the-art, while achieving superior efficiency with only 1/10 the training time and 1/6 the GPU memory of STDLoc.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36460", "url": null, "sourceid": 33598, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38574, "uid": "5e8c84768da5bcb14f66d92aabe75092", "name": "LNEM: Lunar Neural Elevation Model", "authors": [{"id": 184105, "fullname": "SUWAN LEE", "url": "http://cvpr.thecvf.com/api/miniconf/users/184105?format=json", "institution": "KENTECH"}, {"id": 190178, "fullname": "Jo Ryeong Yim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190178?format=json", "institution": "Korea Aerospace Research Institute"}, {"id": 190179, "fullname": "Kibaek Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/190179?format=json", "institution": "Korean Institute of Energy Technology"}, {"id": 190180, "fullname": "Dong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190180?format=json", "institution": null}, {"id": 190181, "fullname": "Eunhyeuk Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190181?format=json", "institution": "Korea Aerospace Research Institute"}, {"id": 176148, "fullname": "Minsup Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/176148?format=json", "institution": "Korea Astronomy and Space Science Institute"}, {"id": 190182, "fullname": "Chae Sim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190182?format=json", "institution": "Korea Astronomy and SpaceScience  Institute"}, {"id": 183414, "fullname": "Seokju Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/183414?format=json", "institution": "Korea Institute of Energy Technology (KENTECH)"}], "abstract": "High-resolution and high-precision digital elevation models (DEMs) of the lunar surface are essential for landing site selection and lunar geological research. However, traditional stereo matching techniques provide limited representation of 3D scene, struggling with non-textured regions and extreme light variations. Furthermore, recent lunar neural rendering methods are ill-suited for 3D reconstruction due to their reliance on simple pinhole approximations for pushbroom sensors. These challenges are compounded by inconsistencies introduced during satellite image processing, including geometric misalignment, distributional bias, and labor-intensive handcrafted operations. To address these issues, we introduce the Lunar Neural Elevation Model (LNEM), a volumetric reconstruction method that explicitly incorporates the pushbroom imaging model. A core component of our approach is the Lunar Studio dataset, processed using Rigorous Sensor Models (RSMs) to ensure geometric consistency of multi-orbit Lunar Reconnaissance Orbiter Camera (LROC) Narrow Angle Camera (NAC) and Korea Pathfinder Lunar Orbiter (KPLO) Lunar Terrain Imager (LUTI) images. LNEM integrates this RSM-based pushbroom camera formulation with learned shadow modeling, enabling physically grounded volumetric rendering under challenging lunar illumination. Extensive experiments demonstrate that LNEM achieves geometrically consistent reconstruction and cross-sensor generalization under diverse viewing and lighting conditions, providing a scalable and physically informed alternative to conventional DEM pipelines. To facilitate reproducibility and future lunar research, we release Lunar Studio, its multi-orbit dataset, and the LNEM reconstruction pipeline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38574", "url": null, "sourceid": 43214, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39807, "uid": "daf9cfcd0f0bb2f78d944a219243f228", "name": "Closed-Form Concept Erasure via Double Projections", "authors": [{"id": 192899, "fullname": "CHI ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/192899?format=json", "institution": "National University of Singapore"}, {"id": 192900, "fullname": "Jingpu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192900?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 192901, "fullname": "Zhixian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192901?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 129361, "fullname": "Ping Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129361?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}], "abstract": "Modern generative models, including diffusion-based architectures, produce highly creative outputs but also pose safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39807", "url": null, "sourceid": 36705, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39967, "uid": "bfc73dd2bd4da1dda729c02c9cfa4dea", "name": "Low-Resolution Editing is All You Need for High-Resolution Editing", "authors": [{"id": 193213, "fullname": "Junsung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193213?format=json", "institution": "Seoul National University"}, {"id": 106137, "fullname": "Hyunsoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/106137?format=json", "institution": "Seoul National University"}, {"id": 89480, "fullname": "Yong Jae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/89480?format=json", "institution": "Professor, UW-Madison and Research Scientist, Adobe"}, {"id": 75881, "fullname": "Bohyung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/75881?format=json", "institution": "Seoul National University"}], "abstract": "High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39967", "url": null, "sourceid": 32480, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37339, "uid": "4f367a687694d20193a6f04a8d3a2117", "name": "Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis", "authors": [{"id": 187198, "fullname": "Kang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187198?format=json", "institution": "University of California, Los Angeles"}, {"id": 187199, "fullname": "Yuning Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187199?format=json", "institution": "Meta"}, {"id": 183831, "fullname": "Wan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/183831?format=json", "institution": "University of California, Merced"}], "abstract": "We present GRaF, Generalizable Radio-Frequency (RF) Radiance Fields, a framework that models RF signal propagation to synthesize spatial spectra at arbitrary transmitter or receiver locations, where each spectrum measures signal power across all surrounding directions at the receiver. Unlike state-of-the-art methods that adapt vanilla Neural Radiance Fields (NeRF) to the RF domain with scene-specific training, GRaF generalizes across scenes to synthesize spectra. To enable this, we prove an interpolation theory in the RF domain: the spatial spectrum from a transmitter can be approximated using spectra from geographically proximate transmitters. Building on this theory, GRaF comprises two components: (i) a geometry-aware Transformer encoder that captures spatial correlations from neighboring transmitters to learn a scene-independent latent RF radiance field, and (ii) a neural ray tracing algorithm that estimates spectrum reception at the receiver. Experimental results demonstrate that GRaF outperforms existing methods on single-scene benchmarks and achieves state-of-the-art performance on unseen scene layouts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37339", "url": "https://kangyangg.com/projects/graf/", "sourceid": 34701, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65699, "file": "/media/PosterPDFs/CVPR%202026/37339-thumb.png", "modified": "2026-04-19T17:55:08.529694-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65710, "file": "/media/PosterPDFs/CVPR%202026/37339.png", "modified": "2026-04-19T19:00:25.975158-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65711, "modified": "2026-04-19T20:20:30.585413-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/37339.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39663, "uid": "670f33f3cfb5217bcf008786165f1dc7", "name": "TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning", "authors": [{"id": 76755, "fullname": "Tao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76755?format=json", "institution": "Nanjing University"}, {"id": 192591, "fullname": "Li Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192591?format=json", "institution": null}, {"id": 192592, "fullname": "Gen Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192592?format=json", "institution": "ByteDance Inc."}, {"id": 131821, "fullname": "Yabin ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/131821?format=json", "institution": "Bytedance"}, {"id": 192593, "fullname": "Yiting Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192593?format=json", "institution": "Bytedance"}, {"id": 130240, "fullname": "Junlin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130240?format=json", "institution": "ByteDance Inc."}, {"id": 192594, "fullname": "Deliang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192594?format=json", "institution": "ByteDance Inc."}, {"id": 130296, "fullname": "Li zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130296?format=json", "institution": "Bytedance Inc."}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs\u2019 temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39663", "url": null, "sourceid": 34322, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38117, "uid": "8dfb0f5e389f9d19c436d83da7073879", "name": "Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking", "authors": [{"id": 189089, "fullname": "Andong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189089?format=json", "institution": "Anhui University"}, {"id": 189090, "fullname": "Ziyi Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/189090?format=json", "institution": "Anhui University"}, {"id": 189091, "fullname": "Jiandong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189091?format=json", "institution": "Anhui University"}, {"id": 181116, "fullname": "Shihao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181116?format=json", "institution": "Anhui University"}, {"id": 183391, "fullname": "Chenglong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183391?format=json", "institution": "Anhui University"}, {"id": 126842, "fullname": "Jin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126842?format=json", "institution": "Anhui University"}, {"id": 188997, "fullname": "Bin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188997?format=json", "institution": "Anhui University"}], "abstract": "Missing modalities in RGBT tracking often lead to incomplete and unstable multimodal feature representations that greatly degrade the performance. Existing methods typically attempt to recover missing modalities from available ones, but the quality of data generated in challenging scenarios might be unsatisfactory. In addition, current approaches exhibit limited flexibility in processing both missing and complete data. To overcome these limitations, we propose a Spatio-temporal Conditional Denoising Transformer (SCDT), which integrates the spatial cues and the temporal context to adaptively perform information reconstruction of missing modalities and feature enhancement of weak modalities in a unified framework, for robust modality-missing RGBT tracking. In particular, SCDT leverages the short-term temporal cues from recent historical frames to capture the fine-grained temporal correlations and the long-term temporal cues encoding modality evolution to capture the global context. By jointly exploiting long short-term temporal contexts as the conditions, SCDT progressively guides noisy features of  available modalities to learn reliable and temporally consistent multimodal representations. Furthermore, SCDT introduces a noise-modulated adaptation mechanism that dynamically adjusts its behavior according to the modal availability, enabling a single framework to unify feature learning under both modality-missing and complete scenarios without changing the architecture or parameters. Extensive experiments on three public benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38117", "url": null, "sourceid": 34375, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38417, "uid": "7dccffdd09576456eadf2acaabc8122d", "name": "Progressive Multi-cue Alignment for Unaligned RGBT Tracking", "authors": [{"id": 189091, "fullname": "Jiandong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189091?format=json", "institution": "Anhui University"}, {"id": 183391, "fullname": "Chenglong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183391?format=json", "institution": "Anhui University"}, {"id": 189830, "fullname": "Hao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189830?format=json", "institution": "Anhui University"}, {"id": 189089, "fullname": "Andong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189089?format=json", "institution": "Anhui University"}, {"id": 178236, "fullname": "Lili Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178236?format=json", "institution": "School of Computer Science and Technology, Anhui University"}, {"id": 126842, "fullname": "Jin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126842?format=json", "institution": "Anhui University"}], "abstract": "Unaligned RGBT tracking aims to achieve robust target localization across spatially misaligned RGB and thermal infrared (TIR) videos, a crucial challenge for applying RGBT tracking in real-world scenarios.Existing methods often calculate all cross-modal alignment parameters (i.e., spatial shift and scale change) simultaneously, but suffer from two major limitations. 1) They are difficult to adapt to different degrees of unaligned difficulty during tracking. 2) They usually require complex models to handle challenging scenarios, resulting in a large computational burden.To overcome these limitations, we propose a novel Progressive Multi-cue Alignment framework called PMATrack, which disentangles the calculation of cross-modal alignment parameters in a progressive manner and dynamically selects appropriate cues to handle different challenges, thereby enabling robust and efficient unaligned RGBT tracking. In particular, PMATrack divides the cross-modal alignment parameter estimation into three stages to progressively perform center offset computation, scale transformation estimation, and global refinement. At each stage, we design a difficulty-aware router to adaptively select the appropriate alignment expert based on the cross-modal alignment complexity, thereby reducing computational redundancy.In addition, we build a high-quality video benchmark called MUART244 to facilitate the comprehensive evaluation of different unaligned RGBT tracking algorithms. Extensive experiments on our MUART244 and public LasHeR-Unaligned datasets demonstrate the outstanding performance of PMATrack against existing state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38417", "url": null, "sourceid": 42608, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36978, "uid": "d722e24995a776f6fef5c1654eb42596", "name": "PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data", "authors": [{"id": 182191, "fullname": "Xinyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182191?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 154526, "fullname": "Hanlin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154526?format=json", "institution": "webank"}, {"id": 186368, "fullname": "Guibao Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/186368?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 151959, "fullname": "Gongxi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151959?format=json", "institution": "Tsinghua University"}, {"id": 186369, "fullname": "Yifei Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186369?format=json", "institution": "Shandong University"}, {"id": 154528, "fullname": "Lixin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154528?format=json", "institution": "WeBank"}, {"id": 154529, "fullname": "Yuxing Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/154529?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "As publicly available data dwindles, synthetic data generation (SDG) has become a practical solution for privacy-preserving data sharing. By training generative models on private data, SDG creates samples that retain task-relevant features while obfuscating sensitive content. However, recent work shows that synthetic data can still leak private information via membership inference and reconstruction attacks. Existing defenses often degrade downstream utility. To address the privacy-utility trade-off, we formulate SDG as a bi-objective optimization problem. Yet, intractable gradients and expensive subset evaluation pose major challenges. We address this via alternate optimization over the generative model and data selection parameter, and further recast the selection step as a discrete-time optimal control problem, solved using Pontryagin\u2019s Maximum Principle. We propose PrivSynth, a framework that quantifies multiple privacy risks and integrates it into the control objective. Theoretical analysis guarantees convergence, and experiments on benchmark and medical datasets show that PrivSynth achieves better utility and stronger privacy protection than state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36978", "url": null, "sourceid": 33852, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39065, "uid": "fc3a68ae910a64c3c1a7d8ea0c525b9d", "name": "MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction", "authors": [{"id": 190923, "fullname": "Jiaxin Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190923?format=json", "institution": "University of Macau"}, {"id": 191287, "fullname": "Yue Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191287?format=json", "institution": "Amazon"}, {"id": 86115, "fullname": "Yicong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86115?format=json", "institution": "University of Macau"}], "abstract": "Learning-based edge detection models trained with cross-entropy loss often suffer from thick edge predictions, which deviate from the crisp, single-pixel annotations typically provided by humans. While previous approaches to achieving crisp edges have focused on designing specialized loss functions or modifying network architectures, we show that a carefully designed training and inference strategy alone is sufficient to achieve human-like edge quality. In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. We first construct a large-scale synthetic edge dataset to pre-train MEMO, enhancing its generalization ability. Subsequent fine-tuning on downstream datasets requires only a lightweight module comprising 1.2\\% additional parameters. During training, MEMO learns to predict edges under varying ratios of input masking. A key insight guiding our inference is that thick edge predictions typically exhibit a confidence gradient: high in the center and lower toward the boundaries. Leveraging this, we propose a novel progressive prediction strategy that sequentially finalizes edge predictions in order of prediction confidence, resulting in thinner and more precise contours. Our method achieves visually appealing, post-processing-free, human-like edge maps and outperforms prior methods on crispness-aware evaluations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39065", "url": null, "sourceid": 46213, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36787, "uid": "1852e1d1aba3354f8ceef1b48261bf5e", "name": "Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective", "authors": [{"id": 185871, "fullname": "Aobo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185871?format=json", "institution": "Xidian University"}, {"id": 90655, "fullname": "Jinjian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90655?format=json", "institution": "Xidian University"}, {"id": 132011, "fullname": "Yongxu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132011?format=json", "institution": "Xidian University"}, {"id": 185872, "fullname": "Jupo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185872?format=json", "institution": "Xidian University"}, {"id": 89279, "fullname": "Weisheng Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89279?format=json", "institution": "Xidian University"}], "abstract": "As imaging scenarios diversify rapidly, Image Quality Assessment (IQA) faces a key challenge: how to effectively transfer perceptual knowledge from existing annotated datasets to ensure reliable quality prediction in new scenarios. However, current IQA models struggle to generalize. Direct transfer often leads to severe performance degradation, while multi-dataset joint training rarely yields stable gains and can even harm target performance. We identify the root cause as inconsistent perceptual preference structures across datasets, where models trained on different sources rely on distinct perceptual cues, leading to mismatched conditional distributions $P(Y|X)$ that fundamentally limit transferability.To address this, we propose Perceptual Preference Representation (PPR), which quantifies dataset-specific perceptual preference structures by analyzing correlations between visual features and quality scores. PPR enables training-free assessment of cross-dataset perceptual preference consistency, offering a systematic and interpretable way to analyze transferability.Building on this, we develop Preference-Structure-Aligned Transfer (PreSTA), which iteratively selects samples whose perceptual preferences align with the target domain. Across both cross- and within-domain scenarios, PreSTA achieves superior transfer performance with only a small fraction of data. In the targeted joint transfer setting, PreSTA consistently attains better performance with only a limited portion of the combined data. These results demonstrate that aligning perceptual preference structures, rather than simply increasing dataset size, is the key to effective knowledge transfer in IQA.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36787", "url": null, "sourceid": 46752, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36385, "uid": "cf01a4dac6d72a27e17f23c93ab036d6", "name": "Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images", "authors": [{"id": 180654, "fullname": "Hongyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180654?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 184923, "fullname": "Bochao Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184923?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 155519, "fullname": "Qiankun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155519?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 175911, "fullname": "Haochen Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175911?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 184924, "fullname": "Qi Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/184924?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 184925, "fullname": "Jianfei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184925?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 153214, "fullname": "Chen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153214?format=json", "institution": "Li Auto Inc."}, {"id": 184926, "fullname": "Cheng Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184926?format=json", "institution": "Li Auto Inc."}, {"id": 184927, "fullname": "Zhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184927?format=json", "institution": "Li Auto Inc."}, {"id": 150336, "fullname": "Xueyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150336?format=json", "institution": "Li Auto Inc."}, {"id": 153217, "fullname": "Yifei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153217?format=json", "institution": "Li Auto Inc."}, {"id": 157057, "fullname": "Jiansheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157057?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 157058, "fullname": "Huimin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/157058?format=json", "institution": "University of Science and Technology Beijing"}], "abstract": "Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train a image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameter from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed data. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36385", "url": null, "sourceid": 37754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37992, "uid": "123c12d84c20dd2666df87c9522555d0", "name": "ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation", "authors": [{"id": 180561, "fullname": "Simon Boeder", "url": "http://cvpr.thecvf.com/api/miniconf/users/180561?format=json", "institution": "Bosch"}, {"id": 188770, "fullname": "Fabian Gigengack", "url": "http://cvpr.thecvf.com/api/miniconf/users/188770?format=json", "institution": "Robert Bosch GmbH, Bosch"}, {"id": 188771, "fullname": "Simon Roesler", "url": "http://cvpr.thecvf.com/api/miniconf/users/188771?format=json", "institution": "Robert Bosch GmbH, Bosch"}, {"id": 75575, "fullname": "Holger Caesar", "url": "http://cvpr.thecvf.com/api/miniconf/users/75575?format=json", "institution": "TU Delft"}, {"id": 188772, "fullname": "Benjamin Risse", "url": "http://cvpr.thecvf.com/api/miniconf/users/188772?format=json", "institution": "University of M\u00fcnster"}], "abstract": "Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding.We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR.ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations.While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes.Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation.This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data.We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation.On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34\\% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37992", "url": null, "sourceid": 35150, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39273, "uid": "870327f92efd3be6f7be7279c858eb0d", "name": "CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning", "authors": [{"id": 191739, "fullname": "Junyoung Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/191739?format=json", "institution": "Korea University"}, {"id": 191740, "fullname": "Seungwoo Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191740?format=json", "institution": "Korea University"}, {"id": 191741, "fullname": "Minjun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191741?format=json", "institution": "Korea University"}, {"id": 191742, "fullname": "Sumin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/191742?format=json", "institution": "Korea University"}, {"id": 75518, "fullname": "Arsha Nagrani", "url": "http://cvpr.thecvf.com/api/miniconf/users/75518?format=json", "institution": "Google "}, {"id": 128054, "fullname": "Paul Hongsuck Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128054?format=json", "institution": "Korea University"}], "abstract": "Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image\u2013text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other multi-image benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39273", "url": null, "sourceid": 45756, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38953, "uid": "8dffe5b204ef21a11e551adb4d2e9e21", "name": "Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human\u2013Computer Interaction", "authors": [{"id": 177372, "fullname": "Mukhiddin Toshpulatov", "url": "http://cvpr.thecvf.com/api/miniconf/users/177372?format=json", "institution": "INHA UNIVERSITY"}, {"id": 191051, "fullname": "Wookey Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191051?format=json", "institution": "Inha University"}, {"id": 191052, "fullname": "Suan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191052?format=json", "institution": "Semyung University"}, {"id": 191053, "fullname": "Geehyuk Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191053?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Precise fingertip contact detection is a fundamental challenge for natural and immersive virtual reality (VR)interaction. However, existing vision-based methods suffer from insufficient accuracy, with typical depth errors(12-25mm) being too large to reliably distinguish between hovering and true contact $(<3mm)$. While commercial motioncapture systems provide sub-millimeter accuracy, their prohibitive cost limits widespread adoption. This paperaddresses this critical gap by developing a highly accurate and cost-effective system for fingertip contact detection.We introduce a novel, specialized dataset of 53,300 RGB-depth pairs capturing millimeter-scale, hand-table typinginteractions. By systematically fine-tuning six state-of-the-art depth estimation architectures on this dataset, wereduce the mean absolute error (MAE) by 68\\%, from 12.3mm to a state-of-the-art 3.8mm. Our complete VR keyboard system,TapBoard-X, achieves 95.96\\% contact detection accuracy and enables typing speeds of 45.6 WPM with a low 3.1% charactererror rate, rivaling physical keyboards. This performance is achieved at over a 90\\% cost reduction compared tocommercial systems, democratizing high-precision hand tracking for the broader research community and paving the wayfor the next generation of tactile VR experiences.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38953", "url": null, "sourceid": 35541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36976, "uid": "8aa25d565a8891e68d123138a137622b", "name": "Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes", "authors": [{"id": 162594, "fullname": "Antonio Luigi Stefani", "url": "http://cvpr.thecvf.com/api/miniconf/users/162594?format=json", "institution": "University of Trento"}, {"id": 162591, "fullname": "Niccol\u00f2 Bisagno", "url": "http://cvpr.thecvf.com/api/miniconf/users/162591?format=json", "institution": "University of Trento"}, {"id": 163032, "fullname": "Nicola Conci", "url": "http://cvpr.thecvf.com/api/miniconf/users/163032?format=json", "institution": "University of Trento"}, {"id": 186363, "fullname": "Eckehard Steinbach", "url": "http://cvpr.thecvf.com/api/miniconf/users/186363?format=json", "institution": "Technical University of Munich"}, {"id": 166786, "fullname": "Francesco De Natale", "url": "http://cvpr.thecvf.com/api/miniconf/users/166786?format=json", "institution": "Universit\u00e0 di Trento"}], "abstract": "We address the problem of making 3D scenes interactive by asking: what would objects feel like if touched in a virtual environment? State-of-the-art 3D rendering methods provide compelling visual realism, but they fall short in modeling physical interactions, such as haptic feedback. We propose a framework that learns the correspondence between user actions and tactile responses, enabling the generation of touch-based signals directly from simulated interactions in 3D scenes. Our approach leverages a neural field representation conditioned on geometry and action to synthesize material-specific tactile signals. Experiments show that the generated signals reliably convey material properties and interaction dynamics. This paves the way toward interactive, touch-aware virtual environments with realistic haptic feedback.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36976", "url": null, "sourceid": 44046, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36629, "uid": "84adc5c5be445d7eae80dcf4c87a6428", "name": "PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models", "authors": [{"id": 180615, "fullname": "Yiming CAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/180615?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 158841, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158841?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 143875, "fullname": "Xinqi Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143875?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 158842, "fullname": "Bin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158842?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Large Vision-Language Models (VLMs) exhibit impressive multimodal capabilities and widespread deployment, yet remain vulnerable to targeted adversarial attacks. However, the practical robustness of such attacks often remains unclear with limited evaluation under defenses. Diffusion-based purification (DBP), a widely adopted black-box defense for VLMs, effectively blocks current attacks by removing adversarial perturbations via generative diffusion. Prior DBP evasion methods are designed for white-box image classifiers and are ill-suited for attacking VLMs. Even when adapted, they face high computational costs and potential vanishing/exploding gradient from backpropagating through deep diffusion steps and gradient instability due to diffusion\u2019s stochasticity. To address these challenges, we present PureProof, a black-box targeted attack on VLMs resilient to DBP. PureProof introduces Stochastic Reverse Alignment, using a single-step reverse prediction to efficiently guide adversarial optimization while avoiding costly and unstable full-trajectory backpropagation. To mitigate diffusion stochasticity,  we employ Adaptive Re-noising Augmentation, which re-noises intermediate predictions in a timestep-adaptive manner to smooth the optimization landscape, complemented by Self-Consistency Regularization to promote local temporal coherence.  Extensive experiments on open-source and commercial VLMs show that PureProof consistently outperforms prior attacks against DBP, achieves strong noise resilience, and remains highly effective without defenses, revealing critical vulnerabilities of VLMs and offering implications for future model safety.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36629", "url": null, "sourceid": 44843, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37033, "uid": "7dd63bc018cc1cc492e7d87ac6c3e465", "name": "ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications", "authors": [{"id": 129802, "fullname": "Yuxi Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/129802?format=json", "institution": "Fudan University"}, {"id": 153485, "fullname": "Qiuyang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153485?format=json", "institution": "Fudan University"}, {"id": 186532, "fullname": "Zhizhou Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186532?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 153486, "fullname": "Xuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153486?format=json", "institution": "Fudan University"}, {"id": 186533, "fullname": "Jiaogen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186533?format=json", "institution": null}, {"id": 186534, "fullname": "Fubao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186534?format=json", "institution": "School of Computer Science and Artificial Intelligence, Zhengzhou University of Light Industry"}, {"id": 154292, "fullname": "Jihong Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154292?format=json", "institution": "Tongji University"}, {"id": 129791, "fullname": "Shuigeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129791?format=json", "institution": "Fudan University"}], "abstract": "Recently, iris recognition is regaining prominence in immersive applications such as extended reality as a means of seamless user identification. This application scenario introduces unique challenges compared to traditional iris recognition under controlled setups, as the ocular images are primarily captured off-axis and less constrained, causing perspective distortion, intra-subject variation, and quality degradation in iris textures. Datasets capturing these challenges remain limited. This paper fills this gap by presenting a large-scale iris dataset collected via head-mounted displays, termed ImmerIris. It contains 499,791 ocular images from 564 subjects, and is, to our knowledge, the largest public iris dataset to date and among the first dedicated to immersive applications. It is accompanied by a comprehensive set of evaluation protocols that benchmark recognition systems under various challenging conditions. This paper also draws attention to a shared obstacle of current recognition methods, the reliance on a pre-processing, normalization stage, which is fallible in off-axis and unconstrained setups. To this end, this paper further proposes a normalization-free paradigm that directly learns from minimally adjusted ocular images. Despite its simplicity, it outperforms normalization-based prior arts, indicating a promising direction for robust iris recognition.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37033", "url": null, "sourceid": 33759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40280?format=json"], "related_events_ids": [40280]}, {"id": 40280, "uid": "7dd63bc018cc1cc492e7d87ac6c3e465", "name": "ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications", "authors": [{"id": 129802, "fullname": "Yuxi Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/129802?format=json", "institution": "Fudan University"}, {"id": 153485, "fullname": "Qiuyang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153485?format=json", "institution": "Fudan University"}, {"id": 186532, "fullname": "Zhizhou Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186532?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 153486, "fullname": "Xuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153486?format=json", "institution": "Fudan University"}, {"id": 186533, "fullname": "Jiaogen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186533?format=json", "institution": null}, {"id": 186534, "fullname": "Fubao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186534?format=json", "institution": "School of Computer Science and Artificial Intelligence, Zhengzhou University of Light Industry"}, {"id": 154292, "fullname": "Jihong Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154292?format=json", "institution": "Tongji University"}, {"id": 129791, "fullname": "Shuigeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129791?format=json", "institution": "Fudan University"}], "abstract": "Recently, iris recognition is regaining prominence in immersive applications such as extended reality as a means of seamless user identification. This application scenario introduces unique challenges compared to traditional iris recognition under controlled setups, as the ocular images are primarily captured off-axis and less constrained, causing perspective distortion, intra-subject variation, and quality degradation in iris textures. Datasets capturing these challenges remain limited. This paper fills this gap by presenting a large-scale iris dataset collected via head-mounted displays, termed ImmerIris. It contains 499,791 ocular images from 564 subjects, and is, to our knowledge, the largest public iris dataset to date and among the first dedicated to immersive applications. It is accompanied by a comprehensive set of evaluation protocols that benchmark recognition systems under various challenging conditions. This paper also draws attention to a shared obstacle of current recognition methods, the reliance on a pre-processing, normalization stage, which is fallible in off-axis and unconstrained setups. To this end, this paper further proposes a normalization-free paradigm that directly learns from minimally adjusted ocular images. Despite its simplicity, it outperforms normalization-based prior arts, indicating a promising direction for robust iris recognition.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40280", "url": null, "sourceid": -33759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37033?format=json"], "related_events_ids": [37033]}, {"id": 37469, "uid": "cf4012623d75c1e1f73eef47b4ec918d", "name": "Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation", "authors": [{"id": 172807, "fullname": "Yilong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172807?format=json", "institution": "Xiamen University"}, {"id": 187531, "fullname": "Jianxin Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187531?format=json", "institution": "Xiamen University"}, {"id": 70439, "fullname": "Shengchuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70439?format=json", "institution": "Xiamen University"}, {"id": 88415, "fullname": "Liujuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88415?format=json", "institution": "Xiamen University"}], "abstract": "Current zero-shot camouflaged object segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the Discover-Segment-Select (DSS) mechanism, a three-stage framework that progressively refines the segmentation process. The proposed method contains a Feature-driven Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art performance with lower GPU memory consumption.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37469", "url": null, "sourceid": 37062, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38045, "uid": "7c8b2cb44792a9a6b30a02869a605fd8", "name": "Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning", "authors": [{"id": 188793, "fullname": "Changlin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188793?format=json", "institution": "Stanford University"}, {"id": 188909, "fullname": "Jiawei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188909?format=json", "institution": "North China Electric Power University"}, {"id": 188910, "fullname": "Shuhao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188910?format=json", "institution": "NCEPU"}, {"id": 126869, "fullname": "Sihao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/126869?format=json", "institution": "Royal Melbourne Institute of Technology"}, {"id": 188911, "fullname": "Zeyi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188911?format=json", "institution": "University of Technology Sydney"}, {"id": 185783, "fullname": "Zhihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185783?format=json", "institution": "University of Science and Technology of China"}, {"id": 89609, "fullname": "Xiaojun Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89609?format=json", "institution": "University of Technology Sydney"}], "abstract": "Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2\u00d7 training speedup and 2.4\u00d7 GPU memory reduction without compromising generative performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38045", "url": null, "sourceid": 30595, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37467, "uid": "a4a87e583804c00a14032b615ae88f53", "name": "Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning", "authors": [{"id": 184111, "fullname": "Zhicheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184111?format=json", "institution": "Southern Illinois University-Carbondale"}, {"id": 187529, "fullname": "Yichen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187529?format=json", "institution": "University of Minnesota - Twin Cities"}, {"id": 187530, "fullname": "Chang Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/187530?format=json", "institution": "University of Minnesota - Twin Cities"}, {"id": 184109, "fullname": "Xiaopeng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184109?format=json", "institution": "Southern Illinois University-Carbondale"}], "abstract": "Contrastive learning is widely used for generating multimodal data representations by aligning embeddings of different modalities of the same data samples. This alignment is achieved through a loss function that treats matched and unmatched modality pairs as positive and negative samples within a data batch. However, when extending contrastive learning to scenarios involving more than two modalities, existing approaches either rely solely on fully unmatched modalities as negative samples, or fail to distinguish between partially and fully unmatched modalities, thereby overlooking the fine-grained contrasting relationships. To address this limitation, we propose Easy2Hard, a novel framework that explicitly separates partially and fully unmatched modalities. To learn from negative samples at improved granularity, Easy2Hard further introduces a sigmoid weighting curriculum that smoothly transitions the learning process from easy (partially unmatched) to hard (fully unmatched) contrasts. Comprehensive evaluations on five multimodal datasets across diverse domains demonstrate the superiority of our approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37467", "url": null, "sourceid": 33865, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38399, "uid": "d8b08550cd94134c41e87ac5ab772552", "name": "CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird\u2019s-Eye-View Semantic Segmentation", "authors": [{"id": 180630, "fullname": "Jeongbin Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180630?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 163315, "fullname": "Dooseop Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/163315?format=json", "institution": "ETRI/UST"}, {"id": 189794, "fullname": "Taeg-Hyun An", "url": "http://cvpr.thecvf.com/api/miniconf/users/189794?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 163503, "fullname": "KYOUNG AN AN", "url": "http://cvpr.thecvf.com/api/miniconf/users/163503?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 189795, "fullname": "Kyoung-Wook Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/189795?format=json", "institution": "Electronics and Telecommunications Research Institute"}], "abstract": "Transforming image features from perspective view (PV) space to bird's-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements---with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively---without increasing inference complexity, since the IVT network is used only during training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38399", "url": null, "sourceid": 35487, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39221, "uid": "2c12328aab48072aee9c07cf4442b238", "name": "Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation", "authors": [{"id": 107609, "fullname": "Fei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107609?format=json", "institution": "Hefei University of Technology"}, {"id": 191617, "fullname": "Xinye Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191617?format=json", "institution": "Hefei University of Technology"}, {"id": 128709, "fullname": "Kun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128709?format=json", "institution": "Hefei University of Technology"}, {"id": 182512, "fullname": "Yanyan Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/182512?format=json", "institution": "Hefei University of Technology"}, {"id": 191618, "fullname": "Yuxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191618?format=json", "institution": "Anhui University"}, {"id": 191619, "fullname": "Ganpeng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191619?format=json", "institution": "Hefei University of Technology"}, {"id": 191620, "fullname": "Tong Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191620?format=json", "institution": "Hefei University of Technology"}, {"id": 191621, "fullname": "Jingwen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191621?format=json", "institution": "Hefei University of Technology"}], "abstract": "Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number ($k_\\text{cat}$), Michaelis constant ($K_\\text{m}$), and inhibition constant ($K_\\text{i}$) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines often simplify this process as a static compatibility problem between enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39221", "url": null, "sourceid": 32695, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39150, "uid": "c107528e73e8598530e5ca306eb40503", "name": "Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection", "authors": [{"id": 180982, "fullname": "Zhanhe Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/180982?format=json", "institution": "Wuhan University"}, {"id": 86262, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86262?format=json", "institution": "Wuhan University"}, {"id": 151412, "fullname": "Jikang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151412?format=json", "institution": "Wuhan University"}, {"id": 76502, "fullname": "Baojin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76502?format=json", "institution": "Huazhong Agricultural University"}, {"id": 191454, "fullname": "Yuhong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191454?format=json", "institution": "Wuhan University"}, {"id": 153101, "fullname": "Zhen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/153101?format=json", "institution": "Wuhan University"}, {"id": 153102, "fullname": "Chao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153102?format=json", "institution": "Wuhan University"}, {"id": 84752, "fullname": "Dengpan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/84752?format=json", "institution": "Wuhan University"}], "abstract": "Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a \"Tutor\" agent learns to guide a \"Student\" (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39150", "url": null, "sourceid": 35361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37892, "uid": "9d0fcc7311d2f9f58104bdfafc6b41eb", "name": "HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation", "authors": [{"id": 165371, "fullname": "Jie Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165371?format=json", "institution": "Fujian Agriculture and Forestry University"}, {"id": 128954, "fullname": "XIN LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/128954?format=json", "institution": "G42"}, {"id": 128933, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128933?format=json", "institution": "AIQ"}, {"id": 188506, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188506?format=json", "institution": "Fujian Agriculture and Forestry University"}, {"id": 188507, "fullname": "Dong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188507?format=json", "institution": "Beijing Jiaotong University"}, {"id": 188508, "fullname": "Changying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188508?format=json", "institution": "Fujian Agriculture and Forestry University"}, {"id": 188509, "fullname": "Linwei Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188509?format=json", "institution": "IFLYTEK CO.LTD."}, {"id": 188510, "fullname": "Yongxiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188510?format=json", "institution": "Fujian Agriculture and Forestry University"}, {"id": 188511, "fullname": "Youqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188511?format=json", "institution": "Fujian University of Technology"}, {"id": 188512, "fullname": "Jianzhang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188512?format=json", "institution": "Fujian Agriculture and Forestry University"}], "abstract": "High-resolution remote sensing imagery exhibits complex spatial regularities where topology, continuity, and region adjacency  govern semantic organization. However, existing remote sensing image semantic segmentation (RSISS) networks, being predominantly discriminative, estimate strong posteriors from data while lacking generative priors that encode such structural dependencies. This imbalance leads to fragmented boundaries, texture overfitting, and poor cross-domain generalization. We address this challenge by reformulating RSISS as posterior inference grounded in generative structural priors, introducing {\\bf HySeg}, a hybrid generative\u2013discriminative segmentation paradigm that learns structure-consistent priors through generative modeling and guides posterior inference for remote sensing segmentation. At its core, the MeanStruct module, a MeanFlow-based generative prior learner, models semantic topology as a continuous stochastic field, while the Prior-to-Affinity Projection (P2A) dynamically transforms this field into topology-aware, class-conditional affinities that guide posterior inference in the Dynamic Affinity-driven Segmentation (DAS) head. Our approach is model-agnostic and seamlessly integrates with diverse backbones, consistently improving structural coherence and generalization. Across four challenging RSISS benchmarks, HySeg achieves state-of-the-art performance and advances remote sensing segmentation from appearance-based perception to structural reasoning. All code and models will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37892", "url": null, "sourceid": 33730, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39451, "uid": "db793b3a1b4c40e4977207d285a5ea8a", "name": "Bridging Domain Expertise and Generalization for Performance Estimation", "authors": [{"id": 192102, "fullname": "Shuxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192102?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 187483, "fullname": "Zhilin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187483?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 154150, "fullname": "Quyu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154150?format=json", "institution": "Alibaba Group"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose \\textbf{F}used \\textbf{R}eference \\textbf{A}lignment \\textbf{P}rediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39451", "url": null, "sourceid": 40073, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37402, "uid": "3e1d9bee7fa8982cbd0dd0c0aa5ce905", "name": "Goldilocks Test Sets for Face Verification", "authors": [{"id": 187353, "fullname": "Haiyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187353?format=json", "institution": "Altos Labs, Inc."}, {"id": 187354, "fullname": "Sicong Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187354?format=json", "institution": "Indiana University"}, {"id": 88830, "fullname": "Aman Bhatta", "url": "http://cvpr.thecvf.com/api/miniconf/users/88830?format=json", "institution": "University of Notre Dame"}, {"id": 187355, "fullname": "Jacob Gutierrez", "url": "http://cvpr.thecvf.com/api/miniconf/users/187355?format=json", "institution": "University of Notre Dame"}, {"id": 187356, "fullname": "Grace Bezold", "url": "http://cvpr.thecvf.com/api/miniconf/users/187356?format=json", "institution": "Johns Hopkins University Applied Physics Laboratory; Johns Hopkins University; Johns Hopkins University"}, {"id": 178541, "fullname": "Genesis Argueta", "url": "http://cvpr.thecvf.com/api/miniconf/users/178541?format=json", "institution": "University of Notre Dame"}, {"id": 149577, "fullname": "Karl Ricanek", "url": "http://cvpr.thecvf.com/api/miniconf/users/149577?format=json", "institution": "UNC Wilmington"}, {"id": 172534, "fullname": "Michael King", "url": "http://cvpr.thecvf.com/api/miniconf/users/172534?format=json", "institution": "Florida Tech"}, {"id": 88847, "fullname": "Kevin W. Bowyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/88847?format=json", "institution": "University of Notre Dame"}], "abstract": "Reported face verification accuracy has reached a plateau on current well-known test sets. As a result, some difficult test sets have been assembled by reducing the image quality or adding artifacts to the image. However, we argue that test sets can be challenging without artificially reducing the image quality because the face recognition (FR) models suffer from correctly recognizing 1) the pairs from the same identity (i.e., genuine pairs) with a large face attribute difference, 2) the pairs from different identities (i.e., impostor pairs) with a small face attribute difference, and 3) the pairs of similar-looking identities (e.g., twins and relatives). We propose three challenging test sets to reveal important but ignored weaknesses of the existing FR algorithms. To challenge models on variation of facial attributes, we propose Hadrian and Eclipse to address facial hair differences and face exposure differences. The images in both test sets are high-quality and collected in a controlled environment. To challenge FR models on similar-looking persons, we propose twins-IND, which contains images from a dedicated twins dataset. The LFW test protocol is used to structure the proposed test sets. Moreover, we introduce additional rules to assemble \u201cGoldilocks1\" level test sets, including 1) restricted number of occurrence of hard samples, 2) equal chance evaluation across demographic groups, and 3) constrained identity overlap across validation folds. Quantitatively, without further processing the images, the proposed test sets have on-par or higher difficulties than the existing test sets that add artifacts to the images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37402", "url": null, "sourceid": 42542, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38556, "uid": "fcaae931422688b8a0134e51a7a2fb12", "name": "MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos", "authors": [{"id": 190132, "fullname": "Kehong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190132?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190133, "fullname": "Zhengyu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190133?format=json", "institution": "Huawei International Pte Ltd"}, {"id": 190134, "fullname": "Weixia He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190134?format=json", "institution": "National University of Singapore"}, {"id": 190135, "fullname": "Xu Mingxi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190135?format=json", "institution": ""}, {"id": 146747, "fullname": "Qi WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/146747?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 141549, "fullname": "ning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/141549?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190136, "fullname": "Zhengyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190136?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190137, "fullname": "Dongze Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190137?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 157941, "fullname": "Wei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157941?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190138, "fullname": "He Xiaoyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190138?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 129461, "fullname": "Mingyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129461?format=json", "institution": "Nanyang Technological University"}], "abstract": "Motion capture now underpins content creation far beyond digital humans, yet most pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation (e.g., BVH) that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware Inverse Kinematics (IK) Fitting. MoCapAnything comprises three learnable modules and a lightweight IK stage: a **Reference Prompt Encoder** that distills per-joint queries from the asset\u2019s skeleton, mesh, and rendered image set; a **Video Feature Extractor** that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the modality gap between RGB tokens and the point-cloud\u2013like joint space; and a **Unified Motion Decoder** that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1,038 motion clips, each providing a standardized skeleton\u2013mesh\u2013rendered-video triad. Experiments on in-domain benchmarks and in-the-wild videos show that \\name delivers high-quality skeletal animations and exhibits non-trivial cross-species retargeting across heterogeneous rigs, offering a scalable path toward prompt-based 3D motion capture for arbitrary assets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38556", "url": null, "sourceid": 32723, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37819, "uid": "b93d43f13fcac0fdca43fda781fd1062", "name": "Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow", "authors": [{"id": 183691, "fullname": "Shimin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183691?format=json", "institution": "University of Science and Technology of China"}, {"id": 188342, "fullname": "Yuanyi Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188342?format=json", "institution": "University of Science and Technology of China"}, {"id": 188343, "fullname": "Fei Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/188343?format=json", "institution": "University of Science and Technology of China"}, {"id": 133700, "fullname": "Yudong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/133700?format=json", "institution": "University of Science and Technology of China"}, {"id": 127405, "fullname": "Juyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127405?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore photorealistic details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37819", "url": null, "sourceid": 33010, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36792, "uid": "d69768607bc09f0dbfa0669da46ef6fc", "name": "Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video", "authors": [{"id": 182090, "fullname": "Chanhyuk Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182090?format=json", "institution": "UNIST"}, {"id": 185883, "fullname": "Taesoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185883?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 185884, "fullname": "Donggyu Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/185884?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 185885, "fullname": "Siyeol Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/185885?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 98614, "fullname": "Taehwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/98614?format=json", "institution": "UNIST"}], "abstract": "Talking face generation has gained significant attention as a core application of generative models.To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role.However, existing approaches often limit expressive flexibility and struggle to generate extended emotions.Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions.Audio-based methods can leverage emotionally rich speech signals\u2014and even benefit from expressive text-to-speech (TTS) synthesis\u2014but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches.Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm).To address these limitations, we propose \\textbf{Cross-Modal Emotion Transfer (C-MET)}, a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces.C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities.Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14\\% over state-of-the-art methods, while generating expressive talking face videos\u2014even for unseen extended emotions. All source code and checkpoint will be released upon acceptance, including video samples.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36792", "url": null, "sourceid": 45941, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36628, "uid": "1b45b0f97ceab36bd24ee1e318824f4e", "name": "PolySLGen: Online Multimodal Speaking\u2013Listening Reaction Generation in Polyadic Interaction", "authors": [{"id": 103034, "fullname": "Zhi-Yi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/103034?format=json", "institution": "Delft University of Technology (TU Delft)"}, {"id": 185504, "fullname": "Thomas Markhorst", "url": "http://cvpr.thecvf.com/api/miniconf/users/185504?format=json", "institution": "Delft University of Technology"}, {"id": 185505, "fullname": "Jouh Yeong Chew", "url": "http://cvpr.thecvf.com/api/miniconf/users/185505?format=json", "institution": "Honda Research Institution Japan Co., Ltd."}, {"id": 73724, "fullname": "Xucong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73724?format=json", "institution": "Delft University of Technology"}], "abstract": "Human-like multimodal reaction generation is essential for natural group interactions between humans and intelligent embodied AI. However, existing approaches are often limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and the complex dynamics of polyadic interactions, both critical for engagement and conversational coherence.In this work, we present **PolySLGen**, an online framework for **Poly**adic multimodal **S**peaking and **L**istening reaction **Gen**eration. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking-state score.To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multimodal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism. Source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36628", "url": null, "sourceid": 43300, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39231, "uid": "4b51868e506be56701a97d98433379df", "name": "POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs", "authors": [{"id": 157229, "fullname": "Haicheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157229?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191657, "fullname": "Yuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191657?format=json", "institution": "WeChat AI"}, {"id": 132099, "fullname": "Yikun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132099?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191658, "fullname": "Zhemeng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191658?format=json", "institution": "T\u00e9l\u00e9com Paris; Shanghai Jiao Tong University"}, {"id": 191659, "fullname": "Zhongyin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191659?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191660, "fullname": "Yangxiu You", "url": "http://cvpr.thecvf.com/api/miniconf/users/191660?format=json", "institution": null}, {"id": 191661, "fullname": "Zilin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191661?format=json", "institution": "Tencent"}, {"id": 191662, "fullname": "Le Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191662?format=json", "institution": "Tencent Wechat AI"}, {"id": 191663, "fullname": "Zhou Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191663?format=json", "institution": "Pattern recognition center"}, {"id": 149440, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/149440?format=json", "institution": "Tencent Inc"}, {"id": 73937, "fullname": "Weidi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73937?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 126281, "fullname": "Yanfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126281?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences\u2014especially in long-video and streaming scenarios\u2014poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7\\% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive visual reasoning and efficient long-form visual understanding. Model weights will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39231", "url": null, "sourceid": 41403, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36775, "uid": "d6137197f40a5b7916e5e9cbbbab739e", "name": "Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering", "authors": [{"id": 130163, "fullname": "Sebin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/130163?format=json", "institution": "Korea Advanced Institute of Science and Technology (KAIST)"}, {"id": 107308, "fullname": "Jumin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/107308?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 185849, "fullname": "Taeyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185849?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 107602, "fullname": "Youngju Na", "url": "http://cvpr.thecvf.com/api/miniconf/users/107602?format=json", "institution": "KAIST"}, {"id": 130165, "fullname": "Woobin Im", "url": "http://cvpr.thecvf.com/api/miniconf/users/130165?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 89458, "fullname": "Sung-Eui Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/89458?format=json", "institution": "KAIST"}], "abstract": "Rapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (1) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (2) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consistent gradient exploitation. Extensive experiments across various robot manipulators including Franka, UR5e, and Fetch demonstrate that vRRT achieves effective visual-goal planning in both simulated and real-world settings, bridging the gap between sampling-based planning and vision-centric robot applications. Our code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36775", "url": null, "sourceid": 33340, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39080, "uid": "05b99715f32c973f929cd22735389966", "name": "Breaking Multimodal LLM Safety via Video-Driven Prompting", "authors": [{"id": 158841, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158841?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 179822, "fullname": "XIANGYU HE", "url": "http://cvpr.thecvf.com/api/miniconf/users/179822?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 143875, "fullname": "Xinqi Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143875?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 158842, "fullname": "Bin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158842?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual reasoning tasks, such as image and video understanding. Recent studies have introduced several effective image-based jailbreak methods. However, these approaches are often mitigated by pre-defined system prompts and overlook vulnerabilities within the video encoder. In this work, we show that video-based attacks are significantly more effective than image-based ones. Specifically, we find that simply repeating a harmful image across multiple frames to construct a video can bypass the safety mechanisms of MLLMs. Our analysis reveals that unsafe videos are embedded more similarly to safe videos in the model\u2019s representation space than individual harmful images, making them harder to detect. Moreover, videos composed of identical frames are processed more like static images and are more likely to trigger safety defenses compared to videos with diverse frames. Motivated by these findings, we propose an algorithm that injects harmful content into typographic videos by interleaving it with diverse, safety-proximal frames, thereby evading MLLM safety alignment. Extensive experiments demonstrate that our approach achieves state-of-the-art jailbreaking performance on several widely-used MLLMs (e.g., VideoLLaMA-2, Qwen2.5-VL, GPT-4.1, and Gemini-2.5) under 16 different safety policies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39080", "url": null, "sourceid": 44028, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38502, "uid": "f8efc7c14e9be56f00e682929c90beea", "name": "ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors", "authors": [{"id": 178103, "fullname": "Liming Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178103?format=json", "institution": "Technical University of Munich"}, {"id": 184922, "fullname": "Yordanka Velikova", "url": "http://cvpr.thecvf.com/api/miniconf/users/184922?format=json", "institution": "Technical University of Munich"}, {"id": 190005, "fullname": "Mahdi Saleh", "url": "http://cvpr.thecvf.com/api/miniconf/users/190005?format=json", "institution": "Zillow Group, Inc."}, {"id": 186770, "fullname": "Jan-Nico Zaech", "url": "http://cvpr.thecvf.com/api/miniconf/users/186770?format=json", "institution": "Woven by Toyota &amp; INSAIT, Sofia University"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}, {"id": 74044, "fullname": "Benjamin Busam", "url": "http://cvpr.thecvf.com/api/miniconf/users/74044?format=json", "institution": "Technical University of Munich"}], "abstract": "Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities.In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training. We will release our code upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38502", "url": null, "sourceid": 35318, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37144, "uid": "f527bb54bbe0ff43068983a3cb2b0749", "name": "SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding", "authors": [{"id": 186767, "fullname": "Nikolay Nikolov", "url": "http://cvpr.thecvf.com/api/miniconf/users/186767?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 141973, "fullname": "Giuliano Albanese", "url": "http://cvpr.thecvf.com/api/miniconf/users/141973?format=json", "institution": "INSAIT"}, {"id": 186768, "fullname": "Sombit Dey", "url": "http://cvpr.thecvf.com/api/miniconf/users/186768?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 186769, "fullname": "Aleksandar Yanev", "url": "http://cvpr.thecvf.com/api/miniconf/users/186769?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology; Sofia University &quot;St. Kliment Ohridski&quot;"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 186770, "fullname": "Jan-Nico Zaech", "url": "http://cvpr.thecvf.com/api/miniconf/users/186770?format=json", "institution": "Woven by Toyota &amp; INSAIT, Sofia University"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}], "abstract": "Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control.Yet their ability to generalize across new environments, tasks, and embodiments remains limited.We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs).However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world.Bridging this gap directly with large-scale robotic data is costly and difficult to scale.Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities.Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image.Building on SPEAR-VLM, we introduce our main contribution, $~\\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control.Trained on $\\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $\\pi_0$-FAST and $\\pi_{0.5}$, while it uses 20$\\times$ fewer robot demonstrations.This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data.We make our model weights and 3D-annotated datasets publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37144", "url": null, "sourceid": 46344, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39816, "uid": "62c8d075dc4cffa2b9e0796f43bc1c2a", "name": "Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image", "authors": [{"id": 174886, "fullname": "Si-Sheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174886?format=json", "institution": "Institute of Computer and Communication Engineering"}, {"id": 183897, "fullname": "Chia-Hsiang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/183897?format=json", "institution": "National Cheng Kung University"}], "abstract": "The European Space Agency's Sentinel-2 satellite provides global multispectral coverage for remote sensing (RS) applications. However, limited spectral resolution (12 bands) and non-unified spatial resolution (60/20/10 m) restrict their practicality. In contrast, the high spectral-spatial resolution sensor (e.g., NASA's AVIRIS-NG) covers only the American region due to practical considerations. This raises a fundamental question: ``Can a global hyperspectral coverage be achieved by reconstructing Sentinel-2 data to NASA hyperspectral images?'' This study aims to achieve spectral super-resolution from 12-to-186 and unify the spatial resolution of Sentinel-2 data to 5 m. To enable a reliable and efficient reconstruction, we formulate a novel deep unfolding framework regularized by a data-driven spectrum prior from PriorNet, instead of relying on implicit deep priors as conventional deep unfolding does. Moreover, an adversarial term is integrated into the unfolded architecture, enabling the discriminator to guide the reconstruction in both the training and testing phases; we term this novel concept unfolding adversarial learning (UAL). Experiments show that our UALNet outperforms the next-best Transformer in PSNR, SSIM, and SAM, while requiring only 15\\% MACs and 20 times fewer parameters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39816", "url": null, "sourceid": 34174, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39866, "uid": "d8e917c6af68b61ef2b3ba045c3436f4", "name": "Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning", "authors": [{"id": 193026, "fullname": "Tianci Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193026?format=json", "institution": "Tsinghua University"}, {"id": 193027, "fullname": "Haohao Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193027?format=json", "institution": "Northeastern University"}, {"id": 181347, "fullname": "Jinpeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181347?format=json", "institution": "Harbin Institue of Technology, Shenzhen"}, {"id": 157915, "fullname": "Niu Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/157915?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193028, "fullname": "Xinrui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193028?format=json", "institution": null}, {"id": 87209, "fullname": "Bin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87209?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 87242, "fullname": "Shu-Tao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87242?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 86975, "fullname": "Chun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86975?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Visual in-context learning (VICL) enables visual foundation models to handle multiple tasks by steering them with demonstrative prompts. The choice of such prompts largely influences VICL performance, standing out as a key challenge.Prior work has made substantial progress on prompt retrieval and reranking strategies; however, they focused primarily on prompt images while often overlooking the labels. We reveal that these approaches sometimes get visually similar but label-inconsistent prompts, which potentially degrade VICL performance. On the other hand, higher label consistency between query and prompts preferably indicates stronger VICL results. Motivated by these findings, we develop a framework named LaPR (**L**abel-**a**ware **P**rompt **R**etrieval), which highlights the role of labels in prompt selection. Our framework first designs an image\u2013label joint representation for prompts to incorporate label cues explicitly. Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. Each expert is expected to capture a specific label mode, while the router infers query-adaptive mixture weights and helps to learn label-aware representation.We carefully design alternative optimization for experts and the router, with a VICL performance-guided contrastive loss and a label-guided contrastive loss, respectively. Extensive experiments show promising and consistent improvement of LaPR on in-context segmentation, detection, and colorization tasks. Moreover, LaPR generalizes well across feature extractors and cross-fold scenarios, suggesting the importance of label utilization in prompt retrieval for VICL. Faithful code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39866", "url": null, "sourceid": 33971, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38616, "uid": "0789e47353cb91045d7bd8a39bff5d0b", "name": "C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis", "authors": [{"id": 181717, "fullname": "Jiayang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181717?format=json", "institution": null}, {"id": 190307, "fullname": "Tianyi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190307?format=json", "institution": "vivo Mobile Communication Co., Ltd"}, {"id": 190308, "fullname": "Jiayang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190308?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186087, "fullname": "Fengxiang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186087?format=json", "institution": "Vivo"}, {"id": 190309, "fullname": "Shice Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190309?format=json", "institution": "Vivo"}, {"id": 190310, "fullname": "Luyao Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190310?format=json", "institution": null}, {"id": 155870, "fullname": "Zheyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155870?format=json", "institution": "University of Science and Technology of China"}, {"id": 126669, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126669?format=json", "institution": "vivo Mobile Communication \uff08Hangzhou\uff09Co., Ltd"}, {"id": 126675, "fullname": "Jinwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126675?format=json", "institution": "vivo Mobile Communication Co., Ltd."}, {"id": 126680, "fullname": "Peng-Tao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126680?format=json", "institution": "vivo Mobile Communication (Hangzhou) Co., Ltd."}, {"id": 87210, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87210?format=json", "institution": "vivo Mobile Communication Co.,Ltd."}, {"id": 185450, "fullname": "Jia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185450?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process.This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce **Control Classifier-Free Guidance (C$^2$FG)**, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38616", "url": null, "sourceid": 32630, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40217, "uid": "af7244bb99debb4a1152fa49a993a05c", "name": "FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy", "authors": [{"id": 183511, "fullname": "Hyejin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/183511?format=json", "institution": "Ewha Womans University"}, {"id": 181847, "fullname": "Jiwon Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/181847?format=json", "institution": "Ewha Women&#x27;s University"}, {"id": 193806, "fullname": "Sumin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/193806?format=json", "institution": "Ewha Womans University"}, {"id": 193807, "fullname": "Suree Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193807?format=json", "institution": "Daesang"}, {"id": 193808, "fullname": "Sinae Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193808?format=json", "institution": "Ewha Women&#x27;s University"}, {"id": 193809, "fullname": "Eunsoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193809?format=json", "institution": "Ewha Women&#x27;s University"}, {"id": 193810, "fullname": "Dongmin Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193810?format=json", "institution": "Ewha Women&#x27;s University"}, {"id": 85489, "fullname": "Dongbo Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/85489?format=json", "institution": "Ewha Womans University"}], "abstract": "Accurate focus quality assessment (FQA) in fluorescence microscopy remains challenging, as the stain-dependent optical properties of fluorescent dyes cause abrupt and heterogeneous focus shifts. However, existing datasets and models overlook this variability, treating focus quality as a stain-agnostic problem. In this work, we formulate the task of \\textbf{stain-aware FQA}, emphasizing that focus behavior in fluorescence microscopy must be modeled as a function of staining characteristics. Through quantitative analysis of existing datasets (FocusPath, BBBC006) and our newly curated FluoMix, we demonstrate that focus\u2013rank relationships vary substantially across stains, underscoring the need for stain-aware modeling in fluorescence microscopy. To support this new formulation, we propose \\textbf{FluoMix}, the first dataset for stain-aware FQA that encompasses multiple tissues, fluorescent stains, and focus variations. Building on this dataset, we propose \\textbf{FluoCLIP}, a two-stage vision-language framework that leverages CLIP's alignment capability to interpret focus quality in the context of biological staining. In the \\textbf{stain-grounding phase}, FluoCLIP learns general stain representations by aligning textual stain tokens with visual features, while in the \\textbf{stain-guided ranking phase}, it optimizes stain-specific rank prompts for ordinal focus prediction. Together, our formulation, dataset, and framework establish the first foundation for \\textbf{stain-aware FQA}, and \\textbf{FluoCLIP} achieves strong generalization across diverse fluorescence microscopy conditions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40217", "url": null, "sourceid": 43203, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37412, "uid": "5a7535cf531bc52d3200f569ed40ca0b", "name": "Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization", "authors": [{"id": 187381, "fullname": "Bingjun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187381?format=json", "institution": "Tencent"}, {"id": 187382, "fullname": "Jialin Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187382?format=json", "institution": "Harbin Engineering University"}, {"id": 76436, "fullname": "Yue Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76436?format=json", "institution": "Shandong University"}, {"id": 107253, "fullname": "Xinpeng Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/107253?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Multimodal Large Language Models (MLLMs) have achieved impressive performance, but their safety alignment remains vulnerable to jailbreak attacks. Existing content-based jailbreaks are often inconsistent and show low attack success rates (ASR) against commercial closed-source MLLMs, failing to exploit non-content-based vulnerabilities. Unlike previous research, we empirically find that MLLMs exhibit a Stylistic Inconsistency between their comprehension ability and safety ability. That is, from the perspective of comprehension, MLLMs can robustly understand content regardless of visual style (e.g., \"pencil sketch\"). However, from the perspective of safety ability, their defense mechanisms can be easily bypassed by these specific stylistic triggers, leading to harmful responses. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. ASO fine-tunes an image-editing model to superimpose an optimized stylistic modification onto a given adversarial image. We apply a Group Relative Policy Optimization (GRPO) agent, guided by a Structurally-Tiered Reward Function. This function uniquely combines a logit-based signal for detecting explicit refusals with a high-fidelity semantic evaluation from a powerful judge model, mapping outcomes to distinct, non-overlapping reward tiers to select the most potent stylistic parameters. Extensive experiments show that ASO significantly enhances the ASR of SOTA attacks. The GRPO agent automatically discovers optimal, non-intuitive parameters, demonstrating that stylistic biases are a scalable and modular vector for red-teaming MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37412", "url": null, "sourceid": 40757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40299?format=json"], "related_events_ids": [40299]}, {"id": 40081, "uid": "8ea72b15523ac50dca6a1370d803eb19", "name": "EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling", "authors": [{"id": 182971, "fullname": "Jiafei Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/182971?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 193455, "fullname": "Fengwei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193455?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp., Ltd."}, {"id": 193456, "fullname": "Jin Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193456?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 193457, "fullname": "Wenjin (Jason) Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193457?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 193458, "fullname": "Tong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193458?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 193459, "fullname": "Gengjian Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193459?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 193460, "fullname": "Zhikang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193460?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 193461, "fullname": "Daomin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/193461?format=json", "institution": "Guangdong OPPO Mobile Telecommunications Corp.,Ltd."}, {"id": 193462, "fullname": "Yichao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193462?format=json", "institution": null}, {"id": 193463, "fullname": "Bailin Na", "url": "http://cvpr.thecvf.com/api/miniconf/users/193463?format=json", "institution": null}], "abstract": "Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in high-resolution or multi-image scenarios. To address this issue, we propose EvoComp, a visual token compression framework that significantly reduces token count while preserving task accuracy. EvoComp introduces a lightweight encoder-only transformer-based compressor that selects the most informative and non-redundant visual tokens by jointly considering visual and textual contexts. A core challenge lies in providing effective supervision for training the compressor. To this end, we design an evolutionary labeling strategy that searches for token subsets minimizing the MLLM's output loss, while enforcing semantic diversity through vocabulary-based token grouping. We further train the compressor using a tailored loss function combining the GHM loss to mitigate class and difficulty imbalance, and a cosine similarity regularization to encourage semantic separation between retained and discarded tokens. Extensive experiments across multiple vision-language benchmarks show that EvoComp outperforms existing methods based on attention or similarity heuristics. Notably, it retains 99.3\\% of the original accuracy under 3x token compression and delivers up to 1.6x speedup on mobile devices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40081", "url": null, "sourceid": 30836, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38102, "uid": "adaedfcb189c2913db383791af9b8ec3", "name": "RaUF: Learning the Spatial Uncertainty Field of Radar", "authors": [{"id": 143734, "fullname": "Shengpeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143734?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189061, "fullname": "Kuangyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189061?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189062, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189062?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Millimeter-wave radar offers unique advantages in adverse weather but suffers from low spatial fidelity, severe azimuth ambiguity, and clutter-induced spurious returns. Existing methods mainly focus on improving spatial perception effectiveness via coarse-to-fine cross-modal supervision, yet often overlook the ambiguous feature-to-label mapping, which may lead to ill-posed geometric inference and pose fundamental challenges to downstream perception tasks. In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. To resolve conflicting feature-to-label mapping, we design an anisotropic probabilistic model that learns fine-grained uncertainty. To further enhance reliability, we propose a Bidirectional Domain Attention mechanism that exploits the mutual complementarity between spatial structure and Doppler consistency, effectively suppressing spurious or multipath-induced reflections. Extensive experiments on public benchmarks and real-world datasets demonstrate that RaUF delivers highly reliable spatial detections with well-calibrated uncertainty. Moreover, downstream case studies further validate the enhanced reliability and scalability of RaUF under challenging real-world driving scenarios. Our dataset will be released to the community.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38102", "url": null, "sourceid": 43187, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38862, "uid": "7d214dbbb7364420941d28a3e51a11e4", "name": "One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control", "authors": [{"id": 180271, "fullname": "Haoxiang Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180271?format=json", "institution": "Nanjing University"}, {"id": 190858, "fullname": "Zhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190858?format=json", "institution": "China Mobile Communications Company Limited Research Institute"}, {"id": 89111, "fullname": "Chenyang Si", "url": "http://cvpr.thecvf.com/api/miniconf/users/89111?format=json", "institution": "Nanyang Technological University  Singapore"}, {"id": 190859, "fullname": "Yan LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/190859?format=json", "institution": "China Mobile Communications Company Limited Research Institute"}, {"id": 190860, "fullname": "Yuanyi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190860?format=json", "institution": "Shandong University of Science and Technology"}, {"id": 89714, "fullname": "Fang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89714?format=json", "institution": "Tencent AI Lab"}, {"id": 153687, "fullname": "Caifeng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153687?format=json", "institution": "Nanjing University"}], "abstract": "Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of \\method, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38862", "url": null, "sourceid": 41554, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40299, "uid": "5a7535cf531bc52d3200f569ed40ca0b", "name": "Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization", "authors": [{"id": 187381, "fullname": "Bingjun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187381?format=json", "institution": "Tencent"}, {"id": 187382, "fullname": "Jialin Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187382?format=json", "institution": "Harbin Engineering University"}, {"id": 76436, "fullname": "Yue Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76436?format=json", "institution": "Shandong University"}, {"id": 107253, "fullname": "Xinpeng Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/107253?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Multimodal Large Language Models (MLLMs) have achieved impressive performance, but their safety alignment remains vulnerable to jailbreak attacks. Existing content-based jailbreaks are often inconsistent and show low attack success rates (ASR) against commercial closed-source MLLMs, failing to exploit non-content-based vulnerabilities. Unlike previous research, we empirically find that MLLMs exhibit a Stylistic Inconsistency between their comprehension ability and safety ability. That is, from the perspective of comprehension, MLLMs can robustly understand content regardless of visual style (e.g., \"pencil sketch\"). However, from the perspective of safety ability, their defense mechanisms can be easily bypassed by these specific stylistic triggers, leading to harmful responses. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. ASO fine-tunes an image-editing model to superimpose an optimized stylistic modification onto a given adversarial image. We apply a Group Relative Policy Optimization (GRPO) agent, guided by a Structurally-Tiered Reward Function. This function uniquely combines a logit-based signal for detecting explicit refusals with a high-fidelity semantic evaluation from a powerful judge model, mapping outcomes to distinct, non-overlapping reward tiers to select the most potent stylistic parameters. Extensive experiments show that ASO significantly enhances the ASR of SOTA attacks. The GRPO agent automatically discovers optimal, non-intuitive parameters, demonstrating that stylistic biases are a scalable and modular vector for red-teaming MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40299", "url": null, "sourceid": -40757, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37412?format=json"], "related_events_ids": [37412]}, {"id": 39536, "uid": "7d35335f47d5d82b093aeee47a5b0a64", "name": "Instance-level Visual Active Tracking with Occlusion-Aware Planning", "authors": [{"id": 192290, "fullname": "Haowei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192290?format=json", "institution": "South China University of Technology"}, {"id": 192291, "fullname": "Kai Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192291?format=json", "institution": "South China University of Technology"}, {"id": 192292, "fullname": "Hao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192292?format=json", "institution": "South China University of Technology"}, {"id": 192293, "fullname": "Shiteng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192293?format=json", "institution": "South China University of Technology"}, {"id": 157345, "fullname": "Jinwu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157345?format=json", "institution": "South China University of Technology"}, {"id": 192294, "fullname": "Xutao Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192294?format=json", "institution": "South China University of Technology"}, {"id": 87065, "fullname": "Qixiang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/87065?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 87321, "fullname": "Mingkui Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87321?format=json", "institution": "South China University of Technology"}], "abstract": "Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39536", "url": null, "sourceid": 42545, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37267, "uid": "3e79c62a8c156c71bc7d1ff24609d223", "name": "CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation", "authors": [{"id": 183750, "fullname": "Hongjin Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/183750?format=json", "institution": "Tianjin University"}, {"id": 187050, "fullname": "Jian Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187050?format=json", "institution": "Tianjin University"}, {"id": 187051, "fullname": "Hongjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187051?format=json", "institution": "Tianjin University"}, {"id": 187052, "fullname": "Jia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187052?format=json", "institution": "KEDACOM"}, {"id": 87066, "fullname": "Ruizhen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87066?format=json", "institution": "Shenzhen University"}, {"id": 84906, "fullname": "Yu-Kun Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/84906?format=json", "institution": "Cardiff University"}, {"id": 77013, "fullname": "Kun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77013?format=json", "institution": "Tianjin University"}], "abstract": "Large-scale floorplan generation is critical for virtual space planning and architectural simulation. Although existing methods have shown success in generating small-scale floorplans with simple room shapes, they struggle to handle the complex room connections and irregular room shapes that arise in large-scale floorplans. In this paper, we propose CG-Floor, a centroid-guided hierarchical framework that explicitly decouples topology and geometry to address these issues. We first introduce the size-aware semantic centroid heatmap, derived from predicted room centroids, which provides a structured representation to precisely guide the effective generation of a coarse-to-fine floorplan generator while ensuring semantic alignment. Additionally, we train a vector quantized codebook of floorplans with complex room shapes to capture the diversity of room shapes and employ a latent diffusion transformer to generate large-scale floorplans featuring non-Manhattan room shapes. CG-Floor achieves state-of-the-art performance on the large-scale MSD dataset, and supports 3D floorplan conversion and editing, demonstrating the practicality of our approach.The code will be publicly available for research purposes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37267", "url": null, "sourceid": 43250, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38459, "uid": "c8c1efaca9730df9826ea5f5bd72d025", "name": "DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning", "authors": [{"id": 180073, "fullname": "Chuan Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180073?format=json", "institution": "Peking University"}, {"id": 186907, "fullname": "Haoqi Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186907?format=json", "institution": "Peking University"}, {"id": 189898, "fullname": "Ziye Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189898?format=json", "institution": "Peking University"}, {"id": 189899, "fullname": "Chaoyi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189899?format=json", "institution": null}, {"id": 189900, "fullname": "Kai Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189900?format=json", "institution": "Peking University"}, {"id": 87087, "fullname": "Zongqing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87087?format=json", "institution": "Peking University"}], "abstract": "Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components \u2014 grasping style and affordance \u2014 and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance.Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38459", "url": null, "sourceid": 31541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38424, "uid": "01eb7e52b5ec845670f477a41d4c00a3", "name": "Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation", "authors": [{"id": 183264, "fullname": "Wenhan Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/183264?format=json", "institution": "Xiamen University"}, {"id": 143046, "fullname": "Shaopan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143046?format=json", "institution": "Xiamen University"}, {"id": 178306, "fullname": "Xiangyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178306?format=json", "institution": "Xiamen University"}, {"id": 189841, "fullname": "Tianchu Hang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189841?format=json", "institution": "NC DHHS"}, {"id": 189842, "fullname": "Zhongquan Jian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189842?format=json", "institution": "Minjiang University"}, {"id": 189843, "fullname": "Qingqiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189843?format=json", "institution": "Xiamen University"}], "abstract": "Text-to-motion generation aims to synthesize realistic and semantically aligned 3D human motions from natural language descriptions. However, existing diffusion-based methods often rely on isotropic latent priors and shallow cross-modal supervision, which lead to semantic entanglement, limited controllability, and poor interpretability.We propose HESP, a unified diffusion framework that hierarchically enhances semantic priors for disentangled text-driven motion generation. At its core, HESP introduces an Adaptive Gaussian Variational Autoencoder (AG-VAE) that structures the latent motion manifold into multiple semantically coherent submanifolds, enabling interpretable and controllable motion representations. To further bridge linguistic and kinematic semantics, we design a Dynamic Cross-Modal Memory (DCMM) module for adaptive semantic fusion and a Hierarchical Cross-Modal Attention (HCA) mechanism to capture multi-level text\u2013motion correspondences.Extensive experiments on HumanML3D and KIT-ML demonstrate that HESP consistently outperforms state-of-the-art baselines such as SALAD, MoMask, and MDM, achieving improvement, while maintaining higher diversity and physical plausibility. Moreover, the structured latent space of HESP provides interpretable clusters that reveal clear semantic boundaries among different motion categories.Our work establishes a new paradigm for text-conditioned human motion generation by integrating hierarchical latent modeling with adaptive cross-modal reasoning, advancing both performance and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38424", "url": null, "sourceid": 36598, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39741, "uid": "6f942b9867c5426a14f5841ece172b18", "name": "VENI: Variational Encoder for Natural Illumination", "authors": [{"id": 192765, "fullname": "Paul Walker", "url": "http://cvpr.thecvf.com/api/miniconf/users/192765?format=json", "institution": "Friedrich Alexander Universit\u00e4t Erlangen-N\u00fcrnberg"}, {"id": 177440, "fullname": "James Gardner", "url": "http://cvpr.thecvf.com/api/miniconf/users/177440?format=json", "institution": "University of York"}, {"id": 192766, "fullname": "Andreea Ardelean", "url": "http://cvpr.thecvf.com/api/miniconf/users/192766?format=json", "institution": "Friedrich-Alexander-Universit\u00e4t Erlangen-N\u00fcrnberg"}, {"id": 192767, "fullname": "William Smith", "url": "http://cvpr.thecvf.com/api/miniconf/users/192767?format=json", "institution": "University of York"}, {"id": 75545, "fullname": "Bernhard Egger", "url": "http://cvpr.thecvf.com/api/miniconf/users/75545?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Inverse rendering is an ill-posed problem, but priors like illumination priors, can simplify it.Existing work either disregards the spherical and rotation-equivariant nature of illumination environments or does not provide a well-behaved latent space.We propose a rotation-equivariant variational autoencoder that models natural illumination on the sphere without relying on 2D projections.To preserve the SO(2)-equivariance of environment maps, we use a novel Vector Neuron Vision Transformer (VN-ViT) as encoder and a rotation-equivariant conditional neural field as decoder.In the encoder, we reduce the equivariance from SO(3) to SO(2) using a novel SO(2)-equivariant fully connected layer, an extension of Vector Neurons.We show that our SO(2)-equivariant fully connected layer outperforms standard Vector Neurons when used in our SO(2)-equivariant model.Compared to previous methods, our variational autoencoder enables smoother interpolation in latent space and offers a more well-behaved latent space.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39741", "url": null, "sourceid": 31922, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36422, "uid": "5993d62a909320b8c3d62ef50e61b25a", "name": "Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models", "authors": [{"id": 184994, "fullname": "Baoheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184994?format=json", "institution": "Department of Electrical and Electronic Engineering, University of Hong Kong"}, {"id": 76577, "fullname": "Jiahui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76577?format=json", "institution": "Megvii Technology Inc."}, {"id": 184995, "fullname": "Zhao Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184995?format=json", "institution": null}, {"id": 184996, "fullname": "Zhang Weizhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184996?format=json", "institution": "University of Hong Kong"}, {"id": 184997, "fullname": "YIXUAN MA", "url": "http://cvpr.thecvf.com/api/miniconf/users/184997?format=json", "institution": "University of Hong Kong"}, {"id": 184998, "fullname": "Jun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184998?format=json", "institution": "University of Hong Kong"}, {"id": 184999, "fullname": "Yingxian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184999?format=json", "institution": "University of Hong Kong"}, {"id": 185000, "fullname": "Wilton.W.T. Fok", "url": "http://cvpr.thecvf.com/api/miniconf/users/185000?format=json", "institution": null}, {"id": 88776, "fullname": "Xiaojuan Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88776?format=json", "institution": "University of Oxford"}, {"id": 185001, "fullname": "Hayden Kwok-Hay So", "url": "http://cvpr.thecvf.com/api/miniconf/users/185001?format=json", "institution": "University of Hong Kong; University of Hong Kong"}], "abstract": "Multimodal Large Language Models (MLLMs) perform strong vision\u2013language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator -- a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion -- and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05\u00d7 \u2013 20\u00d7), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36422", "url": null, "sourceid": 39662, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37383, "uid": "bc15433f32b2e408dcf5f338935402af", "name": "PaNDaS: Learnable Shape Interpolation Modeling with Localized Control", "authors": [{"id": 180288, "fullname": "Thomas Besnier", "url": "http://cvpr.thecvf.com/api/miniconf/users/180288?format=json", "institution": "University of Copenhagen"}, {"id": 168736, "fullname": "Emery Pierson", "url": "http://cvpr.thecvf.com/api/miniconf/users/168736?format=json", "institution": "Lix, Polytechnique"}, {"id": 187308, "fullname": "Sylvain Arguillere", "url": "http://cvpr.thecvf.com/api/miniconf/users/187308?format=json", "institution": "Universit\u00e9 de Lille"}, {"id": 85376, "fullname": "Maks Ovsjanikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85376?format=json", "institution": "Ecole Polytechnique, France"}, {"id": 187309, "fullname": "Mohamed Daoudi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187309?format=json", "institution": "IMT NORD EUROPE"}], "abstract": "We present PaNDaS, a novel deep learning framework for Partial Non-Rigid Deformations and interpolations of Surfaces (PaNDaS). PaNDaS learns a per-face feature field on the source mesh and fuses it with a global encoding of the target. A deformation generator predicts a Jacobian field and recovers a smooth displacement, enabling precise regional control, pose mixing, and transferable local edits. Unlike previous approaches, our method can restrict the deformations to specific parts of the shape in a versatile way. Across various human body part datasets, PaNDaS achieves state-of-the-art interpolation accuracy and stronger locality than methods based on global shape codes or handles, while remaining robust to remeshing. We demonstrate several localized shape manipulation tasks and show that our method can generate new shapes by combining different input deformations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37383", "url": null, "sourceid": 43223, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36623, "uid": "4d83f681638e40203e88cb7cb014b74f", "name": "Phrase-grounded APO for Improving Chest X-ray Report Generation", "authors": [{"id": 185494, "fullname": "Raziuddin Mahmood", "url": "http://cvpr.thecvf.com/api/miniconf/users/185494?format=json", "institution": "Rensselaer Polytechnic Institute"}, {"id": 185495, "fullname": "Tanveer Syeda-Mahmood", "url": "http://cvpr.thecvf.com/api/miniconf/users/185495?format=json", "institution": "Stanford University"}], "abstract": "The deployment of automatic radiology report generator (RRG) models in clinical workflows is being hampered by the lack of factual correctness in the produced reports. Existing methods to improve the report generators use alignment approaches that require pairs of ground truth preferred and dis-preferred responses. As these are not available at inference time in clinical workflows, new alignment methods are needed to improve report quality at inference time.  In this paper, we present a new phrase-grounded automatic preference optimization (APO) alignment method which offers such improvement during inference without needing additional ground truth.   Specifically, the method generates surrogate ground truth preference data for alignment automatically from the RRG model response itself though fact-checking and LLM-prompted correction. We also develop a novel APO loss function that combines preference response alignment loss with phrasal grounding loss paying attention to both the description of the finding and its image location.  We show that this method of alignment, on the average, improves the report quality at inference time by 30-40\\% across various SOTA report generators as tested on multi-institutional chest X-ray datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36623", "url": null, "sourceid": 31414, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38735, "uid": "2e428b7bc6471fe638cb2aee110b8032", "name": "LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models", "authors": [{"id": 185992, "fullname": "Senyu Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185992?format=json", "institution": "Tongji University &amp; Shanghai Innovation Institute"}, {"id": 185993, "fullname": "Siyin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185993?format=json", "institution": "Fudan University"}, {"id": 190544, "fullname": "Junhao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190544?format=json", "institution": "Fudan University"}, {"id": 190545, "fullname": "Zihao Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190545?format=json", "institution": "Harbin Institute of Technology"}, {"id": 190546, "fullname": "Jikun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190546?format=json", "institution": "Fudan University"}, {"id": 183059, "fullname": "Pengfang Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/183059?format=json", "institution": "Fudan University"}, {"id": 185994, "fullname": "Li Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/185994?format=json", "institution": "Fudan University"}, {"id": 190547, "fullname": "Xinzhe He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190547?format=json", "institution": "Fudan University"}, {"id": 185996, "fullname": "Shiduo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185996?format=json", "institution": "Fudan University"}, {"id": 190548, "fullname": "Zhaoye Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/190548?format=json", "institution": "Fudan University"}, {"id": 154480, "fullname": "Jinlan Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154480?format=json", "institution": "National University of Singapore"}, {"id": 185998, "fullname": "Jingjing Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185998?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 130875, "fullname": "Xipeng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130875?format=json", "institution": "Fudan University"}], "abstract": "Visual\u2013Language\u2013Action (VLA) models report impressive success rates exceeding 95\\% on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. Current simulation-based robustness evaluations suffer from narrow perturbation coverage, manual design constraints, and coarse-grained analysis that fails to reveal when and how models fail. To address this gap, we propose LIBERO-Plus, a comprehensive, automatic, and fine-grained evaluation framework with controlled perturbations across seven dimensions: object layouts, camera viewpoints, robot initial states, language instructions, lighting conditions, background textures, and sensor noise. Our systematic analysis of ten state-of-the-art models reveals consistent brittleness beneath apparent competence, with performance dropping from 95\\% to below 30\\% under modest perturbations. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38735", "url": null, "sourceid": 40237, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39463, "uid": "7bd4fd545ca2f255550a8fd6a4722446", "name": "Teacher-Guided Routing for Sparse Vision Mixture-of-Experts", "authors": [{"id": 183113, "fullname": "Masahiro Kada", "url": "http://cvpr.thecvf.com/api/miniconf/users/183113?format=json", "institution": "Institute of Science Tokyo"}, {"id": 192125, "fullname": "Ryota Yoshihashi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192125?format=json", "institution": "LY Corporation; Tokyo Institute of Technology"}, {"id": 90853, "fullname": "Satoshi Ikehata", "url": "http://cvpr.thecvf.com/api/miniconf/users/90853?format=json", "institution": "NII, Tokyo Institute of Technology"}, {"id": 192126, "fullname": "Rei Kawakami", "url": "http://cvpr.thecvf.com/api/miniconf/users/192126?format=json", "institution": "Institute of Science Tokyo"}, {"id": 192127, "fullname": "Ikuro Sato", "url": "http://cvpr.thecvf.com/api/miniconf/users/192127?format=json", "institution": "Institute of Science Tokyo / Denso IT Laboratory, Inc."}], "abstract": "Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck.Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of expert networks for each input, achieving high scalability with limited computation.Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives gradients only from the experts it selects in each forward pass, its learning signal is highly localized, with little information about the broader expert space.This limited gradient feedback can lead the router toward suboptimal configurations, for example collapsing to only a few experts when no auxiliary losses are used, and it has also been associated with fluctuating expert selections during training. These behaviors suggest that task-driven signals alone do not provide sufficient guidance for learning robust routing behavior in sparse MoE.To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model.TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training.Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39463", "url": null, "sourceid": 39587, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37788, "uid": "9173003064218a5f9e5a2504b53210ad", "name": "HP-Edit: A Human-Preference Post-Training Framework for Image Editing", "authors": [{"id": 153255, "fullname": "Fan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153255?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 172152, "fullname": "Chonghuinan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172152?format=json", "institution": "Harbin Institute of Technology"}, {"id": 188268, "fullname": "Lina Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188268?format=json", "institution": "Nankai University"}, {"id": 188269, "fullname": "Yuping Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188269?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 76441, "fullname": "Jiaqi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76441?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 144150, "fullname": "Jiaxiu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144150?format=json", "institution": "Harbin Institute of Technology"}, {"id": 144139, "fullname": "Xinran Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/144139?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188270, "fullname": "Zhikai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188270?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88204, "fullname": "Fenglong Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/88204?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 153256, "fullname": "Zhixin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153256?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 84715, "fullname": "Renjing Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84715?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs.To fill this gap, we propose HP-Edit, a post-training framework for \\textbf{H}uman \\textbf{P}reference-aligned \\textbf{Edit}ing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing.Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer\u2014an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model.We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human-preferred results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37788", "url": null, "sourceid": 44109, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36386, "uid": "193c41415729c936d40a3a927872bef3", "name": "DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum", "authors": [{"id": 184928, "fullname": "Yaokun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184928?format=json", "institution": "The Chinese University of Hong Kong; SUN YAT-SEN UNIVERSITY"}, {"id": 129030, "fullname": "Lihe Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/129030?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 107390, "fullname": "Xiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/107390?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 105238, "fullname": "Guang Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/105238?format=json", "institution": "Sun Yat-sen University"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Generating dynamic and interactive 3D trees has wide applications in virtual reality, games, and world simulation. However, existing methods still face various challenges in generating structurally consistent and realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate long-term, interactive 3D motion for 3DGS reconstructions of real trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under external forces, allowing real-time interactive responses. To train our model, we also introduce 4DTree, the first large-scale synthetic 4D tree dataset containing 8,786 animated tree meshes with semantic labels and 100-frame motion sequences. Extensive experiments demonstrate that our method achieves realistic and responsive tree animations, significantly outperforming existing approaches in both visual quality and computational efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36386", "url": null, "sourceid": 41760, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39286, "uid": "f1294e4013a5d5f467bdb6cde16adf35", "name": "Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics", "authors": [{"id": 183095, "fullname": "Bo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/183095?format=json", "institution": "Sun Yat-Sen University"}, {"id": 90413, "fullname": "Peixi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90413?format=json", "institution": "Peking University"}, {"id": 105238, "fullname": "Guang Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/105238?format=json", "institution": "Sun Yat-sen University"}, {"id": 106076, "fullname": "Haoran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106076?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184928, "fullname": "Yaokun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184928?format=json", "institution": "The Chinese University of Hong Kong; SUN YAT-SEN UNIVERSITY"}, {"id": 153528, "fullname": "Yiqian Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153528?format=json", "institution": "Harbin Institute of Technology"}, {"id": 191769, "fullname": "Shuaixian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191769?format=json", "institution": "Shenzhen Polytechnic University"}, {"id": 145618, "fullname": "Luntong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/145618?format=json", "institution": "Pengcheng laboratory"}], "abstract": "This paper proposes the Continual Dual-Critic with Cross-Attention (CD-CCA) framework for visual reinforcement learning to address the plasticity-stability conflict. Our method introduces continual learning techniques into the visual RL architecture, constructing two complementary critics using Continual Backpropagation (CBP) and Elastic Weight Consolidation (EWC) -- one for maintaining representational plasticity for rapid environmental adaptation, and the other for preserving knowledge stability to prevent catastrophic forgetting. Furthermore, we design a cross-attention based fusion mechanism that balances the value estimates from the dual critics according to observation characteristics. Experimental results on DeepMind Control and CARLA benchmarks show that CD-CCA effective mitigates issues of representation drift and policy degradation. Compared to existing visual RL methods, our approach exhibits enhanced robustness and adaptability in non-stationary environments and long-horizon decision-making tasks, providing a new architectural paradigm for the advancement of continual reinforcement learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39286", "url": null, "sourceid": 40727, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36448, "uid": "1b9b3786eae526755467c2593d194005", "name": "GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping", "authors": [{"id": 90327, "fullname": "Beining Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/90327?format=json", "institution": "Department of Computer Science, Princeton University"}, {"id": 76010, "fullname": "Yu-Wei Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76010?format=json", "institution": "NVIDIA"}, {"id": 185078, "fullname": "Erwin Coumans", "url": "http://cvpr.thecvf.com/api/miniconf/users/185078?format=json", "institution": "NVIDIA"}, {"id": 185079, "fullname": "Clemens Eppner", "url": "http://cvpr.thecvf.com/api/miniconf/users/185079?format=json", "institution": "NVIDIA"}, {"id": 90297, "fullname": "Jia Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90297?format=json", "institution": "Princeton University"}, {"id": 159437, "fullname": "Stan Birchfield", "url": "http://cvpr.thecvf.com/api/miniconf/users/159437?format=json", "institution": "NVIDIA"}, {"id": 137420, "fullname": "Adithya Murali", "url": "http://cvpr.thecvf.com/api/miniconf/users/137420?format=json", "institution": "NVIDIA"}], "abstract": "We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 395 Million grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36448", "url": null, "sourceid": 42265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36594, "uid": "992a6d18b2a148cf20d9014c3524aa11", "name": "One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers", "authors": [{"id": 131383, "fullname": "Moayed Haji Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/131383?format=json", "institution": "Rice University"}, {"id": 88383, "fullname": "Willi Menapace", "url": "http://cvpr.thecvf.com/api/miniconf/users/88383?format=json", "institution": "University of Trento"}, {"id": 87280, "fullname": "Ivan Skorokhodov", "url": "http://cvpr.thecvf.com/api/miniconf/users/87280?format=json", "institution": "KAUST"}, {"id": 185426, "fullname": "Dogyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/185426?format=json", "institution": "Snap Inc.; Korea University"}, {"id": 127013, "fullname": "Anil Kag", "url": "http://cvpr.thecvf.com/api/miniconf/users/127013?format=json", "institution": "Snap Inc."}, {"id": 129079, "fullname": "Michael Vasilkovsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/129079?format=json", "institution": "Snap Inc."}, {"id": 85389, "fullname": "Sergey Tulyakov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85389?format=json", "institution": "Snap Inc."}, {"id": 75466, "fullname": "Vicente Ordonez", "url": "http://cvpr.thecvf.com/api/miniconf/users/75466?format=json", "institution": "Rice University"}, {"id": 184760, "fullname": "Aliaksandr Siarohin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184760?format=json", "institution": "Snap Inc.; Snap Inc."}], "abstract": "Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\\%$ and $39.6\\%$ in FID and FDD scores.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36594", "url": null, "sourceid": 34567, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38905, "uid": "ccb946faf2ff655ccffee7f306a81888", "name": "MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments", "authors": [{"id": 180690, "fullname": "Svitlana Morkva", "url": "http://cvpr.thecvf.com/api/miniconf/users/180690?format=json", "institution": "Robotic Systems Lab, ETH Zurich"}, {"id": 126558, "fullname": "Vaishakh Patil", "url": "http://cvpr.thecvf.com/api/miniconf/users/126558?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 87265, "fullname": "Alessio Tonioni", "url": "http://cvpr.thecvf.com/api/miniconf/users/87265?format=json", "institution": "Google"}, {"id": 166991, "fullname": "Michael Oechsle", "url": "http://cvpr.thecvf.com/api/miniconf/users/166991?format=json", "institution": "Google"}, {"id": 190945, "fullname": "Maximum Wilder-Smith", "url": "http://cvpr.thecvf.com/api/miniconf/users/190945?format=json", "institution": "ETH Zurich - Robotic Systems Lab"}, {"id": 88665, "fullname": "Marco Hutter", "url": "http://cvpr.thecvf.com/api/miniconf/users/88665?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting.Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage.Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings.To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding.We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods,while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38905", "url": null, "sourceid": 32931, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38509, "uid": "f5ebe2235651644d7d4c166038960bfa", "name": "RunawayEvil: Jailbreaking the Image-to-Video Generative Models", "authors": [{"id": 100090, "fullname": "yueming lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/100090?format=json", "institution": "nanjing university"}, {"id": 190019, "fullname": "Rufan Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190019?format=json", "institution": "nanjing university"}, {"id": 190020, "fullname": "Yueming Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190020?format=json", "institution": "Nanjing university"}, {"id": 190021, "fullname": "Qinglong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190021?format=json", "institution": "nanjing university"}, {"id": 190022, "fullname": "Linzhuang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190022?format=json", "institution": "Nanjing University"}, {"id": 190023, "fullname": "Jie Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190023?format=json", "institution": "Meituan"}, {"id": 180264, "fullname": "Songhua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180264?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 153687, "fullname": "Caifeng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153687?format=json", "institution": "Nanjing University"}], "abstract": "Image-to-Video (I2V) generation represents a frontier in content creation, where models synthesize dynamic visual sequences by jointly reasoning from both image and text prompts. This multimodal grounding enables diverse controllability over video attributes. However, it is precisely this capability that introduces a critical security blind spot: by exploiting the interplay between visual and textual cues, attackers can launch multimodal jailbreak attacks that severely compromise output security. Despite the increasing implementation of security mechanisms in real-world I2V systems, such cross-modal threats remain unexplored. Existing attack methods remain confined to single-modal settings, relying solely on isolated text or image perturbations, which severely limits their effectiveness. To bridge this gap, we propose Runaway Evil, the first multimodal jailbreaking framework for I2V models with dynamic evolutionary capability. Built on a Strategy-Tactic-Action paradigm, our framework exhibits self-amplifying attack through three core components: (1) a strategy-aware command unit that enables the attack to self-evolve its strategies through reinforcement learning-driven strategy customization and large language model (LLM)-based strategy exploration; (2) a multimodal tactical planning unit that generates synergistic text jailbreak instructions and image tampering guidelines based on the selected strategies; and (3) an tactical action Unit executes and evaluates the coordinated attacks. This self-evolving architecture allows the framework to continuously adapt and intensify its attack strategies without human intervention. Extensive experiments demonstrate that Runaway Evil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX. This work provides a critical tool for probing and mitigating multimodal vulnerabilities, laying a foundation for building more robust video generation systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38509", "url": null, "sourceid": 38334, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39366, "uid": "bb8e9f780bd9dfae6288d1b15d13ffed", "name": "GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding", "authors": [{"id": 175112, "fullname": "Rong Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/175112?format=json", "institution": "Newcapec Electronic Co., Ltd."}, {"id": 191935, "fullname": "Kaiyan Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191935?format=json", "institution": "Tongji University"}, {"id": 191936, "fullname": "Minghao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191936?format=json", "institution": "Tongji University"}, {"id": 129890, "fullname": "Liuyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129890?format=json", "institution": "Tongji University"}, {"id": 191937, "fullname": "KAI DAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/191937?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191938, "fullname": "Zhao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191938?format=json", "institution": "Newcapec Electronics"}], "abstract": "Video Temporal Grounding (VTG) is a critical task in video understanding and a key capability for extending Video Large Language Models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues.To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate our model on three standard VTG benchmarks, where GroundVTS outperforms state-of-the-art methods, achieving a +7.7\\% mIoU improvement on moment retrieval and +12.0\\% mAP on highlight detection.Code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39366", "url": null, "sourceid": 43958, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37061, "uid": "f1b325ca803c9e81662f52cfd88dabbe", "name": "MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis", "authors": [{"id": 186586, "fullname": "Di Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186586?format=json", "institution": "Nankai University"}, {"id": 186587, "fullname": "Shuhui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186587?format=json", "institution": "Tencent Hunyuan 3D"}, {"id": 186588, "fullname": "Mingxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186588?format=json", "institution": "Hunyuan"}, {"id": 145316, "fullname": "Jiawei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145316?format=json", "institution": "Nankai University"}, {"id": 186589, "fullname": "Yixuan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186589?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 90390, "fullname": "Xintong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/90390?format=json", "institution": "Huya Inc"}, {"id": 155060, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155060?format=json", "institution": null}, {"id": 145863, "fullname": "Beibei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145863?format=json", "institution": "Nanjing University"}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}], "abstract": "Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present \\textbf{MatPedia}, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-channel sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks\u2014text-to-material generation, image-to-material generation, and intrinsic decomposition\u2014within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native 1024\u00d71024 synthesis that substantially surpasses existing approaches in both quality and diversity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37061", "url": null, "sourceid": 37226, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40063, "uid": "603a60c30fc626443b4652fd1f63de04", "name": "QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence", "authors": [{"id": 168573, "fullname": "Weiyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/168573?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193416, "fullname": "Ru Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193416?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193417, "fullname": "Jiaqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193417?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193418, "fullname": "Sizhe Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193418?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193419, "fullname": "Qinglin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193419?format=json", "institution": "Harbin Institute of Technology"}, {"id": 127493, "fullname": "Shengping Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127493?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Open-vocabulary 3D object affordance grounding aims to identify functional regions of objects given arbitrary semantic descriptions. However, existing methods often rely on fixed training categories and geometric priors, lacking geometric invariance and analogical reasoning capabilities. Since there exists a significant domain gap when transferring affordance knowledge learned from 2D images to 3D point clouds, existing methods struggle to generalize well to objects with diverse shapes or unseen categories, and fail to perform effective category reasoning.To address these challenges, we propose **QueryMe**, a **Query**-driven framework that learns from **M**ultimodal **e**vidence spaces to achieve open-vocabulary 3D affordance grounding.The proposed approach is to project human-object interaction images into 3D space, employ an Adaptive Spatial Attention module to focus on key interaction regions, and introduce a multimodal query structure to retrieve available geometrically consistent functional parts within the point cloud, effectively fusing visual, linguistic, and geometric cues.Leveraging attention-based query mechanisms, our method adaptively localizes affordance regions and performs analogy reasoning through geometric similarity, thereby exhibiting strong generalization to unseen scenes and objects. Experimental results demonstrate that QueryMe consistently outperforms state-of-the-art approaches, with the AUC improving by 4.19\\% compared to previous work for unseen affordance grounding tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40063", "url": null, "sourceid": 42095, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40121, "uid": "7b453137eec6a1d8f8a1528f8f40916e", "name": "Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation", "authors": [{"id": 176877, "fullname": "Guohui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176877?format=json", "institution": "Dalian Minzu University"}, {"id": 184398, "fullname": "Fuming Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184398?format=json", "institution": "Dalian Minzu University"}, {"id": 193576, "fullname": "Yu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193576?format=json", "institution": "Dalian Minzu University"}, {"id": 193577, "fullname": "Yuqiu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193577?format=json", "institution": "Dalian Minzu University"}, {"id": 193578, "fullname": "Jing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193578?format=json", "institution": "Dalian Minzu University"}, {"id": 136820, "fullname": "Ganggang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136820?format=json", "institution": "Dalian Minzu University"}], "abstract": "Open-Vocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects from unseen categories under textual guidance precisely. However, existing methods often employ a unidirectional interaction strategy, where textual prompts guide the matching of visual features. Such a design neglects the bidirectional interaction between visual and language modalities, making the model vulnerable to the semantic gap between image-level textual semantics and pixel-level segmentation cues, which in turn leads to severe semantic confusion in complex camouflaged scenarios. To address this challenge, we propose BaCLIP, a novel bidirectional semantic alignment framework for OVCOS. At its core lies the Mutual Refinement and Enhancement Module (MREM), which establishes bidirectional cross-attention between visual and textual features, enabling mutual semantic calibration to resolve ambiguity and strengthen cross-modal alignment. Moreover, we introduce an Adaptive Prompt that transforms refined textual embeddings into semantic-aware prompts for Segment Anything Model (SAM), enabling direct textual guidance and improving mask precision. Experimental results on the OVCamo benchmark demonstrate that BaCLIP consistently achieves state-of-the-art performance with a compact architecture, effectively mitigating semantic confusion and advancing the understanding of cross-modal camouflage perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40121", "url": null, "sourceid": 44087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39443, "uid": "0569715221d9e6147085bca7324be293", "name": "CHAL: Causal-guided Hierarchical Anomaly-aware Learning for Moving Infrared Small Target Detection", "authors": [{"id": 180416, "fullname": "Weiwei Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180416?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155495, "fullname": "Luping Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/155495?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 192088, "fullname": "Shipeng Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192088?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 192089, "fullname": "Sicheng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192089?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 178407, "fullname": "Jianghong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178407?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 128551, "fullname": "Mao Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/128551?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Infrared small target detection is one highly special category of object detection, faced with tiny target imaging size and cluttered backgrounds. Currently, almost all existing methods are target-centered, directly learning the target features from backgrounds. However, due to weak target signals, they are often difficult in effectively capturing stable features. Sometimes, they cannot even distinguish real targets from background confounders. To overcome these problems, from an opposite perspective, we propose the first Causal-guided Hierarchical Anomaly-aware Learning (CHAL) framework. Breaking through target-centered paradigm, it focuses on background learning, while the targets are handled as the anomalies in backgrounds. In detail, to fulfill the goal, a spatio-temporal neural field is designed to model the background evolution patterns from generative perspective. Meanwhile, a hierarchical anomaly-aware learning is proposed to decompose anomaly discovery. Furthermore, to block the spurious correlations often caused by background confounders, and enhance true target causality, a causal-guiding mechanism is designed. The experiments on three infrared datasets verify the effectiveness and superiority of our CHAL. Even in visible-light scenarios, it still possesses obvious adaptivity. Source code will be open.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39443", "url": null, "sourceid": 35678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37473, "uid": "831979cdc82239bb411794d618b7cfbb", "name": "PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training", "authors": [{"id": 187538, "fullname": "Weifu Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187538?format=json", "institution": "Tencent"}, {"id": 130735, "fullname": "Jinyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130735?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 71543, "fullname": "Bin-Bin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71543?format=json", "institution": "Tencent"}, {"id": 126169, "fullname": "Jialin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126169?format=json", "institution": "Tencent"}, {"id": 126193, "fullname": "Yuhuan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/126193?format=json", "institution": "Tencent Youtu Lab"}, {"id": 90204, "fullname": "Hanqiu Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90204?format=json", "institution": "University of Alberta"}, {"id": 130746, "fullname": "Wenbing Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130746?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 88620, "fullname": "Yong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88620?format=json", "institution": "Tencent Youtu Lab"}, {"id": 86912, "fullname": "Chengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86912?format=json", "institution": "Tencent Youtu Lab; Shanghai Jiao Tong University"}], "abstract": "Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text paired samples for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, extending the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal object detector supporting both text and visual prompts. Our visual prompt generation scheme builds on an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, parallel alignment with diverse real-world usage scenarios, and improved classification. Extensive experiments demonstrate that our visual prompt generation scheme, based on text-prompt-based detection pretraining, achieves a higher performance ceiling compared to using visual prompts alone.Our method achieves significant zero-shot detection performance on COCO, LVIS, and ODinW, and excels across various prompt-based detection protocols. In-domain evaluations also demonstrate robust localization performance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37473", "url": null, "sourceid": 40971, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39184, "uid": "8a6a6209090b90fbc29026916155dec2", "name": "Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration", "authors": [{"id": 181915, "fullname": "Fengyang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181915?format=json", "institution": "Duke University"}, {"id": 191525, "fullname": "Peng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191525?format=json", "institution": null}, {"id": 191526, "fullname": "Lei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191526?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 191527, "fullname": "XingE Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191527?format=json", "institution": "Duke University"}, {"id": 191528, "fullname": "Guanyi Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191528?format=json", "institution": "National University of Singapore"}, {"id": 191529, "fullname": "Yuqi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191529?format=json", "institution": "Tsinghua University"}, {"id": 127801, "fullname": "Chengyu Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127801?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 145704, "fullname": "Rihan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145704?format=json", "institution": "Anhui University"}, {"id": 69456, "fullname": "Chunming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/69456?format=json", "institution": "Duke University"}, {"id": 87009, "fullname": "Sina Farsiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87009?format=json", "institution": "Duke University"}], "abstract": "Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable.To address these problems, we propose a novel framework, termed \\textbf{IQPIR}, that introduces an Image Quality Prior (IQP)\u2014extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models\u2014to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms:(1) a \\textbf{quality-conditioned Transformer}, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and(2) a \\textbf{dual-branch codebook structure}, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) {a \\textbf{discrete representation-based quality optimization strategy}, which mitigates over-optimization effects commonly observed in continuous latent spaces.}Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39184", "url": null, "sourceid": 41583, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39605, "uid": "bdb39ff93e597d32c796a93733593fa4", "name": "Towards Cross-Modal Preservation, Consistency and Alignment for Privacy-Preserving Visible-Infrared Person Re-Identification", "authors": [{"id": 192461, "fullname": "Yudi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192461?format=json", "institution": "Wuhan University"}, {"id": 191791, "fullname": "Zhongao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191791?format=json", "institution": "Wuhan University"}, {"id": 130757, "fullname": "Bin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130757?format=json", "institution": "Wuhan University"}, {"id": 192462, "fullname": "Zhenghan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192462?format=json", "institution": "Wuhan University"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "Privacy-preserving Person Re-Identification (PP-ReID) addresses the core privacy-utility trade-off in Re-ID by retrieving a person across multiple non-overlapping cameras while applying anonymization techniques to protect sensitive information. However, prior PP-ReID studies are confined to single-modality visible scenarios, as 24-hour surveillance systems require robust cross-modal visible-infrared (VI) capabilities. Extending PP-ReID to the cross-modal VI setting is therefore crucial for 24-hour surveillance. Accordingly, we introduce a new task: Privacy-Preserving Visible-Infrared Person Re-Identification (PP-VI-ReID). This task presents two severe challenges: 1) Crude anonymization strategies destroy identity-critical information and disrupt cross-modal alignment 2) The anonymization process creates inconsistent distortions across modalities. It disrupts color-based textures in visible images while obscuring thermal contours in infrared images. This inconsistency with modality gap forms a Mixed Gap. To overcome these challenges, we propose a framework, the Precise Privacy-preserving and Alignment Network (PPA) with two components: 1) A Keypoint-Preserving Regularization (KPR) module leverages human pose as a prior to guide structure-aware anonymization, preserving essential body features. 2) A Differential Consistency-guided Modality Alignment (DCMA) module. It treats anonymization perturbations not as varying noise, but as a stable, learnable offset, facilitating robust alignment between raw and anonymized features across modalities. Experiments on SYSU-MM01 and RegDB validate our framework, establishing a strong baseline for this task. The source code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39605", "url": null, "sourceid": 46150, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37024, "uid": "72e81de94dfc0373b006ca75e9c851a1", "name": "SIGMA: A Physics-Informed Benchmark for Gas Chimney Understanding in Seismic Images", "authors": [{"id": 186512, "fullname": "Bao Truong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186512?format=json", "institution": "FPT"}, {"id": 186513, "fullname": "Quang Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186513?format=json", "institution": "FPT"}, {"id": 127983, "fullname": "Baoru Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127983?format=json", "institution": "University College London, University of London"}, {"id": 186514, "fullname": "Jinpei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/186514?format=json", "institution": "Imperial College London"}, {"id": 186515, "fullname": "Van Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186515?format=json", "institution": "FPT Software AI Center"}, {"id": 89972, "fullname": "Ngan Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/89972?format=json", "institution": "University of Arkansas, Fayetteville"}, {"id": 186516, "fullname": "Minh-Tan Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/186516?format=json", "institution": "Universit\u00e9 de Bretagne Sud"}, {"id": 186517, "fullname": "Doan Hien", "url": "http://cvpr.thecvf.com/api/miniconf/users/186517?format=json", "institution": null}, {"id": 87853, "fullname": "Anh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87853?format=json", "institution": "University of Liverpool"}], "abstract": "Seismic images reconstruct subsurface reflectivity from field recordings, guiding exploration and reservoir monitoring. Gas chimneys are vertical anomalies caused by subsurface fluid migration. Understanding these phenomena is crucial for assessing hydrocarbon potential and avoiding drilling hazards. However, accurate detection is challenging due to strong seismic attenuation and scattering. Traditional physics-based methods are computationally expensive and sensitive to model errors, while deep learning offers efficient alternatives, yet lacks labeled datasets. In this work, we introduce \\textbf{SIGMA}, a new physics-informed dataset for gas chimney understanding in seismic images, featuring (i) pixel-level gas-chimney mask for detection and (ii) paired degraded and ground-truth image for enhancement. We employed physics-based methods that cover a wide range of geological settings and data acquisition conditions. Comprehensive experiments demonstrate that SIGMA serves as a challenging benchmark for gas chimney interpretation and benefits general seismic understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37024", "url": null, "sourceid": 42293, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36563, "uid": "d94101ad400cbaf86b951c78f8172039", "name": "MeshRipple: Structured Autoregressive Generation of Artist-Meshes", "authors": [{"id": 174823, "fullname": "JunKai Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/174823?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185355, "fullname": "Hang Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/185355?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185356, "fullname": "Huipeng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185356?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185357, "fullname": "Jielei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185357?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185358, "fullname": "JiaYi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185358?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185359, "fullname": "Tianle Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185359?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185360, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185360?format=json", "institution": "Huazhong University of Science and Technology; Beijing Institute of Mathematical Sciences and Applications"}, {"id": 185361, "fullname": "Jianwen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185361?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 185362, "fullname": "Wenxiao ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185362?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 75895, "fullname": "Matthias Nie\u00dfner", "url": "http://cvpr.thecvf.com/api/miniconf/users/75895?format=json", "institution": "Technical University of Munich"}, {"id": 90421, "fullname": "Wei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90421?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface.MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies.This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36563", "url": null, "sourceid": 40904, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38594, "uid": "014cec6ea89e4bfbcb911f6845a96d55", "name": "EmoStyle: Emotion-Driven Image Stylization", "authors": [{"id": 132363, "fullname": "Jingyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132363?format=json", "institution": "Shenzhen University"}, {"id": 190239, "fullname": "Zihuan Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190239?format=json", "institution": "Shenzhen University"}, {"id": 85724, "fullname": "Hui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85724?format=json", "institution": "Shenzhen University"}], "abstract": "Art has long been a profound medium for expressing emotions.While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles.To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content.We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion\u2013style mapping.First, we construct EmoStyleSet, a content-emotion-stylized image triplet dataset derived from ArtEmis to support AIS.We then propose an Emotion\u2013Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries.Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries.Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency.Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications.Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38594", "url": null, "sourceid": 45329, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37135, "uid": "6249187e66247d5d4542c409af2f0f5c", "name": "DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO", "authors": [{"id": 180657, "fullname": "Henglin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180657?format=json", "institution": "Tsinghua University"}, {"id": 185264, "fullname": "Huijuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185264?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 180176, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180176?format=json", "institution": "Sun Yat-sen University"}, {"id": 89019, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89019?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 90909, "fullname": "Xiu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90909?format=json", "institution": "Tsinghua University"}, {"id": 89145, "fullname": "Xiangyang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/89145?format=json", "institution": "Tsinghua University"}], "abstract": "Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, restricting the application scenarios of the model.This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality\u2013diversity trade-off.Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves an 13\\%$\\sim$18\\% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37135", "url": null, "sourceid": 39227, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36367, "uid": "625cbb43db2928b8396df846aed9456a", "name": "MapRoute:Precise-Concept Erasing Mappers via Semantic Routing", "authors": [{"id": 184888, "fullname": "Sihao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184888?format=json", "institution": "Harbin Institute of Technology"}, {"id": 159819, "fullname": "Baixi Baixi", "url": "http://cvpr.thecvf.com/api/miniconf/users/159819?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184889, "fullname": "Shuohong Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184889?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184890, "fullname": "Yunyun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184890?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}], "abstract": "Contemporary commercial and open-source diffusion models have demonstrated remarkable performance in text-to-image generation, enabling widespread applications in creative design and content creation. However, legitimate requirements\u2014such as copyright protection, privacy compliance, or personalized customization\u2014often necessitate the removal of specific semantic concepts from pretrained models. Existing concept erasure methods suffer from two critical limitations: (1) **Incomplete suppression**, where the model still occasionally generates images containing the target concept; (2) **Poor semantic selectivity**, which degrades the generation quality of unrelated concepts and compromises overall model utility.To address these challenges, we propose **`MapRoute`**, a lightweight, semantics-aware concept erasure framework based on dynamic routing. Our approach introduces a set of modular components\u2014termed *Mappers*\u2014placed after a frozen pretrained text encoder. Each Mapper learns a linear mapping from a target concept to a surrogate concept. During inference, the system dynamically activates the top-$K$ Mappers most relevant to the input prompt, based on cosine similarity between the text embedding and all the target concept embeddings, and applies their transformations sequentially. This input-driven, modular intervention enables precise, on-demand erasure while avoiding unnecessary interference with irrelevant semantics.Extensive experiments demonstrate that **`MapRoute`** effectively suppresses specified concepts while significantly reducing collateral damage to unrelated concept. By operating without full-model fine-tuning, our method entirely avoids parameter drift and concept erosion.  Moreover, **`MapRoute`** outperforms state-of-the-art baselines in terms of generation fidelity, semantic consistency, and scalability to multi-concept erasure scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36367", "url": null, "sourceid": 38333, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38554, "uid": "d037731d1da83efffdb92ce25f124a8c", "name": "VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm", "authors": [{"id": 190121, "fullname": "Zhenkai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190121?format=json", "institution": "Zhejiang University"}, {"id": 190122, "fullname": "Xiaowen Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190122?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 183938, "fullname": "ZHENLIANG NI", "url": "http://cvpr.thecvf.com/api/miniconf/users/183938?format=json", "institution": "Huawei"}, {"id": 190123, "fullname": "Dengming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190123?format=json", "institution": "Zhejiang University"}, {"id": 189422, "fullname": "Han Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189422?format=json", "institution": "Huawei Technologies"}, {"id": 190124, "fullname": "Xin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190124?format=json", "institution": null}, {"id": 154103, "fullname": "Xinghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154103?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}], "abstract": "Vision\u2013language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects.To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a Centrifugal Token Pruning Paradigm (CTPP) to prioritize the preservation of fine-grained object details. Leveraging CTPP, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones.Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\\% pruning rate, while also delivering an end-to-end inference speedup. The code is available in Supplementary Material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38554", "url": null, "sourceid": 37504, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36522, "uid": "aafa7c874d0d45e91e3712364dd8e574", "name": "Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model", "authors": [{"id": 144454, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144454?format=json", "institution": "University of Science and Technology of China"}, {"id": 185263, "fullname": "Borui Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185263?format=json", "institution": "Kuaishou"}, {"id": 185264, "fullname": "Huijuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185264?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 185265, "fullname": "Jinda Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185265?format=json", "institution": "University of Science and Technology of China"}, {"id": 89355, "fullname": "Ouxiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89355?format=json", "institution": "University of Science and Technology of China"}, {"id": 156859, "fullname": "Kuien Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156859?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 180129, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180129?format=json", "institution": "Kling AI"}, {"id": 129139, "fullname": "Xiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129139?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video.To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion.We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structural distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36522", "url": null, "sourceid": 36727, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37163, "uid": "13d29e6e8cf1bcf427da3e7bd696a73f", "name": "Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning", "authors": [{"id": 186810, "fullname": "Weijia Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186810?format=json", "institution": "Tianjin Normal University"}, {"id": 174313, "fullname": "Jingyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174313?format=json", "institution": "Tianjin Normal University"}, {"id": 186811, "fullname": "Ruojia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186811?format=json", "institution": "Tianjin Normal University"}, {"id": 186812, "fullname": "Fengtao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186812?format=json", "institution": "Tianjin Normal University"}, {"id": 186813, "fullname": "Qian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186813?format=json", "institution": "Tianjin Normal University"}, {"id": 186814, "fullname": "Chenyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186814?format=json", "institution": "Shenzhen University"}, {"id": 144338, "fullname": "tongtong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/144338?format=json", "institution": "Tianjin Normal University"}, {"id": 186815, "fullname": "Jia Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186815?format=json", "institution": null}, {"id": 186816, "fullname": "Xiaobai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186816?format=json", "institution": "Zhejiang University"}, {"id": 186817, "fullname": "Minglai Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186817?format=json", "institution": "Tianjin University"}], "abstract": "Micro-gestures are subtle and transient movements triggered by unconscious neural and emotional activities, holding great potential for human\u2013computer interaction and clinical monitoring. However, their low amplitude, short duration, and strong inter-subject variability make existing deep models prone to degradation under low-sample, noisy, and cross-subject conditions. This paper presents an active inference\u2013based framework for micro-gesture recognition, featuring Expected Free Energy (EFE)-guided temporal sampling and uncertainty-aware adaptive learning. The model actively selects the most discriminative temporal segments under EFE guidance, enabling dynamic observation and information gain maximization. Meanwhile, sample weighting driven by predictive uncertainty mitigates the effects of label noise and distribution shift. Experiments on the SMG dataset demonstrate the effectiveness of the proposed method, achieving consistent improvements across multiple mainstream backbones. Ablation studies confirm that both the EFE-guided observation and the adaptive learning mechanism are crucial to the performance gains. This work offers an interpretable and scalable paradigm for temporal behavior modeling under low-resource and noisy conditions, with broad applicability to wearable sensing, HCI, and clinical emotion monitoring.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37163", "url": null, "sourceid": 32654, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39488, "uid": "b28d9123c8c2bb408428a90d5598906f", "name": "$\\textbf{FailureAtlas}$: Mapping the Failure Landscape of T2I Models via Active Exploration", "authors": [{"id": 192179, "fullname": "Muxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192179?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 192180, "fullname": "Zhaohua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192180?format=json", "institution": "Dalian University of Technology"}, {"id": 192181, "fullname": "Chenchen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192181?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 85509, "fullname": "Mingyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85509?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 192182, "fullname": "Wenyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192182?format=json", "institution": "Nanjing University"}, {"id": 192183, "fullname": "Tianwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192183?format=json", "institution": "Tencent"}, {"id": 192184, "fullname": "Jianhuan Zhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192184?format=json", "institution": null}, {"id": 192185, "fullname": "Yutang Yutang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192185?format=json", "institution": "Tencent AI Lab"}, {"id": 192186, "fullname": "Qiuyong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192186?format=json", "institution": null}, {"id": 192187, "fullname": "Jihong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192187?format=json", "institution": null}, {"id": 89274, "fullname": "Qiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89274?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Static benchmark-driven evaluation has provided a valuable foundation for analyzing Text-to-Image (T2I) models.However, the fixed and predetermined prompt sets in benchmarks inherently limit diagnostic depth, making it difficult to uncover the full landscape of models' systematic failures or isolate their root causes.We argue for a complementary paradigm: $\\textbf{active exploration}$, and introduce $\\textbf{FailureAtlas}$, the first framework designed to autonomously explore and map the vast failure landscapes of T2I models at scale.Unlike benchmarks that evaluate a fixed prompt set, $\\textbf{FailureAtlas}$ performs guided exploration in the input space, framing error discovery as a structured search for minimal, failure-inducing concepts. While this is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (e.g., over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, $\\textbf{FailureAtlas}$ establishes a new, diagnostic-first methodology to guide the development of more robust generative AI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39488", "url": null, "sourceid": 40410, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36181, "uid": "c5677f71b1968b865a1570e182b7a18e", "name": "RefTON: Person-to-Person Virtual Try-On with Unpaired Visual References", "authors": [{"id": 184349, "fullname": "Liuzhuozheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184349?format=json", "institution": "University of Tokyo"}, {"id": 155922, "fullname": "Yue Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155922?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 102638, "fullname": "Shanyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102638?format=json", "institution": "360 AI Institute"}, {"id": 184350, "fullname": "Zanyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184350?format=json", "institution": "University of California, San Diego"}, {"id": 158676, "fullname": "Dengyang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158676?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 184351, "fullname": "Liebucha Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184351?format=json", "institution": "Qihoo 360"}, {"id": 184352, "fullname": "Bo Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184352?format=json", "institution": "Qihoo 360"}, {"id": 184353, "fullname": "Yuhang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184353?format=json", "institution": "Qihoo 360"}, {"id": 180658, "fullname": "Dawei Leng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180658?format=json", "institution": "360 AI Research"}, {"id": 184354, "fullname": "Yuhui Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184354?format=json", "institution": "360 AI Research"}], "abstract": "We introduce RefTON, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTON streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTON leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTON achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36181", "url": null, "sourceid": 32116, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36277, "uid": "15c08959e5e7b1930cb81c503058b4c4", "name": "When Token Pruning is Worse than Random: Understanding  Visual Token Information in VLLMs", "authors": [{"id": 181141, "fullname": "Yahong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181141?format=json", "institution": "Tongji University"}, {"id": 139139, "fullname": "Juncheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/139139?format=json", "institution": "Tongji University"}, {"id": 127716, "fullname": "Zhangkai Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/127716?format=json", "institution": "Tongji University"}, {"id": 184654, "fullname": "Longzhen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184654?format=json", "institution": "Tongji University"}, {"id": 184655, "fullname": "Yihang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184655?format=json", "institution": "Tongji University"}, {"id": 184656, "fullname": "Chengmei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184656?format=json", "institution": "Tongji University\uff1bCity University of Hong Kong"}, {"id": 73065, "fullname": "Ying Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73065?format=json", "institution": "East China Normal University"}, {"id": 85653, "fullname": "Lianghua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/85653?format=json", "institution": "Tongji University"}, {"id": 184657, "fullname": "Xianfeng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184657?format=json", "institution": "Amazon"}, {"id": 184658, "fullname": "Hui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184658?format=json", "institution": "Amazon"}, {"id": 75508, "fullname": "Yuyin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75508?format=json", "institution": "UC Santa Cruz"}], "abstract": "Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by **\"vanishing token information''**, where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as ``information horizon\", beyond which the visual tokens become redundant;(2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA);(3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5).Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DART with random pruning achieves state-of-the-art results, maintaining 93.9\\% of Qwen-2.5-VL-7B performance while pruning 50\\% of visual tokens.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36277", "url": null, "sourceid": 38523, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39539, "uid": "238f96611389c857dd9a1c99bf901eae", "name": "COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs", "authors": [{"id": 184253, "fullname": "Peizheng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184253?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 192297, "fullname": "Jingyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192297?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187189, "fullname": "Wenwen Qiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187189?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 107125, "fullname": "Jiahuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/107125?format=json", "institution": "Peking University"}, {"id": 192298, "fullname": "Changwen Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192298?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 85363, "fullname": "Gang Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/85363?format=json", "institution": "Wormpex AI Research"}], "abstract": "Despite Multimodal Large Language Models (MLLMs) having shown impressive capabilities, they may suffer from hallucinations. Empirically, we find that MLLMs attend disproportionately to task-irrelevant background regions compared with text-only LLMs, implying spurious background-answer correlations. We claim and analyze that (i) outcome-based rewards can be an important factor leading to spurious correlations, and (ii) spurious correlations can be an important factor leading to hallucinations. Based on these results, we propose Causal-Oriented Policy Optimization (COPO) to mitigate these spurious correlations, thus addressing the issue of hallucinations. It imposes token-level sufficiency and necessity constraints to measure each inference token's causal contribution, thus ensuring correct and evidence-grounded output. Specifically, we first evaluate each token's causal contribution via a newly proposed causal completeness reward. This reward is then used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are causally sufficient and necessary for accurate generation. Experimental results across various benchmarks demonstrate the advantages of COPO.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39539", "url": null, "sourceid": 38789, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37478, "uid": "823e7bde3c29d63fc14b1178cfd8d997", "name": "VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning", "authors": [{"id": 187549, "fullname": "Boyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187549?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187550, "fullname": "Zikang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187550?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 143191, "fullname": "Zhengrong Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/143191?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187551, "fullname": "Kainan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187551?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187552, "fullname": "Chenyun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187552?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 187553, "fullname": "Yi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187553?format=json", "institution": null}, {"id": 187554, "fullname": "Zijun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187554?format=json", "institution": "Department of Computer Science and Technology, Tsinghua University"}, {"id": 155443, "fullname": "Yafei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155443?format=json", "institution": "vivo"}, {"id": 155444, "fullname": "Xiaoxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155444?format=json", "institution": "vivo AI Lab"}, {"id": 128402, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128402?format=json", "institution": "Tsinghua University"}, {"id": 128392, "fullname": "Peng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128392?format=json", "institution": "Tsinghua University"}, {"id": 86624, "fullname": "Yali Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86624?format=json", "institution": "SIAT, Chinese Academy of Sciences"}], "abstract": "By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single and fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes.(1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user\u2019s query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the top baseline by 3.6% and surpasses GPT-4o by 15.6%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37478", "url": null, "sourceid": 33205, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37790, "uid": "a286be2b8cb6de66943d8025b3fa7e33", "name": "LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference", "authors": [{"id": 179904, "fullname": "Junkun JIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/179904?format=json", "institution": "Hong Kong Baptist University"}, {"id": 188272, "fullname": "Ho Yin Au", "url": "http://cvpr.thecvf.com/api/miniconf/users/188272?format=json", "institution": "Hong Kong Baptist University"}, {"id": 148843, "fullname": "Jingyu Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/148843?format=json", "institution": "Hong Kong Baptist University"}, {"id": 188273, "fullname": "Jie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188273?format=json", "institution": "Hong Kong Baptist University"}], "abstract": "Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text\u2013motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description\u2013motion pairs and three metrics that jointly measure text\u2013motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis. All code and data are available at https://github.com/xxx/xxx.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37790", "url": null, "sourceid": 32790, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37178, "uid": "417cce83a9373223e4aae3b833114354", "name": "DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation", "authors": [{"id": 183325, "fullname": "SANKARSHANA VENUGOPAL", "url": "http://cvpr.thecvf.com/api/miniconf/users/183325?format=json", "institution": "Seoul National University"}, {"id": 186860, "fullname": "Mohammad Mostafavi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186860?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology; Seoul National University"}, {"id": 129634, "fullname": "Jonghyun Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/129634?format=json", "institution": "Seoul National University"}], "abstract": "Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce \\textbf{DBMSolver}, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding exact $1^\\text{st}$- and $2^\\text{nd}$-order solutions. This reduces NFEs by up to $5\\times$ while boosting quality (e.g., FID drops $53\\%$ on DIODE at 20 NFEs vs. $2^\\text{nd}$-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256$\\times$256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37178", "url": null, "sourceid": 30783, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37747, "uid": "b4a24c9a34cbb53b49c920b905f38d29", "name": "BluRef: Unsupervised Image Deblurring with Dense-Matching References", "authors": [{"id": 71418, "fullname": "Bang-Dang Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/71418?format=json", "institution": "VinAI Research"}, {"id": 76979, "fullname": "Anh Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/76979?format=json", "institution": "VinAI Research"}, {"id": 134650, "fullname": "Cuong Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/134650?format=json", "institution": "Posts &amp; Telecom. Institute of Technology and VinAI Research"}, {"id": 136690, "fullname": "Minh Nguyen Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/136690?format=json", "institution": "Qualcomm Inc, QualComm; University of Adelaide"}], "abstract": "This paper introduces a novel unsupervised approach for image deblurring that utilizes a simple process for training data collection, thereby enhancing the applicability and effectiveness of deblurring methods. Our technique does not require meticulously paired data of blurred and corresponding sharp images; instead, it uses unpaired blurred and sharp images of similar scenes to generate pseudo-ground truth data by leveraging a dense matching model to identify correspondences between a blurry image and reference sharp images. Thanks to the simplicity of the training data collection process, our approach does not rely on existing paired training data or pre-trained networks, making it more adaptable to various scenarios and suitable for networks of different sizes, including those designed for low-resource devices. We demonstrate that this novel approach achieves state-of-the-art performance, marking a significant advancement in the field of image deblurring.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37747", "url": null, "sourceid": 40807, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39570, "uid": "37d92d59a2b02e0256aa1d2bddcfa50d", "name": "PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose", "authors": [{"id": 192370, "fullname": "Jebastin Nadar", "url": "http://cvpr.thecvf.com/api/miniconf/users/192370?format=json", "institution": "Imperial College London"}, {"id": 182061, "fullname": "Simone Foti", "url": "http://cvpr.thecvf.com/api/miniconf/users/182061?format=json", "institution": "Imperial College London"}, {"id": 73518, "fullname": "Tolga Birdal", "url": "http://cvpr.thecvf.com/api/miniconf/users/73518?format=json", "institution": "Imperial College London"}], "abstract": "Generative pose priors have recently emerged as a powerful tool for inference under occlusion or noise. Yet today\u2019s strongest generative paradigm, *flow matching*, remains unused for human pose due to two fundamental barriers: the absence of a pre-trained flow prior and the non-Euclidean nature of articulated poses. We overcome both by introducing **PoseD-Flow**, a novel framework to unify Riemannian Flow Matching (RFM) with training-free guidance for 3D human pose recovery. PoseD-Flow is composed of two contributions: (i) **PoseRFM**, the first RFM model of human pose, defined directly on the product manifold of joint rotations, and (ii) **Riemannian D-Flow**, a principled guidance mechanism that, by differentiating through its ODE sampling dynamics, conditions PoseRFM at inference without any task-specific training. Our theoretical analysis shows that the induced dynamics are shaped by data covariance and manifold curvature, yielding a bias toward realistic poses. Across pose completion, denoising, and inverse kinematics, \\MethodName~establishes new state of the art, particularly under noise, occlusion, and partial observations.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39570", "url": null, "sourceid": 39499, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39960, "uid": "eaa9bf7df753a0756ace3d7553bc11ff", "name": "Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation", "authors": [{"id": 181187, "fullname": "Ruiying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181187?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 192070, "fullname": "Yuanzhi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192070?format=json", "institution": "TeleAI"}, {"id": 87044, "fullname": "Haibin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87044?format=json", "institution": "Kuaishou Technology"}, {"id": 193199, "fullname": "Tianshu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193199?format=json", "institution": "Chinese University of Hong Kong (Shenzhen)"}, {"id": 152943, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152943?format=json", "institution": "China Telecom"}], "abstract": "Group Relative Policy Optimization (GRPO) has emerged as an effective and lightweight framework for post-training visual generative models. However, its performance is fundamentally limited by the ambiguity of textual\u2013visual correspondence: a single prompt may validly describe diverse visual outputs, and a single image or video may support multiple equally correct interpretations. This many-to-many relationship leads reward models to generate uncertain and weakly discriminative signals, causing GRPO to underutilize reliable feedback and overfit noisy ones. We introduce Bayesian Prior-Guided Optimization (BPGO), a novel extension of GRPO that explicitly models reward uncertainty through a semantic prior anchor. BPGO adaptively modulates optimization trust at two levels: inter-group Bayesian trust allocation emphasizes updates from groups consistent with the prior while down-weighting ambiguous ones, and intra-group prior-anchored renormalization sharpens sample distinctions by expanding confident deviations and compressing uncertain scores.Across both image and video generation tasks, BPGO delivers consistently stronger semantic alignment, enhanced perceptual fidelity, and faster convergence than standard GRPO and recent variants.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39960", "url": null, "sourceid": 45196, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37182, "uid": "4a61f90089bdb5a4965c92b9b825afc5", "name": "Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach", "authors": [{"id": 186865, "fullname": "YIFAN LIAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/186865?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 186866, "fullname": "Yuxin Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186866?format=json", "institution": "National University of Singapore"}, {"id": 186867, "fullname": "Yedi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186867?format=json", "institution": "National University of Singapore"}, {"id": 186868, "fullname": "Wentao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186868?format=json", "institution": "Ningbo University"}, {"id": 186869, "fullname": "Yan XIAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/186869?format=json", "institution": "Sun Yat-Sen University"}, {"id": 186870, "fullname": "Xianglong Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/186870?format=json", "institution": "Changan Automobile"}, {"id": 186871, "fullname": "Zhiyong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186871?format=json", "institution": "NUS School of Computing"}, {"id": 186873, "fullname": "Jin Song Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186873?format=json", "institution": "National University of Singapore"}], "abstract": "Deep learning-based lane detection (LD) plays a critical role in autonomous driving and advanced driver assistance systems. However, its vulnerability to backdoor attacks presents a significant security concern. Existing backdoor attack methods on LD often exhibit limited practical utility due to the artificial and conspicuous nature of their triggers. To address this limitation and investigate the impact of more ecologically valid backdoor attacks on lane detection models, we examine the common data poisoning attack and introduce DBALD, a novel diffusion-based data poisoning framework for generating naturalistic backdoor triggers. DBALD comprises two key components: optimal trigger position finding and stealthy trigger generation. Given the insight that attack performance varies depending on the trigger position, we propose a heatmap-based method to identify the optimal trigger location, with gradient analysis to generate attack-specific heatmaps. A region-based editing diffusion process is then applied to synthesize visually plausible triggers within the most susceptible regions identified previously. Furthermore, to ensure scene integrity and a stealthy attack, we introduce two loss strategies: one for preserving lane structure and another for maintaining the consistency of the driving scene. Consequently, compared to existing attack methods, DBALD achieves both a high attack success rate and superior stealthiness. Extensive experiments on 4 mainstream lane detection models show that DBALD exceeds state-of-the-art methods, with an average success rate improvement of +10.87% and significantly enhanced stealthiness. The experimental results highlight significant practical challenges in ensuring model robustness against real-world backdoor threats in lane detection. Our data and demos are available at https://sites.google.com/view/dbald.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37182", "url": null, "sourceid": 33184, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37197, "uid": "59495dda55063c90e74d8761976f3229", "name": "Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding", "authors": [{"id": 180669, "fullname": "Yubo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180669?format=json", "institution": "Beihang University"}, {"id": 186894, "fullname": "Yitong An", "url": "http://cvpr.thecvf.com/api/miniconf/users/186894?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 186895, "fullname": "Xin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186895?format=json", "institution": "Zhejiang University"}, {"id": 186896, "fullname": "Abudukelimu Wuerkaixi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186896?format=json", "institution": "Tsinghua University, Beijing"}, {"id": 157697, "fullname": "Xuxin Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157697?format=json", "institution": "Peking University"}, {"id": 186897, "fullname": "Fengying Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186897?format=json", "institution": "Beihang University"}, {"id": 186898, "fullname": "Zhiguo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186898?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 186899, "fullname": "Cao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186899?format=json", "institution": "Meituan"}, {"id": 186900, "fullname": "Ke Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186900?format=json", "institution": "Meituan"}, {"id": 186901, "fullname": "Haopeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186901?format=json", "institution": "Beihang University"}], "abstract": "Vision-Language Models (VLMs) are frequently undermined by object hallucination\u2014generating content that contradicts visual reality\u2014due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5\\% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail\u2014all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37197", "url": null, "sourceid": 38670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39925, "uid": "44ee138c3a477dd10e20cfc5d1402213", "name": "WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning", "authors": [{"id": 156922, "fullname": "Woongyeong Yeo", "url": "http://cvpr.thecvf.com/api/miniconf/users/156922?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 138226, "fullname": "Kangsan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/138226?format=json", "institution": "KAIST"}, {"id": 187592, "fullname": "Jaehong Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/187592?format=json", "institution": "NTU Singapore"}, {"id": 90666, "fullname": "Sung Ju Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90666?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39925", "url": null, "sourceid": 37975, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39620, "uid": "81623160a21568ddbc0a8173ebbf1670", "name": "A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection", "authors": [{"id": 181790, "fullname": "SuYeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/181790?format=json", "institution": "Kyung Hee University"}, {"id": 192499, "fullname": "Wongyu Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192499?format=json", "institution": "Kyung Hee University"}, {"id": 187337, "fullname": "MyeongAh Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/187337?format=json", "institution": "Kyung Hee University"}], "abstract": "3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)\u2014where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically consistent reconstruction. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that our method achieves state-of-the-art for both unified and category-specific models, improving object-level AUROC by 2.8% and 9.1%, respectively, while enhancing the reliability of unified 3D anomaly detection.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39620", "url": "https://visualsciencelab-khu.github.io/SeDiR_project/", "sourceid": 41932, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37101, "uid": "79d6e374e65e80786cc3b0c8ff46daa0", "name": "Your One-Stop Solution for AI-Generated Video Detection", "authors": [{"id": 175207, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/175207?format=json", "institution": "University of Science and Technology of China"}, {"id": 186654, "fullname": "Zihao Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186654?format=json", "institution": "Huzhou Normal University"}, {"id": 186655, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186655?format=json", "institution": "Alibaba Group"}, {"id": 131427, "fullname": "Zhiyuan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131427?format=json", "institution": "Peking University"}, {"id": 186656, "fullname": "Jin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186656?format=json", "institution": "University of Science and Technology of China"}, {"id": 186657, "fullname": "Xiaorui Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186657?format=json", "institution": "The Hong Kong Polytechnic University; University of Science and Technology of China"}, {"id": 186658, "fullname": "Haiyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186658?format=json", "institution": "University of Science and Technology of China; Institute of Dataspace, Hefei Comprehensive National Science Center"}, {"id": 186659, "fullname": "Yong Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186659?format=json", "institution": "University of Science and Technology of China; China Academic of Electronics and Information Technology"}, {"id": 186660, "fullname": "Zhen Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186660?format=json", "institution": "Huzhou Normal University"}], "abstract": "Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field.**From the dataset perspective**, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. **From the benchmark perspective**, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored.Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering **31** state-of-the-art generation models and over **440,000** videos. By executing more than **1,500** evaluations on **33** existing detectors belonging to four distinct categories. This work presents **8 in-depth analyses** from multiple perspectives and identifying **4 novel findings** that offer valuable insights for the field. We hope this work provides a solid foundation for advancing the field of AI-generated video detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37101", "url": null, "sourceid": 38944, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37501, "uid": "771a920034826064643a2d215e7ae6e1", "name": "Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation", "authors": [{"id": 173674, "fullname": "Taekyung Ki", "url": "http://cvpr.thecvf.com/api/miniconf/users/173674?format=json", "institution": "KAIST"}, {"id": 158955, "fullname": "Sangwon Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158955?format=json", "institution": "KAIST"}, {"id": 158957, "fullname": "Jaehyeong Jo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158957?format=json", "institution": "KAIST"}, {"id": 187592, "fullname": "Jaehong Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/187592?format=json", "institution": "NTU Singapore"}, {"id": 90666, "fullname": "Sung Ju Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90666?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user\u2019s audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (about 500ms), achieving 6.8x speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37501", "url": null, "sourceid": 33670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39826, "uid": "3544e424ddb90271172feb1b676a9e60", "name": "Single-Round Scalable Analytic Federated Learning", "authors": [{"id": 184130, "fullname": "Alan T. L. Bacellar", "url": "http://cvpr.thecvf.com/api/miniconf/users/184130?format=json", "institution": "The University of Texas at Austin"}, {"id": 95377, "fullname": "Mustafa Munir", "url": "http://cvpr.thecvf.com/api/miniconf/users/95377?format=json", "institution": "The University of Texas at Austin"}, {"id": 192928, "fullname": "Felipe M.G. Fran\u00e7a", "url": "http://cvpr.thecvf.com/api/miniconf/users/192928?format=json", "institution": "Google"}, {"id": 192929, "fullname": "Priscila Machado Vieira Lima", "url": "http://cvpr.thecvf.com/api/miniconf/users/192929?format=json", "institution": "Universidade Federal do Rio de Janeiro"}, {"id": 85871, "fullname": "Radu Marculescu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85871?format=json", "institution": "University of Texas, Austin"}, {"id": 192930, "fullname": "Lizy Kurian John", "url": "http://cvpr.thecvf.com/api/miniconf/users/192930?format=json", "institution": "University of Texas at Austin"}], "abstract": "Federated Learning (FL) is plagued by two key challenges: high communication overhead and performance collapse on heterogeneous (non-IID) data. Analytic FL (AFL) provides a single-round, data distribution invariant solution, but is limited to linear models. Subsequent non-linear approaches, like DeepAFL, regain accuracy but sacrifice the single-round benefit. In this work, we break this trade-off. We propose SAFLe, a framework that achieves scalable non-linear expressivity by introducing a structured head of bucketed features and sparse, grouped embeddings. We prove this non-linear architecture is mathematically equivalent to a high-dimensional linear regression. This key equivalence allows SAFLe to be solved with AFL's single-shot, invariant aggregation law. Empirically, SAFLe establishes a new state-of-the-art for analytic FL, significantly outperforming both linear AFL and multi-round DeepAFL in accuracy across all benchmarks, demonstrating a highly efficient and scalable solution for federated vision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39826", "url": null, "sourceid": 44436, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36935, "uid": "77c561fed0fbec92643ef18e304de03c", "name": "FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration", "authors": [{"id": 181270, "fullname": "Jingren Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181270?format=json", "institution": "Tianjin University"}, {"id": 71668, "fullname": "Shuning Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71668?format=json", "institution": "University of Macau"}, {"id": 186265, "fullname": "Qirui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186265?format=json", "institution": "Tianjin University"}, {"id": 149001, "fullname": "WANG Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/149001?format=json", "institution": "City University of Hong Kong"}, {"id": 91016, "fullname": "Xiangyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91016?format=json", "institution": "University of Macau"}, {"id": 186266, "fullname": "Zhong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/186266?format=json", "institution": "Tianjin university"}], "abstract": "All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36935", "url": null, "sourceid": 38146, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38670, "uid": "9508e33866145631ca76768f61282f2c", "name": "VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos", "authors": [{"id": 107511, "fullname": "Min Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107511?format=json", "institution": "Nanjing University"}, {"id": 190430, "fullname": "Xinwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190430?format=json", "institution": "Nanjing University"}, {"id": 190431, "fullname": "Jialei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190431?format=json", "institution": "Nanjing university"}, {"id": 190432, "fullname": "Xin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190432?format=json", "institution": "nanjing university"}, {"id": 190433, "fullname": "Kehan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190433?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 157906, "fullname": "Zeyi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157906?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 86063, "fullname": "Limin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86063?format=json", "institution": "Nanjing University"}], "abstract": "With the great advancement of video generation models, a growing number of content creators and researchers are leveraging these technologies to produce large volumes of human-centric videos for content creation and customized data generation for specific tasks. Although existing video generation models are capable of producing videos with high visual quality, their inadequate understanding of video realism results in generating unrealistic videos. While various evaluators have emerged to assess the quality of generated videos, they are trained from low-quality generated videos and data annotations, leading to misaligned ratings with human preferences. They also lack interpretability due to the absence of chain-of-thought reasoning. To address these issues, we propose \\textbf{VideoRealBench}, a comprehensive benchmark for evaluating the realism of generated human-centric videos. We leverage a rating scale designed from human preferences to score videos and provide three-step rationales, thereby creating a finely-annotated dataset \\textbf{VideoRealDataset} and proposing an evaluator \\textbf{VideoRealEval} capable of providing reliable scores along with detailed rationales. VideoRealEval achieves a Pearson\u2019s linear correlation coefficient (PLCC) of 57.07\\% and a Spearman\u2019s rank correlation coefficient (SROCC) of 56.78\\% on VideoRealDataset, demonstrating closer alignment with human preferences than existing evaluators.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38670", "url": null, "sourceid": 45163, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39435, "uid": "bb51be7cabcec70c5ebd89a677b5f450", "name": "Seeing What Matters: Visual Preference Policy Optimization for Visual Generation", "authors": [{"id": 180664, "fullname": "Ziqi Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/180664?format=json", "institution": "Southeast University"}, {"id": 192070, "fullname": "Yuanzhi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192070?format=json", "institution": "TeleAI"}, {"id": 192071, "fullname": "Rui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192071?format=json", "institution": "University of Science and Technology of China"}, {"id": 185180, "fullname": "Yi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185180?format=json", "institution": "Southeast University"}, {"id": 87044, "fullname": "Haibin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87044?format=json", "institution": "Kuaishou Technology"}, {"id": 152943, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152943?format=json", "institution": "China Telecom"}, {"id": 185717, "fullname": "Xuelong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185717?format=json", "institution": "China Telecom; Northwestern Polytechnical University"}], "abstract": "Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39435", "url": null, "sourceid": 43831, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39545, "uid": "7206ef1be0a3d44c57fa8214fc74421e", "name": "Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods", "authors": [{"id": 177162, "fullname": "Xuanru Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/177162?format=json", "institution": "Zhejiang University"}, {"id": 192313, "fullname": "Yiwen Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192313?format=json", "institution": "Tencent AI Lab"}, {"id": 192314, "fullname": "Wei-Cheng Tseng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192314?format=json", "institution": "University of Texas at Austin"}, {"id": 185776, "fullname": "Dong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185776?format=json", "institution": "Capital One"}], "abstract": "Current audio pre-training seeks to learn unified representations for broad audio understanding tasks, but it remains fragmented and is fundamentally bottlenecked by its reliance on weak, noisy, and scale-limited labels. Drawing lessons from vision's foundational pre-training blueprint, we argue that the audio field must first establish its own large-scale, strong supervision framework. We introduce a new data-centric pipeline that leverages a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. We then conduct a systematic comparative study of different pre-training objectives on these strong source data. Our experiments suggest that data quality and coverage are the primary drivers of performance, while the choice of objective dictates downstream task specialization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39545", "url": null, "sourceid": 43934, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39929, "uid": "e9c550b97a038b9dbe82e0c87ac80988", "name": "Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance", "authors": [{"id": 193132, "fullname": "Xiaoyu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193132?format=json", "institution": "Beihang University"}, {"id": 193133, "fullname": "Ketong Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193133?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 189967, "fullname": "Dongyu She", "url": "http://cvpr.thecvf.com/api/miniconf/users/189967?format=json", "institution": "Zhongguancun Laboratory"}, {"id": 87073, "fullname": "Weiming Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87073?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 126546, "fullname": "Miao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126546?format=json", "institution": "Beihang University"}], "abstract": "Video amodal completion (VAC) aims to mimic the human brain's ability to implicitly perceive the complete appearance of partially occluded objects, thereby facilitating recognition and understanding. Existing VAC methods finetune video generation models on custom datasets, yet these datasets often have unrealistic distributions and small scales due to the challenges of collecting real amodal data and thus limit their performance and generalization.To address this, we utilize pre-trained image inpainting models for VAC and introduce in-context (IC) learning to enhance inter-frame consistency. However, despite the satisfactory performance of DiT-based IC Learning in generation tasks, task-agnostic global information often utilizes irrelevant scene information, resulting in completion failures when applied to amodal completion task. Additionally, IC Learning faces a cold-start problem with the exemplar construction. To this end, we propose a consistency video amodal completion with rectified in-context exemplar guidance. Specifically, we introduce rectified exemplar-guided completion by adjusting the attention weights of exemplar image relative to the target images for consistent completion, and adopt a dual-frame calibrated exemplar rectification to tackle the cold-start issue.Quantitative and qualitative experiments demonstrate that our method outperforms SOTAs, especially in terms of generalization and robustness on uncommon data and under severe occlusion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39929", "url": null, "sourceid": 40458, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37554, "uid": "5d2f634bb9a59d39eab86442d2459d64", "name": "Where, What, Why: Toward Explainable 3D-GS Watermarking", "authors": [{"id": 180101, "fullname": "Mingshu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180101?format=json", "institution": "Waseda University"}, {"id": 187708, "fullname": "Jiajun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187708?format=json", "institution": "Southeast University"}, {"id": 86025, "fullname": "Osamu Yoshie", "url": "http://cvpr.thecvf.com/api/miniconf/users/86025?format=json", "institution": "Waseda University"}, {"id": 187709, "fullname": "Yuya Ieiri", "url": "http://cvpr.thecvf.com/api/miniconf/users/187709?format=json", "institution": null}, {"id": 187710, "fullname": "Yixuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187710?format=json", "institution": "Nanyang Technological University"}], "abstract": "As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers\u2014optimized for bit resilience under perturbation and bitrate budgets\u2014and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong robustness against common image distortions such as compression and noise, while achieving a favorable robustness\u2013quality trade-off compared with prior methods. In addition, the decoupled finetuning provides per-Gaussian attributions that reveal where the message is carried and why those carriers are selected, enabling auditable explainability. Compared with state-of-the-art methods, our approach achieves a PSNR improvement of +0.83 dB and a bit-accuracy gain of +1.15\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37554", "url": null, "sourceid": 43591, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39317, "uid": "0a2fbc3a7639168c9e918f557d1802ac", "name": "AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models", "authors": [{"id": 129188, "fullname": "Xiaoqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129188?format=json", "institution": "Peking University"}, {"id": 191833, "fullname": "Muhe Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191833?format=json", "institution": "Peking University"}, {"id": 191834, "fullname": "Jiadong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191834?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 191835, "fullname": "Juan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191835?format=json", "institution": "PrimeBot"}, {"id": 185934, "fullname": "Hongwei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185934?format=json", "institution": "Peking University"}, {"id": 129189, "fullname": "Yan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129189?format=json", "institution": "Peking University"}, {"id": 99906, "fullname": "Guanghui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/99906?format=json", "institution": "AgiBot"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}], "abstract": "Vision-Language-Action (VLA) models have significantly advanced robotic agents capable of executing diverse tasks; however, they remain limited in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment.To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations.Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s.Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39317", "url": null, "sourceid": 34758, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40363?format=json"], "related_events_ids": [40363]}, {"id": 39032, "uid": "4b369421ea33047f8fcbaf5d02938c9a", "name": "LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models", "authors": [{"id": 180906, "fullname": "Guolei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180906?format=json", "institution": "Southeast University"}, {"id": 191209, "fullname": "Qinzhi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191209?format=json", "institution": "University of California, Santa Cruz"}, {"id": 191210, "fullname": "Gan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191210?format=json", "institution": "Zhejiang University of Technology"}, {"id": 88121, "fullname": "Yao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88121?format=json", "institution": "Beihang University"}, {"id": 191211, "fullname": "Yuxuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191211?format=json", "institution": ""}, {"id": 191212, "fullname": "Yongjun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191212?format=json", "institution": "Southeast University"}], "abstract": "As Vision-Language Models (VLMs) move into interactive, multi-turn use, safety concerns intensify for multimodal multi-turn dialogue, which is characterized by concealment of malicious intent, contextual risk accumulation, and cross-modal joint risk. As a result, these characteristics limit the effectiveness of content moderation approaches designed for single-turn or single-modality settings. To address these limitations, we first construct the Multimodal Multi-turn Dialogue Safety (MMDS) dataset, comprising 4,484 annotated dialogues and a comprehensive risk taxonomy with 8 primary and 60 subdimensions. As part of MMDS construction, we introduce Multimodal Multi-turn Red Teaming (MMRT), an automated framework for generating unsafe multimodal multi-turn dialogues. We further propose LLaVAShield, which audits the safety of both user inputs and assistant responses under specified policy dimensions in multimodal multi-turn dialogues. Extensive experiments show that LLaVAShield significantly outperforms state-of-the-art VLMs and existing content moderation tools while demonstrating strong generalization and flexible policy adaptation. Additionally, we analyze vulnerabilities of mainstream VLMs to harmful inputs and evaluate the contribution of key components, advancing understanding of safety mechanisms in multimodal multi-turn dialogues.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39032", "url": null, "sourceid": 46538, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37573, "uid": "11ce4c3ce3498f8b1c49e8adad14eee5", "name": "UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization", "authors": [{"id": 180144, "fullname": "Xiao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180144?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187752, "fullname": "Huaizhi Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187752?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187753, "fullname": "Feiyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187753?format=json", "institution": "Beihang University"}, {"id": 187754, "fullname": "Shiji Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187754?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187755, "fullname": "Chun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187755?format=json", "institution": "Beijing Institute of Technology"}, {"id": 73017, "fullname": "Dezhi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/73017?format=json", "institution": "Beijing Institute of Technology"}, {"id": 73020, "fullname": "Kang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/73020?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Cross-view geo-localization (CVGL) aims to estimate an image\u2019s geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satellite or drone-to-satellite matching, lacking a unified framework capable of handling image matching across all three platforms: satellite, drone, and ground. To this end, we introduce the Unified Geo-localization dataset with Real-world and Synthetic imagery (UniGeoRS), a comprehensive benchmark featuring satellite, drone, and ground-view images, with a particular emphasis on the richness and diversity of drone and ground perspectives, enabling more realistic and flexible evaluations of CVGL. Additionally, we propose Cross-Attention-based Matching Enhancement (CAME), a unified framework for CVGL. By dynamically aggregating contextual information from top-ranked candidates, CAME refines feature representations and enhances cross-view matching robustness. Experimental results show (1) The Proposed UniGeoRS benchmark is necessary for training and evaluating the CVGL model across all three platforms. (2) UniGeoRS improves model generalization across diverse conditions. (3) CAME consistently boosts performance across state-of-the-art CVGL approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37573", "url": null, "sourceid": 37995, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36445, "uid": "cb9d8280ba7e5c9156cee212f36d01d7", "name": "PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning", "authors": [{"id": 185063, "fullname": "Hongchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185063?format=json", "institution": "Tongji University; Shanghai Innovation Institute"}, {"id": 185064, "fullname": "Tianyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185064?format=json", "institution": "Fudan University"}, {"id": 90131, "fullname": "Jiazhi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90131?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 185065, "fullname": "Mingyang Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185065?format=json", "institution": "Li Auto Inc."}, {"id": 185066, "fullname": "Gaoqiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185066?format=json", "institution": "Li Auto Inc."}, {"id": 185067, "fullname": "Caojun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185067?format=json", "institution": "Tongji University"}, {"id": 142811, "fullname": "Haochen Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/142811?format=json", "institution": "CASIA, Xiaomi EV"}, {"id": 185068, "fullname": "Zengrong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185068?format=json", "institution": "Li Auto Inc."}, {"id": 185069, "fullname": "Zhihui Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185069?format=json", "institution": "Li Auto Inc."}, {"id": 153220, "fullname": "XianPeng Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153220?format=json", "institution": "LiAuto"}, {"id": 185070, "fullname": "Jia Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185070?format=json", "institution": "Tongji University"}, {"id": 69669, "fullname": "Hongyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/69669?format=json", "institution": "The University of Hong Kong"}], "abstract": "Diffusion-based planners have emerged as a promising approach for human-like trajectory generation in autonomous driving. Recent works incorporate reinforcement fine-tuning to enhance the robustness of diffusion planners through reward-oriented optimization in a generation\u2013evaluation loop. However, they struggle to generate multi-modal, scenario-adaptive trajectories, hindering the exploitation efficiency of informative rewards during fine-tuning. To resolve this, we propose PlannerRFT, a sample-efficient reinforcement fine-tuning framework for diffusion-based planners. PlannerRFT adopts a dual-branch optimization that simultaneously refines the trajectory distribution and adaptively guides the denoising process toward more promising exploration, without altering the original inference pipeline. To support parallel learning at scale, we develop nuMax, an optimized simulator that achieves 10 times faster rollout compared to native nuPlan. Extensive experiments shows that PlannerRFT yields state-of-the-art performance with distinct behaviors emerging during the learning process. Code and simulator would be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36445", "url": null, "sourceid": 36841, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36300, "uid": "f0a864c2092c2885179523674f580428", "name": "Fully Decentralized Certified Unlearning", "authors": [{"id": 184731, "fullname": "Hithem Lamri", "url": "http://cvpr.thecvf.com/api/miniconf/users/184731?format=json", "institution": null}, {"id": 184732, "fullname": "Michail Maniatakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/184732?format=json", "institution": "New York University Abu Dhabi"}], "abstract": "Machine unlearning (MU) seeks to remove the influence of specified data from a trained model in response to privacy requests or data poisoning. While certified unlearning has been analyzed in centralized and server-orchestrated federated settings (via guarantees analogous to differential privacy, DP), the decentralized setting\u2014where peers communicate without a coordinator\u2014remains underexplored. We study certified unlearning in decentralized networks with fixed topologies and propose \\methodname, a random-walk procedure that performs one projected gradient ascent step on the forget set at the unlearning client and a geometrically distributed number of projected descent steps on the retained data elsewhere, combined with subsampled Gaussian noise and projection onto a trust region around the original model. We provide (i) convergence guarantees in the convex case and stationarity guarantees in the nonconvex case, (ii) $(\\varepsilon,\\delta)$ network-unlearning certificates on client views via subsampled Gaussian R\\'enyi DP (RDP) with segment-level subsampling, and (iii) deletion-capacity bounds that scale with the forget-to-local data ratio and quantify the effect of decentralization (network mixing and randomized subsampling) on the privacy\u2013utility trade-off. Empirically, on image benchmarks (MNIST, CIFAR-10), \\methodname~ matches a given $(\\varepsilon,\\delta)$ while achieving higher test accuracy than decentralized DP baselines and reducing forget accuracy to random guessing (\\(\\approx 10\\%\\)).", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36300", "url": null, "sourceid": 45923, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37245, "uid": "c53ac11d42c5bd5051bb953b99e01d75", "name": "SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation", "authors": [{"id": 180815, "fullname": "Qianpeng Chong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180815?format=json", "institution": "Beijing Normal University"}, {"id": 186999, "fullname": "Wenyi Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186999?format=json", "institution": "Beijing Normal University"}, {"id": 187000, "fullname": "Xiuxuan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187000?format=json", "institution": "Xidian University"}, {"id": 187001, "fullname": "Jiajie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187001?format=json", "institution": "Beijing Normal University"}, {"id": 187002, "fullname": "Qian Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187002?format=json", "institution": "Beijing Normal University"}, {"id": 176467, "fullname": "Xin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/176467?format=json", "institution": "Beijing Normal University"}], "abstract": "As an emerging multi-granularity clustering paradigm, granular-ball computing (GBC) hierarchically represents samples through granular-balls (GBs) to capture compact, multi-scale features. Nevertheless, its effective application to clustering-based segmentation methods (CSMs) remains challenging due to two key issues: representing intrinsic uncertainties and defining a justifiable, semantics-aware quality criterion. To address them, the first segmentation framework based on GBC (SegGBC) is proposed to alleviate the single\u2011granularity limitation of existing CSMs. Concretely, we leverage intuitionistic fuzzy sets (IFS) to explicitly quantify image uncertainty: membership and non\u2011membership encode evidence, and the IFS hesitation degree models residual ambiguity. In addition, a semantic compactness metric criterion (SCM_GB) is designed to characterize semantic information by considering the ''stable region'' in conjunction with the overall density of  GBs. The proposal of ''stable region'' ensures robust semantics concurrently with high computational efficiency.  Extensive experiments demonstrate that the proposed SegGBC achieves promising performance  for segmentation. The proposed segmentation GB representation is a plug-and-play front-end, significantly boosting the performance of CSMs by >+3.25\\% SA and >+3.92\\% mIoU on standard image and COCO benchmarks. Code is available at supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37245", "url": null, "sourceid": 36800, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36653, "uid": "ff31526c497a36611f9c89449b1c96ce", "name": "UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration", "authors": [{"id": 92089, "fullname": "DAEHYUN KIM", "url": "http://cvpr.thecvf.com/api/miniconf/users/92089?format=json", "institution": "Hanyang University"}, {"id": 185567, "fullname": "Youngmin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185567?format=json", "institution": "Hanyang Universty"}, {"id": 185568, "fullname": "Yoon Ju Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/185568?format=json", "institution": "Hanyang Universty"}, {"id": 77245, "fullname": "Tae Hyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/77245?format=json", "institution": "Hanyang Univ."}], "abstract": "Under-display cameras (UDCs) allow for full-screen designs by positioning the imaging sensor underneath the display. Nonetheless, light diffraction and scattering through the various display layers result in spatially varying and complex degradations, which significantly reduce high-frequency details. Current PSF-based physical modeling techniques and frequency-separation networks are effective at reconstructing low-frequency structures and maintaining overall color consistency. However, they still face challenges in recovering fine details when dealing with complex, spatially varying degradation. To solve this problem, we propose a lightweight Uncertainty-aware Context-Memory Network (UCMNet), for UDC image restoration. Unlike previous methods that apply uniform restoration, UCMNet performs uncertainty-aware adaptive processing to restore high-frequency details in regions with varying degradations. The estimated uncertainty maps, learned through an uncertainty-driven loss, quantify spatial uncertainty induced by diffraction and scattering, and guide the Memory Bank to retrieve region-adaptive context from the Context Bank. This process enables effective modeling of the non-uniform degradation characteristics inherent to UDC imaging. Leveraging this uncertainty as a prior, UCMNet achieves state-of-the-art performance on multiple benchmarks with 30% fewer parameters than previous models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36653", "url": null, "sourceid": 44471, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39751, "uid": "d41380111f8558d2a5a9da7a095bbd0f", "name": "Learning Multi-View Spatial Reasoning from Cross-View Relations", "authors": [{"id": 192783, "fullname": "Suchae Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192783?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology; Dongguk University"}, {"id": 180859, "fullname": "Jaehwi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/180859?format=json", "institution": "Config"}, {"id": 192784, "fullname": "Haeone Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192784?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 192785, "fullname": "Hanna Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192785?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 192786, "fullname": "Jian Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192786?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology; Yonsei University"}, {"id": 192787, "fullname": "Dongjun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192787?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 192788, "fullname": "Dong Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192788?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology; Seoul National University"}, {"id": 192789, "fullname": "Changyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192789?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 192790, "fullname": "Dongyoon Hahm", "url": "http://cvpr.thecvf.com/api/miniconf/users/192790?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 192791, "fullname": "Woogyeol Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192791?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology; Korea Advanced Institute of Science &amp; Technology"}, {"id": 170451, "fullname": "Juheon Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/170451?format=json", "institution": "KAIST"}, {"id": 126267, "fullname": "Kimin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/126267?format=json", "institution": "KAIST"}], "abstract": "Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as backbones in Vision-Language-Action models, XVR-trained representations improve success rates on RoboCasa. Our results demonstrate that explicit training on cross-view spatial relations significantly enhances multi-view reasoning and transfers effectively to real-world robotic manipulation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39751", "url": null, "sourceid": 38842, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37601, "uid": "7983ab09e8bd433374adf5ddfe658161", "name": "Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection", "authors": [{"id": 143012, "fullname": "Yuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/143012?format=json", "institution": "Dalian University of Technology"}, {"id": 187803, "fullname": "Xiaoqin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187803?format=json", "institution": "Construction Project Management Branch, National Petroleum and Natural Gas Pipeline Network Group Co., Ltd."}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}, {"id": 127233, "fullname": "Lihe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127233?format=json", "institution": "Dalian University of Technology"}], "abstract": "Multimodal unsupervised anomaly detection has garnered increasing attention for robust defect localization.Recent approaches rely on establishing cross-modal matching relationships under normal conditions without explicit guidance.However, in practice, a single modality may have multiple distinct representations corresponding to another modality, and such unconditional mappings struggle to adaptively capture these variations, resulting in mapping ambiguity and the misclassification of diverse yet normal variations as anomalies.Moreover, existing methods suffer from  slow inference speed and high  memory overhead, hindering their deployment in real-world production lines.To address these issues, we propose an efficient and effective  Complementary Prototype Mapping (\\textbf{CPMAD}) framework, which dynamically extracts consensus and supplementary prototypes to serve as complementary priors, thereby guiding and disambiguating cross-modal mappings.The framework comprises three key components:(1)  Consensus Extraction Module (CEM) learns a dynamic anchor, transforming multimodal  features into anomaly-free consensus prototypes to improve cross-modal consistency and suppress latent anomalies;(2) Supplementary  Query Module (SQM)  employs a Complementary Residual Attention mechanism to capture the discrepancy between the consensus and modality-specific spaces, thereby exploring the most representative and discriminative cues as supplementary prototypes; and(3)  Complementary Mapping Module adaptively integrates both prototypes to perform feature mapping.Extensive experiments demonstrate that CPMAD not only achieves superior performance in both full-data and few-shot settings across diverse industrial and medical scenarios but also maintains faster inference speeds and lower memory consumption compared to existing methods.The code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37601", "url": null, "sourceid": 38149, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37380, "uid": "dfa7dbe6935687f9e719401f15bf07b7", "name": "Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench", "authors": [{"id": 187298, "fullname": "Fenfen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187298?format=json", "institution": "Beijing Academy of Artificial Intelligence; Wuhan University"}, {"id": 182966, "fullname": "Yesheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182966?format=json", "institution": "The Institute of Automation, Chinese Academy of Sciences"}, {"id": 187299, "fullname": "Haiyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187299?format=json", "institution": "Peking University"}, {"id": 187300, "fullname": "Chen Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/187300?format=json", "institution": "Beijing Academy of Artificial Intelligence; Peking University"}, {"id": 147097, "fullname": "Zheqi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/147097?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 187301, "fullname": "Mingxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187301?format=json", "institution": "Peking University"}, {"id": 187302, "fullname": "Miguel Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187302?format=json", "institution": "BAAI"}, {"id": 148615, "fullname": "Jin-Ge Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/148615?format=json", "institution": "BAAI"}, {"id": 147263, "fullname": "Xi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147263?format=json", "institution": "Beijing Academy of Artificial Intelligence (BAAI)"}], "abstract": "Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. We have also conducted preliminary experiments with reinforcement finetuning (RFT) over synthetic data, and find a significant improvement on both in-domain synthetic subset and real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource and our code releases can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37380", "url": null, "sourceid": 44958, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39430, "uid": "8a45da3f21ea12d7948abeaf346c5bd1", "name": "Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models", "authors": [{"id": 180905, "fullname": "Xuzeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180905?format=json", "institution": "Beijing Jiaotong University"}, {"id": 192057, "fullname": "Tao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192057?format=json", "institution": "Beijing Jiaotong University"}, {"id": 192058, "fullname": "Xiangyun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192058?format=json", "institution": "Minzu University of China"}, {"id": 192059, "fullname": "JIACHENG WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/192059?format=json", "institution": null}, {"id": 192060, "fullname": "Jian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192060?format=json", "institution": "Beijing Jiaotong University"}, {"id": 192061, "fullname": "Jiawen Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192061?format=json", "institution": "Guangdong University of Technology"}, {"id": 192062, "fullname": "Jiqiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192062?format=json", "institution": "Beijing Jiaotong University"}, {"id": 192063, "fullname": "Zhen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192063?format=json", "institution": "Beijing Jiaotong University"}, {"id": 192064, "fullname": "Dusit Niyato", "url": "http://cvpr.thecvf.com/api/miniconf/users/192064?format=json", "institution": "Nanyang Technological University"}, {"id": 192065, "fullname": "Dong In Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192065?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Federated learning (FL) enables a central server to collaboratively train a global model with multiple clients while preserving data privacy. However, the distributed nature of FL makes the paradigm vulnerable to backdoor attacks, as proved by numerous recent studies. Although existing studies improve the effectiveness of backdoor attacks through optimized triggers, they have two limitations: (1) they ignore the heterogeneous contribution of individual model layers to the success of a backdoor; (2) they induce conspicuous differences between backdoor and clean models in the early stages of poisoning. The limitations cause backdoor models to exhibit significant discrepancies from clean models, making them easily detectable. To fill these gaps, we propose LaySelFL, a novel layer-selective method to eliminate distance differences induced by the backdoor to conceal attacks in FL. Our central insight is that different layers contribute unequally to backdoor attacks, by localizing poisoning to layers that are most sensitive to backdoor objectives, an attacker can reduce the model differences substantially between the backdoor and clean models. Concretely, LaySelFL identifies sensitive layers via both dynamic and static evaluations of parameter differences between backdoor and benign models, and then applies a targeted training protocol and a regularized loss that constrains differences from the global model in each round. Finally, LaySelFL performs clipping on non-poisoning layers to further mask residual differences introduced by the attack. This strategy yields a more covert and resilient backdoor attack. Extensive experiments show that LaySelFL increases the effectiveness of attacks by 25\\% and reduces the effectiveness of defense methods to 4\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39430", "url": null, "sourceid": 34666, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40363, "uid": "0a2fbc3a7639168c9e918f557d1802ac", "name": "AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models", "authors": [{"id": 129188, "fullname": "Xiaoqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129188?format=json", "institution": "Peking University"}, {"id": 191833, "fullname": "Muhe Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191833?format=json", "institution": "Peking University"}, {"id": 191834, "fullname": "Jiadong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191834?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 191835, "fullname": "Juan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191835?format=json", "institution": "PrimeBot"}, {"id": 185934, "fullname": "Hongwei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185934?format=json", "institution": "Peking University"}, {"id": 129189, "fullname": "Yan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129189?format=json", "institution": "Peking University"}, {"id": 99906, "fullname": "Guanghui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/99906?format=json", "institution": "AgiBot"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}], "abstract": "Vision-Language-Action (VLA) models have significantly advanced robotic agents capable of executing diverse tasks; however, they remain limited in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment.To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations.Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s.Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40363", "url": null, "sourceid": -34758, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39317?format=json"], "related_events_ids": [39317]}, {"id": 37479, "uid": "3b05af2c48dbaf6656fdf2d2f905b3b6", "name": "How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?", "authors": [{"id": 87269, "fullname": "Arda Senocak", "url": "http://cvpr.thecvf.com/api/miniconf/users/87269?format=json", "institution": "KAIST"}, {"id": 187555, "fullname": "Sooyoung Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/187555?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 152617, "fullname": "Tae-Hyun Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/152617?format=json", "institution": "KAIST"}, {"id": 126430, "fullname": "Joon Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/126430?format=json", "institution": "KAIST"}], "abstract": "We present the first scalable framework for training sound source localization (SSL) models using synthetic data from text-to-X models. Although SSL has made notable progress, existing models remain constrained by limited-scale, uncurated real-world datasets that often suffer from semantic misalignment. Furthermore, the introduction of new SSL tasks and benchmarks has increased the need for more generalizable models. To address these challenges, we leverage synthetic data to create synthetic clones of the VGGSound dataset, enabling both fully synthetic and hybrid real\u2013synthetic training. We demonstrate that synthetic data can effectively replace, refine, and scale real training datasets. Extensive experiments across multiple benchmarks show that synthetic data not only matches real data in performance but also enables significant improvements when combined with real samples. Our findings provide the first systematic evidence that synthetic data can serve as a scalable and effective approach for advancing SSL models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37479", "url": null, "sourceid": 40343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36446, "uid": "5cf0b0751a223522c722f87bc8a9628d", "name": "FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes", "authors": [{"id": 147680, "fullname": "Yibin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/147680?format=json", "institution": "East China University of Science and Technology"}, {"id": 185071, "fullname": "Yihan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185071?format=json", "institution": "East China University of Science and Technology"}, {"id": 185072, "fullname": "Jun Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185072?format=json", "institution": "East China University of Science and Technology"}, {"id": 185073, "fullname": "Liwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185073?format=json", "institution": "Shanghai xiaoyuan innovation institute"}, {"id": 185074, "fullname": "Jianjun YI", "url": "http://cvpr.thecvf.com/api/miniconf/users/185074?format=json", "institution": "East China University of Science and Technology"}], "abstract": "Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstruction from free sparse-view images often leads to poor surface due to limited overlap and overfitting.We introduce FSFSplatter for $\\textbf{f}$ast geometrically accurate reconstruction from $\\textbf{f}$ree $\\textbf{s}$parse-view images. Our method integrates end-to-end dense Gaussian scene initialization and geometry-enhanced scene optimization.Specifically, FSFSplatter employs a large transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a batch based self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting by leveraging depth and multi-view feature supervision, along with differentiable camera parameters within 2 minutes.FSFSplatter outperforms current state-of-the-art methods on widely used DTU, Replica, and BlendedMVS datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36446", "url": null, "sourceid": 35295, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37371, "uid": "fb52538ee970026501864b2272852dc4", "name": "Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients", "authors": [{"id": 147380, "fullname": "Ziwei Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147380?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 156982, "fullname": "Fanhu Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156982?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 187271, "fullname": "Hongjian Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187271?format=json", "institution": null}, {"id": 187272, "fullname": "Rui-Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187272?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 187273, "fullname": "Renxing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187273?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 187274, "fullname": "Yanan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187274?format=json", "institution": "Beijing University"}, {"id": 185156, "fullname": "yi chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185156?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 187275, "fullname": "Peipei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187275?format=json", "institution": "Institute of automation, Chinese academy of sciences"}, {"id": 86374, "fullname": "Xu-Yao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86374?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Large Vision Language Models (LVLMs) have achieved remarkable success in a wide range of downstream tasks that require multimodal interaction, but their powerful capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the rich and complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy based on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method consistently improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60\\%, reducing the gap to its full-precision counterpart to only 1.33\\%. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37371", "url": null, "sourceid": 34903, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38034, "uid": "1b9f12a814847b47a21871a32ac4349d", "name": "Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning", "authors": [{"id": 182405, "fullname": "Chi-Pin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182405?format=json", "institution": "NVIDIA"}, {"id": 85270, "fullname": "Yunze Man", "url": "http://cvpr.thecvf.com/api/miniconf/users/85270?format=json", "institution": "Department of Computer Science, University of Illinois at Urbana-Champaign"}, {"id": 91930, "fullname": "Zhiding Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91930?format=json", "institution": "NVIDIA"}, {"id": 127278, "fullname": "Min-Hung Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127278?format=json", "institution": "NVIDIA"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 89818, "fullname": "Yu-Chiang Frank Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89818?format=json", "institution": "NVIDIA"}, {"id": 140039, "fullname": "Fu-En Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/140039?format=json", "institution": "NVIDIA"}], "abstract": "Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38034", "url": null, "sourceid": 36014, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39865, "uid": "239960b9beb0a3e5097ab21c301bee18", "name": "Towards Robust Multi-Modal Semantic Segmentation with Teacher-Student Framework and Hybrid Prototype Distillation", "authors": [{"id": 181448, "fullname": "jiaqi tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181448?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 87543, "fullname": "Xu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87543?format=json", "institution": "HKUST"}, {"id": 193025, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193025?format=json", "institution": "BUPT"}], "abstract": "Multimodal semantic segmentation (MMSS) faces significant challenges in real-world applications due to incomplete, degraded, or missing sensor data. To address this, we propose RobustSeg, an efficient teacher-student framework that enhances model robustness under missing-modality conditions while maintaining strong performance in full-modality scenarios. RobustSeg adopts a feedback-based self-distillation paradigm consisting of two complementary stages. Firstly, we introduce Hybrid Prototype Distillation (HPD), which enables more reliable knowledge transfer of both cross-modal and modality-specific aspects. Concretely, combined with dominant-modality selection, HPD performs cross-modal semantic distillation with high-level semantic prototypes to reduce modality bias. Meanwhile, HPD conducts intra-class feature variation distillation for modality-specific structural details. Secondly, to enable the teacher model to gradually produce more balanced and robust modality representations, we make the student model provide feedback from the non-dominant modality to the teacher, benefiting the entire distillation process. Experiments on three datasets demonstrate that our method achieves state-of-the-art robustness (e.g., +2.40% missing-modality performance on DeLiVER) while causing almost no degradation in full-modality performance (only -0.1% mIoU). Moreover, evaluations using different backbones (AnySeg and CMNeXt) further validate the generalization ability of RobustSeg.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39865", "url": null, "sourceid": 36929, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37305, "uid": "9a2df0715e604dc5b670a2ffcc67dbc7", "name": "4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation", "authors": [{"id": 187128, "fullname": "Chiao-An Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187128?format=json", "institution": "Purdue University"}, {"id": 86491, "fullname": "Ryo Hachiuma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86491?format=json", "institution": "Konica Minolta, Inc."}, {"id": 76011, "fullname": "Sifei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76011?format=json", "institution": "NVIDIA"}, {"id": 150928, "fullname": "Subhashree Radhakrishnan", "url": "http://cvpr.thecvf.com/api/miniconf/users/150928?format=json", "institution": "NVIDIA"}, {"id": 85283, "fullname": "Raymond A. Yeh", "url": "http://cvpr.thecvf.com/api/miniconf/users/85283?format=json", "institution": "Purdue University"}, {"id": 89818, "fullname": "Yu-Chiang Frank Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89818?format=json", "institution": "NVIDIA"}, {"id": 127278, "fullname": "Min-Hung Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127278?format=json", "institution": "NVIDIA"}], "abstract": "Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting.We tackle these issues by introducing:(a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception;(b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and(c) \\ourbenchmark, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline.Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37305", "url": null, "sourceid": 33022, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39044, "uid": "1c34b2e86eacc79f7b3c7d6902d1388c", "name": "Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation", "authors": [{"id": 180629, "fullname": "Zhen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180629?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 155058, "fullname": "Jian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155058?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 86712, "fullname": "Biwen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/86712?format=json", "institution": "Alibaba Group"}, {"id": 191239, "fullname": "Jing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191239?format=json", "institution": null}, {"id": 107143, "fullname": "Haohan Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107143?format=json", "institution": "South China University of Technology"}, {"id": 191240, "fullname": "Yiling Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191240?format=json", "institution": "Tencent hunyuan"}, {"id": 155060, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155060?format=json", "institution": null}, {"id": 191241, "fullname": "Junfeng Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191241?format=json", "institution": null}, {"id": 191242, "fullname": "Yunkai Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/191242?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 191243, "fullname": "Dazhao Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/191243?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 126142, "fullname": "Song Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126142?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 191244, "fullname": "Fengshui Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/191244?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}], "abstract": "Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. Specifically, (1) we design the first asynchronous online RL framework tailored for 3D mesh generation post-training efficiency improvement, which is 3.75$\\times$ faster than synchronous RL. (2) We propose Advantage-guided Ranking Preference Optimization (ARPO), a novel RL algorithm that achieves a better trade-off between training efficiency and generalization than current RL algorithms designed for 3D mesh generation, such as DPO and group relative policy optimization (GRPO). (3) Based on asynchronous ARPO, we propose Mesh-Pro, which additionally introduces a novel diagonal-aware mixed triangular-quadrilateral tokenization for mesh representation and a ray-based reward for geometric integrity. Mesh-Pro achieves state-of-the-art performance on artistic and dense meshes. Code will be available soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39044", "url": null, "sourceid": 33145, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38388, "uid": "5527123f5a35a024c0f3bf4689e76e7b", "name": "Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control", "authors": [{"id": 180342, "fullname": "Weisheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180342?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 189767, "fullname": "Qiwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189767?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189768, "fullname": "Jiaxi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189768?format=json", "institution": "The Hong Kong University of Science and Technology; Guangdong University of Technology"}, {"id": 189769, "fullname": "Jing Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189769?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189770, "fullname": "Yangfan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189770?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189771, "fullname": "Yuetong Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189771?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 189772, "fullname": "Jiaqi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189772?format=json", "institution": "University of Oxford"}, {"id": 189773, "fullname": "Kai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189773?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 189774, "fullname": "Rong OU", "url": "http://cvpr.thecvf.com/api/miniconf/users/189774?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 126902, "fullname": "Renjing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126902?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38388", "url": null, "sourceid": 42099, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39157, "uid": "56fc31d371e2b988e7b0b3a96ea2ae42", "name": "MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly", "authors": [{"id": 191473, "fullname": "Rui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191473?format=json", "institution": "University of Hong Kong"}, {"id": 191474, "fullname": "Tianyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/191474?format=json", "institution": "University of Hong Kong"}, {"id": 191475, "fullname": "Qiujie Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191475?format=json", "institution": "Shandong University"}, {"id": 191476, "fullname": "Le Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191476?format=json", "institution": null}, {"id": 153720, "fullname": "Zhe Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153720?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 131523, "fullname": "Peng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131523?format=json", "institution": "Tsinghua University"}, {"id": 190466, "fullname": "Zhiyang Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190466?format=json", "institution": "MIT"}, {"id": 90999, "fullname": "Cheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/90999?format=json", "institution": "Tencent"}, {"id": 128311, "fullname": "Shiqing Xin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128311?format=json", "institution": "Shandong University"}, {"id": 77389, "fullname": "Yuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77389?format=json", "institution": "The University of Hong Kong"}, {"id": 87784, "fullname": "Wenping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87784?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 89090, "fullname": "Taku Komura", "url": "http://cvpr.thecvf.com/api/miniconf/users/89090?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}], "abstract": "Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing transformer-based methods suffer from long-sequence bottlenecks and limited quantization resolution, primarily due to the large number of tokens required and constrained quantization granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns.We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles\u2014substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions.This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure.Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39157", "url": null, "sourceid": 44622, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37640, "uid": "6e9c8746c47446e383b404bc3e5cab85", "name": "Watch and Learn: Learning to Use Computers from Online Videos", "authors": [{"id": 97322, "fullname": "Chan Hee Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/97322?format=json", "institution": "The Ohio State University"}, {"id": 187931, "fullname": "Yiwen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/187931?format=json", "institution": "Google"}, {"id": 187932, "fullname": "Palash Goyal", "url": "http://cvpr.thecvf.com/api/miniconf/users/187932?format=json", "institution": "Google"}, {"id": 128280, "fullname": "Yu Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/128280?format=json", "institution": "The Ohio State University"}, {"id": 187933, "fullname": "Oriana Riva", "url": "http://cvpr.thecvf.com/api/miniconf/users/187933?format=json", "institution": "Google DeepMind"}, {"id": 92461, "fullname": "Hamid Palangi", "url": "http://cvpr.thecvf.com/api/miniconf/users/92461?format=json", "institution": "Google"}, {"id": 84645, "fullname": "Tomas Pfister", "url": "http://cvpr.thecvf.com/api/miniconf/users/84645?format=json", "institution": "Google"}], "abstract": "Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while synthetic data often yields oversimplified or misaligned behaviors. We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. Instead of directly generating actions or relying on handcrafted heuristics, we cast trajectory annotation as an inverse dynamics problem that predicts user actions from consecutive screen states, which simplifies learning and generalizes across domains. Through a task-aware retrieval and labeling pipeline, W&L yields over 53K high-quality trajectories that enhance CUAs both as in-context exemplars and as supervised training data. On OSWorld, it consistently improves general-purpose and specialized CUAs, while on WindowsAgentArena it achieves state-of-the-art performance among 7B-scale models under the 15-step limit. These results show that web-scale human demonstration videos can serve as a practical and scalable foundation for advancing real-world CUAs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37640", "url": null, "sourceid": 41991, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36520, "uid": "e12c76ac7c27a883f030e9b9807441e5", "name": "Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation", "authors": [{"id": 185260, "fullname": "Jae Yun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/185260?format=json", "institution": "Sogang University"}, {"id": 185261, "fullname": "Hyeok Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/185261?format=json", "institution": "Sogang University"}, {"id": 185262, "fullname": "Sung In Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/185262?format=json", "institution": "Sogang University"}], "abstract": "Source-free domain adaptation (SFDA) adapts a pre-trained source model to an unlabeled target domain using only the model itself, typically relying on pseudo labeling augmented with auxiliary knowledge and consistency regularization (CR) mechanisms to alleviate noise in the generated pseudo labels. However, existing approaches overlook the geometric structure of the target embedding manifold when assigning pseudo labels, resulting in unreliable distance measurements and consequently severe mislabeling.Moreover, their CR is applied solely to output logits, making it insensitive to feature-level reliability. To solve these issues, we propose a novel pseudo labeling scheme based on geometry aware-universe feature space and a new gravity CR loss.Our pseudo labeling strategy first models the embedding space with virtual features to make geometry-aware universe feature space. On this space, pseudo labels are generated through feature traversal, which propagates labels only from statistically reliable regions. In addition, the proposed CR jointly encourages logit- and feature-level consistency, aligning predictions for augmented images while preserving the geometric structure of the embedding space. It further modulates the strength of CR for each sample, preventing the confirmation of noisy pseudo labels through a gravity-based force defined between two input embeddings.Experiments on Office-Home, DomainNet-126, and VisDA-C demonstrate consistent improvements over prior SFDA methods, and incorporating gravity CR loss into baselines yields substantial additional gains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36520", "url": null, "sourceid": 44452, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36341, "uid": "0ff73ca20cd787c5b817aff62e7890da", "name": "Revisiting 3D Reconstruction Kernels as Low-Pass Filters", "authors": [{"id": 126735, "fullname": "Shengjun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126735?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 184817, "fullname": "Min Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184817?format=json", "institution": "Tsinghua University"}, {"id": 184818, "fullname": "Yibo Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/184818?format=json", "institution": ",Wuhan University"}, {"id": 184819, "fullname": "Mingyu Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184819?format=json", "institution": "Tsinghua University"}, {"id": 76969, "fullname": "Yueqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76969?format=json", "institution": "Tsinghua University"}], "abstract": "3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces.In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge.Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student's t distributions, serve as the low pass filters to isolate the baseband spectrum.However, their unideal low-pass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal\u2019s spectrum.To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters.As Jinc kernel suffers from low decay speed in the spatial domain, we further propose modulated kernels to strick an effective balance, and achieves superior rendering performance by reconciling spatial efficiency and frequency-domain fidelity.Experimental results have demonstrated the effectiveness of our Jinc and modulated kernels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36341", "url": null, "sourceid": 45969, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36412, "uid": "d9020d9a5ca6747225528d00572ac845", "name": "SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data", "authors": [{"id": 180870, "fullname": "Zixuan Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180870?format=json", "institution": "Nanjing University"}, {"id": 184976, "fullname": "Zeyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184976?format=json", "institution": "nanjing university"}, {"id": 184977, "fullname": "Fengyuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184977?format=json", "institution": "nanjing university"}, {"id": 181212, "fullname": "Shaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181212?format=json", "institution": "University of Science and Technology of China"}, {"id": 184978, "fullname": "Wenbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184978?format=json", "institution": "Nanjing University"}, {"id": 130772, "fullname": "Qi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86625, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86625?format=json", "institution": "Nanjing University"}], "abstract": "Existing Incremental Learning (IL) methods are primarily evaluated under either a single-domain class-incremental setting, or a multi-domain task-incremental setting with known task identifiers. However, these assumptions often fail to hold in real-world applications. To bridge this gap, we introduce Heterogeneous Incremental Learning (HIL), a new setting for evaluating IL methods under realistic and challenging conditions, where task boundaries are ambiguous or unknown, class distributions shift dynamically across environments, and training data is limited. Model editing is inherently well-suited for this challenging HIL, as it allows for the efficient integration of new knowledge while preserving model capabilities. Thus, we propose a novel Sparse and Anchored Model Editing (SAME) for addressing HIL. Specifically, SAME sparsely and selectively updates task-relevant model parameters to extract compact, task-specific key\u2013value knowledge pairs from limited data. Using these task knowledge pairs, the model performs knowledge injection for new tasks under double-anchor constraints. The knowledge anchor aligns the updated and original model features, while the parameter anchor constrains parameter magnitudes, ensuring stable and consistent knowledge injection. Our method can efficiently solve HIL using only a few labeled examples, without introducing additional model parameters. Extensive experiments on 11 diverse visual-language datasets across 22 sequential tasks show that our method outperforms existing continual learning approaches by 6.8% in average accuracy, while retaining 95.8% of the oracle model performance, demonstrating strong stability and cross-domain generalization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36412", "url": null, "sourceid": 45662, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37912, "uid": "3102051b5f75731d90b707b4ce0fd4a8", "name": "MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models", "authors": [{"id": 133107, "fullname": "Sang Yun Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/133107?format=json", "institution": "KAIST"}, {"id": 188574, "fullname": "Se Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188574?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 188575, "fullname": "Youngchae Chee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188575?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 84664, "fullname": "Yong Man Ro", "url": "http://cvpr.thecvf.com/api/miniconf/users/84664?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model's inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8\\% and 2.0\\% improvements for VideoLLaMA2-AV, 8.7\\% and 4.7\\% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37912", "url": null, "sourceid": 42996, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39141, "uid": "a234673350319edb8a0aa1cd94236d48", "name": "Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer", "authors": [{"id": 191434, "fullname": "Yue Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191434?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 178632, "fullname": "Siqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178632?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191435, "fullname": "Ting Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191435?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 182021, "fullname": "Fan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182021?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Music style transfer (MST) aims to reinterpret existing musical pieces in new stylistic forms while maintaining their melodic coherence. Conventional approaches conditioned on text or audio overlook the profoundly multimodal character of musical style. Visual ambience -- reflected in color, lighting, and composition -- encodes affective attributes that parallel timbre, rhythm, and harmony, which, however, remain underexplored in MST context. We introduce a flow-based, inversion-free framework for multimodal music style transfer that unifies textual and visual guidance. Our approach tackles two challenges: (1) capturing cross-modal semantics beyond language through a dual-encoder fusion module that merges CLIP- and ViT-derived embeddings, and (2) preserving melodic identity using a differentiable normalized chroma constraint that regulates pitch-class consistency along the generative flow. We reorganize and extend the MeLBench and MusicCaps collections into a genre-structured multimodal dataset to support style-aware analysis. Quantitative and perceptual evaluations demonstrate that our approach achieves superior control, structural fidelity, and cross-modal expressiveness, underscoring the role of visual perception in music generation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39141", "url": null, "sourceid": 38389, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38251, "uid": "64932b5ac0489696a19c899d53a20c3a", "name": "DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection", "authors": [{"id": 189423, "fullname": "Chaoyu Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189423?format=json", "institution": "Nanyang Technological University"}, {"id": 189424, "fullname": "Han Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189424?format=json", "institution": "University of Washington"}, {"id": 184188, "fullname": "Siqiang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184188?format=json", "institution": "Nanyang Technological University"}], "abstract": "A fundamental yet overlooked limitation of current deepfake detection benchmark is the lack of evaluation frameworks that align technical accuracy with real-world impact. We argue that technical metrics may fail to capture models' actual capacity to mitigate real-world harm, as they treat all errors as equally significant. To bridge this gap, we introduce DeepfakeImpact, a two-stage benchmark that moves beyond pure technical evaluation toward societally-aware assessment. In Stage I, we establish standardized technical baselines by evaluating 33 SOTA detection baslines across 12 widely used datasets. In Stage II, we propose a novel metric (Social Misjudgment Impact, SMI) that quantifies the potential social harm of misclassified videos, and construct a SMI-critical dataset containing high-risk samples. By integrating SMI-aware performance metrics, we shift the evaluation focus from \"how accurate'' to \"how socially beneficial'' a detector is. DeepfakeImpact thus provides a more realistic and ethically-grounded foundation for assessing deepfake detectors, urging the community to rethink what truly constitutes progress in this field. All resources will be publicly released at: \\url{https://anonymous.4open.science/r/DeepfakeImpact-Stage1-F5EC}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38251", "url": null, "sourceid": 41054, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39638, "uid": "7fbaa471e85a13aada114a4b1065215a", "name": "PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration", "authors": [{"id": 180785, "fullname": "Haoqing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180785?format=json", "institution": "Mercedes-Benz AG"}, {"id": 192541, "fullname": "Alexa Nawotki", "url": "http://cvpr.thecvf.com/api/miniconf/users/192541?format=json", "institution": "Mercedes-Benz"}, {"id": 192542, "fullname": "Jochen Garcke", "url": "http://cvpr.thecvf.com/api/miniconf/users/192542?format=json", "institution": "University of Bonn and Fraunhofer SCAI"}], "abstract": "Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which reformulates geometric translation into two cooperative stages to enhance structural clarity, robustness, and local detail preservation. Extensive experiments on curated benchmarks demonstrate that our approach surpasses state-of-the-art performance in general 3D restoration. It effectively handles complex combinations of completion, deformation, and denoising degradations. With this work, we provide a novel unified, point-only backbone for robust 3D restoration, paving the way for more versatile 3D perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39638", "url": null, "sourceid": 30637, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38016, "uid": "df9af533c2c3d8942ada8a91c88f5e59", "name": "BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery", "authors": [{"id": 188835, "fullname": "Nan An", "url": "http://cvpr.thecvf.com/api/miniconf/users/188835?format=json", "institution": "Dalian University of Technology"}, {"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 151421, "fullname": "Tengyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151421?format=json", "institution": "Dalian University of Technology"}, {"id": 152574, "fullname": "Zhu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152574?format=json", "institution": "Dalian University of Technology"}, {"id": 188836, "fullname": "Yingchi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188836?format=json", "institution": "Shenyang Research Institute of Foundrary Co.,Ltd.CAM"}, {"id": 131737, "fullname": "Risheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131737?format=json", "institution": "Dalian University of Technology"}], "abstract": "The emergence of large generative models has substantially advanced learning-based scene recovery in the synthetic domain. However, applying these models directly to real scenarios reveals sub-optimal performance stemming from the significant distribution gap, alongside poor adaptation to complex and unforeseen degradations. Consequently, it is imperative to develop a real scene adaptation strategy that yields faithful restorations with reliable generalizability. To this end, we propose $\\textbf{Bi}$level $\\textbf{Pro}$mpt $\\textbf{LoRA}$, a novel learning paradigm designed to effectively adapt pre-trained generative models for real scene recovery. First, we introduce a self-supervised distribution-fidelity learning scheme to calibrate the autoencoding pathway under task-irrelevant real distributions, thereby recovering high-fidelity textures. Subsequently, a bilevel joint modeling via hyperparameter optimization is further established, empowering robust synthetic-to-real adaptation for both seen and unseen scenes by exploiting the complementary advantages between LoRA and Prompts to foster mutual promotion. Extensive evaluations on diverse real adverse scenarios demonstrate our method's superiority, with comprehensive algorithm analyses proving our effectiveness. The code will be public released upon the acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38016", "url": null, "sourceid": 38957, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39180, "uid": "d15505e6d277a01b10c86b9137f57d69", "name": "SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition", "authors": [{"id": 191515, "fullname": "Rui Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191515?format=json", "institution": "Xidian University"}, {"id": 191516, "fullname": "Weidong Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191516?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191517, "fullname": "Juntao Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191517?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191518, "fullname": "Lai Rui", "url": "http://cvpr.thecvf.com/api/miniconf/users/191518?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 191519, "fullname": "Tong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191519?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 145864, "fullname": "Fanhong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/145864?format=json", "institution": "Xidian University"}, {"id": 186056, "fullname": "Lin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186056?format=json", "institution": "Tohoku University"}], "abstract": "Event cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting $H$-$W$-$T$ events alone spatial axis $H$ and $W$, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. On three challenging EAR datasets of HARDVS, DailyDVS-200 and THU-EACT-50-CHL, we show +7.0\\%, +10.7\\%, and +10.2\\% Top-1 accuracy gains over  existing SMVRL EOR method with surprising 30.1\\% reduced parameters and 35.7\\% lower computations, establishing our framework as a novel and powerful EAR paradigm. Code will be released once accepted.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39180", "url": null, "sourceid": 33601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37352, "uid": "e66160e598192a253a015cb90ab371dc", "name": "Decoupled Generative Modeling for Human-Object Interaction Synthesis", "authors": [{"id": 156348, "fullname": "Hwanhee Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/156348?format=json", "institution": "Korea University"}, {"id": 187231, "fullname": "Seunggwan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/187231?format=json", "institution": "LG CNS; Korea University"}, {"id": 187232, "fullname": "Jeongyoon Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/187232?format=json", "institution": "Korea University"}, {"id": 187233, "fullname": "SeungHyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187233?format=json", "institution": "Korea University"}, {"id": 159460, "fullname": "Giljoo Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/159460?format=json", "institution": "Meta"}, {"id": 91087, "fullname": "Qixing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91087?format=json", "institution": "University of Texas at Austin"}, {"id": 130426, "fullname": "Sangpil Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130426?format=json", "institution": "Korea University"}], "abstract": "Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long\u2011sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37352", "url": null, "sourceid": 42674, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37614, "uid": "1f710d07916bb3151c453c764cfaf1ca", "name": "Hierarchical Attacks for Multi\u2011Modal Multi\u2011Agent Reasoning", "authors": [{"id": 180751, "fullname": "Hao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180751?format=json", "institution": null}, {"id": 187864, "fullname": "Tiru Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187864?format=json", "institution": "JD"}, {"id": 182666, "fullname": "yan jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182666?format=json", "institution": "JD.com"}, {"id": 187865, "fullname": "Wanqi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187865?format=json", "institution": "JD.com"}, {"id": 187866, "fullname": "Junxing Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187866?format=json", "institution": "JD.com"}, {"id": 187867, "fullname": "Ai Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187867?format=json", "institution": "JD.com"}], "abstract": "Multi\u2011modal multi\u2011agent systems (MM\u2011MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important.However, existing studies on adversarial attacks in multi\u2011agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM\u2011MAS largely underexplored. To bridge this gap, we introduce HAM\\textsuperscript{3}, a Hierarchical Attack framework for multi-modal multi-agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM\\textsuperscript{3}mounts attacks by perturbing visual inputs, textual inputs, and their fused visual\u2013textual representations. At the communication layer, it performs communication-level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning-level attacks that interfere with each agent\u2019s cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM\\textsuperscript{3} on the GQA benchmark through multi\u2011agent systems built on distinct reasoning paradigms including ReAct, Plan\u2011and\u2011Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3\\%, with reasoning\u2011layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi\u2011agent intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37614", "url": null, "sourceid": 32090, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40108, "uid": "ed16bc5bad6025e0777eeedb1e0cd6bc", "name": "WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with A Million Realistic Tasks", "authors": [{"id": 182267, "fullname": "Hao Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/182267?format=json", "institution": "UIUC, MSR"}, {"id": 193554, "fullname": "Alexey Taymanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/193554?format=json", "institution": "Research, Microsoft"}, {"id": 130359, "fullname": "Tong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130359?format=json", "institution": "UIUC"}, {"id": 193555, "fullname": "Aviral Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/193555?format=json", "institution": "Carnegie Mellon University; Google DeepMind"}, {"id": 193556, "fullname": "Spencer Whitehead", "url": "http://cvpr.thecvf.com/api/miniconf/users/193556?format=json", "institution": "Microsoft"}], "abstract": "We present WebGym, the largest open-source environment for training realistic visual web agents to date. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 1 million tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) algorithm, REINFORCE, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To speed up sampling rollouts for RL training, we develop a high-throughput asynchronous rollout system, designed specifically for web agents, that achieves a 4-5x rollout speedup compared to naive implementations, enabling us to train at scale on a diverse set of tasks. With this setup, we fine-tune strong vision-language models, such as Qwen-3-VL-8B-Instruct, on the training tasks from WebGym, which results in an improvement in success rate on an out-of-distribution test set from 21.8% to 28.5%, outperforming a \"proprietary\" GPT-4o-based agent and closing the gap to a GPT-5-Thinking agent that achieves 31.8%. This improvement is significant because our test set consists only of tasks on websites never seen during training, demonstrating generalization for web agents. We provide both the task breadth and system throughput for large-scale RL on web agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40108", "url": null, "sourceid": 38362, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37772, "uid": "b1bf0038e7a15b5b3dcecf1576af8863", "name": "GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation", "authors": [{"id": 182905, "fullname": "Rang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182905?format=json", "institution": "Peking University"}, {"id": 153681, "fullname": "Lei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153681?format=json", "institution": "University of Hong Kong"}, {"id": 107456, "fullname": "Shuhuai Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/107456?format=json", "institution": "Peking University"}, {"id": 156956, "fullname": "Hao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/156956?format=json", "institution": "Sensetime Research Institute"}, {"id": 188227, "fullname": "Shuhao Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188227?format=json", "institution": null}, {"id": 128133, "fullname": "Shicheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128133?format=json", "institution": "Peking University"}, {"id": 188228, "fullname": "Zihao Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/188228?format=json", "institution": "Renmin University of China"}, {"id": 188229, "fullname": "Yudong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188229?format=json", "institution": "Peking University"}, {"id": 188230, "fullname": "Wenhan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188230?format=json", "institution": "Peking University"}, {"id": 188231, "fullname": "Zhe Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188231?format=json", "institution": "Peking University"}, {"id": 188232, "fullname": "Jingyuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188232?format=json", "institution": "Peking University"}, {"id": 156517, "fullname": "Zhifang Sui", "url": "http://cvpr.thecvf.com/api/miniconf/users/156517?format=json", "institution": "Peking University"}, {"id": 188233, "fullname": "Fuli Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188233?format=json", "institution": "Individual Researcher"}], "abstract": "Visual grounding\u2014localizing objects from natural language descriptions\u2014represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative\u2014distinguishing highly similar objects, (2) Spatial\u2014understanding complex relational descriptions, (3) Limited\u2014handling occlusions or tiny objects, and (4) Rejection\u2014recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1\\% accuracy, while most score 0\\% on rejection tasks\u2014reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9\\%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0\\% to 27.9\\%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37772", "url": null, "sourceid": 33838, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36388, "uid": "0ace14f7dd3d8e29870664c6fa021440", "name": "Real-World Point Tracking with Verifier-Guided Pseudo-Labeling", "authors": [{"id": 184929, "fullname": "G\u00f6rkay Aydemir", "url": "http://cvpr.thecvf.com/api/miniconf/users/184929?format=json", "institution": "Codeway"}, {"id": 69231, "fullname": "Fatma G\u00fcney", "url": "http://cvpr.thecvf.com/api/miniconf/users/69231?format=json", "institution": "Ko\u00e7 University"}, {"id": 73937, "fullname": "Weidi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73937?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due todifferent characteristics and the absence of dense ground-truth annotations.Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher predictions, which vary across frames and scenes.In this paper, we address the problem of real-world fine-tuning and introduce Verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation.Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions to construct refined pseudo-label trajectories.When applied during fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos.Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36388", "url": null, "sourceid": 41065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38846, "uid": "ef5b0acc4a9fef53a0b84fea9ea01c69", "name": "Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection", "authors": [{"id": 180349, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180349?format=json", "institution": "University of Science and Technology of China"}, {"id": 190822, "fullname": "Zhangchi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190822?format=json", "institution": "University of Science and Technology of China"}, {"id": 190823, "fullname": "Xu Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190823?format=json", "institution": "USTC"}, {"id": 190824, "fullname": "Bin Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190824?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "Integrating LiDAR and camera inputs into a unified Bird\u2019s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect.The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to alleviate misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries, yielding a structurally aware representation. To effectively utilize these aligned representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively. Additionally, on the Argoverse 2 validation set, we achieve a competitive mAP of 41.3%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38846", "url": null, "sourceid": 42583, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37851, "uid": "b2f07227dfe21404a9778b9cc04dd684", "name": "AcTTA : Rethinking Test-Time Adaptation via Dynamic Activation", "authors": [{"id": 72183, "fullname": "Hyeongyu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/72183?format=json", "institution": "Yonsei University"}, {"id": 188409, "fullname": "GeonHui Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188409?format=json", "institution": "Yonsei University"}, {"id": 91230, "fullname": "Dosik Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91230?format=json", "institution": "Yonsei University"}], "abstract": "Test-time adaptation (TTA) aims to mitigate performance degradation under distribution shifts by updating model parameters during inference. Existing approaches have primarily framed adaptation around affine modulation, focusing on recalibrating normalization layers. This perspective, while effective, overlooks another influential component in representation dynamics: the activation function. We revisit this overlooked space and propose AcTTA, an activation-aware framework that reinterprets conventional activation functions from a learnable perspective and updates them adaptively at test time. AcTTA reformulates conventional activation functions (e.g., ReLU, GELU) into parameterized forms that shift their response threshold and modulate gradient sensitivity, enabling the network to adjust activation behavior under domain shifts. This functional reparameterization enables continuous adjustment of activation behavior without modifying network weights or requiring source data. Despite its simplicity, AcTTA achieves robust and stable adaptation across diverse corruptions. Across CIFAR10-C, CIFAR100-C, and ImageNet-C, AcTTA consistently surpasses normalization-based TTA methods. Our findings highlight activation adaptation as a compact and effective route toward domain-shift\u2013robust test-time learning, broadening the prevailing affine-centric view of adaptation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37851", "url": null, "sourceid": 41165, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38364, "uid": "13a0e2e9d803e8072b0d637d16f5fdb9", "name": "OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation", "authors": [{"id": 129580, "fullname": "Han Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129580?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189726, "fullname": "Xinyu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189726?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 76656, "fullname": "Yaoming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76656?format=json", "institution": "Meituan"}, {"id": 107369, "fullname": "Zelin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107369?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189727, "fullname": "Xin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189727?format=json", "institution": "Meituan"}, {"id": 189728, "fullname": "Rongxiang Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189728?format=json", "institution": "Meituan"}, {"id": 189729, "fullname": "Jingang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189729?format=json", "institution": "Meituan"}, {"id": 185049, "fullname": "Xunliang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185049?format=json", "institution": "Meituan"}, {"id": 89859, "fullname": "Wenrui Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/89859?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 76584, "fullname": "Hongkai Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76584?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a single decoder-only transformer architecture. OneCAT uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution image inputs and outputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) design trained with a unified autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer to achieve multi-scale visual autoregressive mechanism within the Large Language Model (LLM) with proposed scale-aware adapter (SAA) that drastically reduces decoding latency compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as an elegant foundation for unified multimodal intelligence. As a result, OneCAT outperforms existing unified models across benchmarks for multimodal understanding, generation, and editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38364", "url": null, "sourceid": 31169, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37384, "uid": "8a0a18ee792712a3d3aa044dbfc581ee", "name": "Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation", "authors": [{"id": 181185, "fullname": "Xiaole Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181185?format=json", "institution": "Southwest Jiaotong University"}, {"id": 187310, "fullname": "Qingsong Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187310?format=json", "institution": "Southwest Jiaotong University"}, {"id": 187311, "fullname": "Xiaobo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187311?format=json", "institution": null}, {"id": 126862, "fullname": "Xun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126862?format=json", "institution": "A*STAR"}, {"id": 187312, "fullname": "Xun Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187312?format=json", "institution": "Southwest Jiaotong university"}, {"id": 187313, "fullname": "Yan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187313?format=json", "institution": "Southwest Jiaotong University"}, {"id": 155216, "fullname": "Tianrui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155216?format=json", "institution": "Southwest Jiaotong University"}], "abstract": "Zero-shot image denoising has gained prominence in recent years, as it inherently relies on the intrinsic priors of images rather than learning from external data. Nevertheless, most existing methods either fail to fully exploit global priors, or do not properly preserve the fine-grained details governed by local priors. In this work, we propose a novel framework of pseudo sample generation for zero-shot denoising guided by local and global image priors. Specifically, we propose a well-crafted down-sampler based on gradient merging and grouping within a small window to generate down-sampled samples by exploiting spatial locality. Meanwhile, a global random sampler conditioned on a Gaussian distribution is devised to incorporate the nonlocal self-similarity of natural images. These two samplers build a new paradigm of pseudo sample generation powered by both local and global priors, which is termed as Zero-Shot Hybrid Prior-guided Denoising (ZS-HPD). Considering that noise is more likely to affect high-frequency details, we also present a simple yet effective loss that works in the Fourier domain and applies discriminative weights to distinct spectral bands. Numerous experiments on benchmark datasets have demonstrated the superiority of our ZS-HPD over existing advanced methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37384", "url": null, "sourceid": 35941, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38761, "uid": "02561202cc074e8f025c5fa81209e671", "name": "UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration", "authors": [{"id": 190608, "fullname": "Zihan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190608?format=json", "institution": "Xiamen University"}, {"id": 190609, "fullname": "Liangtai Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190609?format=json", "institution": "Xiamen University"}, {"id": 190610, "fullname": "Dian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190610?format=json", "institution": "Xiamen University"}, {"id": 145506, "fullname": "Ni Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145506?format=json", "institution": "Xiamen University"}, {"id": 87734, "fullname": "Xiaotong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87734?format=json", "institution": "XMU"}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 87721, "fullname": "Yanyun Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87721?format=json", "institution": "Xiamen University"}], "abstract": "All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address the core challenges of diverse degradation modeling and detail preservation, we propose UniLDiff, a unified framework enhanced with degradation- and detail-aware mechanisms, unlocking the power of diffusion priors for robust image restoration. Specifically, we introduce a Degradation-Aware Feature Fusion (DAFF) to dynamically inject low-quality features into each denoising step via decoupled fusion and adaptive modulation, enabling implicit modeling of diverse and compound degradations. Furthermore, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery through expert routing. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38761", "url": null, "sourceid": 37119, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38038, "uid": "fb3fda882b7b7a67a2aeea29a065b9ac", "name": "Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface", "authors": [{"id": 188892, "fullname": "Yujie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188892?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 185934, "fullname": "Hongwei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185934?format=json", "institution": "Peking University"}, {"id": 126713, "fullname": "Di Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126713?format=json", "institution": "Alibaba Group"}, {"id": 188893, "fullname": "Shengcong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188893?format=json", "institution": "Agibot"}, {"id": 188894, "fullname": "Liliang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188894?format=json", "institution": "AgiBot"}, {"id": 129188, "fullname": "Xiaoqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129188?format=json", "institution": "Peking University"}, {"id": 99906, "fullname": "Guanghui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/99906?format=json", "institution": "AgiBot"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}], "abstract": "Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only $1$\u2013$5$ source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to $10$-$50\\times$. Moreover, experimental results on height and texture editing demonstrate the framework\u2019s flexibility and extensibility, indicating its potential to serve as a unified data generation framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38038", "url": null, "sourceid": 43760, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37574, "uid": "e17710ba7a0f5dc2fac4374b629dce33", "name": "Physical Adversarial Examples through Camera Power Signal Injection", "authors": [{"id": 180325, "fullname": "yanze ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/180325?format=json", "institution": "Zhejiang University"}, {"id": 187756, "fullname": "Mingyuan Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/187756?format=json", "institution": "Zhejiang University"}, {"id": 187757, "fullname": "Qinhong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187757?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 187758, "fullname": "Yan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187758?format=json", "institution": null}, {"id": 187759, "fullname": "Chen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187759?format=json", "institution": "Zhejiang University"}, {"id": 149296, "fullname": "Xiaoyu Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/149296?format=json", "institution": "Zhejiang University"}, {"id": 187760, "fullname": "Wenyuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187760?format=json", "institution": "Zhejiang University"}], "abstract": "Physical adversarial examples pose a concrete threat to real\u2011world computer vision systems. Existing works mainly generate physical adversarial examples by affixing patches or projecting light onto targets, which are usually visible and can expose the malicious intention. In this work, we reveal a new attack surface that generates invisible adversarial samples by injecting signals into the camera's power supply. We analyze the mechanism of injecting structural stripe patterns into cameras and demonstrate the feasibility of controllable fine-grained injection with signal modulation. We develop a simulation model to emulate the physically injected perturbation, and propose end-to-end optimization methodologies in both white-box and black-box settings to generate the injection signal parameters. We perform a simulated evaluation across seven classification models and carry out physical signal injection experiments with optimized signals. The results show that physical adversarial examples generated through camera power signal injection can disrupt computer vision performance. Our work introduces a new methodology for physical adversarial examples, emphasizing the need for securing computer vision systems in the physical world.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37574", "url": null, "sourceid": 36826, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39556, "uid": "f72a53e19c3310188cedd074226312c9", "name": "PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow", "authors": [{"id": 184971, "fullname": "Xincheng Shuai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184971?format=json", "institution": "Fudan University"}, {"id": 167721, "fullname": "Song Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/167721?format=json", "institution": "Fudan University"}, {"id": 192335, "fullname": "Yutong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192335?format=json", "institution": "Fudan University"}, {"id": 76198, "fullname": "Henghui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/76198?format=json", "institution": "Fudan University"}, {"id": 73994, "fullname": "Dacheng Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73994?format=json", "institution": "Nanyang Technological University"}], "abstract": "Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose ***PSDesigner***, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, ***PSDesigner*** collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, ***CreativePSD***, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that ***PSDesigner*** outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39556", "url": null, "sourceid": 46635, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40252, "uid": "e428d5580f729d23e6070e43c492e424", "name": "Implicit-Knowledge Visual Question Answering with Structured Reasoning Traces", "authors": [{"id": 193882, "fullname": "Zhihao Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193882?format=json", "institution": "Ant Group"}, {"id": 193883, "fullname": "Wenkang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/193883?format=json", "institution": "University of Science and Technology of China"}, {"id": 193884, "fullname": "Yuan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193884?format=json", "institution": "Singapore Management University"}, {"id": 193885, "fullname": "Xingtong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193885?format=json", "institution": "Singapore Management University"}, {"id": 180234, "fullname": "hui zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180234?format=json", "institution": "university of science and technology of China"}, {"id": 193886, "fullname": "Weicheng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193886?format=json", "institution": "Alibaba Group"}, {"id": 193887, "fullname": "Xin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193887?format=json", "institution": "Ant Group; Microsoft"}], "abstract": "Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, *IK-KVQA*, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We propose **StaR-KVQA**, a framework that equips IK-KVQA with *dual-path structured reasoning traces*\u2014symbolic relation paths over text and vision together with path-grounded natural-language explanations\u2014to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offering more structure than generic chain-of-thought supervision while not constraining reasoning to any single fixed path. With a single open-source MLLM, StaR-KVQA constructs and selects traces to build an offline trace-enriched dataset and then performs structure-aware self-distillation; no external retrievers, verifiers, or curated knowledge bases are used, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA consistently improves both answer accuracy and the transparency of intermediate reasoning, achieving up to **+11.3%** higher answer accuracy on OK-VQA over the strongest baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40252", "url": null, "sourceid": 44340, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37334, "uid": "908c5a804c204a58e06d8f7f0b7e53b7", "name": "SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior\u2013Guided Multimodal LLMs", "authors": [{"id": 182875, "fullname": "Ran Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182875?format=json", "institution": "Communication University of China"}, {"id": 180929, "fullname": "Haoxiang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180929?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187187, "fullname": "Chenxi Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187187?format=json", "institution": "Communication University of China"}, {"id": 187188, "fullname": "Yanxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187188?format=json", "institution": "Communication University of China"}, {"id": 187189, "fullname": "Wenwen Qiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187189?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 187190, "fullname": "Fang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187190?format=json", "institution": "Communication University of China"}, {"id": 128482, "fullname": "Xiaoming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128482?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 187191, "fullname": "Cuixia Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187191?format=json", "institution": "institute of software, chinese academy of sciences"}, {"id": 91120, "fullname": "Yong-Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91120?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Transforming sparse, partial pixel sketches from diverse media into complete, editable vector drawings is essential yet underexplored in digital creation. Prior methods either generate from scratch or inpaint local gaps without predicting global structure, leading to coarse contours and limited detail. To address this, we introduce SketchRevive, a two\u2011stage framework for fine\u2011grained pixel\u2011to\u2011vector sketch completion that couples diffusion\u2011based pixel completion with MLLM\u2011driven refinement and vectorization to produce coherent, detail\u2011faithful SVG results. Specifically, we first construct a practical benchmark by augmenting stroke\u2011annotated sketches from paper and whiteboards. Stage I trains a diffusion model with a line\u2011distribution head to predict per\u2011pixel stroke presence, producing structural and appearance consistent completions. Stage II fine-tunes an MLLM for structure\u2011aware SVG vectorization with iterative refinement, optimized by instance\u2011level stroke attribute similarities. To align key clues e.g. spatial structure, appearance details across both stages, we introduce a diffusion-prior aggregated encoding module by injecting multi\u2011scale UNet features from Stage I into the MLLM\u2019s visual embeddings and using line prediction logits for token compression to prioritize informative tokens. Experiments indicate that SketchRevive completes topology\u2011coherent vector outputs with high fidelity and recognizability while preserving user intent, suitable for interactive creation and artistic design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37334", "url": null, "sourceid": 44817, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39224, "uid": "7c4bbe3b40bfbd75077c9b6fc69c43e8", "name": "Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation", "authors": [{"id": 155013, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155013?format=json", "institution": "Renmin University of China"}, {"id": 191630, "fullname": "Yucheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191630?format=json", "institution": "Horizon Robotics"}, {"id": 191631, "fullname": "Guoxin Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191631?format=json", "institution": "Renmin University of China"}, {"id": 155015, "fullname": "Yongcai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155015?format=json", "institution": "Renmin University of China"}, {"id": 191632, "fullname": "Maiyue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191632?format=json", "institution": null}, {"id": 191633, "fullname": "kaihui.wang kaihui.wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191633?format=json", "institution": "Horizon Robotics"}, {"id": 191634, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191634?format=json", "institution": "Horizon Robotics"}, {"id": 164010, "fullname": "Zhizhong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/164010?format=json", "institution": "Horizon Robotics"}, {"id": 191635, "fullname": "Yutian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191635?format=json", "institution": "Renmin University of China"}, {"id": 155014, "fullname": "Wanting Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155014?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 155017, "fullname": "Deying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155017?format=json", "institution": "Renmin University of China"}, {"id": 153722, "fullname": "Zhaoxin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153722?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction.However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences.Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation.To achieve this without annotations, we propose a three-stage framework.In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes.Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions.Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives.Experiments on R2R-CE and RxR-CE show substantial gains in success, efficiency, and interpretability, demonstrating semantic progress provides a more consistent and generalizable representation of navigation advancement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39224", "url": null, "sourceid": 36781, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37379, "uid": "c35a6d53e9dd68679cd7a9e1557dd68a", "name": "Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs", "authors": [{"id": 187289, "fullname": "Meng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187289?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 187290, "fullname": "Ran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187290?format=json", "institution": "Google DeepMind"}, {"id": 187291, "fullname": "Yi Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187291?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 131776, "fullname": "Wenxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131776?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 187292, "fullname": "Yue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187292?format=json", "institution": "Meta"}, {"id": 187293, "fullname": "Gaurav Srivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/187293?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 187294, "fullname": "Yuchen Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187294?format=json", "institution": "Google DeepMind"}, {"id": 85585, "fullname": "Mohamed Elhoseiny", "url": "http://cvpr.thecvf.com/api/miniconf/users/85585?format=json", "institution": "KAUST"}, {"id": 153251, "fullname": "Charles Fleming", "url": "http://cvpr.thecvf.com/api/miniconf/users/153251?format=json", "institution": "Cisco"}, {"id": 153504, "fullname": "Carl Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153504?format=json", "institution": "Emory University"}, {"id": 155027, "fullname": "Zhengzhong Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155027?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 187295, "fullname": "Yang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187295?format=json", "institution": "UT Southwestern Medical Center"}, {"id": 187296, "fullname": "Guanghua Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187296?format=json", "institution": null}, {"id": 187297, "fullname": "Di Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187297?format=json", "institution": "Eigen AI"}, {"id": 182989, "fullname": "Wenqi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182989?format=json", "institution": "University of Texas Southwestern Medical Center"}, {"id": 180258, "fullname": "Xuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180258?format=json", "institution": "Virginia Polytechnic Institute and State University"}], "abstract": "While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to \u201cthink with images,\u201d i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37379", "url": null, "sourceid": 35204, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39182, "uid": "0898bae5662b8c4a9cd8ea2db1fa7ee4", "name": "VA-$\\boldsymbol{\\pi}$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation", "authors": [{"id": 191522, "fullname": "Xinyao Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191522?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 76416, "fullname": "QIYUAN HE", "url": "http://cvpr.thecvf.com/api/miniconf/users/76416?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 128123, "fullname": "Kai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128123?format=json", "institution": "National University of Singapore"}, {"id": 156078, "fullname": "Xiaoye Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156078?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 129143, "fullname": "Yicong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129143?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 130050, "fullname": "Wei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130050?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85773, "fullname": "Angela Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85773?format=json", "institution": "National University of Singapore"}], "abstract": "Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences.  However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose \\textbf{VA}-$\\boldsymbol{\\pi}$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$\\pi$ formulates the generator\u2013tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$\\pi$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling.The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency. VA-$\\pi$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1\\% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39182", "url": null, "sourceid": 30713, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38007, "uid": "c6078c876ca28b1c764e7dce3fe86ed5", "name": "Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors", "authors": [{"id": 183671, "fullname": "Tianlin Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183671?format=json", "institution": null}, {"id": 188813, "fullname": "Dongchuan Ran", "url": "http://cvpr.thecvf.com/api/miniconf/users/188813?format=json", "institution": "Sensetime"}, {"id": 188814, "fullname": "Ranjie Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188814?format=json", "institution": null}, {"id": 90273, "fullname": "Yao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90273?format=json", "institution": "Zhejiang University"}, {"id": 177442, "fullname": "Peilun Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/177442?format=json", "institution": "Institute of Systems Engineering ,Academy of Military Sciences(AMS),People&#x27;s Liberation Army of China(PLA)"}, {"id": 188815, "fullname": "ningbo yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188815?format=json", "institution": null}, {"id": 188816, "fullname": "Huanqian Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188816?format=json", "institution": "Beihang University"}, {"id": 188817, "fullname": "Xu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188817?format=json", "institution": "Tsinghua University"}, {"id": 188818, "fullname": "Qiang Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188818?format=json", "institution": null}, {"id": 188819, "fullname": "Yuzheng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188819?format=json", "institution": "Xidian university"}, {"id": 188820, "fullname": "baoyang baoyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188820?format=json", "institution": null}, {"id": 188821, "fullname": "Yuan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188821?format=json", "institution": "Tsinghua University"}], "abstract": "Backdoor attacks pose potential threats to object detection models, highlighting the importance of studying their security. However, existing backdoor attacks mainly rely on trigger-specific intrinsic features, which limits their practicality in real-world scenarios. In this paper, we propose a novel backdoor attack that leverages dynamic object interactions in realistic scenarios to activate malicious behavior. By hijacking the Non-Maximum Suppression (NMS) process in object detectors, this attack demonstrates robust effectiveness, including misclassification, mislocalization, and object appearance/disappearance, while maintaining the model\u2019s normal performance on clean inputs. Experimental results demonstrate that our attack exhibits significant attack performance across various object detectors and datasets, and remains effective both in physical environments and under existing defense mechanisms. These findings highlight the urgent need to develop efficient and robust defense strategies against backdoor attacks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38007", "url": null, "sourceid": 40464, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40292, "uid": "50acb95527b5b98d6a41b41cab45fe6c", "name": "ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning", "authors": [{"id": 103328, "fullname": "Wenjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103328?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 185592, "fullname": "Yabin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185592?format=json", "institution": "Stanford University"}, {"id": 132631, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/132631?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 133538, "fullname": "Wenjun Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133538?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40292", "url": null, "sourceid": -32732, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37319?format=json"], "related_events_ids": [37319]}, {"id": 40103, "uid": "c7e07c312fd6bafc9f5192b3dfdf3d3f", "name": "Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation", "authors": [{"id": 181089, "fullname": "Tiantian Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181089?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 193541, "fullname": "Chao Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193541?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 193542, "fullname": "Shufan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193542?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 193543, "fullname": "Jinzhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193543?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 85025, "fullname": "Shuhui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85025?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment.Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs.However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks.In this paper, we propose a plug-and-play framework called **L**ocate-**T**hen-**S**parsify for **F**eature **S**teering (**LTS-FS**), which controls the steering intensity according to the hallucination relevance of each layer.We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers.Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance on general LVLM benchmarks. Codes are provided in the supplementary.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40103", "url": null, "sourceid": 32884, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38614, "uid": "06ebaf4306562823d37da3071facc9d0", "name": "Logit-Margin Repulsion for Backdoor Defense", "authors": [{"id": 177510, "fullname": "Zhiguo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177510?format=json", "institution": "University of Science and Technology of China"}, {"id": 177513, "fullname": "Dongsheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177513?format=json", "institution": "University of Science and Technology of China"}, {"id": 190303, "fullname": "Ruizhi Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190303?format=json", "institution": "University of Science and Technology of China"}, {"id": 182536, "fullname": "Jiacheng Pi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182536?format=json", "institution": "University of Science and Technology of China"}, {"id": 190304, "fullname": "Xingxing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190304?format=json", "institution": "University of Science and Technology of China"}, {"id": 126452, "fullname": "Wenjie Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126452?format=json", "institution": "University of Exeter"}], "abstract": "Backdoor attacks are an increasingly significant threat to deep neural networks. Recent studies have revealed that model compression, such as quantization and pruning, can be exploited to implant conditional backdoors. These backdoors remain dormant in full-precision models but are activated during the compression, making them highly stealthy and difficult to detect. Traditional defense methods are generally ineffective against such attacks, and defenses designed for conditional backdoors struggle to handle traditional ones. Moreover, most existing approaches fail to generalize to Transformer architectures.To address these challenges, we propose $\\textit{$\\textbf{L}$ogit $\\textbf{M}$argin $\\textbf{R}$epulsion}$ (LMR), a universal and architecture-agnostic defense method. LMR uses a small set of clean samples, combining selective cross-entropy with a logit-margin constraint to enlarge the gap between the backdoor class and benign classes. It then applies selective pruning to remove channels associated with backdoor behavior, achieving strong defense without changing the model architecture. Extensive experiments on a wide range of CNNs and Vision Transformers demonstrate that LMR, even with a minimal amount of clean data (0.1\\%), can effectively mitigate both traditional and conditional backdoor attacks across diverse model architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38614", "url": null, "sourceid": 32228, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37319, "uid": "50acb95527b5b98d6a41b41cab45fe6c", "name": "ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning", "authors": [{"id": 103328, "fullname": "Wenjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103328?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 185592, "fullname": "Yabin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185592?format=json", "institution": "Stanford University"}, {"id": 132631, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/132631?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 133538, "fullname": "Wenjun Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133538?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37319", "url": null, "sourceid": 32732, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40292?format=json"], "related_events_ids": [40292]}, {"id": 38879, "uid": "ba0909e302db12f30293fed31693f19b", "name": "Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image\u2013Phrase Injection", "authors": [{"id": 182536, "fullname": "Jiacheng Pi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182536?format=json", "institution": "University of Science and Technology of China"}, {"id": 177510, "fullname": "Zhiguo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177510?format=json", "institution": "University of Science and Technology of China"}, {"id": 190304, "fullname": "Xingxing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190304?format=json", "institution": "University of Science and Technology of China"}, {"id": 177513, "fullname": "Dongsheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177513?format=json", "institution": "University of Science and Technology of China"}, {"id": 190303, "fullname": "Ruizhi Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190303?format=json", "institution": "University of Science and Technology of China"}, {"id": 126452, "fullname": "Wenjie Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126452?format=json", "institution": "University of Exeter"}], "abstract": "The integration of vision and language in Vision-Language Models (VLMs), while enabling multimodal capabilities, inherently expands their attack surface. Among existing white-box jailbreak methods, suffix-optimization-based approaches often rely on gradient approximations over discrete token spaces, yielding insufficient guidance and causing optimization to stagnate in local optima, while image-perturbation-based ones frequently exhibit poor cross-model transferability. In this work, we introduce $\\textbf{DGSIP}$, a $\\textbf{D}$issonance-$\\textbf{G}$uided $\\textbf{S}$uffix Optimization and $\\textbf{I}$mage\u2013$\\textbf{P}$hrase Injection framework. DGSIP leverages predictive dissonance between the target model and an unaligned model to identify tokens suppressed by safety alignment, using them as a more effective signal than gradient-based cues for suffix optimization. It further reinforces the attack by jointly optimizing the content and presentation of phrase embedded in images to leverage VLMs\u2019 cross-modal sensitivity. Our extensive experiments demonstrate that DGSIP outperforms prior baselines across multiple safety benchmarks and a range of open-source VLMs (e.g., MiniGPT-4, InstructBlip and LLaVA). More importantly, compared to baselines, our method exhibits much stronger transferability to commercial black-box VLMs, such as GPT-4o-Mini, Gemini 2.0 Flash and Qwen 2.5-VL. Based upon DGSIP, we empirically reveal critical vulnerabilities in the safeguard mechanisms of current VLMs, highlighting the urgent need for more robust defense strategies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38879", "url": null, "sourceid": 41195, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38025, "uid": "1fa3b85e26b3058ead1ef2eec9be060e", "name": "The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification", "authors": [{"id": 188853, "fullname": "Dante Wasmuht", "url": "http://cvpr.thecvf.com/api/miniconf/users/188853?format=json", "institution": "Conservation X Labs"}, {"id": 96104, "fullname": "Otto Brookes", "url": "http://cvpr.thecvf.com/api/miniconf/users/96104?format=json", "institution": "University of Bristol"}, {"id": 188854, "fullname": "Maximilian Schall", "url": "http://cvpr.thecvf.com/api/miniconf/users/188854?format=json", "institution": "Hasso Plattner Institute"}, {"id": 188855, "fullname": "Pablo Palencia", "url": "http://cvpr.thecvf.com/api/miniconf/users/188855?format=json", "institution": "Universidad de Oviedo"}, {"id": 188856, "fullname": "Christopher Beirne", "url": "http://cvpr.thecvf.com/api/miniconf/users/188856?format=json", "institution": "University of Exeter; University of Exeter"}, {"id": 91093, "fullname": "Tilo Burghardt", "url": "http://cvpr.thecvf.com/api/miniconf/users/91093?format=json", "institution": "University of Bristol"}, {"id": 91082, "fullname": "Majid Mirmehdi", "url": "http://cvpr.thecvf.com/api/miniconf/users/91082?format=json", "institution": "University of Bristol"}, {"id": 188857, "fullname": "Hjalmar K\u00fchl", "url": "http://cvpr.thecvf.com/api/miniconf/users/188857?format=json", "institution": "Senckenberg Society for Nature Research"}, {"id": 152329, "fullname": "Mimi Arandjelovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/152329?format=json", "institution": "Max Planck Institute for Evolutionary Anthropology"}, {"id": 175063, "fullname": "Sam Pottie", "url": "http://cvpr.thecvf.com/api/miniconf/users/175063?format=json", "institution": "Climate Corridors"}, {"id": 188858, "fullname": "Peter Bermant", "url": "http://cvpr.thecvf.com/api/miniconf/users/188858?format=json", "institution": "Conservation X Labs"}, {"id": 188859, "fullname": "Brandon Asheim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188859?format=json", "institution": "Conservation X Labs"}, {"id": 188860, "fullname": "Yi Toh", "url": "http://cvpr.thecvf.com/api/miniconf/users/188860?format=json", "institution": "Conservation X Labs"}, {"id": 188861, "fullname": "Adam Elzinga", "url": "http://cvpr.thecvf.com/api/miniconf/users/188861?format=json", "institution": "Conservation X Labs"}, {"id": 188862, "fullname": "Jason Allan Holmberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/188862?format=json", "institution": "Conservation X Labs"}, {"id": 188863, "fullname": "Andrew Whitworth", "url": "http://cvpr.thecvf.com/api/miniconf/users/188863?format=json", "institution": null}, {"id": 188864, "fullname": "Eleanor Flatt", "url": "http://cvpr.thecvf.com/api/miniconf/users/188864?format=json", "institution": "Osa Conservation"}, {"id": 188865, "fullname": "Laura Gustafson", "url": "http://cvpr.thecvf.com/api/miniconf/users/188865?format=json", "institution": "Meta"}, {"id": 188866, "fullname": "Chaitanya Ryali", "url": "http://cvpr.thecvf.com/api/miniconf/users/188866?format=json", "institution": "Meta"}, {"id": 188867, "fullname": "Yuan-Ting Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188867?format=json", "institution": "Meta AI"}, {"id": 188868, "fullname": "Baishan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188868?format=json", "institution": "Meta"}, {"id": 188869, "fullname": "Andrew Westbury", "url": "http://cvpr.thecvf.com/api/miniconf/users/188869?format=json", "institution": "Facebook"}, {"id": 69230, "fullname": "Kate Saenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/69230?format=json", "institution": "Meta / Boston University"}, {"id": 89163, "fullname": "D\u00eddac Sur\u00eds", "url": "http://cvpr.thecvf.com/api/miniconf/users/89163?format=json", "institution": "Columbia University"}], "abstract": "Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity -- leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in $\\sim$46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multi-animal tracking in the wild. The dataset is available at [ANONYMIZED]", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38025", "url": null, "sourceid": 41743, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40322?format=json"], "related_events_ids": [40322]}, {"id": 40322, "uid": "1fa3b85e26b3058ead1ef2eec9be060e", "name": "The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification", "authors": [{"id": 188853, "fullname": "Dante Wasmuht", "url": "http://cvpr.thecvf.com/api/miniconf/users/188853?format=json", "institution": "Conservation X Labs"}, {"id": 96104, "fullname": "Otto Brookes", "url": "http://cvpr.thecvf.com/api/miniconf/users/96104?format=json", "institution": "University of Bristol"}, {"id": 188854, "fullname": "Maximilian Schall", "url": "http://cvpr.thecvf.com/api/miniconf/users/188854?format=json", "institution": "Hasso Plattner Institute"}, {"id": 188855, "fullname": "Pablo Palencia", "url": "http://cvpr.thecvf.com/api/miniconf/users/188855?format=json", "institution": "Universidad de Oviedo"}, {"id": 188856, "fullname": "Christopher Beirne", "url": "http://cvpr.thecvf.com/api/miniconf/users/188856?format=json", "institution": "University of Exeter; University of Exeter"}, {"id": 91093, "fullname": "Tilo Burghardt", "url": "http://cvpr.thecvf.com/api/miniconf/users/91093?format=json", "institution": "University of Bristol"}, {"id": 91082, "fullname": "Majid Mirmehdi", "url": "http://cvpr.thecvf.com/api/miniconf/users/91082?format=json", "institution": "University of Bristol"}, {"id": 188857, "fullname": "Hjalmar K\u00fchl", "url": "http://cvpr.thecvf.com/api/miniconf/users/188857?format=json", "institution": "Senckenberg Society for Nature Research"}, {"id": 152329, "fullname": "Mimi Arandjelovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/152329?format=json", "institution": "Max Planck Institute for Evolutionary Anthropology"}, {"id": 175063, "fullname": "Sam Pottie", "url": "http://cvpr.thecvf.com/api/miniconf/users/175063?format=json", "institution": "Climate Corridors"}, {"id": 188858, "fullname": "Peter Bermant", "url": "http://cvpr.thecvf.com/api/miniconf/users/188858?format=json", "institution": "Conservation X Labs"}, {"id": 188859, "fullname": "Brandon Asheim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188859?format=json", "institution": "Conservation X Labs"}, {"id": 188860, "fullname": "Yi Toh", "url": "http://cvpr.thecvf.com/api/miniconf/users/188860?format=json", "institution": "Conservation X Labs"}, {"id": 188861, "fullname": "Adam Elzinga", "url": "http://cvpr.thecvf.com/api/miniconf/users/188861?format=json", "institution": "Conservation X Labs"}, {"id": 188862, "fullname": "Jason Allan Holmberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/188862?format=json", "institution": "Conservation X Labs"}, {"id": 188863, "fullname": "Andrew Whitworth", "url": "http://cvpr.thecvf.com/api/miniconf/users/188863?format=json", "institution": null}, {"id": 188864, "fullname": "Eleanor Flatt", "url": "http://cvpr.thecvf.com/api/miniconf/users/188864?format=json", "institution": "Osa Conservation"}, {"id": 188865, "fullname": "Laura Gustafson", "url": "http://cvpr.thecvf.com/api/miniconf/users/188865?format=json", "institution": "Meta"}, {"id": 188866, "fullname": "Chaitanya Ryali", "url": "http://cvpr.thecvf.com/api/miniconf/users/188866?format=json", "institution": "Meta"}, {"id": 188867, "fullname": "Yuan-Ting Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188867?format=json", "institution": "Meta AI"}, {"id": 188868, "fullname": "Baishan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188868?format=json", "institution": "Meta"}, {"id": 188869, "fullname": "Andrew Westbury", "url": "http://cvpr.thecvf.com/api/miniconf/users/188869?format=json", "institution": "Facebook"}, {"id": 69230, "fullname": "Kate Saenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/69230?format=json", "institution": "Meta / Boston University"}, {"id": 89163, "fullname": "D\u00eddac Sur\u00eds", "url": "http://cvpr.thecvf.com/api/miniconf/users/89163?format=json", "institution": "Columbia University"}], "abstract": "Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity -- leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in $\\sim$46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multi-animal tracking in the wild. The dataset is available at [ANONYMIZED]", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40322", "url": null, "sourceid": -41743, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38025?format=json"], "related_events_ids": [38025]}, {"id": 40048, "uid": "b7444aa33236c190c7e81510bedfc9f4", "name": "Mario: Multimodal Graph Reasoning with Large Language Models", "authors": [{"id": 193375, "fullname": "Yuanfu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193375?format=json", "institution": "New York University"}, {"id": 193376, "fullname": "Kang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193376?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 193377, "fullname": "Pengkang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193377?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 187963, "fullname": "Jiajin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187963?format=json", "institution": "New York University"}, {"id": 152657, "fullname": "Qiaoyu Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152657?format=json", "institution": "New York University Shanghai"}], "abstract": "Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision\u2013language models (VLMs) to encode image\u2013text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code is available in supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40048", "url": null, "sourceid": 40243, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38060, "uid": "eb9253d4e3fb42cbb7ae9c6d2b10082a", "name": "SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction", "authors": [{"id": 188945, "fullname": "Lingxiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188945?format=json", "institution": "University of Melbourne"}, {"id": 182388, "fullname": "Dongwon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/182388?format=json", "institution": "Hanyang University"}, {"id": 188946, "fullname": "Lingyan Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188946?format=json", "institution": null}, {"id": 188947, "fullname": "Taesoo Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188947?format=json", "institution": "Hanyang University"}, {"id": 188948, "fullname": "Bin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188948?format=json", "institution": "University of Melbourne"}, {"id": 188949, "fullname": "Taehyun Rhee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188949?format=json", "institution": "University of Melbourne"}], "abstract": "Text-guided motion generation in 3D scenes has advanced the synthesis of human\u2013scene interactions, contributing to embodied AI, scene understanding, and virtual agent simulation. While recent studies have begun exploring multi-agent scenarios, achieving temporally synchronised interactions among multiple agents remains an open challenge. Existing methods are often limited in flexibility and scalability when handling diverse interaction contexts.We present a method that enables synchronised multi-agent interaction using a single-agent motion synthesis model through two key components: a text-guided dependency-aware story planner and a temporal synchronisation module. The story planner interprets natural language instructions into structured event sequences with temporal dependencies. Our synchronisation module, built upon time-warping control and diffusion posterior sampling, aligns interaction timing across agents without retraining.Experimental results demonstrate that the proposed framework effectively models temporal dependencies and causal order between events. Evaluations across diverse interaction types show improved temporal alignment and coherent multi-agent motion generation consistent with textual instructions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38060", "url": null, "sourceid": 44308, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37633, "uid": "d9f81c55a64eb1324cbe20c56cf59374", "name": "Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation", "authors": [{"id": 187918, "fullname": "Yutong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187918?format=json", "institution": "Beihang University"}, {"id": 87568, "fullname": "Jiaxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87568?format=json", "institution": "Beihang University"}, {"id": 187919, "fullname": "Honglin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187919?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187920, "fullname": "Kaiqi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187920?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 90587, "fullname": "Shengcai Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90587?format=json", "institution": "Inception Institute of Artificial Intelligence"}, {"id": 187921, "fullname": "Hanwen Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187921?format=json", "institution": "Beihang University"}, {"id": 107115, "fullname": "Weixin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107115?format=json", "institution": "Beihang University"}, {"id": 89528, "fullname": "Yunhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89528?format=json", "institution": "Beihang University"}], "abstract": "Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2\\% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37633", "url": null, "sourceid": 31825, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38220, "uid": "a6c374f805b45c3cb19b3e30270bb5eb", "name": "DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration", "authors": [{"id": 184739, "fullname": "Jinzhou Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184739?format=json", "institution": "University of California, San Diego"}, {"id": 189356, "fullname": "Fan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189356?format=json", "institution": "University of California, San Diego; MBZUAI"}, {"id": 189357, "fullname": "Minghao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189357?format=json", "institution": "University of California, San Diego"}, {"id": 189358, "fullname": "Wenjun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189358?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 189359, "fullname": "Jing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189359?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 189360, "fullname": "Biwei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189360?format=json", "institution": "University of California, San Diego"}, {"id": 128912, "fullname": "Keze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128912?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Learned world models excel at interpolative generalization but fail at extrapolative generalization to novel physical properties. This limitation arises because they learn statistical correlations rather than the environment's underlying generative rules, such as physical invariances and conservation laws. We argue that learning these invariances is key to robust extrapolation. To achieve this, we first introduce \\textbf{Symmetry Exploration}, an unsupervised exploration strategy where an agent is intrinsically motivated by a Hamiltonian-based curiosity bonus to actively probe and challenge its understanding of conservation laws, thereby collecting physically informative data. Second, we design a Hamiltonian-based world model that learns from the collected data, using a novel self-supervised contrastive objective to identify the invariant physical state from raw, view-dependent pixel observations. Our framework, \\textbf{DreamSAC}, trained on this actively curated data, significantly outperforms state-of-the-art baselines in 3D physics simulations on tasks requiring extrapolation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38220", "url": null, "sourceid": 45549, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36992, "uid": "721d2344f7d2bfbe7a90c257f8d961df", "name": "Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model", "authors": [{"id": 180087, "fullname": "xulun ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180087?format=json", "institution": "Ningbo University"}, {"id": 186401, "fullname": "Benyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186401?format=json", "institution": "Ningbo University"}, {"id": 186402, "fullname": "Jie Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186402?format=json", "institution": "Ningbo University"}, {"id": 186403, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186403?format=json", "institution": "Shenzhen University"}], "abstract": "The scarcity of labeled fine-grained data presents a significant challenge for deep clustering. Vision-Language Models (VLMs) on existing coarse-grained datasets (characterized by high inter-class and low intra-class variance) struggle to capture the subtle distinctions essential for fine-grained categorization, leading to suboptimal clustering performance. To address this, we propose a novel framework that adapts VLMs for fine-grained clustering without requiring fine-grained labels. Our method steers the model to focus on discriminative fine-grained features by integrating a Bayesian nonparametric process with a tailored representation learning objective, which includes low-rank guidance and orthogonal guidance. This allows our model to dynamically discover clusters that reflect fine-grained categories. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple fine-grained benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36992", "url": null, "sourceid": 34409, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39307, "uid": "96c9f1595495506df084b0966e40432e", "name": "Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration", "authors": [{"id": 188359, "fullname": "Jiaqi Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188359?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 191811, "fullname": "Juntong Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191811?format=json", "institution": "Stanford University"}, {"id": 191812, "fullname": "Puheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191812?format=json", "institution": "Stanford University"}, {"id": 191813, "fullname": "Haotian Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/191813?format=json", "institution": "Stanford University"}, {"id": 130493, "fullname": "Qiushan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/130493?format=json", "institution": "The University of Hong Kong"}, {"id": 85600, "fullname": "Stefano Ermon", "url": "http://cvpr.thecvf.com/api/miniconf/users/85600?format=json", "institution": "Stanford University"}], "abstract": "Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (*Spectrum*), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size.Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\\times$ speedup on FLUX.1 and 4.67$\\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39307", "url": null, "sourceid": 36724, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37901, "uid": "0291fd85422559cef53644838ad97856", "name": "SparseVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration", "authors": [{"id": 181221, "fullname": "Zekun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181221?format=json", "institution": "Beijing Academy of Artificial Intelligence (BAAI)"}, {"id": 188530, "fullname": "wang ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/188530?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 188531, "fullname": "Tongxin Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188531?format=json", "institution": "Beijing Institute of Artificial Intelligence"}, {"id": 188532, "fullname": "Changwang Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188532?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 188533, "fullname": "Peisong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188533?format=json", "institution": "Institute of Automation of\uff0cChinese Academy of Sciences"}, {"id": 188534, "fullname": "Shuang Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188534?format=json", "institution": "City University of Hong Kong"}, {"id": 188535, "fullname": "Jian Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188535?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm.However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency.Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality.To address these problems, we present \\textbf{SparseVAR}, a training-free acceleration framework that exploits three properties of VAR attention: \\textbf{(i) strong attention sinks}, \\textbf{(ii) cross-scale activation similarity}, and \\textbf{(iii) pronounced locality}.Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales.Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\\mathbf{> 5\\times}$ faster forward speed than FlashAttention.Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\\times1024$ high-resolution images to the \\textbf{1s}, \\textbf{without skipping the last scales}.Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\\mathbf{1.57\\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\\mathbf{2.28\\times}$ acceleration, while maintaining competitive visual generation quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37901", "url": null, "sourceid": 34491, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38119, "uid": "2c7e55694e865b64eebe041d813cb0d8", "name": "Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation", "authors": [{"id": 182493, "fullname": "Halima Bouzidi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182493?format=json", "institution": "University of California, Irvine"}, {"id": 189092, "fullname": "Haoyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189092?format=json", "institution": "University of California, Irvine"}, {"id": 189093, "fullname": "Yonatan Achamyeleh", "url": "http://cvpr.thecvf.com/api/miniconf/users/189093?format=json", "institution": "University of California, Irvine"}, {"id": 189094, "fullname": "Praneetsai Iddamsetty", "url": "http://cvpr.thecvf.com/api/miniconf/users/189094?format=json", "institution": "University of California, Irvine"}, {"id": 156618, "fullname": "Mohammad Al Faruque", "url": "http://cvpr.thecvf.com/api/miniconf/users/156618?format=json", "institution": "University of California, Irvine, CA, USA"}], "abstract": "Recent Tracking-by-Query-Propagation (TBP) methods have advanced Multi-Object Tracking (MOT) by enabling end-to-end (E2E) pipelines with long-range temporal modeling. However, this reliance on query propagation introduces un-explored architectural vulnerabilities to adversarial attacks. We present FADE, a novel attack framework designed to exploit these specific vulnerabilities. FADE employs two attack strategies targeting core TBP mechanisms: (i) Temporal Query Flooding: Generates spurious temporally-consistent track queries to exhaust the tracker's limited query budget, forcing it to terminate valid tracks. (ii) Temporal Memory Corruption: Directly attacks the query updater's memory by severing temporal links via state de-correlation and erasing the learned feature identity of matched tracks. Furthermore, we introduce a differentiable pipeline to optimize these attacks for the physical world realizability by leveraging differentiable simulations of advanced perception sensor spoofing methods. Experiments on MOT17 and MOT20 demonstrate that FADE is highly effective against state-of-the-art TBP trackers, causing significant identity switches and track terminations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38119", "url": null, "sourceid": 41711, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37307, "uid": "2d879a9dc61db834e7597d21c9f65101", "name": "GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy", "authors": [{"id": 183285, "fullname": "Guanjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183285?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187130, "fullname": "Chen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187130?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The recent success of artificial intelligence motivates many non-professional users to train their own models. Those users often resort to cloud training services, seeking to obtain a sufficiently accurate model at a modest cost, for which properly setting up the learning rate and batch size is crucial. While various Hyper-parameter Optimization (HPO) methods have been proposed in that regard, they largely act based on heavy-weight validation signals, being inefficient in the overall cost. We find that the model training process can be viewed as a two-dimensional voting process---with gradients for different iterations and from different samples; moreover, to attain cost-efficient training is to ensure that the gradient redundancy is within a proper range which is similar across diverse models. We further introduce GR-Gauge, a general method that gauges the gradient redundancy to instruct HPO decisions like configuration searching and trial termination. Extensive experiments demonstrate that GR-Gauge can help attain near-optimal accuracy in much less time than existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37307", "url": null, "sourceid": 46428, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39011, "uid": "9545ca5aa0f8ab89c566464be34e091a", "name": "Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents", "authors": [{"id": 181856, "fullname": "Seohui Bae", "url": "http://cvpr.thecvf.com/api/miniconf/users/181856?format=json", "institution": "LG AI Research"}, {"id": 191174, "fullname": "Jeonghye Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191174?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191175, "fullname": "Youngchul Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/191175?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 191176, "fullname": "Woohyung Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191176?format=json", "institution": "LG AI Research"}], "abstract": "In this paper, we propose a test-time adaptive agent that performs exploratory inference through posterior-guided belief refinement without relying on gradient-based updates or additional training for LLM agent operating under partial observability.  Our agent maintains an external structured belief over the environment state, iteratively updates it via action-conditioned observations, and selects actions by maximizing predicted information gain over the belief space. We estimate information gain using a lightweight LLM-based surrogate and assess world alignment through a novel reward that quantifies the consistency between posterior belief and ground-truth environment configuration. Experiments show that our method outperforms inference-time scaling baselines such as prompt-augmented or retrieval-enhanced LLMs, in aligning with latent world states with significantly lower integration overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39011", "url": null, "sourceid": 30857, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36561, "uid": "3f00436e3415eabeafb77ba1fe5b6f87", "name": "World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models", "authors": [{"id": 185341, "fullname": "Eunsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185341?format=json", "institution": "KAIST"}, {"id": 185342, "fullname": "Junyeong Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/185342?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 182201, "fullname": "Na Min An", "url": "http://cvpr.thecvf.com/api/miniconf/users/182201?format=json", "institution": "KAIST"}, {"id": 185343, "fullname": "Jun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185343?format=json", "institution": null}, {"id": 185344, "fullname": "Hitesh Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/185344?format=json", "institution": "Oracle"}, {"id": 185345, "fullname": "Jiho Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185345?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 185346, "fullname": "Julia Kruk", "url": "http://cvpr.thecvf.com/api/miniconf/users/185346?format=json", "institution": "Facebook"}, {"id": 179587, "fullname": "Amit Agarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/179587?format=json", "institution": "Oracle"}, {"id": 185347, "fullname": "Srikant Panda", "url": "http://cvpr.thecvf.com/api/miniconf/users/185347?format=json", "institution": "Lam Research"}, {"id": 185348, "fullname": "Fenal Ilasariya", "url": "http://cvpr.thecvf.com/api/miniconf/users/185348?format=json", "institution": "Stevens Institute of Technology"}, {"id": 144939, "fullname": "Hyunjung Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/144939?format=json", "institution": "KAIST"}, {"id": 185349, "fullname": "Alice Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/185349?format=json", "institution": "Google; Korea Advanced Institute of Science and Technology"}], "abstract": "In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background.Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14\\% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts.To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36561", "url": null, "sourceid": 40105, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36815, "uid": "170978449a44d129b9548d122cf7f42f", "name": "Extend3D: Town-scale 3D Generation", "authors": [{"id": 185940, "fullname": "Seungwoo Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/185940?format=json", "institution": "Seoul National University"}, {"id": 150473, "fullname": "Jinmo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/150473?format=json", "institution": "Seoul National University"}, {"id": 131443, "fullname": "Jaesik Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/131443?format=json", "institution": "Seoul National University"}], "abstract": "In this paper, we propose Extend3D, a novel training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces of object-centric models in representing wide scenes, we extend the latent space in $x$ and $y$ directions. Then, by dividing the extended latent into overlapping patches, we use the object-centric 3D generative model on each patch and couple them at each time step. Since object-centric models are sub-optimal for sub-scene generation, we use the input image and point cloud extracted from a depth estimator as priors to enable this process. Using the point cloud prior, we initialize the scene structure and refine the occluded region iteratively with under-noised SDEdit. Also, both priors are used to optimize the extended latent during the denoising process so that the denoising paths do not deviate from the sub-scene dynamics. We demonstrate that our method produces better results than previous methods, as evidenced by human preferences.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36815", "url": null, "sourceid": 35128, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40331, "uid": "a7572e51d89da5dcfc00ce1c2e20e86c", "name": "Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species", "authors": [{"id": 174862, "fullname": "Jinyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174862?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 174874, "fullname": "Tianqi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174874?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 181695, "fullname": "Xiaonan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181695?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189419, "fullname": "Letian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189419?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189420, "fullname": "Songliang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189420?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189421, "fullname": "Meng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189421?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 128704, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128704?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant counting benchmark taking plant taxonomy into account. Our dataset couples instance-level point annotations with complete Linnaean labels (kingdom$\\rightarrow$species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The datasetfeatures $10,000$ images with $678,090$ point annotations, includes $268$ countable plant categories over $242$ plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy.We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40331", "url": "https://tiny-smart.github.io/tpc268-project-page/", "sourceid": -32681, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38249?format=json"], "related_events_ids": [38249]}, {"id": 36205, "uid": "be4dfce0bd450fdd57fda1bd637ad712", "name": "Mixture of Style Experts for Diverse Image Stylization", "authors": [{"id": 182025, "fullname": "Shihao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182025?format=json", "institution": "Nankai University"}, {"id": 154650, "fullname": "Ziheng Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154650?format=json", "institution": "Nankai University"}, {"id": 184446, "fullname": "Yijia Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184446?format=json", "institution": "Nankai University"}, {"id": 86158, "fullname": "Qilong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86158?format=json", "institution": "university  of tianjin of china"}, {"id": 184447, "fullname": "Mi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184447?format=json", "institution": "vivo"}, {"id": 87210, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87210?format=json", "institution": "vivo Mobile Communication Co.,Ltd."}, {"id": 90540, "fullname": "Ming-Ming Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90540?format=json", "institution": "Nankai University, Tsinghua University"}, {"id": 90664, "fullname": "Qibin Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90664?format=json", "institution": "Nankai University"}], "abstract": "Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details. We introduce StyleExpert, a semantic-aware framework based on Mixture of Experts (MoE).Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36205", "url": null, "sourceid": 33442, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37354, "uid": "1b34d47c57a081fbfeec93a1065e69bd", "name": "FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance", "authors": [{"id": 182197, "fullname": "Quanhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182197?format=json", "institution": "Fudan University"}, {"id": 86636, "fullname": "Zhen Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/86636?format=json", "institution": "Fudan University"}, {"id": 77338, "fullname": "Rui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77338?format=json", "institution": "Fudan University"}, {"id": 187239, "fullname": "Haidong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187239?format=json", "institution": "Fudan University"}, {"id": 85511, "fullname": "Qi Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85511?format=json", "institution": "Microsoft Research Asia"}, {"id": 187240, "fullname": "Daoguo Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187240?format=json", "institution": "Fudan University"}, {"id": 74132, "fullname": "Zuxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74132?format=json", "institution": "Fudan University"}], "abstract": "Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories.However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead.While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy.To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation.We first train a trajectory adapter on a multi-step video generator for precise trajectory control.Then, we distill the generator into a few-step version to accelerate video generation.Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos.For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37354", "url": null, "sourceid": 39567, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39675, "uid": "c321c755d0a92e0b884d3336cdf803e1", "name": "Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again", "authors": [{"id": 182339, "fullname": "Weize Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182339?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131581, "fullname": "Yunhao Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/131581?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 192625, "fullname": "Qixiang Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192625?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131585, "fullname": "Zhicheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131585?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131551, "fullname": "Fei Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/131551?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Referring Multi-Object Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods. Code can be found in the Supplementary Materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39675", "url": null, "sourceid": 41013, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39335, "uid": "5b9381f1c57ef843a11410fe4482d3e7", "name": "DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge", "authors": [{"id": 190585, "fullname": "Xin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190585?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191875, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191875?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 146724, "fullname": "Meiqi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/146724?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 158338, "fullname": "Junyao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158338?format=json", "institution": "Tongji University"}, {"id": 185898, "fullname": "Fei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185898?format=json", "institution": "National University of Singapore; Tencent"}, {"id": 181619, "fullname": "Zechao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181619?format=json", "institution": "Nanjing University of Science and Techonolgy"}], "abstract": "Open-set fine-grained retrieval~(OSFR) is a challenging task where models must generalize to unseen subcategories. Existing methods often fail this, as they embed category-specific semantics from closed-set training labels. Recently, diffusion transformers (DiT) have shown promise by encoding \\textit{attribute-centric, generative curriculum knowledge} that is agnostic to these labels. However, the vanilla DiT is not optimized for fine-grained \\textit{visual discrepancies} and its massive size makes \\textit{deployment infeasible}. To solve this, we propose \\textbf{DiT-Distill}, a framework to first refine and then distill this knowledge. We introduce a \\textit{conditional discrepancy refinement} strategy to fine-tune the DiT, forcing it to focus on discrepancy-aware, attribute-centric details rather than holistic context.  Subsequently, a \\textit{generative curriculum distillation} mechanism transfers this refined, hierarchical knowledge from multiple diffusion timesteps of the DiT into a lightweight backbone using a generative infusion module and a curriculum alignment loss. This process results in an efficient retrieval model that enables \\textit{DiT-free inference}. Extensive experiments show DiT-Distill achieves state-of-the-art performance on open-set fine-grained datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39335", "url": null, "sourceid": 45611, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36490, "uid": "0629fccd9ea3789671acab64a17ed21a", "name": "SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation", "authors": [{"id": 145966, "fullname": "kaiwen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145966?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 185180, "fullname": "Yi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185180?format=json", "institution": "Southeast University"}, {"id": 185181, "fullname": "Yizhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185181?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 185182, "fullname": "Jingxiong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185182?format=json", "institution": "Zhejiang University; Westlake University"}, {"id": 184810, "fullname": "Tao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184810?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Semi-supervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency.Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36490", "url": null, "sourceid": 34844, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38751, "uid": "d0c9acab2dcb6aaa94c495fd6f58a2ec", "name": "Seeing Motion Through Polarity for Event-based Action Recognition", "authors": [{"id": 146724, "fullname": "Meiqi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/146724?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 190584, "fullname": "Jiachao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190584?format=json", "institution": "Nanjing Institute of Technology"}, {"id": 190585, "fullname": "Xin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190585?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 88417, "fullname": "Rui Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88417?format=json", "institution": "National University of Singapore"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 181619, "fullname": "Zechao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181619?format=json", "institution": "Nanjing University of Science and Techonolgy"}, {"id": 157797, "fullname": "Xiangbo Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157797?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Event-based Action Recognition (EAR) provides a promising pathway for understanding dynamic behaviors under challenging conditions. Recent progress in vision-language models has introduced a cross-modal learning paradigm into EAR, enabling models to associate event streams with textual semantics for enhancing conceptual understanding. However, existing methods typically overlook the intrinsic polarity-driven motion cues that are fundamental to event data, leading to suboptimal spatiotemporal representations. To address this limitation, we propose a POlarity Knowledge Enhanced framework (POKER), which explicitly incorporates event polarity-aware motion knowledge across visual and textual modalities.POKER consists of two synergistic components: Polarity Motion Capturer (PMC) and Polarity Motion Reasoner (PMR). Specifically, PMC decouples positive and negative polarities to capture polarity-sensitive motion cues, while PMR semantically analyzes polarity-induced motion dynamics via large language models. Through the polarity alignment, POKER couples semantic reasoning with visual dynamics, achieving more discriminative representations. Extensive experiments on multiple benchmarks demonstrate that POKER enhances performance across diverse event representations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38751", "url": null, "sourceid": 33541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38359, "uid": "5d395883d13573b189da524ce1401834", "name": "MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing", "authors": [{"id": 144447, "fullname": "Xiaokun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/144447?format=json", "institution": "Nanjing University"}, {"id": 189034, "fullname": "Zeyu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189034?format=json", "institution": "Nanjing University"}, {"id": 77263, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77263?format=json", "institution": "ETH Zurich and CMU"}, {"id": 107265, "fullname": "Ying Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/107265?format=json", "institution": "Nanjing University"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}, {"id": 77492, "fullname": "Zhenyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77492?format=json", "institution": "Nanjing University"}], "abstract": "3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38359", "url": null, "sourceid": 46769, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38088, "uid": "5ec8b136da1b014682313777cb7a82ee", "name": "Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training", "authors": [{"id": 174448, "fullname": "Hexiao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174448?format=json", "institution": "China Agricultural University"}, {"id": 144447, "fullname": "Xiaokun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/144447?format=json", "institution": "Nanjing University"}, {"id": 189034, "fullname": "Zeyu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189034?format=json", "institution": "Nanjing University"}, {"id": 189035, "fullname": "Hao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189035?format=json", "institution": "China Agricultural University"}, {"id": 107265, "fullname": "Ying Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/107265?format=json", "institution": "Nanjing University"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}, {"id": 77492, "fullname": "Zhenyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77492?format=json", "institution": "Nanjing University"}], "abstract": "We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation.In contrast, Muses leverages the 3D skeleton\u2014a fundamental representation of biological forms\u2014to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation.Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape.Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38088", "url": null, "sourceid": 35324, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36464, "uid": "1b2b9dde67ac01d28e6f13e361008545", "name": "DiP: Taming Diffusion Models in Pixel Space", "authors": [{"id": 185113, "fullname": "Zhennan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185113?format=json", "institution": "nanjing university"}, {"id": 160361, "fullname": "junwei zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/160361?format=json", "institution": "Tencent Youtu Lab"}, {"id": 157272, "fullname": "Xu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157272?format=json", "institution": "Tencent YouTu Lab"}, {"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 185114, "fullname": "Hanzhen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185114?format=json", "institution": "National University of Singapore"}, {"id": 86912, "fullname": "Chengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86912?format=json", "institution": "Tencent Youtu Lab; Shanghai Jiao Tong University"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}, {"id": 107265, "fullname": "Ying Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/107265?format=json", "institution": "Nanjing University"}], "abstract": "Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training.  In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3\\%, and achieves an 1.90 FID score on ImageNet 256$\\times$256.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36464", "url": null, "sourceid": 32398, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38513, "uid": "2f02ebc2e4fc50a2545e0709c5fb526c", "name": "VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset", "authors": [{"id": 152121, "fullname": "Zhizhou Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152121?format=json", "institution": "nanjing university"}, {"id": 186981, "fullname": "Shanyan Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186981?format=json", "institution": "vivo"}, {"id": 190030, "fullname": "Zhanxin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190030?format=json", "institution": "nanjing university"}, {"id": 190031, "fullname": "En Ci", "url": "http://cvpr.thecvf.com/api/miniconf/users/190031?format=json", "institution": "nanjing university"}, {"id": 186983, "fullname": "Yanhao Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/186983?format=json", "institution": "Future Imaging Area"}, {"id": 186984, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186984?format=json", "institution": "vivo; Huawei Technologies Ltd."}, {"id": 77492, "fullname": "Zhenyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77492?format=json", "institution": "Nanjing University"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}, {"id": 107265, "fullname": "Ying Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/107265?format=json", "institution": "Nanjing University"}], "abstract": "Directly editing ultra-high-resolution (UHR) images is valuable but underexplored, primarily due to the lack of high-quality data and the challenge in modeling high-frequency textual details. We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution ($\\geq$4096\u00d74096) and is filtered through a rigorous multi-stage pipeline to ensure visual quality, instruction alignment, and aesthetic fidelity. For the second challenge, we propose a high-frequency-aware post-adaptation strategy that allows previous non-high-resolution models to accurately generate fine-grained, high-frequency details. We further present VINS-4KEval, a benchmark covering diverse editing types, to facilitate consistent evaluation in UHR settings. Experiments confirm that our work delivers superior fine-grained detail and texture realism in UHR image editing. The dataset and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38513", "url": null, "sourceid": 34338, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38440, "uid": "1d9b1a8b18c79139022fa537f4a12fd7", "name": "MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images", "authors": [{"id": 140946, "fullname": "Ankan Deria", "url": "http://cvpr.thecvf.com/api/miniconf/users/140946?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 189867, "fullname": "Komal Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/189867?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 189868, "fullname": "Adinath Dukre", "url": "http://cvpr.thecvf.com/api/miniconf/users/189868?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 189869, "fullname": "Eran Segal", "url": "http://cvpr.thecvf.com/api/miniconf/users/189869?format=json", "institution": "Weizmann Institute of Science"}, {"id": 85806, "fullname": "Salman Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85806?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 158752, "fullname": "Imran Razzak", "url": "http://cvpr.thecvf.com/api/miniconf/users/158752?format=json", "institution": "MBZUAI, Abu Dhabi"}], "abstract": "Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce \\textbf{MedMO}, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of \\textbf{+13.8\\%} over the baseline and performs within \\textbf{1.8\\%} of the SOTA Fleming-VL. For text-based QA, it attains \\textbf{+7.0\\%} over the baseline and \\textbf{+14.6\\%} over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of \\textbf{+40.4} over the baseline and \\textbf{+37.0\\%} over Fleming-VL, underscoring its robust spatial reasoning and localization performance.Evaluations across radiology, ophthalmology, pathology, and emergency care confirm MedMO\u2019s broad cross-modality generalization and reliable spatial reasoning. Our code, data, and models will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38440", "url": null, "sourceid": 36608, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36276, "uid": "d71b5bc5076243ac22b8e30d988b67ae", "name": "DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images", "authors": [{"id": 184653, "fullname": "Tao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184653?format=json", "institution": "Chongqing University"}, {"id": 100113, "fullname": "Xingran LIAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/100113?format=json", "institution": "City University of Hong Kong"}, {"id": 152604, "fullname": "Mingliang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152604?format=json", "institution": "Chongqing University"}], "abstract": "The development of AI-generated technology requires effective image quality assessment (AGIQA) methods to jointly evaluate visual quality and text-content alignment, ensuring that the generated content is both visually appealing and faithful to the user's instructions. Nevertheless, visual degradation and text-content misalignment often coincide, and it is difficult to tell whether a bad subjective evaluation arises from prompt noncompliance or rendering artifacts. As such, disentangling image content and rendering distortions is vital. We propose the dual-prior guided fusion network (DPGF-Net), which leverages image-side priors to disentangle distortions from content and combines them with text-side prompt templates to simulate their interactions, to address this issue. DPGF-Net employs a local text-conditioned aggregation branch to highlight semantically relevant and quality-sensitive regions in conjunction with a global modulation branch that captures holistic perceptual characteristics. Finally, adaptive fusion produces a single score. Experiments on three AGIQA datasets demonstrate that our method is highly correlated with human judgments, with lower prediction error and stable evaluation behavior. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36276", "url": null, "sourceid": 37183, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40170, "uid": "127c091ffa59fab0d49ec3d91c6017c4", "name": "Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models", "authors": [{"id": 193699, "fullname": "Jiarui Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193699?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185313, "fullname": "Libo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185313?format=json", "institution": "Institute of Computing Technology"}, {"id": 193700, "fullname": "Xiangqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193700?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 130221, "fullname": "Zhulin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/130221?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 130183, "fullname": "Chuanguang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130183?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 193701, "fullname": "Yu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193701?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 105398, "fullname": "boyu diao", "url": "http://cvpr.thecvf.com/api/miniconf/users/105398?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences."}, {"id": 105588, "fullname": "Yongjun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105588?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Class-Incremental Learning (CIL) aims to develop models to continuously learn new classes without forgetting learned old ones. Recent advances combine pre-trained models with parameter-efficient fine-tuning, achieving promising results. However, these approaches typically allocate new trainable parameters for each task, causing the model size to grow linearly with task number. Moreover, they lack explicit mechanisms to structure a coherent and discriminative representation space across tasks. To address these limitations, we propose **Representation-Steered Incremental Adapter Tuning** (**RSIAT**). RSIAT maintains a single shared adapter for all tasks, eliminating parameter growth during incremental learning. In the base task, we introduce a representation-steering loss that enhances discriminative feature learning while facilitating future task adaptation. During incremental tasks, a residual autoencoder\u2013based projector aligns feature distributions between old and new tasks, preserving representation consistency without over-constraining the shared adapter. Extensive experiments on six CIL benchmarks demonstrate that RSIAT significantly outperforms state-of-the-art methods in both performance and parameter efficiency, achieving superior stability\u2013plasticity trade-offs with minimal trainable parameters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40170", "url": null, "sourceid": 45023, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36794, "uid": "0289fc9e3bcd6db0d9a8dbfe050fa406", "name": "STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction", "authors": [{"id": 185886, "fullname": "Jiankuo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185886?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 76305, "fullname": "Xiangyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76305?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 184974, "fullname": "Zidu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184974?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}], "abstract": "Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they struggle to reconstruct frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image- and FLAME-based priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to geometric and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces the fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36794", "url": null, "sourceid": 34721, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37095, "uid": "e06e7f5b7285aa53f6ea6971313c9b6d", "name": "TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction", "authors": [{"id": 150057, "fullname": "Xinguo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/150057?format=json", "institution": "Technical University of Munich"}, {"id": 186641, "fullname": "Yixin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186641?format=json", "institution": "Technical University of Munich"}, {"id": 186642, "fullname": "Rahul Chaudhari", "url": "http://cvpr.thecvf.com/api/miniconf/users/186642?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}], "abstract": "Hand mesh reconstruction has attracted growing attention in recent years.Despite significant progress, existing methods often struggle to balance reconstruction quality and inference efficiency.In this work, we propose TokenHand, a novel framework for single-view 3D hand mesh reconstruction that achieves both high accuracy and real-time inference.Our method represents a 3D hand model using $M$ discrete tokens, each describing a specific sub-structure of the hand.This compositional representation enables efficient modeling with minimal reconstruction error.Furthermore, we reformulate hand mesh reconstruction as a classification problem rather than a regression task.Specifically, a classifier predicts the categories of the $M$ tokens from an input image, and a pre-trained decoder network subsequently reconstructs the 3D hand mesh from the predicted tokens without any post-processing.Extensive experiments demonstrate that TokenHand achieves comparable or superior performance to existing methods across standard benchmarks, while maintaining high efficiency in practical scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37095", "url": null, "sourceid": 38459, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38499, "uid": "3097d5dc68379483291132f222606ccf", "name": "Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems", "authors": [{"id": 181657, "fullname": "Jiahuan Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/181657?format=json", "institution": "Defense Innovation Institute"}, {"id": 189993, "fullname": "Tingsong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189993?format=json", "institution": "Chinese Academy of Military Science"}, {"id": 189994, "fullname": "Hanqing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189994?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86334, "fullname": "Chao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86334?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189995, "fullname": "Weien Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189995?format=json", "institution": "Defense Innovation Institute, Academy of Military Science"}, {"id": 189996, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189996?format=json", "institution": "Intelligence Game and Decision Laboratory"}, {"id": 189997, "fullname": "Wen Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189997?format=json", "institution": "National University of Defense Technology"}], "abstract": "Adversarial patches have emerged as a popular privacy-preserving approach for resisting AI-driven surveillance systems. However, their conspicuous appearance makes them difficult to deploy in real-world scenarios. In this paper, we propose a thermally activated adversarial wearable designed to ensure adaptability and effectiveness in complex real-world environments. The system integrates thermochromic dyes with flexible heating units to induce visually dynamic adversarial patterns on clothing surfaces. In its default state, the clothing appears as an ordinary black T-shirt. Upon heating via an embedded thermal unit, hidden adversarial patterns on the fabric are activated, allowing the wearer to effectively evade detection across both visible and infrared modalities. Physical experiments demonstrate that the adversarial wearable achieves rapid texture activation within 50 seconds and maintains an adversarial success rate above 80\\% across diverse real-world surveillance environments. This work demonstrates a new pathway toward physically grounded, user-controllable anti-AI systems, highlighting the growing importance of proactive adversarial techniques for privacy protection in the age of ubiquitous AI surveillance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38499", "url": null, "sourceid": 41934, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38585, "uid": "99d403e4b3d5513eb108bac456ad8b1b", "name": "FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation", "authors": [{"id": 190213, "fullname": "Xingyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190213?format=json", "institution": "Sichuan University"}, {"id": 70064, "fullname": "Tao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70064?format=json", "institution": "Sichuan University"}], "abstract": "Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities.In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy.To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption.  Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO's superior performance, achieving 59.52\\% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13\\%).  Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38585", "url": null, "sourceid": 41436, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39078, "uid": "4fad4b358c995e79c3b8417c5543cb67", "name": "TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection", "authors": [{"id": 128662, "fullname": "Leyuan Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/128662?format=json", "institution": "Xiamen University"}, {"id": 143033, "fullname": "huanjia zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143033?format=json", "institution": null}, {"id": 191320, "fullname": "Dongyu Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191320?format=json", "institution": "Xiamen University"}, {"id": 77099, "fullname": "Hai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77099?format=json", "institution": "Xiamen University"}, {"id": 128691, "fullname": "Qiming Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/128691?format=json", "institution": "XMU"}, {"id": 191321, "fullname": "Kezheng Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191321?format=json", "institution": "Xiamen University"}, {"id": 77098, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77098?format=json", "institution": "schoold of informatics xiamen university"}, {"id": 86653, "fullname": "Chenglu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86653?format=json", "institution": "Xiamen University"}, {"id": 86652, "fullname": "Cheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86652?format=json", "institution": "Xiamen University"}], "abstract": "Reliable navigation and decision-making of autonomous vehicles require both accurate localization and object detection. Traditionally, these two tasks are handled separately, leading to redundant computation and limited cross-task knowledge transfer. This paper proposes TACO, the first Task-Aware COntrastive learning framework, which performs joint LiDAR localization and 3D object detection within a single, unified network. TACO leverages contrastive learning to explicitly decouple and align static geographic features for localization and object-centric features for detection. This bidirectional mutual supervision not only enhances localization robustness in dynamic environments by filtering dynamic noise but also boosts detection accuracy via effective spatial context. Additionally, we propose OxfoLD, the first dataset that provides multi-traversal LiDAR localization ground truth with rich 3D object annotations, thereby supporting task validation across various times and weather conditions. Experimental results demonstrate that TACO achieves state-of-the-art localization accuracy while maintaining competitive detection performance. The code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39078", "url": null, "sourceid": 35867, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37410, "uid": "21ea0396c61ba3b8f89aa0a75a1a36a7", "name": "OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data", "authors": [{"id": 187371, "fullname": "Bin Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187371?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 186908, "fullname": "Sipeng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186908?format=json", "institution": "BeingBeyond"}, {"id": 180059, "fullname": "Hao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180059?format=json", "institution": "Peking University"}, {"id": 155077, "fullname": "Boyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155077?format=json", "institution": "Renmin University of China"}, {"id": 87818, "fullname": "Jing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87818?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 87087, "fullname": "Zongqing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87087?format=json", "institution": "Peking University"}], "abstract": "Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as ``frills''. Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37410", "url": null, "sourceid": 37829, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40120, "uid": "8d6b31ed2ba3ecd5a1c7a80773f01ec2", "name": "Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving", "authors": [{"id": 180569, "fullname": "Minhao Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180569?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187226, "fullname": "Zichen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187226?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 192435, "fullname": "Zhuangcheng Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192435?format=json", "institution": "Carnegie Mellon University"}, {"id": 186062, "fullname": "Xuyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186062?format=json", "institution": "Sichuan University"}, {"id": 187177, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187177?format=json", "institution": "University of Hong Kong"}, {"id": 193574, "fullname": "Hengrui Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193574?format=json", "institution": "SJTU &amp; Shanghai AI Lab"}, {"id": 193575, "fullname": "Jiabing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193575?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 180358, "fullname": "JUNYUAN ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180358?format=json", "institution": "University of Hong Kong"}, {"id": 107541, "fullname": "Weijia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107541?format=json", "institution": "Sun Yat-sen University"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their real-world deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images\u2014a standard setup in AD systems that utilize six or even more synchronized cameras to perceive the environment comprehensively. This overhead stems from the large number of visual tokens generated during encoding, which significantly increases inference latency and memory consumption when passed to large language models, owing to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework specifically designed for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores; and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, demonstrate that Prune2Drive achieves significant speedups and memory savings while maintaining\u2014and in some cases improving\u2014task performance. Our results establish Prune2Drive as a practical and generalizable solution for efficient vision-language reasoning in autonomous driving. When retaining only 10% of the visual tokens, our method achieves a 6.40$\\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% average performance drop compared to the original model on the DriveLM benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40120", "url": null, "sourceid": 31099, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36919, "uid": "3292a73c68dfe2d1244b14cdfb7fc26a", "name": "Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models", "authors": [{"id": 186223, "fullname": "Hyundong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186223?format=json", "institution": "Chung-Ang University"}, {"id": 90646, "fullname": "Dongyoon Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/90646?format=json", "institution": "NAVER Corp, CLOVA AI."}, {"id": 156661, "fullname": "Eunwoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156661?format=json", "institution": "ChungAng University"}], "abstract": "Continual unlearning poses the challenge of enabling large vision-language models to selectively refuse specific image-instruction pairs in response to sequential deletion requests, while preserving general utility. However, sequential unlearning updates distort shared representations, creating spurious associations between vision-language pairs and refusal behaviors that hinder precise identification of refusal targets, resulting in inappropriate refusals. To address this challenge, we propose a novel continual unlearning framework that grounds refusal behavior in fine-grained descriptions of visual and textual concepts decomposed from deletion targets. We first identify which visual-linguistic concept combinations characterize each forget category through a concept modulator, then determine how to generate appropriate refusal responses via a mixture of refusal experts, termed refusers, each specialized for concept-aligned refusal generation. To generate concept-specific refusal responses across sequential tasks, we introduce a multimodal, concept-driven routing scheme that reuses refusers for tasks sharing similar concepts and adapts underutilized ones for novel concepts. Extensive experiments on vision-language benchmarks demonstrate that the proposed framework outperforms existing methods by generating concept-grounded refusal responses and preserving the general utility across unlearning sequences.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36919", "url": null, "sourceid": 38649, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39016, "uid": "403e7163b5aef0323eee42fe413bccc5", "name": "Variation-aware Vision Token Dropping for Faster Large Vision-Language Models", "authors": [{"id": 178228, "fullname": "Chen junjie", "url": "http://cvpr.thecvf.com/api/miniconf/users/178228?format=json", "institution": "College of Electronics and Information Engineering, Sichuan University"}, {"id": 186062, "fullname": "Xuyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186062?format=json", "institution": "Sichuan University"}, {"id": 187226, "fullname": "Zichen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187226?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186061, "fullname": "Yiyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186061?format=json", "institution": "Shanghai University of Science and Technology"}, {"id": 107448, "fullname": "Siteng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107448?format=json", "institution": "Zhejiang University &amp; Westlake University"}, {"id": 191186, "fullname": "Honggang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191186?format=json", "institution": "Sichuan University"}], "abstract": "Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\\textit{i.e.}, \\textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain \\textbf{94.0\\%} and \\textbf{98.6\\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \\textbf{31.5\\%} and \\textbf{74.2\\%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage. \\textit{Code is available in supplementary materials.}", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39016", "url": null, "sourceid": 41129, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40186, "uid": "2acada6d8bf5d4624dcb1374d25dcfb0", "name": "OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning", "authors": [{"id": 193574, "fullname": "Hengrui Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193574?format=json", "institution": "SJTU &amp; Shanghai AI Lab"}, {"id": 192435, "fullname": "Zhuangcheng Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192435?format=json", "institution": "Carnegie Mellon University"}, {"id": 157145, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/157145?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 187226, "fullname": "Zichen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187226?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 155031, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155031?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 107541, "fullname": "Weijia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/107541?format=json", "institution": "Sun Yat-sen University"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}], "abstract": "Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, layout generation, remains underexplored. Distinct from traditional graphic layout design and room layout planning, document layout generation typically involves a larger number of elements per page and exhibits greater structural diversity and complexity. Currently, a major obstacle lies in the scarcity of diverse document layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniDocLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniDocLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from our dataset with coarse category definitions, and 2) transferring the knowledge to a specific domain with few fine-grained annotated samples. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^6$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, dataset, and models will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40186", "url": null, "sourceid": 44541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37351, "uid": "5dd808ef0d37652165522bd8b73b795e", "name": "TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition", "authors": [{"id": 180358, "fullname": "JUNYUAN ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/180358?format=json", "institution": "University of Hong Kong"}, {"id": 155031, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155031?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 172570, "fullname": "Qintong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172570?format=json", "institution": "Peking University"}, {"id": 155032, "fullname": "Fan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155032?format=json", "institution": "Shanghai AI Lab"}, {"id": 187226, "fullname": "Zichen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187226?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187227, "fullname": "Jialin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187227?format=json", "institution": "University of Hong Kong"}, {"id": 145165, "fullname": "Junjie Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/145165?format=json", "institution": "University of Hong Kong"}, {"id": 187228, "fullname": "Ziqi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187228?format=json", "institution": "University of Hong Kong"}, {"id": 151465, "fullname": "Shuya Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151465?format=json", "institution": "University of Hong Kong"}, {"id": 187229, "fullname": "Ziling Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187229?format=json", "institution": "University of Hong Kong"}, {"id": 146220, "fullname": "Ziyang Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/146220?format=json", "institution": "Beihang University"}, {"id": 129840, "fullname": "Huaping Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129840?format=json", "institution": "SenseTime"}, {"id": 131138, "fullname": "Yuhang Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131138?format=json", "institution": "Nanyang Technological University"}, {"id": 90594, "fullname": "Xiaoyi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/90594?format=json", "institution": "Microsoft"}, {"id": 187230, "fullname": "Ka-Ho Chow", "url": "http://cvpr.thecvf.com/api/miniconf/users/187230?format=json", "institution": "The University of Hong Kong"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}], "abstract": "Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown.As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data.While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain.Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind.To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model.This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37351", "url": null, "sourceid": 46073, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38383, "uid": "15af797d5623e076064d023c7f68faf8", "name": "DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference", "authors": [{"id": 189759, "fullname": "Aditya Kumar Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/189759?format=json", "institution": "Advanced Micro Devices"}, {"id": 189760, "fullname": "Hitesh Kandala", "url": "http://cvpr.thecvf.com/api/miniconf/users/189760?format=json", "institution": "Advanced Micro Devices"}, {"id": 189761, "fullname": "Pratik Brahma", "url": "http://cvpr.thecvf.com/api/miniconf/users/189761?format=json", "institution": "Advanced Micro Devices"}, {"id": 126358, "fullname": "Zicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126358?format=json", "institution": "Microsoft"}, {"id": 149601, "fullname": "Emad Barsoum", "url": "http://cvpr.thecvf.com/api/miniconf/users/149601?format=json", "institution": "AMD"}], "abstract": "Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone,often trading accuracy for speed. In this work, we propose *DUET-VLM*, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens,followed by(b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens.This coordinated token management enables aggressive compression while retaining critical semantics. On *LLaVA-1.5-7B*, our approach maintains over 99\\% of baseline accuracy with 67\\% fewer tokens $\\downarrow$, and still retains $>$97\\% even at 89\\% $\\downarrow$ reduction. With this dual-stage compression during training, it achieves 99.7\\% accuracy at 67\\% $\\downarrow$ and 97.6\\% at 89\\% $\\downarrow$, surpassing prior SoTA visual token reduction methods across multiple benchmarks.When integrated into *Video-LLaVA-7B*, it even surpasses the baseline\u2013achieving $>$*100\\%* $\\uparrow$ accuracy with a substantial 53.1\\% $\\downarrow$ token reduction, and retaining *97.6\\%* accuracy under an extreme *93.4\\%* $\\downarrow$ setting.These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38383", "url": null, "sourceid": 32619, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40296, "uid": "f71d3c3105a6e1374c8abd57e8eeaa15", "name": "VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation", "authors": [{"id": 126697, "fullname": "Yulu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126697?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187267, "fullname": "Bohao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187267?format=json", "institution": "Beihang University"}, {"id": 70931, "fullname": "Zongheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70931?format=json", "institution": "Beihang University"}, {"id": 187268, "fullname": "Jitong Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187268?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187269, "fullname": "wenjun wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187269?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 75839, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75839?format=json", "institution": "Beihang University"}], "abstract": "Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, coarse point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong zero-shot generalization. On the challenging Ego\u2013Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego\u2192Exo and Exo\u2192Ego tasks, respectively, significantly outperforming prior methods. Notably, our zero-shot model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40296", "url": null, "sourceid": -36508, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37369?format=json"], "related_events_ids": [37369]}, {"id": 37369, "uid": "f71d3c3105a6e1374c8abd57e8eeaa15", "name": "VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation", "authors": [{"id": 126697, "fullname": "Yulu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126697?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187267, "fullname": "Bohao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187267?format=json", "institution": "Beihang University"}, {"id": 70931, "fullname": "Zongheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70931?format=json", "institution": "Beihang University"}, {"id": 187268, "fullname": "Jitong Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187268?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187269, "fullname": "wenjun wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187269?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 75839, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75839?format=json", "institution": "Beihang University"}], "abstract": "Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, coarse point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong zero-shot generalization. On the challenging Ego\u2013Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego\u2192Exo and Exo\u2192Ego tasks, respectively, significantly outperforming prior methods. Notably, our zero-shot model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37369", "url": null, "sourceid": 36508, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40296?format=json"], "related_events_ids": [40296]}, {"id": 38080, "uid": "62f0face795f84de82297b4dac2b3359", "name": "Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior", "authors": [{"id": 189015, "fullname": "Haochen Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189015?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 181650, "fullname": "Kanyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181650?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189016, "fullname": "Shuyu Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189016?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 85327, "fullname": "Qinghai Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85327?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 189017, "fullname": "Peilin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189017?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189018, "fullname": "Fei Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189018?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and does not exploit the FAN property, thus lead to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's  output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample efficiency, and success rate in both in-distribution and out-of-distribution (OOD) scenarios. By aligning with the intrinsic action tolerance of physical manipulation, FAN-guided regularization provides a principled and practical method for sample-efficient, and generalizable VLA adaptation. Code is provided in supplemental material.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38080", "url": null, "sourceid": 35590, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37860, "uid": "83de89957f167ec84dc58d7de6e69757", "name": "Benchmarking PhD-Level Coding in 3D Geometric Computer Vision", "authors": [{"id": 159448, "fullname": "Wenyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/159448?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 188421, "fullname": "Renkai Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188421?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188422, "fullname": "Yue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188422?format=json", "institution": "University of Toronto"}, {"id": 153038, "fullname": "Huan-ang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153038?format=json", "institution": "Tsinghua University"}, {"id": 153037, "fullname": "Mingju Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153037?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 186105, "fullname": "Li Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186105?format=json", "institution": "Peking University"}, {"id": 126789, "fullname": "Chaoyou Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126789?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "AI-assisted coding has rapidly reshaped software practice and research workflows, yet today\u2019s models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 28.0\\% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that \u201cmore paper text\u201d is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37860", "url": null, "sourceid": 42383, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36577, "uid": "a2d9c0fd6720cd8a8684dd52ef32deb8", "name": "PAM: A Pose\u2013Appearance\u2013Motion Engine for Sim-to-Real HOI Video Generation", "authors": [{"id": 153037, "fullname": "Mingju Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153037?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185389, "fullname": "Kaisen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185389?format=json", "institution": "Tsinghua University"}, {"id": 153038, "fullname": "Huan-ang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153038?format=json", "institution": "Tsinghua University"}, {"id": 152936, "fullname": "Bohan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152936?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185390, "fullname": "Ao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185390?format=json", "institution": null}, {"id": 159448, "fullname": "Wenyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/159448?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 185391, "fullname": "Yangcheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185391?format=json", "institution": "Tsinghua University"}, {"id": 185392, "fullname": "Jinkun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185392?format=json", "institution": "Tsinghua University"}, {"id": 185393, "fullname": "Shaocong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185393?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 185394, "fullname": "Yike Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185394?format=json", "institution": "Tsinghua University"}, {"id": 185395, "fullname": "Haohan Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185395?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 181218, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181218?format=json", "institution": "Hao Chen"}, {"id": 77263, "fullname": "Hao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77263?format=json", "institution": "ETH Zurich and CMU"}, {"id": 85821, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85821?format=json", "institution": "Nanjing University"}, {"id": 73517, "fullname": "Li Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/73517?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Hand\u2013object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of previous work, we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose\u2013Appearance\u2013Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480\u00d7720 videos compared to 256\u00d7256/256\u00d7384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 \u2192 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50\\% of the real data plus our synthetic data to match the 100% real baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36577", "url": null, "sourceid": 40626, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36694, "uid": "f1cdeb875d0954c6be872e56ee892e5d", "name": "Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack", "authors": [{"id": 144760, "fullname": "Wenwen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/144760?format=json", "institution": "Wuhan University"}, {"id": 184315, "fullname": "Wenke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184315?format=json", "institution": "Nanyang Technological University"}, {"id": 144193, "fullname": "Yiyang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144193?format=json", "institution": "Wuhan University"}, {"id": 185659, "fullname": "Wenjie Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185659?format=json", "institution": "National University of Singapore"}, {"id": 185660, "fullname": "Jiaheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185660?format=json", "institution": "National University of Singapore"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "Federated Learning (FL), a distributed learning paradigm that enables local training on user-held data across decentralized devices, is vulnerable to backdoor attacks due to limited visibility into client updates. Exploiting this opacity, adversaries induce targeted misbehavior on trigger inputs without affecting overall performance, thereby compromising the trust and integrity of collaborative training in federated learning systems. Existing federated backdoor attacks mainly concentrate on benign knowledge alignment on trigger-surface design or representation guidance to evade defense mechanisms. However, trigger-surface attacks suffer from insufficient alignment, leaving malicious knowledge distinguishable from benign updates. In contrast, representation-guided attacks attempt to obscure the boundary between benign and malicious behaviors. Nevertheless, excessive incorporation of benign knowledge within a shared parameter space leads to over-alignment, ultimately degrading attack effectiveness. To overcome shared parameter space dilemma in backdoor attack, we propose Batman, a novel backdoor attack that aligns benign knowledge within the malicious null space, which effectively decouples malicious space from shared parameter space and enables benign alignment in an orthogonal direction of this space that does not interfere with the attack effectiveness. To further enhance stealthiness, we combine both clean and global models to guide the alignment perturbation within this null space to evade detection. Experiments on four benchmark datasets demonstrate that Batman consistently achieves strong backdoor performance while remaining stealthy under various defenses.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36694", "url": null, "sourceid": 36646, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37333, "uid": "cd60cded04fede4abccff6e0dea36f6e", "name": "LightRR: A Lightweight Network for Single Image Reflection Removal", "authors": [{"id": 187185, "fullname": "Wenbin Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187185?format=json", "institution": "East China Normal University"}, {"id": 153308, "fullname": "Junkang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153308?format=json", "institution": "East China Normal University"}, {"id": 187186, "fullname": "Sunzhe Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187186?format=json", "institution": "East China Normal University"}, {"id": 153307, "fullname": "Faming Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153307?format=json", "institution": "East China Normal University"}, {"id": 86004, "fullname": "Guixu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86004?format=json", "institution": "East China Normal University"}], "abstract": "Single-image reflection removal (SIRR) is a highly ill-posed and computationally demanding problem. Existing CNN or Transformer-based methods often rely on large receptive fields and heavy computation, limiting their deployment on resource-constrained devices. To address this, we propose LightRR, a lightweight yet effective reflection removal network that unifies a wavelet-based mechanism and State Space Modeling (SSM).Specifically, we introduce an Asymmetric Frequency Mamba Block (AFM), which leverages the Discrete Wavelet Transform (DWT) to decompose features into low- and high-frequency components. This allows for targeted modeling of frequency-specific dependencies via Mamba-based state space dynamics.This design not only captures long-range context efficiently but also reduces spatial resolution and computation while preserving critical details.Furthermore, a knowledge distillation-enhanced encoder allows the network to inherit the representational power of large pre-trained models during training, enabling lightweight inference.Extensive experiments on multiple real-world benchmarks demonstrate that LightRR achieves performance comparable to state-of-the-art methods, while using only 3.01\\% of the parameters and 5.22\\% of the FLOPs (vs. RDNet), highlighting its superior balance between accuracy and efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37333", "url": null, "sourceid": 42068, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37978, "uid": "5730737405ac64afe17e4cc691ba62a4", "name": "TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space", "authors": [{"id": 188729, "fullname": "Wenxuan Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188729?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 188730, "fullname": "Yulin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188730?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188731, "fullname": "Aiyue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188731?format=json", "institution": ""}, {"id": 188732, "fullname": "Jing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188732?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 177344, "fullname": "Yiwu Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/177344?format=json", "institution": "huawei"}, {"id": 188733, "fullname": "Yiming Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188733?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 188734, "fullname": "Jieru Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188734?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 188735, "fullname": "Jingwen Leng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188735?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 88218, "fullname": "Minyi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/88218?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 183787, "fullname": "Yu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183787?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations.In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85\\%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ($<$0.06\\% loss on VBench).", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37978", "url": null, "sourceid": 36877, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38900, "uid": "e10bc88f2a85dbcfd808b001f0bb8d69", "name": "InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models", "authors": [{"id": 180203, "fullname": "Shunsuke Sakai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180203?format=json", "institution": "University of Fukui, Japan"}, {"id": 182607, "fullname": "Xiangteng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182607?format=json", "institution": "University of British Columbia"}, {"id": 190936, "fullname": "Chunzhi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190936?format=json", "institution": "University of Fukui"}, {"id": 69182, "fullname": "Leonid Sigal", "url": "http://cvpr.thecvf.com/api/miniconf/users/69182?format=json", "institution": "University Of British Columbia"}, {"id": 190937, "fullname": "Tatsuhito Hasegawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/190937?format=json", "institution": "University of Fukui"}], "abstract": "Despite the remarkable success, recent reconstruction-based anomaly detection (AD) methods via diffusion modeling still involve fine-grained noise-strength tuning and computationally expensive multi-step denoising, leading to a fundamental tension between fidelity and efficiency. In this paper, we propose **InvAD**, a novel **Inv**ersion-based **A**nomaly **D**etection approach \u2014 \u201cdetection via noising in latent space\u201d \u2014 which circumvents explicit reconstruction. Importantly, we contend that the limitations in prior reconstruction-based methods originate from the prevailing \u201cdetection via denoising in RGB space\u201d paradigm. To address this, we model AD under a reconstruction-free formulation, which directly infers the final latent variable corresponding to the input image via DDIM inversion, and then measures the deviation based on the known prior distribution for anomaly scoring. Specifically, in approximating the original probability flow ODE using the Euler method, we enforce only a few inversion steps to noise the clean image to pursue inference efficiency. As the added noise is adaptively derived with the learned diffusion model, the original features for the clean testing image can still be leveraged to yield high detection accuracy. We perform extensive experiments and detailed analyses across four widely used industrial and medical AD benchmarks under the unsupervised unified setting to demonstrate the effectiveness of our model, achieving state-of-the-art AD performance and approximately 2\u00d7 inference-time speedup without diffusion distillation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38900", "url": null, "sourceid": 33286, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38128, "uid": "a552e1db6a1c7dbac243c72a8d3140bb", "name": "AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection", "authors": [{"id": 180452, "fullname": "Haolin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180452?format=json", "institution": "Tsinghua University"}, {"id": 76507, "fullname": "Yaohua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76507?format=json", "institution": "Alibaba Group"}, {"id": 189119, "fullname": "Ze Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189119?format=json", "institution": "Tsinghua University"}, {"id": 156862, "fullname": "Lijie Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156862?format=json", "institution": "Tsinghua University"}, {"id": 189120, "fullname": "Biqing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189120?format=json", "institution": "Tsinghua University"}], "abstract": "Large multimodal language models have made rapid progress on vision\u2013language tasks, yet their potential for zero-/few-shot object detection (ZSOD/FSOD) under a closed set of target classes remains underexplored. ZSOD/FSOD is hampered by data scarcity and catastrophic forgetting. Although vision\u2013language models (VLMs) report strong numbers on several benchmarks, they typically rely on massive visual pretraining, which is misaligned with FSOD\u2019s goal of testing generalization to novel classes under limited supervision. We introduce AgentDet, a shared-blackboard multi-agent framework that unifies ZSOD and FSOD via pseudo-incremental learning. AgentDet decouples detection into four cooperating roles\u2014Agent-Scout, Agent-Pinner, Agent-Curator, and Agent-Judge\u2014which collaboratively maintain a Shared Blackboard and a Knowledge Base. For efficiency, we train only Agent-Judge\u2014updating its image encoder and LLM-based detection head\u2014yielding a lightweight recipe that encourages generalization to previously unseen categories. On PASCAL VOC and MS COCO ZSOD/FSOD protocols, AgentDet delivers strongly competitive performance with state-of-the-art results in several settings. Ablations confirm the contributions of blackboard collaboration, safe-write policies, and the pseudo-incremental schedule.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38128", "url": null, "sourceid": 40783, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37540, "uid": "d797c16377b1498bb3fe153d37fedc89", "name": "LAOF: Robust Latent Action Learning with Optical Flow Constraints", "authors": [{"id": 180038, "fullname": "Xizhou Bu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180038?format=json", "institution": "Fudan University"}, {"id": 187676, "fullname": "Jiexi Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187676?format=json", "institution": "Fudan University"}, {"id": 187677, "fullname": "Fulei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187677?format=json", "institution": "Fudan University"}, {"id": 187678, "fullname": "Ruichen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187678?format=json", "institution": "Fudan University"}, {"id": 187679, "fullname": "Zhiqiang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187679?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 187680, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187680?format=json", "institution": "Fudan University, China"}], "abstract": "Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints (LAOF), a pseudo-supervised framework that leverages the agent\u2019s optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10%. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1% of action labels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37540", "url": null, "sourceid": 33511, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38602, "uid": "518a0d4fdd28c9875618b3d7833831e2", "name": "DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving", "authors": [{"id": 142795, "fullname": "Zhuolin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/142795?format=json", "institution": "Fudan University"}, {"id": 190268, "fullname": "Jing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190268?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 183004, "fullname": "Guanghao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183004?format=json", "institution": "Fudan University"}, {"id": 189925, "fullname": "Xiaolei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189925?format=json", "institution": "Fudan University"}, {"id": 190014, "fullname": "Jiacheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190014?format=json", "institution": "Fudan University"}, {"id": 177237, "fullname": "Siyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177237?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 190269, "fullname": "ZhouNanJin ZhouNanJin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190269?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 184855, "fullname": "Feipeng Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184855?format=json", "institution": "Shenzhen Yinwang Intelligent Technology Co., Ltd."}, {"id": 189927, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189927?format=json", "institution": "Fudan University"}, {"id": 76452, "fullname": "Jian Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76452?format=json", "institution": "Fudan University"}, {"id": 184858, "fullname": "Jia Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184858?format=json", "institution": null}, {"id": 89233, "fullname": "Xiangyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/89233?format=json", "institution": "Fudan University"}], "abstract": "Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that learns motion continuity. Furthermore, we design a Dynamic 3D Gaussian Splatting Head (DGSH) that explicitly models point motion by predicting Gaussian velocities using learnable motion tokens under scene flow supervision. It refines dynamic geometry through continuous 3D Gaussian optimization. Extensive experiments on autonomous driving datasets demonstrate that DynamicVGGT significantly outperforms existing methods in reconstruction accuracy and temporal consistency, achieving robust feed-forward 4D dynamic scene reconstruction under complex driving scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38602", "url": null, "sourceid": 30864, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37880, "uid": "15992174039ff729f588d6c82cf022c1", "name": "Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding", "authors": [{"id": 188482, "fullname": "Yerim Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188482?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 142503, "fullname": "Miso Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/142503?format=json", "institution": "Sungkyunkwan University"}, {"id": 87385, "fullname": "WonJun Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/87385?format=json", "institution": "Sungkyunkwan University"}, {"id": 87383, "fullname": "Jae-Pil Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87383?format=json", "institution": "Sungkyunkwan University"}], "abstract": "Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37880", "url": null, "sourceid": 41578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38973, "uid": "c695c406dd17d2fc9dbfe917adaf9e33", "name": "Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework", "authors": [{"id": 181077, "fullname": "Linxiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181077?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences"}, {"id": 191097, "fullname": "Siming Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191097?format=json", "institution": "Vivo"}, {"id": 191098, "fullname": "Zerong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191098?format=json", "institution": "vivo Mobile Communication Co., Ltd."}, {"id": 126669, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126669?format=json", "institution": "vivo Mobile Communication \uff08Hangzhou\uff09Co., Ltd"}, {"id": 126675, "fullname": "Jinwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126675?format=json", "institution": "vivo Mobile Communication Co., Ltd."}, {"id": 87210, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87210?format=json", "institution": "vivo Mobile Communication Co.,Ltd."}, {"id": 153840, "fullname": "Shifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/153840?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences"}, {"id": 126680, "fullname": "Peng-Tao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126680?format=json", "institution": "vivo Mobile Communication (Hangzhou) Co., Ltd."}], "abstract": "Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. The code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38973", "url": null, "sourceid": 31546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40356?format=json"], "related_events_ids": [40356]}, {"id": 40356, "uid": "c695c406dd17d2fc9dbfe917adaf9e33", "name": "Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework", "authors": [{"id": 181077, "fullname": "Linxiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181077?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences"}, {"id": 191097, "fullname": "Siming Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191097?format=json", "institution": "Vivo"}, {"id": 191098, "fullname": "Zerong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191098?format=json", "institution": "vivo Mobile Communication Co., Ltd."}, {"id": 126669, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126669?format=json", "institution": "vivo Mobile Communication \uff08Hangzhou\uff09Co., Ltd"}, {"id": 126675, "fullname": "Jinwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126675?format=json", "institution": "vivo Mobile Communication Co., Ltd."}, {"id": 87210, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87210?format=json", "institution": "vivo Mobile Communication Co.,Ltd."}, {"id": 153840, "fullname": "Shifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/153840?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences"}, {"id": 126680, "fullname": "Peng-Tao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126680?format=json", "institution": "vivo Mobile Communication (Hangzhou) Co., Ltd."}], "abstract": "Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. The code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40356", "url": null, "sourceid": -31546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38973?format=json"], "related_events_ids": [38973]}, {"id": 38731, "uid": "a9d25b21e4dc63331981e0ad1d91b6ad", "name": "$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space", "authors": [{"id": 181337, "fullname": "Ruishan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181337?format=json", "institution": "Tsinghua University"}, {"id": 190538, "fullname": "Ciyu Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190538?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 190539, "fullname": "Haoyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190539?format=json", "institution": "Tsinghua University, China"}, {"id": 190540, "fullname": "Zihang GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/190540?format=json", "institution": "Harbin Institute of Technology"}, {"id": 190541, "fullname": "Jingao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190541?format=json", "institution": "The University of Hong Kong"}, {"id": 149284, "fullname": "Xinlei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149284?format=json", "institution": "Tsinghua University"}], "abstract": "Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex. Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the \\textbf{Event Edge Space}. Building on this idea, we introduce \\textbf{$x^2$-Fusion}, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation. Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38731", "url": null, "sourceid": 33564, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37311, "uid": "aa22b2803b8e7d32e53ac9c29e14845e", "name": "ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering", "authors": [{"id": 182850, "fullname": "Alberto Compagnoni", "url": "http://cvpr.thecvf.com/api/miniconf/users/182850?format=json", "institution": "University of Modena and Reggio Emilia, University of Pisa"}, {"id": 187134, "fullname": "Marco Morini", "url": "http://cvpr.thecvf.com/api/miniconf/users/187134?format=json", "institution": "University of Modena and Reggio Emilia"}, {"id": 86098, "fullname": "Sara Sarto", "url": "http://cvpr.thecvf.com/api/miniconf/users/86098?format=json", "institution": "University of Modena and Reggio Emilia"}, {"id": 158323, "fullname": "Federico Cocchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158323?format=json", "institution": "University of Modena and Reggio Emilia"}, {"id": 153569, "fullname": "Davide Caffagni", "url": "http://cvpr.thecvf.com/api/miniconf/users/153569?format=json", "institution": "Universit\u00e0 degli Studi di Modena e Reggio Emilia"}, {"id": 86090, "fullname": "Marcella Cornia", "url": "http://cvpr.thecvf.com/api/miniconf/users/86090?format=json", "institution": "University of Modena and Reggio Emilia"}, {"id": 86088, "fullname": "Lorenzo Baraldi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86088?format=json", "institution": "Universit\u00e0 degli Studi di Modena e Reggio Emilia"}, {"id": 85805, "fullname": "Rita Cucchiara", "url": "http://cvpr.thecvf.com/api/miniconf/users/85805?format=json", "institution": "Universit\u00e0 di Modena e Reggio Emilia"}], "abstract": "Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Source code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37311", "url": null, "sourceid": 34175, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37292, "uid": "be7b0d3ad04f5cb997e3031e00861c73", "name": "Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification", "authors": [{"id": 180771, "fullname": "Jiangling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180771?format=json", "institution": "Wuhan University of Technology"}, {"id": 187092, "fullname": "Shuxuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187092?format=json", "institution": "Wuhan University of Technology"}, {"id": 187093, "fullname": "Bofan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187093?format=json", "institution": "Wuhan University of Technology"}, {"id": 187094, "fullname": "Siqiang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187094?format=json", "institution": "Wuhan University of Technology"}, {"id": 187095, "fullname": "Jirui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187095?format=json", "institution": "Wuhan University of Technology"}, {"id": 127706, "fullname": "Yaxiong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127706?format=json", "institution": "Wuhan University of Technology"}, {"id": 187096, "fullname": "Ziyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187096?format=json", "institution": "Wuhan University of Technology"}], "abstract": "The proliferation of highly realistic AI-generated images poses critical challenges for digital forensics, demanding precise pixel-level localization of manipulated regions. Existing methods predominantly learn discriminative patterns of specific forgeries, struggling with novel manipulations as editing techniques evolve. We propose the Iterative Forgery Amplifier Network (IFA-Net), which shifts from learning \"what is fake\" to modeling \"what is real\". Grounded in the principle that all manipulations deviate from the natural image manifold, IFA-Net leverages a frozen Masked Autoencoder (MAE) pretrained on real images as a universal realness prior. Our framework operates through a two-stage closed-loop process: an initial Dual-Stream Segmentation Network (DSSN) fuses the original image with MAE reconstruction residuals for coarse localization, then a Task-Adaptive Prior Injection (TAPI) module converts this coarse prediction into guiding prompts that steer the MAE decoder to amplify reconstruction failures in suspicious regions, enabling precise refinement. Extensive experiments on four diffusion-based inpainting benchmarks show that IFA-Net achieves an average improvement of 6.5% in IoU and 8.1% in F1-score over the second-best method, while demonstrating strong generalization to traditional manipulation types.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37292", "url": null, "sourceid": 44068, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39304, "uid": "209e8f3be52784221cd84aec455fdcad", "name": "HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement", "authors": [{"id": 183336, "fullname": "Xinlong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183336?format=json", "institution": "Tianjin University, China"}, {"id": 154517, "fullname": "Di Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154517?format=json", "institution": "Tianjin University"}, {"id": 161569, "fullname": "Shaoyiyi Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/161569?format=json", "institution": null}, {"id": 174384, "fullname": "Yaxuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174384?format=json", "institution": "Shanghaijiaotong University"}, {"id": 191804, "fullname": "Jixian He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191804?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191805, "fullname": "Jiaxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191805?format=json", "institution": "Southwest University"}, {"id": 186968, "fullname": "Ruonan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186968?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 85967, "fullname": "Qing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85967?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}, {"id": 149357, "fullname": "Kairui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149357?format=json", "institution": "Tianjin University"}, {"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}], "abstract": "Open-vocabulary part segmentation (OVPS) aims to segment objects into fine-grained parts while generalizing to unseen categories. Existing VLM-based methods face two challenges: (1) object over-segmentation, caused by overly broad semantic activations, and (2) part under-segmentation, resulting from weak fine-grained perception. To address these issues, we propose HOPS, a two-stage framework for hierarchical open-vocabulary part segmentation. HOPS introduces a bidirectional semantic\u2013structural attention fusion mechanism that integrates CLIP\u2019s semantic alignment with DINO\u2019s structural perception. In the object segmentation stage, the Attention-Aware Filtering Module (AFM) refines cross-modal similarity maps via semantic\u2013structural attention to suppress object over-segmentation. In the part segmentation stage, the Affinity-Guided Enhancement Module (AEM) iteratively propagates part responses to progressively expand activation regions, effectively mitigating part under-segmentation. Experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet demonstrate that HOPS achieves state-of-the-art performance with superior generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39304", "url": null, "sourceid": 32054, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37226, "uid": "b82cebdc54932fa31c6e0b83ebc34aa6", "name": "TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration", "authors": [{"id": 145330, "fullname": "Chunxiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/145330?format=json", "institution": "Beijing Normal University"}, {"id": 156232, "fullname": "Lijun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156232?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 127187, "fullname": "Jing Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127187?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "The rapid advancement of Vision-Language Models (VLMs) has brought their safety vulnerabilities into sharp focus. However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. At its core lies a strategic Orchestrator, powered by a Large Language Model (LLM), which autonomously decides whether to evolve promising attack paths or explore diverse strategic branches, thereby dynamically constructing and expanding a strategy tree. A multimodal actuator is then tasked with executing these complex strategies. In the experiments across 12 prominent VLMs, TreeTeaming achieves state-of-the-art attack success rates on 11 models, outperforming existing methods and reaching up to 87.60\\% on GPT-4o. The framework also demonstrates superior strategic diversity over the union of previously public jailbreak strategies. Furthermore, the generated attacks exhibit an average toxicity reduction of 23.09\\%, showcasing their stealth and subtlety. Our work introduces a new paradigm for automated vulnerability discovery, underscoring the necessity of proactive exploration beyond static heuristics to secure frontier AI models.Warning: This paper contains examples of harmful texts and images, and reader discretion is recommended.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37226", "url": null, "sourceid": 37605, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36772, "uid": "4d7061dcae91d9c79ad39c8289bb4e3d", "name": "Beyond Caption-Based Queries in Video Moment Retrieval", "authors": [{"id": 185842, "fullname": "David Pujol-Perich", "url": "http://cvpr.thecvf.com/api/miniconf/users/185842?format=json", "institution": "University of Barcelona"}, {"id": 166503, "fullname": "Albert Clap\u00e9s", "url": "http://cvpr.thecvf.com/api/miniconf/users/166503?format=json", "institution": "Universitat de Barcelona"}, {"id": 73556, "fullname": "Dima Damen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73556?format=json", "institution": "University of Bristol and Google DeepMind"}, {"id": 129850, "fullname": "Sergio Escalera", "url": "http://cvpr.thecvf.com/api/miniconf/users/129850?format=json", "institution": "Computer Vision Center"}, {"id": 75434, "fullname": "Michael Wray", "url": "http://cvpr.thecvf.com/api/miniconf/users/75434?format=json", "institution": "University of Bristol"}], "abstract": "Current Video Moment Retrieval (VMR) models are trained on videos paired with captions, which are written by annotators after watching the videos. These captions are used as textual queries---which we term caption-based queries. This annotation process induces a visual bias, leading to overly descriptive and fine-grained queries, which significantly differ from the more general search queries that users are likely to employ in practice. In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets---i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search-queries, and (ii) a multi-moment gap, caused by the shift from single moment to multi-moment queries. We also identify a critical issue in these architectures---an active decoder-query collapse---as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% $mAP_m$, and up to 21.83% $mAP_m$ on multi-moment search queries.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36772", "url": null, "sourceid": 42359, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38820, "uid": "ab968680ad9fa9909541f1225dcf0711", "name": "Rethinking Box Supervision: Bias-Free Weakly Supervised Medical Segmentation", "authors": [{"id": 190758, "fullname": "Jun Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/190758?format=json", "institution": "Shenzhen University"}, {"id": 85724, "fullname": "Hui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85724?format=json", "institution": "Shenzhen University"}], "abstract": "Pixel-level annotations for medical image segmentation are costly and labor-intensive, often requiring expert knowledge. Bounding box labels provide a more scalable alternative but introduce strong box-shaped bias that hampers segmentation quality. We propose WeakMed, a general-purpose weakly supervised segmentation framework that removes the dependence on pixel-level masks while overcoming the structural limitations of box supervision. WeakMed introduces two lightweight, plug-and-play training components: (1) a Mask-to-Box (M2B) transformation that aligns predicted masks with box annotations to reduce label mismatch and box-induced bias, and (2) a Scale Consistency (SC) loss that enforces multi-scale self-supervision to address the ambiguity and instability of weak labels. Both modules are used only during training and impose no inference overhead. Across 9 segmentation tasks, 10 datasets, and 6 imaging modalities, WeakMed consistently surpasses existing weakly supervised methods and achieves performance competitive with fully supervised baselines. These results demonstrate its practicality as a low-cost yet high-quality solution for medical image segmentation. Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38820", "url": null, "sourceid": 43121, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38189, "uid": "038e12b8ef41273a9ed83a46a130c0a3", "name": "LoPrune: Efficient Data Pruning for LoRA-based Fine-Tuning of Vision Transformers", "authors": [{"id": 185091, "fullname": "Qiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185091?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189247, "fullname": "Yaozong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189247?format=json", "institution": "Swinburne University of Technology"}, {"id": 189248, "fullname": "KAIBIN WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/189248?format=json", "institution": "Swinburne University of Technology"}, {"id": 145164, "fullname": "Ziteng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/145164?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185093, "fullname": "Feifei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185093?format=json", "institution": "Deakin University"}, {"id": 189249, "fullname": "Caslon Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/189249?format=json", "institution": "Swinburne University of Technology"}, {"id": 185094, "fullname": "Yun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185094?format=json", "institution": "Swinburne University of Technology"}], "abstract": "Visual models are deployed on many Internet-of-Things (IoT) devices to power a variety of visual applications at the network edge. These models often need to be fine-tuned on-device continually to adapt to changing operating environments timely. However, the computing and energy overheads incurred are often overwhelming for resource-constrained IoT devices. Existing methods score sample importance based on the entire model via multi-epoch training, incurring overhead that may even exceed the training overhead reduction. To reduce these fine-tuning overheads, this paper presents LoPrune, a novel data pruning method that identifies and removes samples with negligible contributions to model adaptation. The key idea is to evaluate sample importance via a Trainable Subspace Alignment (TSA) Score to align the importance estimation with accurate update directions of the learnable adapter, i.e., Low-Rank Adaptation (LoRA). Specifically, LoPrune projects the influence function onto the LoRA subspace, enforcing consistency between the importance score and the model\u2019s updatable directions while substantially reducing the problem\u2019s dimensionality. It then leverages Kronecker-Factored Approximate Curvature to approximate the change of learnable adapter induced by a sample as its TSA score, retaining higher-scoring samples. Experiments with four representative visual models fine-tuned on three datasets demonstrate that compared with the best state-of-the-art data pruning baselines, LoPrune can reduce fine-tuning overhead by up to 72.9\\%, achieving a $3.69 \\times$ training speedup while improving fine-tuning accuracy by 3.50\\%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38189", "url": null, "sourceid": 31080, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36909, "uid": "e5424ca892fb503e2f87d4dcb2bb8570", "name": "Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model", "authors": [{"id": 180181, "fullname": "Shunkai Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180181?format=json", "institution": "Peking University"}, {"id": 128641, "fullname": "Zike Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128641?format=json", "institution": "Tsinghua University"}, {"id": 140231, "fullname": "fei xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/140231?format=json", "institution": "University of Cambridge"}, {"id": 107274, "fullname": "Dong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107274?format=json", "institution": "Peking University"}, {"id": 186193, "fullname": "Yuchen Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186193?format=json", "institution": "Southwest University"}, {"id": 128617, "fullname": "Hongbin Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/128617?format=json", "institution": "Peking University"}], "abstract": "We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global consistency constraints are operated on sparse keyframes spanning long distances rather than per frame, allowing the model to learn from a consistent prediction over a long trajectory in an efficient way. Our experiments demonstrate that Online3R outperforms previous state-of-the-art methods on various benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36909", "url": null, "sourceid": 36318, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36453, "uid": "f3c1261d2a929b173a2536fa606fa704", "name": "NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices", "authors": [{"id": 145164, "fullname": "Ziteng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/145164?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185091, "fullname": "Qiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185091?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185092, "fullname": "Bing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185092?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185093, "fullname": "Feifei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185093?format=json", "institution": "Deakin University"}, {"id": 84759, "fullname": "Hai Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84759?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185094, "fullname": "Yun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185094?format=json", "institution": "Swinburne University of Technology"}], "abstract": "Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69\u00d7 pruning speedup and reduces pruning cost by up to 99.83\\%, with only a 0.61\\% average accuracy loss.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36453", "url": null, "sourceid": 38177, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40262?format=json"], "related_events_ids": [40262]}, {"id": 36476, "uid": "b7d944793cfd1c1f1b156d5d4ed191fe", "name": "Your Dissimilarities Define You:  Complementary Learning Exploiting Class Diversities", "authors": [{"id": 181740, "fullname": "Dimitrios Katsikas", "url": "http://cvpr.thecvf.com/api/miniconf/users/181740?format=json", "institution": "Aristotle University of Thessaloniki"}, {"id": 185143, "fullname": "Nikolaos Passalis", "url": "http://cvpr.thecvf.com/api/miniconf/users/185143?format=json", "institution": "Aristotle University of Thessaloniki"}, {"id": 185144, "fullname": "Anastasios Tefas", "url": "http://cvpr.thecvf.com/api/miniconf/users/185144?format=json", "institution": "Aristotle University of Thessaloniki"}], "abstract": "In this work, we exploit class dissimilarities in a rather novel way, which provides complementary learning information beyond correct classification, that is not fully utilized in existing learning paradigms. To model these dissimilarities, we introduce the concept of an opposite-class, which consists of everything that is not part of a corresponding class, i.e., all samples from non-target classes or samples from unknown classes. By setting appropriately encoded target distributions over the non-target classes, we explicitly optimize the model\u2019s activation distributions across all non-target classes, which enhances class dissimilarity information and enables better control over the geometry of the learned representations. We analyze the convergence dynamics of our proposed approach, both theoretically and empirically, showing that it naturally pushes the representations towards neural collapse, leading to more discriminative and robust features. Our extensive evaluation across multiple classification settings demonstrates consistent improvements of our method on closed-set, open-set, few-shot classification, and domain generalization benchmarks. Our code is available at: (withheld for review, demo in supplementary material).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36476", "url": null, "sourceid": 38101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40262, "uid": "f3c1261d2a929b173a2536fa606fa704", "name": "NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices", "authors": [{"id": 145164, "fullname": "Ziteng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/145164?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185091, "fullname": "Qiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185091?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185092, "fullname": "Bing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185092?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185093, "fullname": "Feifei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185093?format=json", "institution": "Deakin University"}, {"id": 84759, "fullname": "Hai Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84759?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185094, "fullname": "Yun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185094?format=json", "institution": "Swinburne University of Technology"}], "abstract": "Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69\u00d7 pruning speedup and reduces pruning cost by up to 99.83\\%, with only a 0.61\\% average accuracy loss.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40262", "url": null, "sourceid": -38177, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36453?format=json"], "related_events_ids": [36453]}, {"id": 36351, "uid": "805ab6f0598e9b790f80d1d1eaebbe9c", "name": "OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments", "authors": [{"id": 184849, "fullname": "Hymalai Bello", "url": "http://cvpr.thecvf.com/api/miniconf/users/184849?format=json", "institution": "DFKI Kaiserslautern"}, {"id": 184850, "fullname": "Lala Shakti Swarup Ray", "url": "http://cvpr.thecvf.com/api/miniconf/users/184850?format=json", "institution": "DFKI"}, {"id": 181672, "fullname": "Joanna Sorysz", "url": "http://cvpr.thecvf.com/api/miniconf/users/181672?format=json", "institution": "German Research Center for Artificial Intelligence DFKI"}, {"id": 154070, "fullname": "Sungho Suh", "url": "http://cvpr.thecvf.com/api/miniconf/users/154070?format=json", "institution": "Korea University"}, {"id": 184851, "fullname": "Paul Lukowicz", "url": "http://cvpr.thecvf.com/api/miniconf/users/184851?format=json", "institution": "German Research Center for AI"}], "abstract": "Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D printer assembly task, with the 3D printer manufacturer's instructions provided to guide the volunteers in acquiring procedural knowledge. This setting also includes sequential collaborative assembly, where participants assess and correct each other's progress, reflecting real-world manufacturing dynamics. OpenMarcie includes over 37 hours of egocentric and exocentric, multimodal, and multipositional data, featuring eight distinct data types and more than 200 independent information channels. The dataset is benchmarked across three human activity recognition tasks: activity classification, open vocabulary captioning, and cross-modal alignment. The dataset and code are available at (Removed for Anonymous CVPR Submission).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36351", "url": null, "sourceid": 41603, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39576, "uid": "cd9e8a4a222eb86428130d42fc684ca5", "name": "SPREAD: Spatial-Physical Reasoning via gEometry Aware Diffusion", "authors": [{"id": 192387, "fullname": "Minzhang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192387?format=json", "institution": "ShanghaiTech University"}, {"id": 190504, "fullname": "Kuixiang Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190504?format=json", "institution": "ShanghaiTech University"}, {"id": 178287, "fullname": "xuebing li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178287?format=json", "institution": "\u4e0a\u6d77\u79d1\u6280\u5927\u5b66"}, {"id": 190505, "fullname": "Yuyang Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190505?format=json", "institution": "ShanghaiTech University"}, {"id": 181395, "fullname": "Yinuo Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181395?format=json", "institution": "ShanghaiTech University"}, {"id": 192388, "fullname": "Hengan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192388?format=json", "institution": "ShanghaiTech University"}, {"id": 192389, "fullname": "Sixian Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192389?format=json", "institution": "ShanghaiTech University"}, {"id": 154716, "fullname": "Jiayuan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154716?format=json", "institution": "ShanghaiTech University"}, {"id": 75945, "fullname": "Jingyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75945?format=json", "institution": "Shanghai Tech University"}], "abstract": "Automated 3D scene generation is pivotal for applications spanning virtual reality, digital content creation, and Embodied AI. While computer graphics prioritizes aesthetic layouts, vision and robotics demand scenes that mirror real-world complexity which current data-driven methods struggle to achieve due to limited unstructured training data and insufficient spatial and physical modeling. We propose SPREAD, a diffusion-based framework that jointly learns spatial and physical relationships through a graph transformer, explicitly conditioning on posed scene point clouds for geometric awareness. Moreover, our model integrates differentiable guidance for collision avoidance, relational constraint, and gravity, ensuring physically coherent scenes without sacrificing relational context. Our experiments on 3D-FRONT and ProcTHOR datasets demonstrate state-of-the-art performance in spatial-relational reasoning and physical metrics. Moreover, our method significantly outperforms baselines in scene consistency and stability during pre- and post-physics simulation, proving its capability to generate simulation-ready environments for embodied AI agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39576", "url": null, "sourceid": 40254, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38234, "uid": "6203f1dde486c7e691c5438115e54e0e", "name": "WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World", "authors": [{"id": 188169, "fullname": "Alan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188169?format=json", "institution": "National University of Singapore; Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 76351, "fullname": "Lingdong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76351?format=json", "institution": "National University of Singapore"}, {"id": 155275, "fullname": "Tianyi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155275?format=json", "institution": "University of Macau"}, {"id": 152938, "fullname": "Hongsi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152938?format=json", "institution": "Eastern Institute of Technology"}, {"id": 189388, "fullname": "Yu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189388?format=json", "institution": "National University of Singapore"}, {"id": 70562, "fullname": "Ziqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70562?format=json", "institution": "Nanyang Technological University"}, {"id": 88382, "fullname": "Wei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88382?format=json", "institution": " Shenzhen DJI Sciences and Technologies Ltd."}, {"id": 107238, "fullname": "Jialong Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/107238?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 151540, "fullname": "Yixuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151540?format=json", "institution": null}, {"id": 89794, "fullname": "Dekai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89794?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 152822, "fullname": "Dongyue Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152822?format=json", "institution": "National University of Singapore"}, {"id": 142971, "fullname": "Youquan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142971?format=json", "institution": "Fudan University"}, {"id": 186331, "fullname": "Guangfeng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186331?format=json", "institution": "University of Science and Technology of China"}, {"id": 188378, "fullname": "Linfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188378?format=json", "institution": "ByteDance Inc."}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 189389, "fullname": "Long Zhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189389?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 188170, "fullname": "Lai Xing Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188170?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 95547, "fullname": "Benoit Cottereau", "url": "http://cvpr.thecvf.com/api/miniconf/users/95547?format=json", "institution": "CNRS"}, {"id": 91032, "fullname": "Changxin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/91032?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 127120, "fullname": "Liang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127120?format=json", "institution": "Shanghai AI Lab"}, {"id": 188172, "fullname": "Wei Tsang Ooi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188172?format=json", "institution": "National University of Singapore"}, {"id": 89788, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89788?format=json", "institution": "Nanyang Technological University"}], "abstract": "Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce **WorldLens**, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects - Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference - jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct **WorldLens-26K**, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop **WorldLens-Agent**, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity - standardizing how future models are judged not only by how real they look, but by how real they behave.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38234", "url": null, "sourceid": 34901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40329?format=json"], "related_events_ids": [40329]}, {"id": 37289, "uid": "6109e720dfdee7143b87b597368ef17f", "name": "RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes", "authors": [{"id": 187088, "fullname": "Yipeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187088?format=json", "institution": "Tianjin University"}, {"id": 187089, "fullname": "Xin WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/187089?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 187090, "fullname": "Chenghan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187090?format=json", "institution": "Tianjin University"}, {"id": 186969, "fullname": "Chong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186969?format=json", "institution": "University of Science and Technology of China"}, {"id": 186970, "fullname": "Dongdong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186970?format=json", "institution": "Tianjin University"}, {"id": 157436, "fullname": "Wanchao Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/157436?format=json", "institution": "Monash University"}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}, {"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}, {"id": 149357, "fullname": "Kairui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149357?format=json", "institution": "Tianjin University"}, {"id": 154517, "fullname": "Di Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154517?format=json", "institution": "Tianjin University"}], "abstract": "High-quality video editing and processing are crucial in domains such as filmmaking and autonomous driving, where accurate visual refinement and data preparation are essential. However, it is challenging to achieve precise control over dynamic objects while maintaining spatiotemporal consistency. Current approaches typically utilize text prompts or 2D structural priors for video editing to ensure consistency, yet they struggle to effectively constrain the spatial variations of dynamic 3D objects. In this paper, we introduce $\\textbf{RecEdit-Drive}$, a framework that integrates $\\textbf{Spatial Feature Warping}$ and $\\textbf{Spatiotemporal Collaborative Modeling}$ to effectively control 3D object variations and enhance video consistency. The spatial feature warping enhances precise control over the edited foreground 3D objects, enhancing spatial consistency in the generated videos; and the spatiotemporal collaborative modeling seamlessly integrates edited foreground objects into the background, yielding realistic and consistent edited videos. Besides, we design an inference strategy to reconstruct an accurate background structure through noise manipulation, providing a reliable reference for foreground instance editing at early denoising stages. We perform extensive qualitative and quantitative comparisons regarding general video editing and downstream tasks on the public datasets, demonstrating the state-of-the-art performance of our proposed method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37289", "url": null, "sourceid": 37032, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40329, "uid": "6203f1dde486c7e691c5438115e54e0e", "name": "WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World", "authors": [{"id": 188169, "fullname": "Alan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188169?format=json", "institution": "National University of Singapore; Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 76351, "fullname": "Lingdong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76351?format=json", "institution": "National University of Singapore"}, {"id": 155275, "fullname": "Tianyi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155275?format=json", "institution": "University of Macau"}, {"id": 152938, "fullname": "Hongsi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152938?format=json", "institution": "Eastern Institute of Technology"}, {"id": 189388, "fullname": "Yu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189388?format=json", "institution": "National University of Singapore"}, {"id": 70562, "fullname": "Ziqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70562?format=json", "institution": "Nanyang Technological University"}, {"id": 88382, "fullname": "Wei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88382?format=json", "institution": " Shenzhen DJI Sciences and Technologies Ltd."}, {"id": 107238, "fullname": "Jialong Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/107238?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 151540, "fullname": "Yixuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151540?format=json", "institution": null}, {"id": 89794, "fullname": "Dekai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89794?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 152822, "fullname": "Dongyue Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152822?format=json", "institution": "National University of Singapore"}, {"id": 142971, "fullname": "Youquan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142971?format=json", "institution": "Fudan University"}, {"id": 186331, "fullname": "Guangfeng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186331?format=json", "institution": "University of Science and Technology of China"}, {"id": 188378, "fullname": "Linfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188378?format=json", "institution": "ByteDance Inc."}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 189389, "fullname": "Long Zhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189389?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 188170, "fullname": "Lai Xing Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188170?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 95547, "fullname": "Benoit Cottereau", "url": "http://cvpr.thecvf.com/api/miniconf/users/95547?format=json", "institution": "CNRS"}, {"id": 91032, "fullname": "Changxin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/91032?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 127120, "fullname": "Liang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127120?format=json", "institution": "Shanghai AI Lab"}, {"id": 188172, "fullname": "Wei Tsang Ooi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188172?format=json", "institution": "National University of Singapore"}, {"id": 89788, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89788?format=json", "institution": "Nanyang Technological University"}], "abstract": "Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce **WorldLens**, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects - Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference - jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct **WorldLens-26K**, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop **WorldLens-Agent**, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity - standardizing how future models are judged not only by how real they look, but by how real they behave.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40329", "url": null, "sourceid": -34901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38234?format=json"], "related_events_ids": [38234]}, {"id": 39464, "uid": "2877b49ebb731389a1a583bda03540bd", "name": "BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction", "authors": [{"id": 192128, "fullname": "Zhongjie Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192128?format=json", "institution": "Tianjin University"}, {"id": 154517, "fullname": "Di Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154517?format=json", "institution": "Tianjin University"}, {"id": 187089, "fullname": "Xin WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/187089?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 180431, "fullname": "Haotian Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180431?format=json", "institution": "Tianjin University"}, {"id": 186969, "fullname": "Chong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186969?format=json", "institution": "University of Science and Technology of China"}, {"id": 186970, "fullname": "Dongdong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186970?format=json", "institution": "Tianjin University"}, {"id": 88484, "fullname": "Changqing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88484?format=json", "institution": "Tianjin University"}], "abstract": "3D Gaussian Splatting (3DGS) has demonstrated exceptional performance in reconstruction and novel view synthesis tasks. However, its reliance on Structure-from-Motion preprocessing may lead to degraded performance under sparse-view scenarios. Recent works attempt to address this limitation by leveraging pre-trained image matching models to generate Gaussian primitives but overlook the probabilistic uncertainty embedded in both the initial primitive distribution and iterative position updates. This uncertainty can accumulate and degrade reconstruction fidelity. Hence, we propose BA-GS, a Bayesian framework that models both the global distribution and local uncertainty of Gaussian primitives. At global initialization, a Variational Bayesian Gaussian Mixture Model (VB-GMM) models the latent distribution of primitives, capturing region-wise density and gradient patterns. At local refinement, an Adaptive Kalman Filter refines each primitive\u2019s position by recursively fusing noisy gradient observations with spatial priors, dynamically adjusting its covariance according to local uncertainty.This hierarchical Bayesian formulation effectively bridges probabilistic distribution modeling and uncertainty-aware optimization, resulting in improved reconstruction quality under sparse-view conditions. Experiments across multiple benchmark datasets including Tanks and Temples, MVimgNet, and LLFF demonstrate that our method consistently outperforms existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39464", "url": null, "sourceid": 32345, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36315, "uid": "761b59a8e028e110dec4be2114ee567d", "name": "TerraSeg: Self-Supervised LiDAR Foundation Model for Ground Segmentation", "authors": [{"id": 89937, "fullname": "Ted Lentsch", "url": "http://cvpr.thecvf.com/api/miniconf/users/89937?format=json", "institution": "Delft University of Technology"}, {"id": 184756, "fullname": "Santiago Montiel-Mar\u00edn", "url": "http://cvpr.thecvf.com/api/miniconf/users/184756?format=json", "institution": "Universidad de Alcal\u00e1"}, {"id": 75575, "fullname": "Holger Caesar", "url": "http://cvpr.thecvf.com/api/miniconf/users/75575?format=json", "institution": "TU Delft"}, {"id": 85704, "fullname": "Dariu M. Gavrila", "url": "http://cvpr.thecvf.com/api/miniconf/users/85704?format=json", "institution": "Delft University of Technology"}], "abstract": "LiDAR perception is fundamental to robotics, enabling machines to understand their environment in 3D. A crucial task for LiDAR-based scene understanding and navigation is ground segmentation. Existing methods are either handcrafted for specific LiDAR configurations or require costly per-point manual labels, limiting generalization and scalability. We introduce TerraSeg, establishing the first self-supervised LiDAR foundation model for ground segmentation. We train TerraSeg on OmniLiDAR, a unified large-scale dataset that aggregates and standardizes LiDAR data from nine major public benchmarks, spanning over 20 million raw scans and 11 distinct sensor models, providing unprecedented diversity for learning a generalizable ground model. OmniLiDAR is pseudo-labeled by our PseudoLabeler, a novel self-supervised module that generates high-quality ground/non-ground labels through per-scan runtime optimization. Without any manual labels, TerraSeg achieves state-of-the-art results on nuScenes, SemanticKITTI, and Waymo Perception, and delivers close-to-real-time performance. Our code and models will be publicly released upon paper acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36315", "url": null, "sourceid": 32211, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37558, "uid": "e5860ae00103b3869a25d940345bf0fd", "name": "When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards", "authors": [{"id": 187719, "fullname": "Ali Naseh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187719?format=json", "institution": "University of Massachusetts Amherst"}, {"id": 187720, "fullname": "Anshuman Suri", "url": "http://cvpr.thecvf.com/api/miniconf/users/187720?format=json", "institution": "Northeastern University"}, {"id": 187721, "fullname": "Yuefeng Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187721?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 187722, "fullname": "Harsh Chaudhari", "url": "http://cvpr.thecvf.com/api/miniconf/users/187722?format=json", "institution": "Northeastern University"}, {"id": 187723, "fullname": "Alina Oprea", "url": "http://cvpr.thecvf.com/api/miniconf/users/187723?format=json", "institution": "Northeastern University"}, {"id": 187724, "fullname": "Amir Houmansadr", "url": "http://cvpr.thecvf.com/api/miniconf/users/187724?format=json", "institution": "University of Massachusetts at Amherst"}], "abstract": "Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare their quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts amplify these signatures. Our findings expose fundamental security risks in T2I leaderboards and motivate stronger anonymization defenses.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37558", "url": null, "sourceid": 45650, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39390, "uid": "8eb100008e421f50e5928029cef07c4e", "name": "Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework", "authors": [{"id": 183438, "fullname": "Enzhuo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183438?format=json", "institution": "Nanjing University"}, {"id": 191978, "fullname": "Sijie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191978?format=json", "institution": "nanjing university"}, {"id": 191979, "fullname": "Dilxat Muhtar", "url": "http://cvpr.thecvf.com/api/miniconf/users/191979?format=json", "institution": null}, {"id": 191980, "fullname": "Zhenshi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191980?format=json", "institution": "nanjing university"}, {"id": 191981, "fullname": "Xueliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191981?format=json", "institution": "nanjing university"}, {"id": 191982, "fullname": "Pengfeng Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191982?format=json", "institution": "Nanjing University"}], "abstract": "Generative diffusion priors have recently achieved state-of-the-art performance in Natural Image Super-Resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to Remote Sensing Image Super-Resolution (RSISR) reveals significant shortcomings. Remote sensing images present a unique challenge: ground objects often exhibit globally stochastic yet locally clustered patterns. This characteristic leads to highly imbalanced texture distributions, posing a significant hurdle to the model's spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) that reflects the underlying texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39390", "url": null, "sourceid": 45576, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39243, "uid": "44e97bac42b33fc4d3873d6c3f8cb118", "name": "CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority", "authors": [{"id": 191690, "fullname": "Hengqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191690?format=json", "institution": "Beijing University of Posts and Telecommunications; Li Auto Inc."}, {"id": 191691, "fullname": "Wanting Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191691?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 191692, "fullname": "Longteng Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191692?format=json", "institution": null}, {"id": 185645, "fullname": "Fangxiang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185645?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 191693, "fullname": "Lei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/191693?format=json", "institution": "Li Auto"}, {"id": 187197, "fullname": "Chen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187197?format=json", "institution": "Li Auto Inc."}, {"id": 185646, "fullname": "Xiaojie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185646?format=json", "institution": "Beijing University of Post and Telecommunication"}], "abstract": "Cross-modal alignment aims to learn semantically consistent latent representations across diverse modalities. Prevailing methods rely on a text-guided aggregation paradigm to achieve fine-grained alignment, while they suffer from redundant patch-word correlations and high computational costs. To address these issues, we propose CoV-Align, an effective and efficient fine-grained cross-modal alignment framework with cohesive visual semantics priority. Through a semantically convergent attention mechanism, it progressively aggregates meaningful visual patches in a text-free manner. We design a coarse visual semantic feature extractor that integrates deformable attention and consist assign attention to group patches with semantic consistency. A cohesive and discriminative feature optimization is presented to enhance intra-semantic cohesion and inter-semantic discriminability of visual region features, resulting in explicit improvements in cross-modal alignment. Extensive experiments demonstrate that CoV-Align achieves state-of-the-art performance on the Flickr30K and MS-COCO benchmarks. Notably, it delivers a 3\u20135$\\times$ computational speedup compared to pioneer approaches, offering compelling advantages for large-scale multi-modal tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39243", "url": null, "sourceid": 37984, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39414, "uid": "5a55d6ee22db450394f6f4ff698ce7f9", "name": "Electromagnetic Inverse Scattering from a Single Transmitter", "authors": [{"id": 176393, "fullname": "Yizhe Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/176393?format=json", "institution": "Peking University"}, {"id": 192029, "fullname": "Chunxun Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/192029?format=json", "institution": "Peking University"}, {"id": 192030, "fullname": "Haoru Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192030?format=json", "institution": "Peking University"}, {"id": 183889, "fullname": "Wentao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183889?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 192031, "fullname": "Xiaoxuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192031?format=json", "institution": "Peking University"}, {"id": 74208, "fullname": "Yizhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/74208?format=json", "institution": "Peking University"}], "abstract": "Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging, especially under sparse transmitter setups, e.g., with only one transmitter. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires time-consuming case-specific optimization and fails under sparse transmitter setups. To address these limitations, we revisit EISP from a data-driven perspective. The scarcity of transmitters leads to an insufficient amount of measured data, which fails to capture adequate physical information for stable inversion. Built on this insight, we propose a fully end-to-end and data-driven framework that predicts the relative permittivity of scatterers from measured fields, leveraging data distribution priors to compensate for the lack of physical information. This design enables data-driven training and feed-forward prediction of relative permittivity while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy and robustness. Notably, it achieves high-quality results even with a single transmitter, a setting where previous methods consistently fail. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39414", "url": null, "sourceid": 34348, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37687, "uid": "11d6b3b01f84d22194dbdcff39ad3b3b", "name": "The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA", "authors": [{"id": 89336, "fullname": "Bingfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89336?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 128633, "fullname": "Siyue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128633?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 188014, "fullname": "Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188014?format=json", "institution": null}, {"id": 188015, "fullname": "Jiahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188015?format=json", "institution": "DIGITAL QINGDAO CONSTRUCTION CO.,LTD."}, {"id": 176774, "fullname": "Wenwu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176774?format=json", "institution": "University of Surrey"}, {"id": 89348, "fullname": "Jimin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89348?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Multimodal Large Language Models (MLLMs) like LLaVA have demonstrated remarkable capabilities in multi-modal understanding and generation. This success motivates us to investigate whether the inherent prior knowledge embedded within such MLLMs contains sufficient spatial awareness for dense prediction tasks, without requiring any task-specific fine-tuning. Thus, in this paper, we explore the utilization of LLaVA for training-free open-vocabulary semantic segmentation. We discover that certain layers within the LLM part of LLaVA can generate localized features corresponding to given object classes. Building on this intrinsic capability, we design three modules: A question-answer pipeline to identify target classes in the image, a text-visual response module to extract initial reliable pixel-level activations for the target class, and a visual generation module to produce reliable refined prompts, which further serve as guidance for SAM to generate the predictions. Our LLaVA-based approach achieves new state-of-the-art performance on ``Thing\" category datasets, \\eg, PASCAL VOC 2012 and COCO-object. Moreover, our method does not require explicit background class names, demonstrating its exceptional potential for handling open-world scenarios. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37687", "url": null, "sourceid": 45411, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36952, "uid": "4674be98075d966f02c2c1021806839a", "name": "Domain-Aware Federated Learning via Fisher-Guided Pruning", "authors": [{"id": 181995, "fullname": "Chenchen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/181995?format=json", "institution": "Sun Yat-sen University"}, {"id": 186297, "fullname": "Wenhao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186297?format=json", "institution": "The University of Hong Kong,"}, {"id": 186298, "fullname": "Zhengji Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186298?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186299, "fullname": "Xuehe Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186299?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Federated Learning (FL) serves as a prominent distributed machine learning paradigm, enabling clients to collaboratively train a shared model. However, clients generally possess data from multiple domains, posing significant challenges to model efficiency and generalization. In this paper, we propose \\texttt{FedFIP}, a domain-sensitive federated pruning framework that preserves domain-invariant structures and domain-specific representations. First, we design the Domain-Sensitive Fisher Pruning (DSFP) module to estimate channel importance per domain via Fisher information, and upload it to the server to obtain a globally shared pruning mask. Due to domain heterogeneity, each client reuses its Fisher information to selectively reactivate domain-specific channels, yielding personalized sparse models that remain structurally aligned yet adapt to local heterogeneity. To further enhance performance, we adopt a Domain-Sensitive Regularization (DSR) module, in which the server builds domain prototypes from important signals and broadcasts them back. Guided by the prototypes, we introduce a structure-contrastive loss to strengthen intra-domain consistency and inter-domain discriminability. Finally, we propose a structure-aware aggregation algorithm to fuse heterogeneous personalized architectures into a domain-generalized global model. Extensive experiments on multi-domain benchmarks demonstrate that \\texttt{FedFIP} surpasses state-of-the-art FL baselines while substantially shrinking model size.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36952", "url": null, "sourceid": 34828, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40202, "uid": "edc11ad645bab544e01f77d9d4e7cff3", "name": "Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention", "authors": [{"id": 174776, "fullname": "Yanbo Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/174776?format=json", "institution": "Jilin University"}, {"id": 88152, "fullname": "Jianlong Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88152?format=json", "institution": "Microsoft"}, {"id": 180925, "fullname": "Ruoxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180925?format=json", "institution": "Jilin University"}, {"id": 133400, "fullname": "Hongxia Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/133400?format=json", "institution": "Jilin University"}, {"id": 193764, "fullname": "Meibao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193764?format=json", "institution": "Jilin University"}], "abstract": "Vision-Language-Action (VLA) models have enabled notable progress in general-purpose robotic manipulation, yet their learned policies often exhibit variable execution quality. We attribute this variability to the mixed-quality nature of human demonstrations, where the implicit principles that govern how actions should be carried out are only partially satisfied. To address this challenge, we introduce the LIBERO-Elegant benchmark with explicit criteria for evaluating execution quality. Using these criteria, we develop a decoupled refinement framework that improves execution quality without modifying or retraining the base VLA policy. We formalize Elegant Execution as the satisfaction of Implicit Task Constraints (ITCs) and train an Elegance Critic via offline Calibrated Q-Learning to estimate the expected quality of candidate actions. At inference time, a Just-in-Time Intervention (JITI) mechanism monitors critic confidence and intervenes only at decision-critical moments, providing selective, on-demand refinement. Experiments on LIBERO-Elegant and real-world manipulation tasks show that the learned Elegance Critic substantially improves execution quality, even on unseen tasks. The proposed model enables robotic control that values not only whether tasks succeed, but also how they are performed.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40202", "url": null, "sourceid": 40684, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37937, "uid": "42aa61c7ccfa95dc4db4d894530def8a", "name": "SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence", "authors": [{"id": 107441, "fullname": "Haoning Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107441?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 188633, "fullname": "Xiao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188633?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188634, "fullname": "Yaohui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188634?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 76651, "fullname": "Ya Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76651?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 126281, "fullname": "Yanfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126281?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 73937, "fullname": "Weidi Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73937?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Existing studies on multimodal large language models (MLLMs) in spatial understanding are typically limited by fragmented assessments.This work considers a comprehensive evaluation of the spatial understanding abilities of existing MLLMs. Concretely, we make the following contributions in this paper: (i) we propose **SpatialScore**, the most comprehensive and diverse multimodal spatial intelligence benchmark to date, encompassing various visual data types, input modalities, and QA formats with around 5K manually verified samples across 30 distinct tasks; (ii) we construct **SpatialCorpus**, a large-scale training resource with 331K multimodal QA samples for supervised fine-tuning Qwen3-VL on spatial understanding; (iii) we develop **SpaitalAgent**, a multi-agent system incorporating 12 specialized spatial perception tools, supporting both *Plan-Execute* and *ReAct* reasoning paradigms, enabling to improve spatial reasoning in a training-free manner; and (iv) we conduct extensive evaluations on 40 representative MLLMs, revealing persistent challenges in spatial intelligence while demonstrating the effectiveness of our data-driven and agent-based solutions. All data, code, and models will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37937", "url": null, "sourceid": 31904, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37867, "uid": "2a11da4581b3801a09326efe6cf2af23", "name": "Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views", "authors": [{"id": 181114, "fullname": "Xiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181114?format=json", "institution": "Department of Computer Science, ETHZ - ETH Zurich"}, {"id": 156730, "fullname": "Yang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156730?format=json", "institution": "Disney Research Studios, The Walt Disney Company"}, {"id": 188442, "fullname": "Lukas Mehl", "url": "http://cvpr.thecvf.com/api/miniconf/users/188442?format=json", "institution": "Disney Research, Disney Research; Universit\u00e4t Stuttgart"}, {"id": 86229, "fullname": "Markus Gross", "url": "http://cvpr.thecvf.com/api/miniconf/users/86229?format=json", "institution": "Disney Research, Disney"}, {"id": 85061, "fullname": "Christopher Schroers", "url": "http://cvpr.thecvf.com/api/miniconf/users/85061?format=json", "institution": "Disney Research|Studios, Disney"}], "abstract": "Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37867", "url": null, "sourceid": 33220, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39478, "uid": "8ca137da54011ed7d13cdec4cc2a6340", "name": "PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis", "authors": [{"id": 181999, "fullname": "Inseong Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181999?format=json", "institution": "Dongguk University"}, {"id": 192159, "fullname": "Siwoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192159?format=json", "institution": "Dongguk University"}, {"id": 154570, "fullname": "Seung-Hun Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/154570?format=json", "institution": "NAVER WEBTOON AI"}, {"id": 128394, "fullname": "Soohwan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/128394?format=json", "institution": "Dongguk University"}], "abstract": "Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling thorough quality assessment. When integrated into a diffusion-augmented 3DGS pipeline, PR-IQA restricts supervision to high-confidence regions identified by its quality maps. Experiments demonstrate that PR-IQA outperforms existing IQA methods, achieving full-reference-level accuracy without ground-truth supervision. Thus, our quality-aware 3DGS approach more effectively filters inconsistencies, producing superior 3D reconstructions and NVS results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39478", "url": null, "sourceid": 32548, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37551, "uid": "beb790dc500304731e458a80a44d4299", "name": "FrankenMotion: Part-level Human Motion Generation and Composition", "authors": [{"id": 132652, "fullname": "Chuqiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132652?format=json", "institution": "University of Tubingen"}, {"id": 76869, "fullname": "Xianghui Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/76869?format=json", "institution": "University of T\u00fcbingen"}, {"id": 187703, "fullname": "Yong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187703?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 69174, "fullname": "Andreas Geiger", "url": "http://cvpr.thecvf.com/api/miniconf/users/69174?format=json", "institution": "University of T\u00fcbingen"}, {"id": 75975, "fullname": "Gerard Pons-Moll", "url": "http://cvpr.thecvf.com/api/miniconf/users/75975?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts.In this work, we construct a high-quality motion captioning dataset with atomic, temporally-aware part-level annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution.Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37551", "url": null, "sourceid": 44506, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39149, "uid": "6b998835b626af9fd58e2c3521628217", "name": "CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction", "authors": [{"id": 76869, "fullname": "Xianghui Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/76869?format=json", "institution": "University of T\u00fcbingen"}, {"id": 71925, "fullname": "Bowen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71925?format=json", "institution": "NVIDIA"}, {"id": 164890, "fullname": "Yan Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/164890?format=json", "institution": "NVIDIA"}, {"id": 191453, "fullname": "Hesam Rabeti", "url": "http://cvpr.thecvf.com/api/miniconf/users/191453?format=json", "institution": "NVIDIA"}, {"id": 152870, "fullname": "Jiefeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152870?format=json", "institution": "NVIDIA"}, {"id": 88866, "fullname": "Ye Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88866?format=json", "institution": "NVIDIA Research"}, {"id": 75975, "fullname": "Gerard Pons-Moll", "url": "http://cvpr.thecvf.com/api/miniconf/users/75975?format=json", "institution": "University of T\u00fcbingen"}, {"id": 159437, "fullname": "Stan Birchfield", "url": "http://cvpr.thecvf.com/api/miniconf/users/159437?format=json", "institution": "NVIDIA"}], "abstract": "Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose  a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment,  and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38\\% on in-distribution dataset and 36\\% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39149", "url": null, "sourceid": 30742, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37309, "uid": "9aed5f6c4e7930b325c5e4ae3b95c606", "name": "Foundation Encoders are All You Need for Personalized Image Generation", "authors": [{"id": 184056, "fullname": "Hyungjin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/184056?format=json", "institution": "Inha University"}, {"id": 187132, "fullname": "Seokho Ahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/187132?format=json", "institution": "Inha University"}, {"id": 187133, "fullname": "Young-Duk Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187133?format=json", "institution": "Inha University"}], "abstract": "Personalized image generation based on user behaviors reflects individual preferences with minimal user intervention. However, existing studies often rely on inaccurate profiling, high resource costs, and model-specific designs, which jointly restrict creativity, diversity, and generality. To address these limitations, we propose FANG, a novel approach that enables personalization using only foundation encoders, without additional structures. FANG performs tailored profiling to capture user preferences, and reconstructs transformer-based encoders to integrate them while preserving target fidelity. Experiments show that FANG achieves robust, high-quality personalization across various foundation text-to-image models and applications (e.g., CLIP retrieval, unCLIP, vision-language models), seamlessly integrating into diverse encoders without fine-tuning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37309", "url": null, "sourceid": 39654, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36867, "uid": "354fdb0a06857ad757d94eb86ccdbf06", "name": "Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation", "authors": [{"id": 175951, "fullname": "kaichao jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175951?format=json", "institution": "Hefei University of Technology"}, {"id": 107376, "fullname": "He Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107376?format=json", "institution": "University College London, University of London"}, {"id": 156448, "fullname": "Xiaoshuai Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156448?format=json", "institution": "Beijing Academy of Artificial Intelligence(BAAl)"}, {"id": 186059, "fullname": "Xiulong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186059?format=json", "institution": "Central China Normal University"}, {"id": 75781, "fullname": "Ajian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75781?format=json", "institution": "NLPR, CASIA"}, {"id": 131736, "fullname": "Qi Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131736?format=json", "institution": "University of Science and Technology of China"}, {"id": 186060, "fullname": "Yunfeng Diao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186060?format=json", "institution": "Hefei University of Technology"}, {"id": 85100, "fullname": "Richang Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85100?format=json", "institution": "Hefei University of Technology"}], "abstract": "Joint Energy-based Models (JEMs) are well known for their ability to unify classification and generation within a single framework. Despite their promising generative and discriminative performance, their robustness remains far inferior to adversarial training (AT), which, conversely, achieves strong robustness but sacrifices clean accuracy and lacks generative ability. This inherent trilemma\u2014balancing classification accuracy, robustness, and generative capability\u2014raises a fundamental question: \\textit{Can a single model achieve all three simultaneously?} To answer this, we conduct a systematic energy landscape analysis of clean, adversarial, and generated samples across various JEM and AT variants. We observe that AT reduces the energy gap between clean and adversarial samples, while JEMs narrow the gap between clean and synthetic ones. This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might bridge their performance disparities. Building on this idea,  we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), a unified generative-discriminative-robust framework that maximizes the joint probability of clean and adversarial distribution. EB-JDAT introduces a novel min\u2013max energy optimization to explicitly aligning energies across clean, adversarial, and generated samples. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet subsets demonstrate that EB-JDAT achieves state-of-the-art robustness while maintaining near-original accuracy and generation quality of JEMs, effectively resolving the triple trade-off between accuracy, robustness, and generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36867", "url": null, "sourceid": 34138, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37745, "uid": "842e8292086b3d3b35d127ea5b1fb9ca", "name": "Degradation-Consistent Test-Time Adaptation for All-in-One Image Restoration", "authors": [{"id": 145506, "fullname": "Ni Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145506?format=json", "institution": "Xiamen University"}, {"id": 145806, "fullname": "Shenghao nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/145806?format=json", "institution": "Xiamen University"}, {"id": 87734, "fullname": "Xiaotong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87734?format=json", "institution": "XMU"}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 87721, "fullname": "Yanyun Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87721?format=json", "institution": "Xiamen University"}], "abstract": "All-in-one image restoration (AiOIR) methods have made remarkable progress in handling diverse degradations. However, their performance often deteriorates when the test distribution deviates from the training distribution. Exploring test-time adaptation for AiOIR is therefore crucial. To adapt a pre-trained AiOIR model to unseen degradation distributions without access to source data or retraining, two key challenges must be addressed: designing reliable pseudo-supervision and stabilizing adaptation. Observing that multiple degraded versions of the same scene should map to a consistent clean image, we propose Degradation-Consistent Test-Time Adaptation (DCTTA). DCTTA comprises three core components: (1) test-time redegradation generation, which leverages a diffusion-based generator to construct pseudo degraded\u2013clean pairs for distribution alignment;  (2)  degradation-guided image restoration, which enforces domain adaptation via self-supervised consistency loss; and (3) test-time important parameter selection, which selectively updates degradation-sensitive parameters to ensure stable adaptation while preserving pre-trained knowledge. Extensive experiments across multiple tasks and challenging domain shifts demonstrate that DCTTA consistently outperforms state-of-the-art AiOIR baselines, achieving up to +4.57 dB PSNR improvement on the Rain100H dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37745", "url": null, "sourceid": 39374, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39998, "uid": "7991414f1fb147d65ca8fb117934f94b", "name": "Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views", "authors": [{"id": 174241, "fullname": "Kunwar Maheep Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/174241?format=json", "institution": "MPI for Informatics"}, {"id": 136775, "fullname": "Jianchun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/136775?format=json", "institution": "Max Planck Institute for Informatics"}, {"id": 75786, "fullname": "Vladislav Golyanik", "url": "http://cvpr.thecvf.com/api/miniconf/users/75786?format=json", "institution": "MPI for Informatics"}, {"id": 87361, "fullname": "Stephan J. Garbin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87361?format=json", "institution": "Microsoft"}, {"id": 75876, "fullname": "Thabo Beeler", "url": "http://cvpr.thecvf.com/api/miniconf/users/75876?format=json", "institution": "Google"}, {"id": 89511, "fullname": "Rishabh Dabral", "url": "http://cvpr.thecvf.com/api/miniconf/users/89511?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 127234, "fullname": "Marc Habermann", "url": "http://cvpr.thecvf.com/api/miniconf/users/127234?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}], "abstract": "We present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method\u2019s superior visual fidelity and lighting reproduction compared to state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39998", "url": null, "sourceid": 34670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40390?format=json"], "related_events_ids": [40390]}, {"id": 37794, "uid": "da3b24471e22500279c0333be551037a", "name": "Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion", "authors": [{"id": 182287, "fullname": "Xiaogang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182287?format=json", "institution": "Ningxia University"}, {"id": 188283, "fullname": "Jinchao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188283?format=json", "institution": "Ningxia University"}, {"id": 188284, "fullname": "Zixian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188284?format=json", "institution": "Ningxia University"}, {"id": 188285, "fullname": "Dun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188285?format=json", "institution": "Ningxia University"}, {"id": 188286, "fullname": "BoXiang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188286?format=json", "institution": "Ningxia University"}, {"id": 188287, "fullname": "Yiqiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188287?format=json", "institution": null}], "abstract": "We propose a probabilistic discrepancy learning approach for roadside LiDAR scene completion (PDL). Conventional methods focus on object-level completion and scene completion from ego-vehicle viewpoint. These methods struggle to cope with long-term or total occlusions caused by roadside sensors with fixed viewpoints. To address this issue, we compensate for occlusion roadside point clouds by introducing external visual information. Specifically, Our PDL is mainly divided into probabilistic pose discrepancy minimization and scene discrepancy learning. We employ probabilistic pose discrepancy minimization to correct noisy poses from vision-based detectors, while utilizing a diffusion model within scene discrepancy learning for robust full-scene completion.Furthermore, we introduce regional and global sampling discrepancy learning losses to achieve robust and efficient training. We conducted extensive experiments on the V2X-Seq and TUMTraf-V2X roadside datasets. Results demonstrate that DT-VEM achieves state-of-the-art performance, with average reductions of 14.5\\% in chamfer distance (CD) and 6\\% in 3D Jensen Shannon divergence (JSD) compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37794", "url": null, "sourceid": 41822, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38490, "uid": "c8045671083f48d8d09d1d2523ea8941", "name": "Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought", "authors": [{"id": 132627, "fullname": "Shin&amp;#x27;ya Yamaguchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/132627?format=json", "institution": "NTT"}, {"id": 189972, "fullname": "Kosuke Nishida", "url": "http://cvpr.thecvf.com/api/miniconf/users/189972?format=json", "institution": "NTT"}, {"id": 131726, "fullname": "Daiki Chijiwa", "url": "http://cvpr.thecvf.com/api/miniconf/users/131726?format=json", "institution": "NTT, The University of Tokyo"}], "abstract": "Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs.  While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38490", "url": null, "sourceid": 31506, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37975, "uid": "d748a600fe89a6abd01b174e8d2eb35c", "name": "Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation", "authors": [{"id": 188722, "fullname": "ByeongCheol Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188722?format=json", "institution": "Sungkyunkwan University"}, {"id": 87337, "fullname": "Hyun Seok Seong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87337?format=json", "institution": "Sungkyunkwan University"}, {"id": 87830, "fullname": "Sangeek Hyun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87830?format=json", "institution": "Sungkyunkwan University"}, {"id": 188723, "fullname": "Gilhan Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/188723?format=json", "institution": "Sung Kyun Kwan University"}, {"id": 87385, "fullname": "WonJun Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/87385?format=json", "institution": "Sungkyunkwan University"}, {"id": 87383, "fullname": "Jae-Pil Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87383?format=json", "institution": "Sungkyunkwan University"}], "abstract": "A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37975", "url": null, "sourceid": 34002, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36547, "uid": "10b7aacb4303a22494c755fd942ab5b8", "name": "Stronger Normalization-Free Transformers", "authors": [{"id": 185315, "fullname": "Mingzhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185315?format=json", "institution": "Southern University of Science and Technology"}, {"id": 185316, "fullname": "Taiming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185316?format=json", "institution": "Princeton University"}, {"id": 153069, "fullname": "Jiachen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153069?format=json", "institution": "New York University FAIR, Meta"}, {"id": 185317, "fullname": "Mingjie Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185317?format=json", "institution": "Thinking Machines Lab"}, {"id": 153070, "fullname": "Zhuang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153070?format=json", "institution": "Princeton University"}], "abstract": "Although normalization layers have long been viewed as essential components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. Acting like a normalization layer, the point-wise function DyT constrains extreme values for stable convergence and reach normalization-level performance; this work seeks further for functions that can surpass it. We first study how the intrinsic properties of point-wise functions shape training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\\mathrm{Derf}(x) = \\mathrm{erf}(\\alpha x + s)$  and identify it as the most performant design. \\methodname{} consistently outperforms LayerNorm, RMSNorm, and Dynamic Tanh across a wide range of modalities, tasks, and learning paradigms. Moreover, our findings suggest that the performance gains of \\methodname{} largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and performance make \\methodname{} a practical choice for normalization-free Transformer design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36547", "url": null, "sourceid": 37329, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37517, "uid": "3fd377c199d2f76e92903a5f4b3cd771", "name": "PhysHead: Simulation-Ready Gaussian Head Avatars", "authors": [{"id": 187622, "fullname": "Berna Kabadayi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187622?format=json", "institution": "Max Planck Institute for Intelligent Systems"}, {"id": 133591, "fullname": "Vanessa Sklyarova", "url": "http://cvpr.thecvf.com/api/miniconf/users/133591?format=json", "institution": "Max Planck Institute for Intelligent Systems, Tubingen"}, {"id": 187623, "fullname": "Wojciech Zielonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/187623?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 133051, "fullname": "Justus Thies", "url": "http://cvpr.thecvf.com/api/miniconf/users/133051?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 75975, "fullname": "Gerard Pons-Moll", "url": "http://cvpr.thecvf.com/api/miniconf/users/75975?format=json", "institution": "University of T\u00fcbingen"}], "abstract": "Realistic digital avatars require expressive and dynamic hair motion, yet most existing head avatar methods assume rigid hair movement.These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. Especially, we propose the use of VLM-based models to generate appearance information of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method is able to synthesize physically plausible hair motion besides expression and camera control. The code will be released for research purposes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37517", "url": null, "sourceid": 36725, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37157, "uid": "b3e6fdc36d0f16ef605f73a6ad7f89fb", "name": "WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing", "authors": [{"id": 186796, "fullname": "Bing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186796?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186797, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186797?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186798, "fullname": "JUNDA LU", "url": "http://cvpr.thecvf.com/api/miniconf/users/186798?format=json", "institution": "Western Sydney University"}, {"id": 128584, "fullname": "Le Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128584?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155616, "fullname": "Yun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155616?format=json", "institution": "Nankai University"}, {"id": 128611, "fullname": "Ce Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128611?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186799, "fullname": "Wei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186799?format=json", "institution": "A*STAR"}], "abstract": "WiFi sensing offers passive and privacy-preserving perception that complements vision-based sensing, but its performance degrades sharply under domain shifts caused by changes in environment, users, or hardware. This challenge is exacerbated in real-world deployments where source data are unavailable, motivating test-time adaptation (TTA) as a practical solution for self-calibration using only unlabeled target samples. We introduce WiTTA-Bench, the first comprehensive benchmark for WiFi TTA, covering 20 representative methods, two adaptation protocols (OTTA and TTDA), and three major physics-induced shifts in WiFi: cross-environment, cross-subject, and cross-device. Furthermore, we contribute a new dataset featuring paired recordings from heterogeneous devices to bridge the cross-device gap. Extensive experiments reveal three key insights unique to WiFi sensing: (i) WiFi domain shifts exhibit a physics-induced hierarchy: environmental changes alter multipath statistics, subject variation perturbs temporal\u2013spectral geometry, and hardware differences reshape the entire feature manifold; (ii) OTTA and TTDA are complementary: lightweight OTTA handles mild statistical drift, while TTDA is necessary to correct deep, hardware-induced structural distortions; and (iii) OTTA is hyperparameter-robust and scales linearly with source quality, whereas TTDA is more sensitive due to recursive self-training. WiTTA-Bench establishes the first systematic foundation for adaptive, robust, and deployable WiFi sensing under realistic wireless conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37157", "url": null, "sourceid": 44418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39139, "uid": "a658d23016fb87f4536849e04851921d", "name": "OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation", "authors": [{"id": 191429, "fullname": "Tatiana Zemskova", "url": "http://cvpr.thecvf.com/api/miniconf/users/191429?format=json", "institution": "Cognitive AI Systems Lab; MIRAI"}, {"id": 180648, "fullname": "Aleksei Staroverov", "url": "http://cvpr.thecvf.com/api/miniconf/users/180648?format=json", "institution": "AIRI"}, {"id": 191430, "fullname": "Dmitry Yudin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191430?format=json", "institution": "Cognitive AI Systems Lab; Moscow Independent Research Institute of Artificial Intelligence"}, {"id": 191431, "fullname": "Aleksandr Panov", "url": "http://cvpr.thecvf.com/api/miniconf/users/191431?format=json", "institution": "Moscow Independent Research Institute of Artificial Intelligence; Cognitive AI Systems Lab"}], "abstract": "Open-vocabulary Object Goal Navigation requires an embodied agent to reach objects described by free-form language, including categories never seen during training. Existing end-to-end policies tend to overfit small simulator datasets, achieving high success on training scenes but failing to generalize and often exhibiting unsafe behavior (frequent collisions). In our work, we are the first to show that a high degree of generalization to unseen categories in the open-vocabulary object goal navigation task can be achieved with a lightweight transformer model (130M parameters) using only RGB input. We introduce the OVSegDT approach, which has three key features. First, we add a goal binary mask encoder that grounds the textual goal and provides precise spatial cues. The second component is a proposed Entropy-Adaptive Loss Modulation (EALM) \u2014 a per-sample scheduler that continuously balances imitation and reinforcement signals according to policy entropy, eliminating brittle manual phase switches. EALM reduces the sample complexity of training by 33% and cuts the collision count by 10% compared to the baseline. The final component improves the agent\u2019s navigation quality even under noisy predicted segmentation by combining an auxiliary segmentation loss with a reward function based on the area of the true goal mask during fine-tuning on predicted segmentation. On HM3D-OVON, our model achieves performance on unseen categories comparable to that on seen ones and establishes state-of-the-art results (44.7% SR, 20.6% SPL on val unseen) without using depth, odometry, or large vision\u2013language models", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39139", "url": null, "sourceid": 38261, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39831, "uid": "931cad0d73bc5648212d48966f6fcc21", "name": "SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs", "authors": [{"id": 157702, "fullname": "Mohamad Alansari", "url": "http://cvpr.thecvf.com/api/miniconf/users/157702?format=json", "institution": "Khalifa University of Science, Technology and Research"}, {"id": 192939, "fullname": "Naufal Suryanto", "url": "http://cvpr.thecvf.com/api/miniconf/users/192939?format=json", "institution": "Khalifa University of Science, Technology and Research"}, {"id": 192940, "fullname": "Divya Velayudhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192940?format=json", "institution": "Khalifa University of Science, Technology and Research"}, {"id": 70835, "fullname": "Sajid Javed", "url": "http://cvpr.thecvf.com/api/miniconf/users/70835?format=json", "institution": "Khalifa University of Science and Technology"}, {"id": 129320, "fullname": "Naoufel Werghi", "url": "http://cvpr.thecvf.com/api/miniconf/users/129320?format=json", "institution": "Khalifa University"}, {"id": 73853, "fullname": "Muzammal Naseer", "url": "http://cvpr.thecvf.com/api/miniconf/users/73853?format=json", "institution": "MBZUAI"}], "abstract": "Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Code, datasets, and models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39831", "url": null, "sourceid": 39579, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37971, "uid": "346a524a3ac60a706ba4f27f7c2a6475", "name": "How to Take a Memorable Picture? Empowering Users with Actionable Feedback", "authors": [{"id": 181299, "fullname": "Francesco Laiti", "url": "http://cvpr.thecvf.com/api/miniconf/users/181299?format=json", "institution": "Universit\u00e0 di Pisa, Dipartimento di Informatica"}, {"id": 154366, "fullname": "Davide Talon", "url": "http://cvpr.thecvf.com/api/miniconf/users/154366?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 188714, "fullname": "Jacopo Staiano", "url": "http://cvpr.thecvf.com/api/miniconf/users/188714?format=json", "institution": "University of Trento"}, {"id": 75841, "fullname": "Elisa Ricci", "url": "http://cvpr.thecvf.com/api/miniconf/users/75841?format=json", "institution": "University of Trento"}], "abstract": "Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of **Mem**orability **Feed**back (**MemFeed**), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present **MemCoach**, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., \u201cemphasize facial expression,\u201d \u201cbring the subject forward\u201d). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce **MemBench**, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators. Dataset and code will be publicly released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37971", "url": "https://laitifranz.github.io/MemCoach/", "sourceid": 39004, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40222, "uid": "c35a7212cc6cd96ae5d3ea61164054e8", "name": "HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation", "authors": [{"id": 180665, "fullname": "Chengjie Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180665?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 70234, "fullname": "Cong Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/70234?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 193818, "fullname": "Zijian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193818?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 152750, "fullname": "Ningzhong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152750?format=json", "institution": "College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics"}, {"id": 85792, "fullname": "Jie Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85792?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has drawn widespread attention, owing to its significant application value in areas such as logistics delivery and urban inspection. However, existing methods in complex urban environments face several challenges, including insufficient generalization to unknown scenes, suboptimal performance in long-distance path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that blends Imitation Learning (IL) and Reinforcement Learning (RL) into a hybrid IL-RL paradigm. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative interaction between macro-level path planning and fine-grained action control. Furthermore, a map representation learning module is introduced to deepen its understanding of spatial continuity in open domains. On the CityNav benchmark, our method achieves state-of-the-art performance at all levels of scenes and task difficulties. Experimental results demonstrate that this framework significantly improves navigation precision and robustness in complex urban environments.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40222", "url": null, "sourceid": 44144, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40022, "uid": "87c8812aeeb170aa52177d2735ca7042", "name": "Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment", "authors": [{"id": 182960, "fullname": "Tao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182960?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 176448, "fullname": "Yilei Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/176448?format=json", "institution": "SJTU"}, {"id": 193320, "fullname": "Yuxin Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/193320?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 193321, "fullname": "Jingjing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193321?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 193322, "fullname": "Jiting Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193322?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193323, "fullname": "Yinxinyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193323?format=json", "institution": null}, {"id": 193324, "fullname": "Encheng Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193324?format=json", "institution": "University of Cambridge"}, {"id": 193325, "fullname": "Ziyan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193325?format=json", "institution": "University of Science and Technology of China"}, {"id": 185279, "fullname": "Hongyi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185279?format=json", "institution": "Universiti Malaya - Cai&#x27;s Group"}, {"id": 193326, "fullname": "Yanwen Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193326?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193327, "fullname": "Lixing Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193327?format=json", "institution": "Shanghai Jiaotong University; Donghua University, Shanghai"}, {"id": 193328, "fullname": "Zhaoye Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193328?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190740, "fullname": "Gen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190740?format=json", "institution": "Nanyang Technological University"}, {"id": 90479, "fullname": "Bo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90479?format=json", "institution": "University of Edinburgh"}], "abstract": "Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference.Moreover, most training paradigms often degrade the perceptual representations of the Vision\u2013Language backbone, resulting in overfitting and poor generalization to downstream tasks.In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision\u2013Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture.We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM.Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4\\% and 6.9\\%, respectively, and also attains a competitive result of 94.8\\% on LIBERO.In real-world evaluations, Evo-1 attains a 78\\% success rate with high inference frequency and low memory overhead, outperforming all baseline methods.We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40022", "url": null, "sourceid": 42759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38256, "uid": "b025ccf8736e4167de9c5f6366ee5e9e", "name": "HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos", "authors": [{"id": 184362, "fullname": "Tingting Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/184362?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 189438, "fullname": "Xinsong Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189438?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 180426, "fullname": "Yufei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180426?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 181771, "fullname": "Min Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181771?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 152220, "fullname": "Sicheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152220?format=json", "institution": "Tsinghua University"}, {"id": 76382, "fullname": "Zhou Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76382?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks\u2014Charades-OV and ActivityNet-OV\u2014that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO (Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38256", "url": null, "sourceid": 43004, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36226, "uid": "a3a72a1e15a03471808f31ba3172e71c", "name": "TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos", "authors": [{"id": 181784, "fullname": "Jinpeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181784?format=json", "institution": "National University of Singapore"}, {"id": 181783, "fullname": "Yukang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181783?format=json", "institution": "National University of Singapore"}, {"id": 184494, "fullname": "Yutong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184494?format=json", "institution": "National University of Singapore"}, {"id": 184495, "fullname": "Xingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184495?format=json", "institution": "National University of Singapore"}], "abstract": "Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human\u2013scene\u2013camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos--a unified framework tailored for this task. TROPHIES features a Human Branch that models human through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human\u2013scene consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36226", "url": null, "sourceid": 46124, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40001, "uid": "2a3228854c6f47213f364faafb149166", "name": "CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models", "authors": [{"id": 193273, "fullname": "Nan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193273?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187139, "fullname": "Huiqun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187139?format=json", "institution": "ZGC Lab"}, {"id": 193274, "fullname": "Yaoyan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193274?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 87605, "fullname": "Di Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87605?format=json", "institution": "Beihang University"}], "abstract": "Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outperform the frozen baseline across multimodal tasks. Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. By integrating a Context Vector Extraction (CVE) and a Contextual Mixture-of-Experts (CoMoE) module, CoVFT decomposes conflicting optimization signals and enables stable, context-sensitive visual updates. Extensive experiments on 12 multimodal benchmarks demonstrate that CoVFT achieves state-of-the-art performance with superior stability. Notably, fine-tuning a 7B MLLM with CoVFT surpasses the average performance of its 13B counterpart, revealing substantial untapped potential in visual encoder optimization within MLLMs. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40001", "url": null, "sourceid": 35085, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39696, "uid": "eb035214c9feb365b3751ca518cf752c", "name": "FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation", "authors": [{"id": 192669, "fullname": "Tingrui Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192669?format=json", "institution": "South China University of Technology"}, {"id": 145959, "fullname": "Yiheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145959?format=json", "institution": "Tencent VISVISE"}, {"id": 192670, "fullname": "Chen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192670?format=json", "institution": "South China University of Technology"}, {"id": 192671, "fullname": "Chuan Ping", "url": "http://cvpr.thecvf.com/api/miniconf/users/192671?format=json", "institution": "Zhejiang University"}, {"id": 192672, "fullname": "Zixing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192672?format=json", "institution": "Harbin Institute of Technology"}, {"id": 191476, "fullname": "Le Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191476?format=json", "institution": null}, {"id": 88239, "fullname": "Yuwang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88239?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 86749, "fullname": "Ronggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86749?format=json", "institution": "Peking University Shenzhen Graduate School"}, {"id": 73862, "fullname": "Shengfeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73862?format=json", "institution": "Singapore Management University"}], "abstract": "Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications.We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels.Extensive experiments show that FlashMesh achieves up to a 2$\\times$ speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39696", "url": null, "sourceid": 46690, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37876, "uid": "7d3d99bab1c841a23a2d7b50ebd7b7bb", "name": "SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs", "authors": [{"id": 188466, "fullname": "Koonting Yip", "url": "http://cvpr.thecvf.com/api/miniconf/users/188466?format=json", "institution": "University of Macau"}, {"id": 188467, "fullname": "Qiyan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188467?format=json", "institution": "Institute of automation, Chinese academy of science; Shanghai Jiaotong University"}, {"id": 183064, "fullname": "Wenhao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183064?format=json", "institution": "University of Science and Technology of China"}, {"id": 188468, "fullname": "Liangyu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188468?format=json", "institution": "Hefei University of Technology"}, {"id": 188469, "fullname": "Mingkai LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/188469?format=json", "institution": "National University of Singapore"}, {"id": 188470, "fullname": "Xiaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188470?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 90089, "fullname": "Jianmin Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/90089?format=json", "institution": "University of Science and Technology of China"}, {"id": 90108, "fullname": "Yanyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90108?format=json", "institution": "Rutgers University, Newark"}, {"id": 188471, "fullname": "Qing Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188471?format=json", "institution": "School of Civil Engineering, Hefei University of Technology"}, {"id": 188472, "fullname": "Ka-Veng Yuen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188472?format=json", "institution": "University of Macau"}], "abstract": "3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies hindering the model\u2019s ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate\u2013based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37876", "url": null, "sourceid": 31529, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37011, "uid": "69d0353e88f579cded20ed0c14345371", "name": "PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild", "authors": [{"id": 164688, "fullname": "Felix B Mueller", "url": "http://cvpr.thecvf.com/api/miniconf/users/164688?format=json", "institution": "University of Goettingen"}, {"id": 186467, "fullname": "Jan Frederik Meier", "url": "http://cvpr.thecvf.com/api/miniconf/users/186467?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186468, "fullname": "Timo L\u00fcddecke", "url": "http://cvpr.thecvf.com/api/miniconf/users/186468?format=json", "institution": "University of G\u00f6ttingen"}, {"id": 186469, "fullname": "Richard Vogg", "url": "http://cvpr.thecvf.com/api/miniconf/users/186469?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186470, "fullname": "Roger Freixanet", "url": "http://cvpr.thecvf.com/api/miniconf/users/186470?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186471, "fullname": "Valentin Hassler", "url": "http://cvpr.thecvf.com/api/miniconf/users/186471?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186472, "fullname": "Tiffany Bosshard", "url": "http://cvpr.thecvf.com/api/miniconf/users/186472?format=json", "institution": null}, {"id": 186473, "fullname": "Elif Karakoc", "url": "http://cvpr.thecvf.com/api/miniconf/users/186473?format=json", "institution": "Deutsches Primatenzentrum (DPZ), Leibnitz-Institut f\u00fcr Primatenforschung"}, {"id": 186474, "fullname": "William O&amp;#x27;Hearn", "url": "http://cvpr.thecvf.com/api/miniconf/users/186474?format=json", "institution": "University of Exeter"}, {"id": 186475, "fullname": "Sofia Pereira", "url": "http://cvpr.thecvf.com/api/miniconf/users/186475?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186476, "fullname": "Sandro Sehner", "url": "http://cvpr.thecvf.com/api/miniconf/users/186476?format=json", "institution": "Konrad Lorenz Institute of Ethology"}, {"id": 186477, "fullname": "Kaja Wierucka", "url": "http://cvpr.thecvf.com/api/miniconf/users/186477?format=json", "institution": "German Primate Center"}, {"id": 186478, "fullname": "Judith Burkart", "url": "http://cvpr.thecvf.com/api/miniconf/users/186478?format=json", "institution": "Universit\u00e4t Z\u00fcrich"}, {"id": 186479, "fullname": "Claudia Fichtel", "url": "http://cvpr.thecvf.com/api/miniconf/users/186479?format=json", "institution": "Deutsches Primatenzentrum (DPZ), Leibnitz-Institut f\u00fcr Primatenforschung"}, {"id": 186480, "fullname": "Julia Fischer", "url": "http://cvpr.thecvf.com/api/miniconf/users/186480?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186481, "fullname": "Alexander Gail", "url": "http://cvpr.thecvf.com/api/miniconf/users/186481?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186482, "fullname": "Catherine Hobaiter", "url": "http://cvpr.thecvf.com/api/miniconf/users/186482?format=json", "institution": "University of St. Andrews"}, {"id": 186483, "fullname": "Julia Ostner", "url": "http://cvpr.thecvf.com/api/miniconf/users/186483?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186484, "fullname": "Liran Samuni", "url": "http://cvpr.thecvf.com/api/miniconf/users/186484?format=json", "institution": "Deutsches Primatenzentrum (DPZ), Leibnitz-Institut f\u00fcr Primatenforschung"}, {"id": 186485, "fullname": "Oliver Sch\u00fclke", "url": "http://cvpr.thecvf.com/api/miniconf/users/186485?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 186486, "fullname": "Neda Shahidi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186486?format=json", "institution": "Georg-August Universit\u00e4t G\u00f6ttingen"}, {"id": 152323, "fullname": "Erin G. Wessling", "url": "http://cvpr.thecvf.com/api/miniconf/users/152323?format=json", "institution": "German Primate Center"}, {"id": 186487, "fullname": "Alexander Ecker", "url": "http://cvpr.thecvf.com/api/miniconf/users/186487?format=json", "institution": "University of G\u00f6ttingen"}], "abstract": "Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We continue pretraining V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets \u2014 ChimpACT, PanAf500, BaboonLand, and ChimpBehave \u2014 our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37011", "url": null, "sourceid": 43863, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39811, "uid": "94cc372d721433c5dcb08a5c499432cb", "name": "TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction", "authors": [{"id": 180899, "fullname": "Fengyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180899?format=json", "institution": "The University of Queensland"}, {"id": 192906, "fullname": "Tianjun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192906?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192907, "fullname": "Kasra Khosoussi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192907?format=json", "institution": "The University of Queensland"}, {"id": 128732, "fullname": "Zheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128732?format=json", "institution": "Harbin Institute of Technology"}, {"id": 90777, "fullname": "Zi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90777?format=json", "institution": "University of Queensland"}, {"id": 158034, "fullname": "Yadan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158034?format=json", "institution": "The University of Queensland"}], "abstract": "3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are provided in the supplementary material and will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39811", "url": null, "sourceid": 45720, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38994, "uid": "4a2cde3ca35a41a388c8cc1e19edcbdb", "name": "Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration", "authors": [{"id": 160771, "fullname": "sen wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160771?format=json", "institution": "East China Normal University"}, {"id": 163349, "fullname": "Bangwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/163349?format=json", "institution": "East China Normal University"}, {"id": 191143, "fullname": "Zhenkun Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191143?format=json", "institution": "East China Normal University"}, {"id": 89127, "fullname": "Lizhuang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/89127?format=json", "institution": "Dept. of Computer Sci. &amp; Eng., Shanghai Jiao Tong University"}, {"id": 191144, "fullname": "Xuhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191144?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 86818, "fullname": "Xin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86818?format=json", "institution": "East China Normal University"}], "abstract": "An ideal embodied agent should possess lifelong learning capabilities to handle long-horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent\u2019s exploratory cognition and decision-making behaviors to promote lifelong learning. We further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent\u2019s memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38994", "url": null, "sourceid": 31364, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36286, "uid": "0f1fc687aa65dc3bfee3c053472ce62a", "name": "VKG-QA: Visual Knowledge Graph-based Question Answer for Large  Multimodal Models", "authors": [{"id": 184681, "fullname": "Yuntao Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/184681?format=json", "institution": "Shandong University"}, {"id": 184682, "fullname": "Yiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184682?format=json", "institution": "Shandong University"}, {"id": 184683, "fullname": "Renshuo Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184683?format=json", "institution": "Shandong University"}, {"id": 184684, "fullname": "Jincheng Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/184684?format=json", "institution": "Shandong University"}, {"id": 175457, "fullname": "Yijing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/175457?format=json", "institution": "Shandong University"}, {"id": 184685, "fullname": "Yue Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184685?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 87494, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87494?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 184686, "fullname": "Qian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184686?format=json", "institution": "Shandong University"}, {"id": 184687, "fullname": "Lizhen Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184687?format=json", "institution": "Shandong University"}], "abstract": "Understanding and reasoning over structured knowledge is a fundamental capability for intelligent systems. While Large Language Models (LLMs) have leveraged textual knowledge graphs for relational reasoning, linearizing graph structures into text often leads to token inefficiency and loss of higher-order relational cues. Inspired by the advances of Large Multimodal Model to capture higher-order relational structures explicitly novel paradigm of \\textit{visualized knowledge representation}, where knowledge graphs are transformed into graphical visualizations that LMMs can directly perceive and reason over. To systematically evaluate this capability, we introduce \\textbf{VKG-QA}, a benchmark for \\textit{Visual Knowledge Graph-based Question Answering}, covering three major categories and fourteen subtasks. VKG-QA is constructed via a semi-automatic pipeline ensuring high-quality, semantically aligned, and visually clear data. We evaluate 19 representative LMMs on VKG-QA and perform extensive quantitative and qualitative analyses. Results reveal that current models struggle with visualized relational understanding, graph-specific comprehension remains challenging, and closed-source models significantly outperform open-source counterparts. VKG-QA thus highlights critical limitations in current LMMs and provides a scalable platform for advancing graph-aware visual reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36286", "url": null, "sourceid": 36070, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40072, "uid": "c556184b3fe2087834850b68fa435cee", "name": "GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing", "authors": [{"id": 190890, "fullname": "Aoran Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190890?format=json", "institution": "Harbin Institute of Technology"}, {"id": 145520, "fullname": "Shihao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/145520?format=json", "institution": "Wuhan University"}, {"id": 185507, "fullname": "Yonghao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185507?format=json", "institution": "Link\u00f6ping University"}, {"id": 193442, "fullname": "Yexian Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193442?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 187170, "fullname": "Hongruixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187170?format=json", "institution": "The University of Tokyo"}, {"id": 152106, "fullname": "Naoto Yokoya", "url": "http://cvpr.thecvf.com/api/miniconf/users/152106?format=json", "institution": "The University of Tokyo"}], "abstract": "Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and more rigorous evaluation than prior benchmarks. Using GeoMMBench, we assess 36 open-source and proprietary large language models (LLMs), uncovering systematic deficiencies in domain knowledge, perceptual grounding, and reasoning\u2014capabilities essential for expert-level geospatial interpretation. Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. Extensive experimental results demonstrate that GeoMMAgent significantly outperforms standalone LLMs, underscoring the importance of tool-augmented agents for dynamically tackling complex geoscience and RS challenges.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40072", "url": null, "sourceid": 45302, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39399, "uid": "46dae4d4627570456b79473173f92f0f", "name": "Humanoid Generative Pre-Training for Zero-Shot Motion Tracker", "authors": [{"id": 107000, "fullname": "Zekun Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/107000?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 145031, "fullname": "Xuchuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145031?format=json", "institution": "Beihang University"}, {"id": 158677, "fullname": "Jilong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158677?format=json", "institution": "Galbot Co. Ltd."}, {"id": 191996, "fullname": "Chenghuai Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191996?format=json", "institution": "Galbot"}, {"id": 191997, "fullname": "Yunrui Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191997?format=json", "institution": "Tsinghua University"}, {"id": 186078, "fullname": "Wenyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186078?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191998, "fullname": "XinQiang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191998?format=json", "institution": null}, {"id": 74064, "fullname": "He Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/74064?format=json", "institution": "Galbot"}, {"id": 73517, "fullname": "Li Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/73517?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "We introduce Humanoid-GPT, the first GPT-style humanoid motion Transformer trained with causal attention on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility\u2013generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks arbitrary humans executing highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39399", "url": null, "sourceid": 40069, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38871, "uid": "2b9eb12abdcf58e92fcf797a1eb2983d", "name": "MM-OVSeg: Multimodal Optical\u2013SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing", "authors": [{"id": 174921, "fullname": "YIMIN WEI", "url": "http://cvpr.thecvf.com/api/miniconf/users/174921?format=json", "institution": "The University of Tokyo"}, {"id": 190890, "fullname": "Aoran Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190890?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187170, "fullname": "Hongruixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187170?format=json", "institution": "The University of Tokyo"}, {"id": 190891, "fullname": "Junshi Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/190891?format=json", "institution": "RIKEN"}, {"id": 152106, "fullname": "Naoto Yokoya", "url": "http://cvpr.thecvf.com/api/miniconf/users/152106?format=json", "institution": "The University of Tokyo"}], "abstract": "Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical\u2013SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities\u2014optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision\u2013language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. All dataset and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38871", "url": null, "sourceid": 44442, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39674, "uid": "4e4dfebee38dd25062b6888505bcca50", "name": "DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition", "authors": [{"id": 172731, "fullname": "Yuanjun Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/172731?format=json", "institution": "LIESMARS, Wuhan Unversity"}, {"id": 190890, "fullname": "Aoran Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190890?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192624, "fullname": "Liqian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192624?format=json", "institution": "Central China Normal University"}, {"id": 89787, "fullname": "Zhigang Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89787?format=json", "institution": "Wuhan University"}], "abstract": "Human action recognition (HAR) in low-light environments remains challenging due to degraded visibility, illumination variance, and loss of appearance cues. We introduce DarkAct, a large-scale and high-quality RGB\u2013thermal video dataset purpose-built for multimodal action recognition under low illumination. DarkAct contains 12,778 paired RGB\u2013thermal videos covering 27 human actions across diverse viewpoints and scenes, offering a novel and comprehensive benchmark for understanding human actions in dark environments. We conduct extensive experiments on DarkAct, systematically benchmarking unimodal HAR models, multimodal fusion frameworks, and vision-language foundation models. Their limited performance on DarkAct underscores the urgent need for more robust perception systems under adverse illumination. To address this, we propose DarkAct-Net, an RGB\u2013thermal fusion framework that enhances human-centric representation and achieves adaptive cross-modal fusion, enabling robust and fine-grained action recognition across diverse lighting conditions. All dataset and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39674", "url": null, "sourceid": 39317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38314, "uid": "8bb136fd832b36830cb250e66c169811", "name": "Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition", "authors": [{"id": 189578, "fullname": "Mingfang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189578?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 89528, "fullname": "Yunhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89528?format=json", "institution": "Beihang University"}, {"id": 189579, "fullname": "Lu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189579?format=json", "institution": "Beihang University"}, {"id": 87568, "fullname": "Jiaxin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87568?format=json", "institution": "Beihang University"}], "abstract": "Parameter-efficient fine-tuning (PEFT) has become a compelling approach for adapting large language models (LLMs) into multimodal large language models (MLLMs), enabling them to handle diverse modalities with substantially lower memory and computational costs. However, most existing PEFT methods neglect the issue of modality-imbalanced learning, which is characterized by the excessive dominance of text modality in updating parameters, thus incurring insufficient learning of non-text modalities and leading to performance degradation. To address this issue, we propose a novel parameter-efficient adaptation method for MLLMs, namely Implicit Modality Decomposition (IMoD), based on LoRA. It firstly decomposes the learnable parameters into the non-overlapped text-specific, non-text-specific and modality-sharing components, thereby alleviating modality imbalance. To further guide the optimization of these components toward specific modalities, we propose Modality-Specific Decoupling Constraint that suppresses cross-modal interference among modality-specific parameters, and Modality-Agnostic Alignment Constraint that encourages modality-sharing component to capture well-aligned, modality-invariant semantics. Extensive experiments across diverse multimodal settings and LLM architectures demonstrate that our method consistently delivers significant performance gains, particularly achieving an averaged 3.3\\% improvements on the audio-visual-text tasks without sacrificing the parameter and inference efficiency. We will release the source code upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38314", "url": null, "sourceid": 37922, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36506, "uid": "1bdbe1fb986c2b7b9c3e31a1de75d31b", "name": "Making the Classification Explanation Faithful to the Confidence Score", "authors": [{"id": 180728, "fullname": "Jianxun Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180728?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 185233, "fullname": "Lu Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185233?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 86107, "fullname": "Weisheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86107?format=json", "institution": "Chongqing University of Posts and Telecommunications"}], "abstract": "Deep Neural Networks (DNNs) have revolutionized numerous industries, yet their decision-making processes remain largely opaque. Most existing explanation methods visualize the importance of image regions that influence a classifier's decisions, but they predominantly focus on identifying regions with positive contributions, often overlooking those with negative impacts. In this paper, we introduce a novel black-box explanation method, the Metropolis-Hastings Explainer (MHE), designed to provide confidence-faithful explanations. MHE enhances the fidelity of explanations by ensuring that the explained regions closely align with the original confidence score, sampling instances that best match the classifier\u2019s confidence. Furthermore, MHE improves sampling efficiency by utilizing existing valid samples to explore more potential valid ones, reducing computational overhead. To enhance the clarity of explanations, MHE prioritizes valid samples with smaller areas when other factors are equal, thereby reducing the explanation area. Building upon the MHE framework, we propose two extensions: MHE-e, which focuses exclusively on regions with positive contributions, and MHE-pro, which refines explanation quality by integrating multi-scale information. MHE-pro progressively regions, optimizing both sampling efficiency and explanation quality. Experimental results demonstrate that MHE delivers superior and stable explanation quality across various models, including ResNet50, VGG16, ViT DINO, and CLIP, on datasets such as ImageNet, CUB-200-2011, and VOC2012, providing explanations that closely approximate the original classification confidence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36506", "url": null, "sourceid": 38153, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37470, "uid": "e5921a80ed7efb78f3d10d363639f8d4", "name": "PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models", "authors": [{"id": 187532, "fullname": "Yuanhao Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/187532?format=json", "institution": "Fuzhou University"}, {"id": 181212, "fullname": "Shaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181212?format=json", "institution": "University of Science and Technology of China"}, {"id": 75577, "fullname": "Xiaosong Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/75577?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 130772, "fullname": "Qi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130772?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations.To address these limitations, we propose {\\mname}, a novel feature-level alignment regularization method. {\\mname} explicitly supervises intermediate point cloud representations to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation.Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \\textbf{2.08} pp improvement on average for classification tasks, with a substantial \\textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \\textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\\mname}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37470", "url": null, "sourceid": 38470, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36736, "uid": "889823b59e4946c262348782d54afc70", "name": "HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks", "authors": [{"id": 185754, "fullname": "Xiaoyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185754?format=json", "institution": ""}, {"id": 185755, "fullname": "Yuhang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185755?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185756, "fullname": "xuanshuo kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185756?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185757, "fullname": "zheng luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185757?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185758, "fullname": "Fangqi Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185758?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185759, "fullname": "\u5434\u6653\u534e \u5434\u6653\u534e", "url": "http://cvpr.thecvf.com/api/miniconf/users/185759?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185760, "fullname": "Zihan Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185760?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a \"shift vector\". Inspired by the exact decomposition, we introduce **Hi**gh-**F**idelity **I**n-**C**ontext **L**earning (**HiFICL**) to more faithfully model the ICL mechanism. HiFICL consists of three key components: 1) a set of \"virtual key-value pairs\" injected into each attention head to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36736", "url": null, "sourceid": 36174, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36946, "uid": "294630c7fceb95fecbb446b26ea16b4b", "name": "SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models", "authors": [{"id": 180243, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180243?format=json", "institution": "Tianjin University of Technology"}, {"id": 186277, "fullname": "Shanshan Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186277?format=json", "institution": "Zhejiang Normal University"}, {"id": 186278, "fullname": "Sheng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186278?format=json", "institution": "Zhejiang Normal University"}, {"id": 178454, "fullname": "Jianmin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/178454?format=json", "institution": "Zhejiang Normal University"}, {"id": 186279, "fullname": "Yibo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186279?format=json", "institution": "Tianjin University of Technology"}, {"id": 186280, "fullname": "Zan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186280?format=json", "institution": "Tianjin University of Technology"}, {"id": 89090, "fullname": "Taku Komura", "url": "http://cvpr.thecvf.com/api/miniconf/users/89090?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 186281, "fullname": "Kemeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186281?format=json", "institution": "University of Hong Kong"}], "abstract": "Reconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view reconstruction. Subsequently, we iteratively refine novel-view images and reconstruct 3D density fields through a progressive multi-stage process that renders and enhances images from increasing viewing angles, generating high-quality multi-view sequences. Finally, we estimate fine-grained density and velocity fields via differentiable advection by leveraging the Navier-Stokes equations. Our approach supports re-simulation and downstream applications while achieving superior reconstruction quality and computational efficiency compared to state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36946", "url": null, "sourceid": 35014, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40279?format=json"], "related_events_ids": [40279]}, {"id": 39396, "uid": "3bbd8afd89f7b6895b3d201b67f6ff12", "name": "DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions", "authors": [{"id": 184874, "fullname": "Jiaqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184874?format=json", "institution": "Beijing Institute of Technology"}, {"id": 184872, "fullname": "Qinfu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184872?format=json", "institution": "Beijing Institute of Technology"}, {"id": 88760, "fullname": "Liyuan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88760?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event\u2013IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 real-world clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39396", "url": null, "sourceid": 44323, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36359, "uid": "7e94cbdb257cdd17c06f5ccf9daf9ce0", "name": "EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning", "authors": [{"id": 184872, "fullname": "Qinfu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184872?format=json", "institution": "Beijing Institute of Technology"}, {"id": 88760, "fullname": "Liyuan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88760?format=json", "institution": "Beijing Institute of Technology"}, {"id": 183259, "fullname": "Yiwei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/183259?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}, {"id": 184873, "fullname": "Shaozu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184873?format=json", "institution": "Meituan"}, {"id": 184874, "fullname": "Jiaqi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184874?format=json", "institution": "Beijing Institute of Technology"}, {"id": 184875, "fullname": "Tianyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184875?format=json", "institution": "Beijing University of Post and Telecommunications"}], "abstract": "Multimodal Emotion Analysis (MEA) is crucial for human-centric AI, yet current methods struggle with two core challenges: the sparse nature of emotional cues across modalities and their inherent temporal asynchrony. Existing approaches, which often rely on implicit fusion, consequently suffer from diluted salient features and entangled representations. To address this issue, we propose EmoThinker, a new framework that advances MEA through explicit, structured reasoning. Our method introduces a structural token selection mechanism to concentrate on pivotal facial regions while refining background context, enhancing visual saliency and efficiency. For audio, an audio evidence extractor aggregates critical paralinguistic features into compact, emotion-rich tokens. More importantly, we enable step-by-step reasoning by constructing a Chain-of-Emotion-Thought dataset, which provides fine-grained annotations for disentangling asynchronous cues and resolving inter-modal conflicts. By decoupling evidence acquisition from reasoning, EmoThinker achieves a more interpretable and robust emotion analysis. Extensive experiments on multiple benchmarks demonstrate that our framework achieves new state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36359", "url": null, "sourceid": 44126, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40279, "uid": "294630c7fceb95fecbb446b26ea16b4b", "name": "SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models", "authors": [{"id": 180243, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180243?format=json", "institution": "Tianjin University of Technology"}, {"id": 186277, "fullname": "Shanshan Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186277?format=json", "institution": "Zhejiang Normal University"}, {"id": 186278, "fullname": "Sheng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186278?format=json", "institution": "Zhejiang Normal University"}, {"id": 178454, "fullname": "Jianmin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/178454?format=json", "institution": "Zhejiang Normal University"}, {"id": 186279, "fullname": "Yibo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186279?format=json", "institution": "Tianjin University of Technology"}, {"id": 186280, "fullname": "Zan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186280?format=json", "institution": "Tianjin University of Technology"}, {"id": 89090, "fullname": "Taku Komura", "url": "http://cvpr.thecvf.com/api/miniconf/users/89090?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 186281, "fullname": "Kemeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186281?format=json", "institution": "University of Hong Kong"}], "abstract": "Reconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view reconstruction. Subsequently, we iteratively refine novel-view images and reconstruct 3D density fields through a progressive multi-stage process that renders and enhances images from increasing viewing angles, generating high-quality multi-view sequences. Finally, we estimate fine-grained density and velocity fields via differentiable advection by leveraging the Navier-Stokes equations. Our approach supports re-simulation and downstream applications while achieving superior reconstruction quality and computational efficiency compared to state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40279", "url": null, "sourceid": -35014, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36946?format=json"], "related_events_ids": [36946]}, {"id": 38643, "uid": "da5ef3ca8048cd3799e75b5039802539", "name": "See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles", "authors": [{"id": 181667, "fullname": "Zongru Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181667?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190372, "fullname": "Rui Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190372?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190373, "fullname": "Zhiyuan Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190373?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190374, "fullname": "Pengzhou Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190374?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190375, "fullname": "Tianjie Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/190375?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190376, "fullname": "Zheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190376?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190377, "fullname": "Lingzhong Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190377?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190378, "fullname": "Haiyue Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190378?format=json", "institution": "Beijing Institute of Technology"}, {"id": 106927, "fullname": "Zhuosheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106927?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 178500, "fullname": "Gongshen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178500?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions derived from public datasets. Evaluation results of existing agents demonstrate their notable unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose **St**ate-**a**ware **R**easoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. Experiments on four multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public agentic benchmarks show that StaR also enhances general agentic task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at Anonymous.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38643", "url": null, "sourceid": 36965, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36998, "uid": "def35ca4715a54f7fecd2fb80419572f", "name": "SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding", "authors": [{"id": 186423, "fullname": "Chang-Hsun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186423?format=json", "institution": "National Taiwan University"}, {"id": 157724, "fullname": "Kai-Po Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157724?format=json", "institution": "NVIDIA"}, {"id": 186424, "fullname": "Yu-Yang Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186424?format=json", "institution": null}, {"id": 186425, "fullname": "Hung-Kai Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/186425?format=json", "institution": "National Taiwan University"}, {"id": 186426, "fullname": "Kuei-Chun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186426?format=json", "institution": "National Taiwan University"}, {"id": 89818, "fullname": "Yu-Chiang Frank Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89818?format=json", "institution": "NVIDIA"}], "abstract": "Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36998", "url": null, "sourceid": 32270, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39932, "uid": "543d921c3afce2b0608c1de6dfa48ee2", "name": "Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework", "authors": [{"id": 180929, "fullname": "Haoxiang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180929?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193141, "fullname": "Yaokun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193141?format=json", "institution": null}, {"id": 193142, "fullname": "Zeyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193142?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 175707, "fullname": "Cangjun Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/175707?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 193143, "fullname": "Qiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193143?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193144, "fullname": "Qingkun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193144?format=json", "institution": "Institute of Software, Chinese Academy of Science"}, {"id": 128482, "fullname": "Xiaoming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128482?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 187191, "fullname": "Cuixia Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187191?format=json", "institution": "institute of software, chinese academy of sciences"}, {"id": 84906, "fullname": "Yu-Kun Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/84906?format=json", "institution": "Cardiff University"}, {"id": 91120, "fullname": "Yong-Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91120?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 193145, "fullname": "Hongan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193145?format=json", "institution": "Institute of software Chinese Academy of Sciences"}], "abstract": "Diagrams are widely used in daily life. However, offline diagrams typically exist in the form of images, lacking structured data representation, which significantly limits their reusability and editability. Current research mainly focuses on supporting basic query tasks for online diagrams and does not meet the semantic understanding and interaction requirements for complex offline diagrams. Although large language models (LLMs) possess powerful reasoning and knowledge integration capabilities, their performance in processing offline diagrams is unsatisfactory due to the inability to accurately understand the structure and content of offline diagrams. To address these issues,we propose DiagramDiff, a framework consisting of a high-precision diagram reconstruction model and an instance-level diagram element recognition model. The framework converts offline diagrams into standardized data structures, enabling LLMs to transition from being unable to understand offline diagrams to becoming intelligent assistants capable of performing tasks such as semantic reasoning, logical validation, and efficient diagram editing. We have constructed a dataset containing diagrams and their corresponding question and answering(Q&A) and editing tasks. Experiments demonstrate that DiagramDiff achieves state-of-the-art performance in diagram reconstruction and recognition tasks, significantly enhancing LLMs' understanding and interaction capabilities with offline diagrams.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39932", "url": null, "sourceid": 41389, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38380, "uid": "f3f64102d3726f3a58c564c77b11117a", "name": "SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation", "authors": [{"id": 180613, "fullname": "Yi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180613?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189752, "fullname": "Hao Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189752?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189753, "fullname": "Lin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189753?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189754, "fullname": "Ranfeng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189754?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 153349, "fullname": "Qinying Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153349?format=json", "institution": "Shanghai artificial intelligence laboratory"}, {"id": 189755, "fullname": "Leilei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189755?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Accurate depth estimation in wide field is highly desired in applications of autonomous driving, robot vision and drone controls. Biological compound eyes inspire wide Field of View (FOV) depth estimation, yet their artificial implementations face the challenge of modality misalignment. Specifically, the spherical imaging data doesn\u2019t align with the planar neural network, diminishing the learning efficiency. Herein, we propose SCE-Depth, a bio-inspired framework for spherical compound eye depth estimation, which processes spherical images natively on a HEALPix grid using a spherical neural network. This approach achieves a unified $180^\\circ$ FOV while avoiding the errors typically introduced by modality conversion. Additionally, we identify a depth-sensitive gradient feature from the overlapping FOVs of adjacent ommatidia. To exploit it, we introduce a spherical Sobel operator called the Spherical Gradient Feature Extractor (SGFE) and a corresponding Spherical Gradient Loss (SGL), which jointly extract gradient features on the HEALPix grid, enabling gradient-aware depth prediction. Extensive benchmark experiments demonstrate that these strategies enable SCE-Depth to substantially reduce depth estimation error compared to fisheye-based baselines, with particularly large improvements in peripheral accuracy. We also demonstrate the generalization capability of SCE-Depth to other wide FOV data modalities, such as fisheye and panoramic imagery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38380", "url": null, "sourceid": 39937, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36827, "uid": "7361aba5e3f228181189f62990bdeb73", "name": "FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models", "authors": [{"id": 179906, "fullname": "Xinyuan An", "url": "http://cvpr.thecvf.com/api/miniconf/users/179906?format=json", "institution": "Zhejiang University"}, {"id": 185970, "fullname": "Tao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185970?format=json", "institution": "Zhejiang University"}, {"id": 185971, "fullname": "gengyun peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185971?format=json", "institution": "Zhejiang University"}, {"id": 185972, "fullname": "Yaobing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185972?format=json", "institution": null}, {"id": 85239, "fullname": "Kui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/85239?format=json", "institution": "Zhejiang University"}, {"id": 185973, "fullname": "Dongxia Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185973?format=json", "institution": "Zhejiang University"}], "abstract": "Vision-Language-Action (VLA) models are emerging as a cornerstone for robotics, with flow-matching policies like $\\pi_0$ showing great promise in generating smooth, continuous actions. As these models advance, their unique action generation mechanism\u2014the vector field dynamics\u2014presents a critical yet unexplored security vulnerability, particularly backdoor vulnerabilities. Existing backdoor attacks designed for autoregressive discretization VLAs cannot be directly applied to this new continuous dynamics.We introduce FlowHijack, the first backdoor attack framework to systematically target the underlying vector-field dynamics of flow-matching VLAs. Our method combines a novel $\\tau$-conditioned injection strategy, which manipulates the initial phase of the action generation, with a dynamics mimicry regularizer. Experiments demonstrate that FlowHijack achieves high attack success rates using stealthy, context-aware triggers where prior works failed. Crucially, it preserves benign task performance and, by enforcing kinematic similarity, generates malicious actions that are behaviorally indistinguishable from normal actions. Our findings reveal a significant vulnerability in continuous embodied models, highlighting the urgent need for defenses targeting the model's internal generative dynamics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36827", "url": null, "sourceid": 37417, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39687, "uid": "89b6d1591a3e9a96cee4c4060484c5ca", "name": "Inconsistency-aware Multimodal Schr\u00f6dinger Bridge for Deepfake Localization", "authors": [{"id": 180920, "fullname": "Jiayu Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180920?format=json", "institution": "Huaqiao University, Xiamen Campus"}, {"id": 192647, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192647?format=json", "institution": "Huaqiao University"}, {"id": 184480, "fullname": "Qi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184480?format=json", "institution": "Tongji University"}, {"id": 192648, "fullname": "Wanlong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192648?format=json", "institution": "Huaqiao University XiaMen"}, {"id": 192649, "fullname": "Jun Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192649?format=json", "institution": "Wuhan University"}], "abstract": "Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schr\u00f6dinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schr\u00f6dinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by $3\\sim10\\%$, and yields improved high-precision localization, particularly for single-sided forgeries.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39687", "url": null, "sourceid": 37612, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36718, "uid": "35dae000552fd88cce58a6571998738c", "name": "Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training", "authors": [{"id": 185718, "fullname": "Haiwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185718?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 179899, "fullname": "Kemou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/179899?format=json", "institution": "University of Macau"}, {"id": 185719, "fullname": "Yuanman Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185719?format=json", "institution": "Shenzhen University"}, {"id": 91005, "fullname": "Jiantao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/91005?format=json", "institution": "University of Macau"}], "abstract": "Digital image forensics can ensure information credibility in tasks like camera source identification (CSI), synthetic image detection (SID), and social network provenance (SNP). These tasks typically rely on image processing history clues left by in-camera operations, post-capture editing, or synthetic generation. However, most existing forensic methods have obvious limitations: 1) they often only focus on camera-specific traces (e.g., the well-known PRNU), and 2) they demand a substantial amount of annotated training data. To address these constraints, we propose Editprint, a novel general forensic feature that captures highly diverse in- and out-camera processing history clues with minimal unlabeled training data. Ideally, we expect that any images undergoing the same imaging, editing, and transmission processes would yield identical Editprints, and vice versa. To model the in- and out-camera operations, we devise an online editing pool based on self-augmentation strategies. Requiring only minimal (e.g., 10) training data, the editing pool can simulate massive (e.g., 10$^\\text{7}$) editing chains and traces arising from the in-camera processing and the subsequent out-camera operations. To ensure that Editprint exhibits high discriminative capabilities across various editing chains, we propose using textual descriptions of these chains as labels and supervising their Editprints through language-guided contrastive learning. Extensive experiments show Editprint outperforms existing self-supervised forensics, particularly in non-camera applications such as SNP and SID. We hope that Editprint would inspire the forensic community and serve as a novel benchmark for self-supervised forensics.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36718", "url": null, "sourceid": 46471, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36941, "uid": "849dc2aa3a610f5ac8d4080ec9ca7f86", "name": "Forensic-Friendly Image Manipulation via Controllable Latent Diffusion", "authors": [{"id": 173211, "fullname": "Hanyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/173211?format=json", "institution": "Macau University of Science and Technology"}, {"id": 185718, "fullname": "Haiwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185718?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186274, "fullname": "Jinyu Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/186274?format=json", "institution": "Macau University of Science and Technology"}, {"id": 149616, "fullname": "Jianqing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/149616?format=json", "institution": "Macau Univeristy of Science and Technology"}, {"id": 91005, "fullname": "Jiantao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/91005?format=json", "institution": "University of Macau"}], "abstract": "With diffusion models demonstrating superior capabilities in image editing, more users now rely on online servers for content manipulation via textual prompts rather than traditional offline tools. Despite servers attempting to prevent the proliferation of maliciously edited content via active defense like watermarking, this approach is not conducive to passive detection by third-party forensics. To address this limitation, we propose a plug-and-play controllable denoising termed Forensic-Friendly Image Manipulation (FFIM), which simultaneously satisfies user editing requirements while facilitating forensic analysis. Specifically, FFIM comprises three phases: Controllable Projection, Implicit Detection, and Explicit Guidance. Phase I enforces orthogonality between the variance of random noise and image features to ensure clear demarcation between the edited and unedited regions. Phase II implicitly evaluates whether this demarcation meets detection requirements; if not, Phase III explicitly introduces a surrogate detection model and adversarially adjusts the random noise to maximize the feature discrepancy between these regions. Experiments across four datasets demonstrate the superiority of FFIM over baseline methods, achieving up to +6.6\\% F1 in pixel-level localization and +27.3\\% AUC in image-level detection. Importantly, these forensic gains are attained without compromising visual quality, as evidenced by comparable manipulation in both subjective user studies and objective quality assessments. We envision that the proposed method will be widely adopted by generative AI service providers, enabling more comprehensive information authenticity from a passive defensive standpoint.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36941", "url": null, "sourceid": 38134, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39232, "uid": "95831099d5d2171aea50c24de5332f73", "name": "IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation", "authors": [{"id": 98628, "fullname": "Hao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/98628?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 191664, "fullname": "Xiangyang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191664?format=json", "institution": "Information Engineering University"}, {"id": 191665, "fullname": "Hao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191665?format=json", "institution": "Huai&#x27;an University"}, {"id": 191666, "fullname": "Jiawei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191666?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 191667, "fullname": "Yi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191667?format=json", "institution": null}, {"id": 191668, "fullname": "Jinwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191668?format=json", "institution": "Nankai University"}], "abstract": "With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder accessibility for resource-constrained researchers and small teams. To address this, we propose a fine-tuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameter-free components: 1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; 2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and 3) the Noise Sensor, which introduces a Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39232", "url": null, "sourceid": 39994, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36930, "uid": "efa4e6f5c6359cc2eadc5d731716468e", "name": "Detect Any AI-Counterfeited Text Image", "authors": [{"id": 85280, "fullname": "Chenfan Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85280?format=json", "institution": "South China University of Technology"}, {"id": 89242, "fullname": "Yiwu Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/89242?format=json", "institution": "University of Wisconsin, Madison"}, {"id": 186242, "fullname": "Xuekang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186242?format=json", "institution": "Sichuan University"}, {"id": 186243, "fullname": "Junchi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186243?format=json", "institution": "Zhejiang University"}, {"id": 186244, "fullname": "Changjiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186244?format=json", "institution": "Wuhan University"}, {"id": 131732, "fullname": "Jian liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131732?format=json", "institution": "AntGroup"}, {"id": 85083, "fullname": "Lianwen Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/85083?format=json", "institution": "South China University of Technology"}], "abstract": "The rapid advancement of generative AI enables the creation of highly realistic text images, posing significant security risks from fraud and disinformation. However, research into robust detection is critically hampered by existing datasets that lack scale, diversity, and updated counterfeit techniques, as well as by models that fail to generalize. To address these deficiencies, we introduce DanceText, a large-scale, comprehensive dataset for AI-counterfeited text image detection. Constructed using our novel Creative Proposer pipeline, which automates the generation of diverse and realistic counterfeits, DanceText surpasses previous works by over 100-fold in multiple dimensions. It is the first to include counterfeits from multimodal large models, commercial software, and mobile apps, covering all major paradigms, including full-image generation, regional removal, and editing. Building on this dataset, we propose DS-Net, a novel and effective detection model. It features two key components: a Forensic Decoupling Encoder to extract generator-agnostic artifact features, and a Synergy Denoising Decoder that synergizes image-level classification with instance-level localization. Extensive experiments demonstrate that DS-Net achieves state-of-the-art performance, advancing the field toward robust and unified detection of AI-counterfeited text images in real-world scenarios. Both our code and dataset will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36930", "url": null, "sourceid": 35352, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38806, "uid": "6cfb8bc7dcaae1fe044f03688188c156", "name": "BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation", "authors": [{"id": 190720, "fullname": "Tengfei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190720?format=json", "institution": "Beijing University of Technology"}, {"id": 190721, "fullname": "Yijian Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190721?format=json", "institution": "University of Technology Sydney"}, {"id": 89711, "fullname": "Boyue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89711?format=json", "institution": "Beijing University of Technology"}, {"id": 89748, "fullname": "Yongli Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89748?format=json", "institution": "Beijing University of Technology"}, {"id": 91647, "fullname": "Mingjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91647?format=json", "institution": "University of Technology Sydney"}, {"id": 190722, "fullname": "Jinghua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190722?format=json", "institution": "Beijing University of Technology"}, {"id": 190723, "fullname": "Junbin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190723?format=json", "institution": "University of Sydney"}, {"id": 89609, "fullname": "Xiaojun Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89609?format=json", "institution": "University of Technology Sydney"}, {"id": 185783, "fullname": "Zhihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185783?format=json", "institution": "University of Science and Technology of China"}, {"id": 84977, "fullname": "Baocai Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84977?format=json", "institution": "Beijing University of Technology"}], "abstract": "Radiology report generation (RRG) aims to automatically describe medical images via free-text reports. In clinical practice, comparing current and prior chest X-rays is essential for assessing disease progression, motivating the development of longitudinal RRG methods. However, most existing approaches often struggle to capture fine-grained temporal changes, as they often rely on unidirectional alignments or static reasoning pipelines, overlooking the bidirectional and asymmetric nature of disease evolution. To tackle these challenges, we propose BiOTPrompt, a novel framework for disease evolution-aware radiology report generation, which introduces a Bidirectional Optimal Transport (BiOT) mechanism to explicitly model progression dynamics between historical and current chest X-rays. By analyzing the asymmetry between bidirectional transport plans, BiOTPrompt can identify newly emerged and resolved regions, which are then used to construct dynamic prompts that guide large language models (LLMs) in generating clinically relevant diagnostic reports. Furthermore, we incorporate a vision-language consistency constraint to ensure alignment between visual evidence and textual descriptions, mitigating hallucinations and enhancing factual correctness. Extensive experiments on the Longitudinal-MIMIC dataset demonstrate that BiOTPrompt achieves state-of-the-art performance in both language metrics and clinical relevance, setting a new standard for longitudinal radiology report generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38806", "url": null, "sourceid": 36193, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36369, "uid": "aebecc502251dd3ec108bc7350f7cca8", "name": "AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation", "authors": [{"id": 180778, "fullname": "Luoxi Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/180778?format=json", "institution": "Advanced Institute of Big Data"}, {"id": 184894, "fullname": "Dianxi Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184894?format=json", "institution": "National University of Defense Technology"}, {"id": 145110, "fullname": "YuShe Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/145110?format=json", "institution": "Tsinghua University"}, {"id": 184895, "fullname": "Yuanze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184895?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 184896, "fullname": "Junze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184896?format=json", "institution": "National University of Defense Technology"}, {"id": 184897, "fullname": "Yuning Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184897?format=json", "institution": "Technical University of Munich"}, {"id": 129933, "fullname": "Mengzhu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129933?format=json", "institution": "Hebei University of Technology"}], "abstract": "Monocular depth estimation is critical for applications like autonomous driving and robotics. The complementary properties of event and image modality motivate the fusion-based methods for robust depth estimation. However, existing fusion methods rely on convolutional or attention-based architectures, which either struggle with global dependencies or incur high computational cost, limiting their suitability for long-sequence modeling in depth tasks. Besides, effective image-event fusion remains a key challenge, as most existing methods directly fuse features without addressing the domain gap and differences in representational characteristics between raw events and images, leading to semantic bias and degraded performance. In this work, we propose AIMDepth, an Asymmetric Image-Event Mamba framework for monocular depth estimation, built entirely on state space models to ensure linear computational complexity and accurate prediction. To address input-domain misalignment, we introduce a Spectral Cross-modal Prior Guidance (SCPG) module that performs bidirectional prior injection at the input level. To mitigate representational imbalance between sparse events and dense images, we design an Asymmetric Modal-aware Encoder (AME) that allocates separate encoding paths for each modality and facilitates feature-level alignment tailored to their distinct information densities. To further enhance fusion, we develop a Modality-interactive Local Refinement (ModiLocal) module that enables hierarchical interaction and fine-grained alignment through SSM-based modeling. Extensive experiments on public datasets demonstrate that AIMDepth achieves state-of-the-art performance and strong robustness in complex environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36369", "url": null, "sourceid": 38778, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37346, "uid": "f08bc848e028e7f9d65567f2ddc15951", "name": "CHIRP dataset: towards long-term, individual-level, behavioural monitoring of bird populations in the wild", "authors": [{"id": 76600, "fullname": "Alex Hoi Hang Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76600?format=json", "institution": "Center for the Advanced Study for Collective Behaviour, University of Konstanz, Germany"}, {"id": 187215, "fullname": "Neha Singhal", "url": "http://cvpr.thecvf.com/api/miniconf/users/187215?format=json", "institution": "Blue yonder"}, {"id": 187216, "fullname": "Onur Kocahan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187216?format=json", "institution": null}, {"id": 187217, "fullname": "Andrea Meltzer", "url": "http://cvpr.thecvf.com/api/miniconf/users/187217?format=json", "institution": "Max-Planck-Institute of Animal Behaviour"}, {"id": 187218, "fullname": "Saverio Lubrano", "url": "http://cvpr.thecvf.com/api/miniconf/users/187218?format=json", "institution": "University of Konstanz"}, {"id": 187219, "fullname": "Miya Warrington", "url": "http://cvpr.thecvf.com/api/miniconf/users/187219?format=json", "institution": "Oxford Brookes University"}, {"id": 187220, "fullname": "Michael Griesser", "url": "http://cvpr.thecvf.com/api/miniconf/users/187220?format=json", "institution": "Universit\u00e4t Konstanz"}, {"id": 91236, "fullname": "Fumihiro Kano", "url": "http://cvpr.thecvf.com/api/miniconf/users/91236?format=json", "institution": "Universit\u00e4t Konstanz"}, {"id": 73850, "fullname": "Hemal Naik", "url": "http://cvpr.thecvf.com/api/miniconf/users/73850?format=json", "institution": "Max Planck Institute for Animal Behavior"}], "abstract": "Long-term behavioural monitoring of individual animals is crucial for studying behavioural changes that occurs over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behaviour monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. The CHIRP (Combining beHaviour, Individual Re-identification and Postures) dataset is curated from a long-term population of wild Siberian jays studied in Swedish Lapland, supporting re-identification (re-id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task-specific benchmarking, we introduce application-specific benchmarking with biologically relevant metrics (feeding rates, co-occurrence rates) to evaluate the performance of models in real-world use cases. Finally, we present CORVID (COlouR-based Video re-ID), a novel pipeline for individual identification of birds based on the segmentation and classification of coloured leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability-based id tracking method by matching the detected combination of colour rings with a database. We use application-specific benchmarking to show that CORVID outperforms state of the art re-id methods. We hope this work offers the community a blueprint for curating real-world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37346", "url": null, "sourceid": 32395, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37916, "uid": "8365c9a8183d6f230a5125b1f65698b6", "name": "UniComp: Rethinking Video Compression Through Informational Uniqueness", "authors": [{"id": 153401, "fullname": "Chao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153401?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 188583, "fullname": "Shimin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188583?format=json", "institution": "Meituan"}, {"id": 173518, "fullname": "Minliang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/173518?format=json", "institution": "Meituan"}, {"id": 137779, "fullname": "Limeng Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/137779?format=json", "institution": null}, {"id": 188584, "fullname": "Guanglu Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188584?format=json", "institution": "Meituan"}, {"id": 88103, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88103?format=json", "institution": "Meituan"}], "abstract": "Distinct from attention-based compression methods, this paper presents an information uniqueness driven video compression framework, termed UniComp, which aims to maximize the information fidelity of video representations under constrained computational budgets. Starting from the information-theoretic perspective, we formulate the vision compression as an optimization problem that minimizes conditional entropy (reconstruction error) between retained and full tokens. To achieve this, we introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error. Based on uniqueness, we design three modules\u2014Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression\u2014that progressively perform semantic frame grouping, adaptive resource allocation, and fine-grained spatial compression. Extensive experiments demonstrate that UniComp consistently outperforms existing compression methods in preserving essential visual tokens under limited computational budgets, highlighting the pivotal role of information uniqueness in token compression efficacy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37916", "url": "https://github.com/TimeMarker-LLM/UniComp", "sourceid": 45988, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39535, "uid": "9c3916c3064354b7c58f2ba44a6d6a61", "name": "Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval", "authors": [{"id": 157916, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157916?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192288, "fullname": "Xuhang Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192288?format=json", "institution": "Harbin Institute of Technology(Shenzhen)"}, {"id": 181347, "fullname": "Jinpeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181347?format=json", "institution": "Harbin Institue of Technology, Shenzhen"}, {"id": 192289, "fullname": "Yuting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192289?format=json", "institution": "Aminer"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}, {"id": 87242, "fullname": "Shu-Tao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87242?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 87209, "fullname": "Bin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87209?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}], "abstract": "Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with  query ambiguity and local noise induced by spurious responses.To address these challenges, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on  fine-grained similarity optimization for precise cross-modal matching.Concretely, these registers are generated by  initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perturbation. The refined registers are then adaptively fused  with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning.Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. The code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39535", "url": null, "sourceid": 38090, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37472, "uid": "576fe2a764389ae05931dd1f11ab6566", "name": "MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras", "authors": [{"id": 153528, "fullname": "Yiqian Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153528?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187535, "fullname": "Qinghong Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/187535?format=json", "institution": "Peking University"}, {"id": 106076, "fullname": "Haoran Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106076?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 152158, "fullname": "Jianing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152158?format=json", "institution": "Peking University"}, {"id": 187536, "fullname": "Dongyang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187536?format=json", "institution": "Peking University"}, {"id": 187537, "fullname": "Xuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187537?format=json", "institution": "Harbin Institute of Technology,Shenzhen"}, {"id": 73079, "fullname": "Wei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73079?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 86326, "fullname": "Yonghong Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/86326?format=json", "institution": "Peking University"}, {"id": 90413, "fullname": "Peixi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90413?format=json", "institution": "Peking University"}], "abstract": "This paper proposes the first task for high-speed 3D point tracking using multi-view Event-RGB hybrid cameras. We design a cuboid observation device comprising 4 RGB cameras (30fps) and 2 Event cameras to synchronously capture high-speed motions, and propose MER-Tracker, a high\u2013frame-rate 3D point-tracking network that fuses the complementary strengths of dual modalities. We first respectively extract 2D motion-change features from the RGB and Event modalities, then apply linear interpolation and anchor sampling to fuse the discrete RGB 3D features and continuous Event 3D features after 3D lifting, and finally employ a LoRA-tuned Transformer based on temporal correlationship to predict the high-frame-rate 3D point trajectories over fast motions, accomplishing high-speed 3D point tracking. To verify the effectiveness of our method, we construct both real-world and simulated high-speed motion datasets. Experiments on these datasets show that our method achieves accurate high-speed 3D point tracking at high-frame-rate (150fps), outperforming state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37472", "url": null, "sourceid": 41260, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39797, "uid": "5ef6fac3b859bda32552555b77f147dd", "name": "Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking", "authors": [{"id": 191353, "fullname": "Hongtao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191353?format=json", "institution": "Guangxi Normal University"}, {"id": 128975, "fullname": "Bineng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128975?format=json", "institution": "Guangxi Normal University"}, {"id": 155925, "fullname": "Qihua Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155925?format=json", "institution": "Guangxi Normal University"}, {"id": 155926, "fullname": "Yaozong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155926?format=json", "institution": "Guangxi Normal University"}, {"id": 187654, "fullname": "Xiantao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187654?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 155928, "fullname": "Yuanliang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155928?format=json", "institution": "Xi\u2019an Research Institute of High Technology"}, {"id": 129683, "fullname": "Shuxiang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/129683?format=json", "institution": "Guangxi Normal University"}], "abstract": "Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: a spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and a prediction-level distillation that enhances spatial localization by learning the teacher\u2019s capability of accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher\u2019s target modeling capacity to the student. While the asymmetric architecture improves efficiency, it limits temporal adaptability. To mitigate this, a temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed, with EATrack-DeiT improving average success rate by 1.2\\% over the previous SOTA while running at 241.9 FPS on GPU.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39797", "url": null, "sourceid": 42901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37695, "uid": "4c64c4b4bd58084a33c95731fea410ee", "name": "Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields", "authors": [{"id": 188033, "fullname": "Berthy Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188033?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 188034, "fullname": "Andrew Chael", "url": "http://cvpr.thecvf.com/api/miniconf/users/188034?format=json", "institution": "Niels Bohr Institute, University of Copenhagen"}, {"id": 172823, "fullname": "David Bromley", "url": "http://cvpr.thecvf.com/api/miniconf/users/172823?format=json", "institution": "University of Toronto"}, {"id": 130722, "fullname": "Aviad Levis", "url": "http://cvpr.thecvf.com/api/miniconf/users/130722?format=json", "institution": "California Institute of Technology"}, {"id": 75516, "fullname": "William Freeman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75516?format=json", "institution": "MIT and Google"}, {"id": 184011, "fullname": "Katie Bouman", "url": "http://cvpr.thecvf.com/api/miniconf/users/184011?format=json", "institution": "California Institute of Technology"}], "abstract": "With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose *PINeRF*, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the estimated emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a totally physics-agnostic approach. We demonstrate how our method can be used to estimate other physics parameters of the black hole, such as its spin.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37695", "url": "https://github.com/berthyf96/pidef", "sourceid": 38441, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38954, "uid": "4e2b3264ead437e99dd16b434ca7dbaf", "name": "Plan, Imagine, then Act: Steering Your VLA with Efficient Visually Grounded Planning", "authors": [{"id": 154298, "fullname": "Zhuoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154298?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 154458, "fullname": "Shang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154458?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 151031, "fullname": "Qinghao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151031?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 191054, "fullname": "Luke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191054?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 136937, "fullname": "James Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/136937?format=json", "institution": "Caltech / Hillbot"}, {"id": 185139, "fullname": "Yufei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185139?format=json", "institution": "Tsinghua University"}, {"id": 154012, "fullname": "Yao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154012?format=json", "institution": "NVIDIA"}, {"id": 85763, "fullname": "Song Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85763?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Vision-Language-Action (VLA) models convert abstract language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present \\textit{Visually Grounded Planning}, a general and efficient high-level planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuomotor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image-generation module that predicts a high-quality 640\u00d7480 future observation from the current visual input and language instruction within only 0.33 s on an H100 GPU, together with a vision\u2013language component that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on approximately 10 million multi-task, cross-embodiment samples, enabling it to learn robust embodied dynamics and achieve strong real-world generalization. We evaluate our framework on a benchmark consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4\\%, demonstrating a +40.9\\% absolute improvement over the $\\pi_0$ baseline (46.5\\%) and a +30.3\\% absolute improvement over $\\pi_0$ augmented with textual subtask guidance (57.1\\%).", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38954", "url": null, "sourceid": 45578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37654, "uid": "7f725650f4fdec0cc8d4099bb7c8b9d4", "name": "Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases", "authors": [{"id": 143875, "fullname": "Xinqi Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143875?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 187953, "fullname": "Yihao LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/187953?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 158841, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158841?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 158842, "fullname": "Bin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158842?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Retrieval-augmented diffusion models (RAG-DMs) have been increasingly deployed across applications, alleviating the data and compute demands of conventional diffusion models. Despite the success, their trustworthiness remains underexplored. Existing backdoor attacks focus on either manipulating the generation phase or the retrieval phase under the white-box setting, which suffer from knowledge conflicts between retrieved images and user prompts. To bridge this gap, we propose a novel red-teaming approach JOB, which is the first jointly optimized backdoor attack tailored to black-box RAG-DMs. Specifically, JOB poisons the knowledge base with a small number of target class images and learns a trigger through multi-objective optimization, steering retrieval toward poisoned images and aligning the generated outputs with the target class, while preserving benign performance. Experiments show that JOB effectively attacks black-box RAG-DMs, achieving high success rates and outperforming state-of-the-art baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37654", "url": null, "sourceid": 42342, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38275, "uid": "cd3b1a524a7c489164eb52735036e4c4", "name": "Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance", "authors": [{"id": 144111, "fullname": "Songze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/144111?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189476, "fullname": "Mingyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189476?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189477, "fullname": "Tonghua Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/189477?format=json", "institution": "Harbin Institute of Technology"}, {"id": 86374, "fullname": "Xu-Yao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86374?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189478, "fullname": "Zhongjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189478?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38275", "url": null, "sourceid": 36331, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39596, "uid": "0388e3b48172f10f9ce58d86b3c86f74", "name": "Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions", "authors": [{"id": 192436, "fullname": "Jingtao Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/192436?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 192437, "fullname": "zhang kexin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192437?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 192438, "fullname": "Xunchi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192438?format=json", "institution": ""}, {"id": 192439, "fullname": "Johann Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192439?format=json", "institution": null}, {"id": 128834, "fullname": "Guangming Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128834?format=json", "institution": "Xidian University"}, {"id": 192440, "fullname": "Peiyi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192440?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 192441, "fullname": "Linhua Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192441?format=json", "institution": null}, {"id": 192442, "fullname": "Xiangdong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192442?format=json", "institution": null}, {"id": 90793, "fullname": "Liang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90793?format=json", "institution": "Xidian University"}], "abstract": "The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at [link].", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39596", "url": null, "sourceid": 45818, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40084, "uid": "a34335a3bede0f17a7af733b697ad848", "name": "OS-Fed: One Snapshot Is All You Need", "authors": [{"id": 193467, "fullname": "Xuwei Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193467?format=json", "institution": "Southeast University"}, {"id": 183685, "fullname": "Jinghui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183685?format=json", "institution": "Southeast University"}, {"id": 193468, "fullname": "Yuchuan Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193468?format=json", "institution": "Southeast University"}, {"id": 193469, "fullname": "Wenbo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193469?format=json", "institution": "Institute of Science Tokyo; Southeast University"}, {"id": 193470, "fullname": "Zhen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193470?format=json", "institution": "Southeast University"}, {"id": 132016, "fullname": "Shen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/132016?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 193471, "fullname": "LiSha Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193471?format=json", "institution": null}, {"id": 193472, "fullname": "Ding Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/193472?format=json", "institution": "Southeast University"}, {"id": 193473, "fullname": "Fang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193473?format=json", "institution": "Southeast University"}], "abstract": "Reducing communication overhead in federated learning (FL) is challenging but crucial for large-scale distributed privacy-preserving machine learning. Unfortunately, directly compressing model updates often leads to sub-optimal convergence due to information loss, while increasing local computation can cause model divergence. Hence, this paper proposes a drastically different approach that adheres to the maxim that ``a picture is worth a thousand words''. We observe that the entire gradient information from local training can be effectively reconstructed from a compact, image-like representation. Based on this observation, we propose a novel approach, OS-Fed, which performs One-Shot Federated Learning by transmitting only a single, compact snapshot (comprising an image and a set of learnable labels) per round. To realize this approach, OS-Fed presents new snapshot synthesis techniques to (1) target the accumulated update of a trajectory segment to tackle gradient noise, (2) design a multi-grid snapshot that decouples conflicting gradient directions, and (3) incorporate error compensation to maintain training stability under extreme compression. Extensive experiments on CV and NLP benchmarks show that OS-Fed reduces communication costs by 1.5-16$\\times$ compared to state-of-the-art algorithms , resulting in 18-45\\% faster convergence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40084", "url": null, "sourceid": 41908, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37608, "uid": "4308acc8c96ae3e22252c9700fb1e2ee", "name": "MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On", "authors": [{"id": 187845, "fullname": "Xiaoyu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187845?format=json", "institution": "Harbin Institute of Technology"}, {"id": 128977, "fullname": "Chenyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128977?format=json", "institution": "Harbin Institute of Technology"}, {"id": 155099, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155099?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 102102, "fullname": "Shunyuan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/102102?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187846, "fullname": "Quanling Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187846?format=json", "institution": "Harbin Institute of Technology"}, {"id": 127493, "fullname": "Shengping Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127493?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Virtual try-on aims to fit an in-shop clothing image onto a specific human body. An optimal virtual try-on method should provide diverse and flexible dressing options, accurately reflecting the varied wearing styles encountered in real-life scenarios, tailored to individual preferences and fashion aspirations. However, current methods predominantly perform a direct replacement of the original clothing with the target clothing, following the same dressing pattern. This limited control over clothing adaptation may result in fixed and monotonous try-on outputs. To delve into More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On, we propose a novel virtual try-on method, termed MOFA-VTON, which allows adjustment for clothing adaptations in try-on results through simple sketches by users. Specifically, we first design a mask construction strategy that transforms user-drawn curve sketches into a dual-region mask, replacing the traditional clothing-agnostic mask and providing fine-grained layout guidance for the subsequent generation process. Further, we propose layout adjustment blocks that utilize the cross-attention mechanism to independently learn layout correspondences for upper and lower regions of the human body, refining the spatial arrangement of the two regions. With these implementations, our method enables flexible and fine-grained adaptations of target clothing, overcoming the constraints of a fixed layout. Extensive experiments on VITON-HD and DressCode datasets demonstrate that our proposed MOFA-VTON outperforms previous state-of-the-art methods and provides more fashion possibilities for virtual try-on.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37608", "url": null, "sourceid": 34749, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39098, "uid": "39e1057382425c5ceab4d8702ffdf7bd", "name": "Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing", "authors": [{"id": 182437, "fullname": "Tian Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/182437?format=json", "institution": "nanjing University"}, {"id": 191353, "fullname": "Hongtao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191353?format=json", "institution": "Guangxi Normal University"}, {"id": 191354, "fullname": "Liangtao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191354?format=json", "institution": "Hefei University of Technology"}, {"id": 127403, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127403?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 187654, "fullname": "Xiantao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187654?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 152125, "fullname": "Jian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152125?format=json", "institution": "nanjing university"}, {"id": 107265, "fullname": "Ying Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/107265?format=json", "institution": "Nanjing University"}], "abstract": "fails under night scenes, glare, fog, and partial occlusion. Despite notable accuracy gains, recent architectures emphasize deep fusion and large parameter counts, driving up FLOPs and bandwidth. This computational burden constrains real-time performance and limits scalability beyond high-end GPUs. To balance accuracy and efficiency, we propose Adaptive Early-Exit (AEE): we augment the backbone with anytime heads and pair them with a confidence-calibrated early-exit policy that halts inference at the earliest reliable layer, skipping redundant computation. For cross-modal interaction, we design a Holistic-Token-Guided Interaction (HTGI) module, where each modality is compressed into a compact set of holistic state tokens and injected into the other modality\u2019s modeling stream without layer-wise alignment, enabling targeted information exchange at extremely low cost. On RGB-T benchmarks, the lightweight tracker substantially reduces latency while maintaining competitive accuracy; on LasHeR, it achieves 70.2% precision and 56.3% success, running at 148.3 FPS on GPU, 50.2 FPS on CPU, and 28.7 FPS on an edge device.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39098", "url": null, "sourceid": 32484, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37409, "uid": "16a50d670cfa9afa1ebd0022240fb0f9", "name": "Learning Forgery-Aware Lip Representations Without Forgery Priors", "authors": [{"id": 180827, "fullname": "Bofan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180827?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187369, "fullname": "Hongyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187369?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186840, "fullname": "Yi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186840?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187370, "fullname": "Sichu Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187370?format=json", "institution": "Southeast University"}, {"id": 186841, "fullname": "Shi-Lin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186841?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Visual Speaker Authentication (VSA) verifies identity by analyzing lip dynamics during prompted speech, offering enhanced privacy compared to full-face methods while maintaining discriminability for high-security applications. However, recent advances in talking face generation (TFG) have enabled realistic forgeries that closely mimic lip dynamics in sync with speech, posing severe threats to VSA systems. Prevailing defenses rely heavily on supervised classifiers trained on known forgeries via empirical risk minimization, resulting in poor generalization to unseen attacks, dependency on continuously updated fake data, and complete failure in the absence of effective forgery priors. In this paper, we revisit the design of forgery detectors and argue that over-reliance on fake priors hinders the exploitation of rich authenticity signals inherently present in real videos. We propose a novel detector trained exclusively on authentic data, learning forgery-aware representations through three key components: (1) lightweight modules that capture forgery-indicative statistics from real videos; (2) an asymmetric contrastive objective that compacts real samples while repelling potential forgeries in representation space; and (3) a theoretically grounded regularizer that shapes real representations into a tractable, isotropic Gaussian. To support rigorous evaluation, we introduce a benchmark suite spanning diverse TFG forgeries. Across eight modern forgery attacks and ten state-of-the-art (SOTA) detectors, our method achieves over a 10% reduction in error rates while preserving identity-verification capability with minimal overhead, and demonstrates consistent gains on datasets that better emulate real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37409", "url": null, "sourceid": 35411, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39889, "uid": "0a38436397f65f4dba1b4393d81e9159", "name": "DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models", "authors": [{"id": 180528, "fullname": "Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180528?format=json", "institution": "Xidian University"}, {"id": 180838, "fullname": "Ning Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180838?format=json", "institution": "Xidian University"}, {"id": 193061, "fullname": "Yinbin Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193061?format=json", "institution": "Xidian University"}, {"id": 193062, "fullname": "Junkang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193062?format=json", "institution": "Tianjin University"}], "abstract": "Balancing convergence efficiency and robustness under Differential Privacy (DP) is a central challenge in Federated Learning (FL). While AdamW accelerates training and fine-tuning in large-scale models, we find that directly applying it to Differentially Private FL (DPFL) suffers from three major issues: (i) data heterogeneity and privacy noise jointly amplify the variance of second-moment estimator, (ii) DP perturbations bias the second-moment estimator, and (iii) DP amplify AdamW\u2019s sensitivity to local overfitting, worsening client drift. We propose DP-FedAdamW, the first AdamW-based optimizer for DPFL. It restores AdamW under DP by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates to the global descent to curb client drift.Theoretically, we establish an unbiased second-moment estimator and prove a linearly accelerated convergence rate without any heterogeneity assumption, while providing tighter $(\\varepsilon,\\delta)$-DP guarantees.Our empirical results demonstrate the effectiveness of DP-FedAdamW across language and vision Transformers and ResNet-18. On Tiny-ImageNet (Swin-Base, $\\varepsilon=1$), DP-FedAdamW outperforms the state-of-the-art (SOTA) by 5.83\\%. Thecode is available in Appendix.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39889", "url": null, "sourceid": 33292, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36514, "uid": "4bb51d72da028d47b13bbe68fa9d43bf", "name": "DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification", "authors": [{"id": 185246, "fullname": "Zhifang Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185246?format=json", "institution": "Central South University"}, {"id": 180484, "fullname": "Junhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180484?format=json", "institution": "Central South University"}, {"id": 185247, "fullname": "HaoKang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185247?format=json", "institution": "Central South University"}, {"id": 185248, "fullname": "Yucheng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/185248?format=json", "institution": "Central South University"}], "abstract": "Despite their impressive performance in multi-label classification of chest X-ray images (CXR), deep learning models are widely plagued by two types of spurious correlations: feature confounding arising from pathological co-occurrence and shortcut learning triggered by non-pathological visual confounders. These non-causal dependencies severely undermine the interpretability and robustness of models in real-world clinical settings. To address these challenges, we propose the Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification (DARC) framework, the first to synergistically decouple both types of confounding sources from a causal mechanism perspective. At the data level, we construct CheXconf, the first pixel-level annotation dataset of non-pathological visual confounders in CXR, comprising 40,213 annotated instances across 11 categories. This provides a solid foundation for accurately modeling these confounders. At the methodological level, we design a novel dual-stream causal learning architecture. Its Global Stream leverages the back-door adjustment criterion with CheXconf to explicitly block spurious paths from non-pathological confounders. Concurrently, the Local Stream employs counterfactual reasoning, constrained by anatomical priors, to disentangle the visual coupling of co-occurring pathologies. Experiments on large-scale public benchmarks demonstrate that our method achieves significant improvements in task performance, interpretability, and robustness. All codes and datasets will be made publicly available upon the publication of this paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36514", "url": null, "sourceid": 46383, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36177, "uid": "5bb0b4dd9a4f5c0de3564b63d74bcf5f", "name": "SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting", "authors": [{"id": 182169, "fullname": "Alexander Prutsch", "url": "http://cvpr.thecvf.com/api/miniconf/users/182169?format=json", "institution": "Graz University of Technology"}, {"id": 184340, "fullname": "Christian Fruhwirth-Reisinger", "url": "http://cvpr.thecvf.com/api/miniconf/users/184340?format=json", "institution": "Technische Universit\u00e4t Graz"}, {"id": 184341, "fullname": "David Schinagl", "url": "http://cvpr.thecvf.com/api/miniconf/users/184341?format=json", "institution": "Technische Universit\u00e4t Graz"}, {"id": 131086, "fullname": "Horst Possegger", "url": "http://cvpr.thecvf.com/api/miniconf/users/131086?format=json", "institution": "Graz University of Technology"}], "abstract": "In dynamic traffic environments, motion forecasting models must be able to accurately estimate future trajectories continuously.Streaming-based methods are a promising solution, but despite recent advances, their performance often degrades when exposed to heterogeneous observation lengths.To address this, we propose a novel streaming-based motion forecasting framework that explicitly focuses on evolving scenes.Our method incrementally processes incoming observation windows and leverages an instance-aware context streaming to maintain and update latent agent representations across inference steps.A dual training objective further enables consistent forecasting accuracy across diverse observation horizons.Extensive experiments on Argoverse 2, nuScenes, and Argoverse 1 demonstrate the robustness of our approach under evolving scene conditions and also the single-agent benchmarks.Moreover, our model achieves state-of-the-art performance in streaming inference on the Argoverse 2 multi-agent benchmark, while maintaining minimal latency, highlighting its suitability for real-world deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36177", "url": null, "sourceid": 46647, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38252, "uid": "dd5ae592370924172fe1ffeb2ebfa577", "name": "CI-VID: A Coherent Interleaved Text-Video Dataset", "authors": [{"id": 189425, "fullname": "Yiming Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/189425?format=json", "institution": "BAAI"}, {"id": 189426, "fullname": "Jijin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189426?format=json", "institution": "Alibaba Group"}, {"id": 158473, "fullname": "Zhengxiong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158473?format=json", "institution": "BAAI"}, {"id": 189427, "fullname": "Haoge Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189427?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 182229, "fullname": "Hanyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182229?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 189428, "fullname": "Li Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/189428?format=json", "institution": "BAAI"}, {"id": 172969, "fullname": "Wenbo Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/172969?format=json", "institution": "University of New South Wales"}, {"id": 189429, "fullname": "Chengwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189429?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 189430, "fullname": "Donglin Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189430?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 89234, "fullname": "Xinlong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89234?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 189431, "fullname": "Tengfei Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189431?format=json", "institution": "BAAI"}], "abstract": "Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text\u2013video (T\u2013V) pairs and thus fail to model inter-clip relationships. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated T2V generation toward text-and-video-to-video (T&V2V) generation. CI-VID contains over 340,000 samples, each comprising a semantically coherent video sequence with interleaved text captions that capture both clip-level content and inter-clip relationships. To validate its effectiveness, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID significantly improve both accuracy and content consistency in multi-clip video generation. This enables the creation of story-driven content with smooth transitions and strong semantic coherence, underscoring the value of CI-VID as a foundation for advancing controllable and coherent video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38252", "url": null, "sourceid": 33778, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38312, "uid": "6d51a99b41cd166309b1f1be618e8bee", "name": "Toward Early Quality Assessment of Text-to-Image Diffusion Models", "authors": [{"id": 189575, "fullname": "Huanlei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189575?format=json", "institution": "Southern University of Science and Technology"}, {"id": 189576, "fullname": "Hongxin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/189576?format=json", "institution": "Southern University of Science and Technology"}, {"id": 189577, "fullname": "Bingyi Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/189577?format=json", "institution": "The Chinese University of Hong Kong; South University of Science and Technology"}], "abstract": "Recent text-to-image (T2I) diffusion models can produce highly realistic images from natural language prompts. In practice, users usually generate multiple candidates and select only a small subset for downstream use, guided by automatic metrics like CLIPScore and ImageReward. However, this post-hoc quality assessment is highly resource-intensive since quality is assessed after dozens to hundreds of denoising steps per image, leading to substantial waste on low-quality samples. To address this issue, we propose \\textbf{Probe-Select}, a plug-in framework for early quality assessment in T2I generation. Our key observation is that certain intermediate features within the denoiser\u2014often as early as 20\\% of the reverse process\u2014already encode stable structural cues (e.g., object layout, spatial composition, and color harmony) that strongly correlate with final image fidelity. Building upon this phenomenon, Probe-Select attaches lightweight probes to these stable activations at an early checkpoint and trains them to align with external evaluators. During inference, the probes forecast image quality on the fly, enabling early pruning of unpromising trajectories so that computation is concentrated on promising ones. Experiments on MS-COCO across multiple generative backbones show that this early assessment mechanism reduces sampling cost by over 60\\% while improving the quality of the generated images, demonstrating that early structural signals can effectively guide efficient text-to-image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38312", "url": null, "sourceid": 35794, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36179, "uid": "8e354b1788d500aed84ab133c1c05a9b", "name": "VidEoMT: Your ViT is Secretly Also a Video Segmentation Model", "authors": [{"id": 101926, "fullname": "Narges Norouzi", "url": "http://cvpr.thecvf.com/api/miniconf/users/101926?format=json", "institution": "Eindhoven University of Technology"}, {"id": 73544, "fullname": "Idil Esen Zulfikar", "url": "http://cvpr.thecvf.com/api/miniconf/users/73544?format=json", "institution": "RWTH Aachen University"}, {"id": 184345, "fullname": "Niccol\u00f2 Cavagnero", "url": "http://cvpr.thecvf.com/api/miniconf/users/184345?format=json", "institution": "Eindhoven University of Technology"}, {"id": 134194, "fullname": "Tommie Kerssies", "url": "http://cvpr.thecvf.com/api/miniconf/users/134194?format=json", "institution": "Eindhoven University of Technology"}, {"id": 75750, "fullname": "Bastian Leibe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75750?format=json", "institution": "RWTH Aachen University"}, {"id": 89343, "fullname": "Gijs Dubbelman", "url": "http://cvpr.thecvf.com/api/miniconf/users/89343?format=json", "institution": "Eindhoven University of Technology"}, {"id": 89339, "fullname": "Daan de Geus", "url": "http://cvpr.thecvf.com/api/miniconf/users/89339?format=json", "institution": "Eindhoven University of Technology"}], "abstract": "Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x-10x faster, running at up to 160 FPS with a ViT-L backbone.Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36179", "url": null, "sourceid": 46525, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36438, "uid": "4d61996cb5a40f214058d84fc7aba126", "name": "Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views", "authors": [{"id": 144046, "fullname": "Zhangquan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144046?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185043, "fullname": "Manyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185043?format=json", "institution": null}, {"id": 185044, "fullname": "Xinlei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185044?format=json", "institution": "National University of Singapore"}, {"id": 155403, "fullname": "Xufang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/155403?format=json", "institution": "Microsoft Research"}, {"id": 71203, "fullname": "Mingze Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/71203?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185045, "fullname": "Zihao Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185045?format=json", "institution": "Alibaba Group; SUN YAT-SEN UNIVERSITY"}, {"id": 185046, "fullname": "Xiang An", "url": "http://cvpr.thecvf.com/api/miniconf/users/185046?format=json", "institution": "DeepGlint"}, {"id": 185047, "fullname": "Yan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185047?format=json", "institution": "Meituan"}, {"id": 185048, "fullname": "Peng Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185048?format=json", "institution": "Meituan"}, {"id": 185049, "fullname": "Xunliang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185049?format=json", "institution": "Meituan"}, {"id": 87774, "fullname": "Ruqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87774?format=json", "institution": "Tsinghua Shenzhen International Graduate School/Tsinghua Berkeley Shenzhen Institute "}], "abstract": "Though recent advances in vision\u2013language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploit the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36438", "url": null, "sourceid": 32903, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39311, "uid": "4c15fe7e4c24eab4f38fb33aea1ec479", "name": "SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes", "authors": [{"id": 175018, "fullname": "ZhiCheng Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175018?format=json", "institution": "PKU"}, {"id": 156364, "fullname": "Jiarui Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156364?format=json", "institution": "Peking University"}, {"id": 191817, "fullname": "Tong-an Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191817?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 191818, "fullname": "Yican Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191818?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 191819, "fullname": "Xuan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191819?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191820, "fullname": "Xuanfu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191820?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190228, "fullname": "Zhan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190228?format=json", "institution": null}], "abstract": "We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39311", "url": null, "sourceid": 46604, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37229, "uid": "f74d36ac2995966d36a8c7c29c0167fa", "name": "UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes", "authors": [{"id": 174006, "fullname": "Shuo Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/174006?format=json", "institution": "Beijing Institude of Technology; Zhongguancun Academy"}, {"id": 158612, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158612?format=json", "institution": "Wuhan University"}, {"id": 186958, "fullname": "He Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186958?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186959, "fullname": "Haonan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186959?format=json", "institution": "Wuhan University"}, {"id": 186960, "fullname": "Ning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186960?format=json", "institution": "Hong Kong Polytechnic University; Beijing Institute of Technology"}, {"id": 91732, "fullname": "Jing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91732?format=json", "institution": "The University of Sydney"}], "abstract": "Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image\u2013mask\u2013instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. The datasets and source code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37229", "url": null, "sourceid": 45127, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36760, "uid": "fa45a0e95dbe32a2a2fad1a5b10683ef", "name": "OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models", "authors": [{"id": 183126, "fullname": "Zhenyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183126?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 103784, "fullname": "JingJing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/103784?format=json", "institution": "Xiamen University"}, {"id": 185812, "fullname": "Zehao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185812?format=json", "institution": "University of Science and Technology of China"}, {"id": 185813, "fullname": "Bowen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185813?format=json", "institution": "University of Science and Technology of China"}, {"id": 185814, "fullname": "Qiushi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185814?format=json", "institution": "University of Hong Kong"}, {"id": 158376, "fullname": "Zhaoyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158376?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 185815, "fullname": "Zhoumianze Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185815?format=json", "institution": "Fudan University"}, {"id": 86632, "fullname": "Yu Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86632?format=json", "institution": "Shanghai Aritifcal Intelligence Laboratory"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185816, "fullname": "Zun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185816?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 185817, "fullname": "Zichen Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185817?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}], "abstract": "The deployment of autonomous agents in Graphical User Interface (GUI) environments confronts significant challenges, notably error accumulation in long-horizon tasks and the severe consequences of irreversible operations. While critic models that provide real-time action assessment offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for computer use.To bridge these gaps,we introduce OS-Oracle that makes three core contributions:(1) a scalable data pipeline for synthesizing cross-platform GUI critic data;(2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms.Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves impressive performance,and further reduces error rates, which improves the capability of GUI agents in dynamic environments. All codes, data and checkpoints will be made public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36760", "url": null, "sourceid": 38616, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37208, "uid": "f8e64e825cf4fd949b32961ad1b4312a", "name": "HATS : Hardness-Aware Trajectory Synthesis for GUI Agents", "authors": [{"id": 89896, "fullname": "Rui Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89896?format=json", "institution": "Harbin Institute of Technology"}, {"id": 180603, "fullname": "RUIZE GAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/180603?format=json", "institution": "National University of Singapore"}, {"id": 160438, "fullname": "Bin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/160438?format=json", "institution": "Harbin Institute of Technology"}, {"id": 186925, "fullname": "Li Yixing", "url": "http://cvpr.thecvf.com/api/miniconf/users/186925?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184580, "fullname": "Kaiwen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184580?format=json", "institution": "Knowin AI"}, {"id": 186926, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186926?format=json", "institution": "Yale University"}, {"id": 90282, "fullname": "Weili Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90282?format=json", "institution": "Monash University"}, {"id": 132952, "fullname": "Gongwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132952?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Graphical user interface (GUI) agents powered by large vision\u2013language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantic-ambiguous actions\u2014interactions whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) a hardness-driven exploration that guides data collection toward ambiguous yet informative interactions, and (2) an alignment-guided refinement that iteratively validates and repairs instruction\u2013execution consistency. The two modules operate in a closed-loop manner\u2014exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37208", "url": null, "sourceid": 42352, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37055, "uid": "0ce31e6390840b5e823e6129a59a37d3", "name": "MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents", "authors": [{"id": 179953, "fullname": "Xuehui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179953?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 183126, "fullname": "Zhenyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183126?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 103784, "fullname": "JingJing Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/103784?format=json", "institution": "Xiamen University"}, {"id": 185817, "fullname": "Zichen Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185817?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 185813, "fullname": "Bowen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185813?format=json", "institution": "University of Science and Technology of China"}, {"id": 185812, "fullname": "Zehao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185812?format=json", "institution": "University of Science and Technology of China"}, {"id": 158376, "fullname": "Zhaoyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158376?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 129643, "fullname": "Qingyun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129643?format=json", "institution": "Harbin Institute of Technology"}, {"id": 156953, "fullname": "Xuan Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156953?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88513, "fullname": "Zhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88513?format=json", "institution": "Nanjing University"}, {"id": 186575, "fullname": "Weiyun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186575?format=json", "institution": "Fudan University"}, {"id": 154737, "fullname": "Xiangyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154737?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186576, "fullname": "Jixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186576?format=json", "institution": "University of California, San Diego"}, {"id": 153062, "fullname": "Haodong Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153062?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 186577, "fullname": "Tianbao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186577?format=json", "institution": "University of Hong Kong"}, {"id": 151861, "fullname": "Chenyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151861?format=json", "institution": "Tsinghua University"}, {"id": 151898, "fullname": "Shiqian Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/151898?format=json", "institution": "Tsinghua University"}, {"id": 186578, "fullname": "Yue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186578?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 126958, "fullname": "Yanting Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126958?format=json", "institution": "Donghua University, Shanghai"}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 156954, "fullname": "Weijie Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/156954?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 153325, "fullname": "Xizhou Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153325?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 76466, "fullname": "Wei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76466?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 87572, "fullname": "Jifeng Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87572?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 87610, "fullname": "Wenhai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87610?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web. The benchmark spans four levels: Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. To assess both effectiveness and efficiency, we further propose the Efficiency\u2013Quality-Aware (EQA) metric, which measures task success alongside action redundancy. Extensive evaluations reveal that precise visual grounding is the critical determinant of performance, underscoring the advantages of modular designs with specialized grounding modules. Moreover, all agents suffer from substantial inefficiencies, frequently completing tasks with excessive steps despite eventual success. Performance also degrades on complex or cross-application tasks, exposing weaknesses in memory, planning, and adaptive reasoning. By providing broad coverage, standardized protocols, and novel metrics, MMBench-GUI establishes the first comprehensive foundation for advancing GUI agent research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37055", "url": null, "sourceid": 38269, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40192, "uid": "58840eb65da053fbdea5f4d19dd3e00f", "name": "NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection", "authors": [{"id": 144773, "fullname": "Yupeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144773?format=json", "institution": "Tianjin University"}, {"id": 193746, "fullname": "Ruize Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/193746?format=json", "institution": "Shenzhen University of Advanced Technology; Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences; City University of Hong Kong"}, {"id": 186764, "fullname": "Zhiwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186764?format=json", "institution": "Nanchang University; Jiangxi Normal University"}, {"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}, {"id": 153902, "fullname": "Liang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153902?format=json", "institution": "Tianjin University"}], "abstract": "Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework\u2014NoOVD\u2014which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs).Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation\u2014without requiring additional data\u2014thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects.Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40192", "url": null, "sourceid": 35644, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38578, "uid": "be71b27fdddb8a0fded30cb6f268bb02", "name": "CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation", "authors": [{"id": 107454, "fullname": "Mainak Singha", "url": "http://cvpr.thecvf.com/api/miniconf/users/107454?format=json", "institution": "University of Trento, Italy"}, {"id": 155234, "fullname": "Sarthak Mehrotra", "url": "http://cvpr.thecvf.com/api/miniconf/users/155234?format=json", "institution": "Indian institute of Technology Bombay"}, {"id": 190195, "fullname": "Paolo Casari", "url": "http://cvpr.thecvf.com/api/miniconf/users/190195?format=json", "institution": "University of Trento"}, {"id": 158110, "fullname": "Subhasis Chaudhuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/158110?format=json", "institution": "Indian Institute of Technology Bombay"}, {"id": 75841, "fullname": "Elisa Ricci", "url": "http://cvpr.thecvf.com/api/miniconf/users/75841?format=json", "institution": "University of Trento"}, {"id": 71694, "fullname": "Biplab Banerjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/71694?format=json", "institution": "Associate Professor, IIT Bombay, India"}], "abstract": "Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce $\\textbf{CLIPoint3D}$, the first framework for $\\textit{few-shot unsupervised 3D point cloud domain adaptation}$ built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that $\\textit{CLIPoint3D}$ achieves consistent $\\textit{3-16}$% accuracy gains over both CLIP-based and conventional encoder-based baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38578", "url": null, "sourceid": 44290, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39815, "uid": "41a85427dc4144544b0ca395cf16e423", "name": "Rethinking Token Reduction for Large Vision-Language Models", "authors": [{"id": 175947, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175947?format=json", "institution": "Zhejiang University"}, {"id": 85428, "fullname": "Haofei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85428?format=json", "institution": "Zhejiang University"}, {"id": 157824, "fullname": "Qihan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157824?format=json", "institution": "Zhejiang University"}, {"id": 192911, "fullname": "Anda Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192911?format=json", "institution": "Zhejiang University"}, {"id": 76886, "fullname": "Gongfan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76886?format=json", "institution": "National University of Singapore"}, {"id": 192912, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192912?format=json", "institution": "Alibaba Group"}, {"id": 129604, "fullname": "Xuan Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129604?format=json", "institution": "University of Science and Technology of China"}, {"id": 85433, "fullname": "Jie Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85433?format=json", "institution": "Zhejiang University"}, {"id": 85446, "fullname": "Mingli Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85446?format=json", "institution": "Zhejiang University"}, {"id": 87323, "fullname": "Xinchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87323?format=json", "institution": "National University of Singapore"}], "abstract": "Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency\u2013accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code will be released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39815", "url": null, "sourceid": 36955, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37060, "uid": "1a4461e3ca15c1b7c5b322f161cdcf0b", "name": "DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging", "authors": [{"id": 174318, "fullname": "Yuxi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/174318?format=json", "institution": "Xiamen university"}, {"id": 186583, "fullname": "Sujie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186583?format=json", "institution": "Xiamen University"}, {"id": 186584, "fullname": "Jing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186584?format=json", "institution": "Xiamen University"}, {"id": 85408, "fullname": "Jiacheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85408?format=json", "institution": "Xiamen University"}, {"id": 155637, "fullname": "Yiping Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155637?format=json", "institution": "Sun Yat-Sen University"}, {"id": 186585, "fullname": "Baptiste Magnier", "url": "http://cvpr.thecvf.com/api/miniconf/users/186585?format=json", "institution": "IMT Mines Ales"}, {"id": 85948, "fullname": "Liansheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85948?format=json", "institution": "Xiamen University, Tsinghua University"}], "abstract": "Large-scale foundation models pretrained on massive datasets have demonstrated strong generalization capabilities in medical image analysis. However, they are typically trained on static datasets and struggle to cope with the continuously evolving nature of clinical data, where new imaging devices, institutions, and disease subtypes constantly emerge. While domain-incremental learning (DIL) provides a solution for sequential adaptation without revisiting historical data, existing methods typically assume fixed label spaces and limited domain heterogeneity, restricting their applicability to real-world clinical scenarios. To address these challenges, we propose DK-DDIL, a rehearsal-free framework for dynamic DIL that integrates two synergistic modules: a Dynamic Adaptation Module (DAM) employing dynamic rank selection and adaptive regularization to flexibly allocate model capacity under domain shifts, and a Knowledge Inheritance and Refinement (KIR) module that stabilizes cross-domain knowledge transfer through selective adapter fusion and prototype-level contrastive refinement. Experiments on the Skin Pathology Diagnosis dataset, the Cyst-X 3D MRI cohort, and the OfficeHome benchmark demonstrate that DK-DDIL consistently outperforms state-of-the-art DIL approaches, highlighting its effectiveness and versatility across dynamic 2D medical, 3D medical, and natural image domains.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37060", "url": null, "sourceid": 33137, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40281?format=json"], "related_events_ids": [40281]}, {"id": 40281, "uid": "1a4461e3ca15c1b7c5b322f161cdcf0b", "name": "DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging", "authors": [{"id": 174318, "fullname": "Yuxi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/174318?format=json", "institution": "Xiamen university"}, {"id": 186583, "fullname": "Sujie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186583?format=json", "institution": "Xiamen University"}, {"id": 186584, "fullname": "Jing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186584?format=json", "institution": "Xiamen University"}, {"id": 85408, "fullname": "Jiacheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85408?format=json", "institution": "Xiamen University"}, {"id": 155637, "fullname": "Yiping Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155637?format=json", "institution": "Sun Yat-Sen University"}, {"id": 186585, "fullname": "Baptiste Magnier", "url": "http://cvpr.thecvf.com/api/miniconf/users/186585?format=json", "institution": "IMT Mines Ales"}, {"id": 85948, "fullname": "Liansheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85948?format=json", "institution": "Xiamen University, Tsinghua University"}], "abstract": "Large-scale foundation models pretrained on massive datasets have demonstrated strong generalization capabilities in medical image analysis. However, they are typically trained on static datasets and struggle to cope with the continuously evolving nature of clinical data, where new imaging devices, institutions, and disease subtypes constantly emerge. While domain-incremental learning (DIL) provides a solution for sequential adaptation without revisiting historical data, existing methods typically assume fixed label spaces and limited domain heterogeneity, restricting their applicability to real-world clinical scenarios. To address these challenges, we propose DK-DDIL, a rehearsal-free framework for dynamic DIL that integrates two synergistic modules: a Dynamic Adaptation Module (DAM) employing dynamic rank selection and adaptive regularization to flexibly allocate model capacity under domain shifts, and a Knowledge Inheritance and Refinement (KIR) module that stabilizes cross-domain knowledge transfer through selective adapter fusion and prototype-level contrastive refinement. Experiments on the Skin Pathology Diagnosis dataset, the Cyst-X 3D MRI cohort, and the OfficeHome benchmark demonstrate that DK-DDIL consistently outperforms state-of-the-art DIL approaches, highlighting its effectiveness and versatility across dynamic 2D medical, 3D medical, and natural image domains.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40281", "url": null, "sourceid": -33137, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37060?format=json"], "related_events_ids": [37060]}, {"id": 38906, "uid": "82ae2e0a5d3b2ef7662e589e8349de15", "name": "Learning Latent Proxies for Controllable Single-Image Relighting", "authors": [{"id": 157388, "fullname": "Haoze Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157388?format=json", "institution": "HKUST"}, {"id": 186991, "fullname": "Zihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186991?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 190946, "fullname": "Xianfeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190946?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 190947, "fullname": "Yajing Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190947?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 168362, "fullname": "Yexin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/168362?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 190948, "fullname": "Yun LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/190948?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 86109, "fullname": "Xiaogang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86109?format=json", "institution": "Zhejiang Lab"}, {"id": 153060, "fullname": "Harry Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153060?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic- or G-buffer\u2013based pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that full intrinsic decomposition is unnecessary for accurate relighting. Instead, sparse but physically meaningful cues\u2014indicating where illumination should change and how materials should respond\u2014are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates minimal physical priors at two levels: a few-shot latent proxy encoder that extracts compact material\u2013geometry cues from limited PBR supervision, and a lighting-aware mask that identifies illumination-sensitive regions and steers the denoiser toward shading-relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that aligns predicted cues with perceptually preferred relighting behavior. We further present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera\u2013light metadata, enabling physically consistent and controllable training. Across object- and scene-level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion- and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38906", "url": null, "sourceid": 42263, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37211, "uid": "65c6cd63f9c7a97d36b6648b1795f35e", "name": "Token Warping Helps MLLMs Look from Nearby Viewpoints", "authors": [{"id": 179560, "fullname": "Phillip Y. Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/179560?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186932, "fullname": "Chanho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/186932?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186933, "fullname": "Mingue Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/186933?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 70920, "fullname": "Seungwoo Yoo", "url": "http://cvpr.thecvf.com/api/miniconf/users/70920?format=json", "institution": "Korea Advanced Institute of Science and Technology (KAIST)"}, {"id": 107384, "fullname": "Juil Koo", "url": "http://cvpr.thecvf.com/api/miniconf/users/107384?format=json", "institution": "KAIST"}, {"id": 75799, "fullname": "Minhyuk Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/75799?format=json", "institution": "KAIST"}], "abstract": "Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from nearby viewpoints? While MLLMs perform well on a single image reasoning, they remain fragile to viewpoint changes because pixel-level warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint warping. We compare two token-level transformation strategies, forward and backward warping, and find that backward token fetching, which selects tokens at target-view grid locations and retrieves their counterparts from the source view, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, while consistently outperforming all baselines, including pixel-warping approaches, MLLMs fine-tuned for spatial reasoning, and a generative warping method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37211", "url": null, "sourceid": 44176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40184, "uid": "1441b775217b1c146325d7dfb664c09d", "name": "HTTM: Head-wise Temporal Token Merging for Faster VGGT", "authors": [{"id": 176898, "fullname": "Weitian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176898?format=json", "institution": "Robert Bosch GmbH"}, {"id": 193739, "fullname": "Lukas Meiner", "url": "http://cvpr.thecvf.com/api/miniconf/users/193739?format=json", "institution": "Bosch; University of L\u00fcbeck"}, {"id": 193740, "fullname": "Shubham rai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193740?format=json", "institution": null}, {"id": 184077, "fullname": "Cecilia Parra", "url": "http://cvpr.thecvf.com/api/miniconf/users/184077?format=json", "institution": "Bosch"}, {"id": 193741, "fullname": "Akash Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/193741?format=json", "institution": "Ruhr-Universt\u00e4t Bochum"}], "abstract": "The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT.Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to $7\\times$ acceleration with negligible performance drops in a GPU-based inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40184", "url": null, "sourceid": 33269, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37702, "uid": "9b66d116fd05ff70ce34863f5ccadd6c", "name": "IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution", "authors": [{"id": 188046, "fullname": "Jonghee Back", "url": "http://cvpr.thecvf.com/api/miniconf/users/188046?format=json", "institution": "BLUEDOT Inc."}, {"id": 147768, "fullname": "Jongju Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/147768?format=json", "institution": "BLUEDOT Inc."}, {"id": 188047, "fullname": "Jeong-Uk Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188047?format=json", "institution": "Bluedot"}, {"id": 188048, "fullname": "Eunjin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188048?format=json", "institution": "BLUEDOT Inc."}, {"id": 188049, "fullname": "Minyong Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188049?format=json", "institution": "BLUEDOT Inc."}], "abstract": "Diffusion models have recently achieved remarkable success in real-world image super-resolution (ISR), typically balancing a trade-off between fidelity (i.e., similarity to HR images) and realism (i.e., perceptual naturalness). To better account for subjective preferences in image quality, controllable diffusion-based methods have been explored, allowing personalized adjustment of this trade-off via tunable parameters. While existing controllable methods have shown effective control, they operate in the latent space and require repeated network inference during adjustment, eventually limiting their practicality. In this paper, we propose IFCSR, a simple yet practical approach for one-step diffusion-based real-world ISR that enables inference-free control between fidelity and realism. The key idea behind IFCSR is to design a controllable model that adjusts the fidelity-realism trade-off in the image space, rather than in the latent space. Such an image-space control allows users to seamlessly adjust the trade-off without extra inference after an initial inference of fidelity- and realism-specific images. We further introduce a two-stage training scheme and specialized losses that encourage the controllable space to span a broad spectrum of fidelity and realism. Our method achieves quality competitive with state-of-the-art models while providing a practical advantage through inference-free control.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37702", "url": null, "sourceid": 38988, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38531, "uid": "009523565f11c6c933850f411b71acf6", "name": "Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels", "authors": [{"id": 190070, "fullname": "Junda Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190070?format=json", "institution": "Beijing Normal University"}, {"id": 190071, "fullname": "Yanmeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190071?format=json", "institution": "Beijing Normal University"}, {"id": 190072, "fullname": "Xiangqiang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190072?format=json", "institution": "Beijing Normal University"}, {"id": 190073, "fullname": "Jinrong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190073?format=json", "institution": "Beijing Normal University"}, {"id": 185545, "fullname": "Ying Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185545?format=json", "institution": "Beijing Normal University"}, {"id": 155853, "fullname": "Libao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155853?format=json", "institution": "Beijing Normal University"}], "abstract": "Google Earth imagery, combined with building footprint databases, offers an efficient way to construct localized building datasets. However, the lack of orthorectification in these images leads to spatial misalignments between annotations and their corresponding roof locations. Adopting such misaligned data directly for model training can severely degrade segmentation performance. To address the challenge, we propose an Object-based Multi-stage Alignment Framework (OMAF) that generates high-quality corrected labels with minimal manual intervention. OMAF first employs a prior-regularized self-alignment method to produce high-confidence, object-level offset pseudo-labels, which are then used to train an instance-level offset regression model for label refinement. Experimental results on the challenging Islahiye and Antakya datasets demonstrate that OMAF effectively corrects misalignments and consistently boosts the mIoU of all baseline models by up to $40.6\\%$. The ablation experiments also demonstrated that each module in OMAF effectively improves the final alignment performance. Among them, the self-alignment algorithm contributed $9.22\\%$ to the mIoU metric, demonstrating the strong effectiveness of this unsupervised alignment method.This work provides a practical and cost-effective solution for large-scale dataset construction and domain adaptation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38531", "url": null, "sourceid": 39178, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36848, "uid": "1cba40e3efa7040ac28c0f065f7cacd1", "name": "Variational Graph-based Normal Integration", "authors": [{"id": 186019, "fullname": "Lixiong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186019?format=json", "institution": "University of Oxford, University of Oxford"}, {"id": 107468, "fullname": "Bohan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107468?format=json", "institution": "Peking University"}, {"id": 77342, "fullname": "Victor Adrian Prisacariu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77342?format=json", "institution": "Niantic; University of Oxford"}, {"id": 76916, "fullname": "Imari Sato", "url": "http://cvpr.thecvf.com/api/miniconf/users/76916?format=json", "institution": "National Institute of Informatics"}], "abstract": "We present a general optimization-based framework for depth-preserving normal integration. Unlike existing methods that operate on surface orientations defined over regular grids, our approach introduces a unified graph-based formulation capable of integrating semi-differentiable surfaces on unstructured domains. Given a set of points uniformly sampled from a surface, we construct a directed, weighted graph that jointly parameterizes the surface geometry and pairwise point correlations. Surface depth is recovered by minimizing projected point-to-plane distances across the graph, and this objective is optimized through variational inference. In our formulation, estimated surface normals serve as latent variables that encode local geometry via the posterior probabilities of a two-component Gaussian mixture, allowing depth discontinuities to be inferred from sampled triplet configurations. The unknowns are estimated in an alternating fashion, and we provide a geometric interpretation of this inference process by relating it to shape deformation. Experimental results show that the proposed method not only outperforms state-of-the-art techniques on regularly gridded data, but also generalizes effectively to scattered points, which existing approaches do not directly support.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36848", "url": null, "sourceid": 40376, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39361, "uid": "3cece0ea5371c5af680cdf2a0c9b72b4", "name": "AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots", "authors": [{"id": 145151, "fullname": "Likui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145151?format=json", "institution": "Sun Yat-sen University"}, {"id": 71020, "fullname": "Tao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71020?format=json", "institution": "SYSU"}, {"id": 191924, "fullname": "Zhihao Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191924?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 176495, "fullname": "xiuwei chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/176495?format=json", "institution": "SYSU"}, {"id": 191925, "fullname": "Zisheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191925?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 85001, "fullname": "Jianhua Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85001?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190862, "fullname": "Jiangtong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190862?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 178413, "fullname": "Pei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178413?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 91873, "fullname": "Hang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91873?format=json", "institution": "Huawei Noah\u2018s Ark Lab"}, {"id": 191926, "fullname": "Hefeng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191926?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 75470, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75470?format=json", "institution": "Sun Yat-sen University"}, {"id": 69930, "fullname": "Xiaodan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69930?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks.However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability.To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning.We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms $\\pi_{0}$ by 2.4\\% on LIBERO, 10\\% on LIBERO-LONG, and outperforms $\\pi_{0}$ and $\\pi_{0.5}$ by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\\% and 21\\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39361", "url": null, "sourceid": 44628, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36695, "uid": "29903d43ded57b63765b195d7c0b4099", "name": "MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes", "authors": [{"id": 185661, "fullname": "Wonjoon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/185661?format=json", "institution": "Yonsei University"}, {"id": 185662, "fullname": "Sungmin Woo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185662?format=json", "institution": null}, {"id": 185663, "fullname": "Donghyeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185663?format=json", "institution": "Yonsei University"}, {"id": 69658, "fullname": "Jungho Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/69658?format=json", "institution": "Yonsei University"}, {"id": 185664, "fullname": "Sangheon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/185664?format=json", "institution": "Electronics and Telecommunications Research Institute; Yonsei University"}, {"id": 87957, "fullname": "Sangyoun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/87957?format=json", "institution": "Yonsei University"}], "abstract": "Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that treats Gaussian movement as a core modeling object. Specifically, we efficiently leverage optical flow on a sparse  set of key views as a lightweight motion cue to guide per-Gaussian motion toward the scene\u2019s true dynamics. To compensate for the sparsity and view-dependence of flow, we learn a per-Gaussian motion offset field that reconciles discrepancies between projected 3D motion and observed flow across views and time. In addition, we introduce a per-Gaussian motion confidence that separates dynamic from static Gaussians and weights Gaussian attribute residual updates, thereby suppressing redundant motion in static regions for better temporal consistency and accelerating the modeling of large motions. Extensive experiments demonstrate that MoRGS achieves state-of-the-art reconstruction quality and motion fidelity among online methods, while maintaining streamable performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36695", "url": null, "sourceid": 44647, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39615, "uid": "03e3ae4dc11a6abfcd5683caea151a40", "name": "Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition", "authors": [{"id": 192484, "fullname": "Wen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192484?format=json", "institution": "Shandong Technology and Business Universality"}, {"id": 184234, "fullname": "Pengfei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184234?format=json", "institution": "Shandong Technology and Business University"}, {"id": 192485, "fullname": "Zongmeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192485?format=json", "institution": "Inner Mongolia University"}, {"id": 152744, "fullname": "Yufan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152744?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 76372, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76372?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios.However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT.Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts.However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos.Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework.In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions.Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation.Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model's adaptability under distribution shifts.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39615", "url": null, "sourceid": 35491, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39628, "uid": "f7c59863dbcc20bd6fedfa36f2329728", "name": "Bridging Domains through Subspace-Aware Model Merging", "authors": [{"id": 180117, "fullname": "Levy Chaves", "url": "http://cvpr.thecvf.com/api/miniconf/users/180117?format=json", "institution": "Universidade Estadual de Campinas (UNICAMP)"}, {"id": 192520, "fullname": "Chao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192520?format=json", "institution": "CISPA Helmholtz Center for Information Security"}, {"id": 192521, "fullname": "Rebekka Burkholz", "url": "http://cvpr.thecvf.com/api/miniconf/users/192521?format=json", "institution": "CISPA Helmholtz Center for Information Security"}, {"id": 192522, "fullname": "Eduardo Valle", "url": "http://cvpr.thecvf.com/api/miniconf/users/192522?format=json", "institution": "Intercom"}, {"id": 182548, "fullname": "Sandra Avila", "url": "http://cvpr.thecvf.com/api/miniconf/users/182548?format=json", "institution": "UNICAMP"}], "abstract": "Model merging integrates multiple task-specific models into a single consolidated one. Recent research has made progress in improving merging performance for in-distribution or multi-task scenarios, but domain generalization in model merging remains underexplored. We investigate how merging models fine-tuned on distinct domains affects generalization to unseen domains. Through an analysis of parameter competition in the task matrix using singular value decomposition (SVD), we show that merging models trained under different distribution shifts induces stronger conflicts between their subspaces compared to traditional multi-task settings. To mitigate this issue, we propose SCORE (Subspace COnflict-Resolving mErging), a method designed to alleviate such singular subspace conflicts. SCORE finds a shared orthogonal basis by computing the principal components of the concatenated leading singular vectors of all models. It then projects each task matrix into the shared basis, pruning off-diagonal components to remove conflicting singular directions. SCORE consistently outperforms, on average, existing model merging approaches in domain generalization settings across a variety of architectures and model scales, demonstrating its effectiveness and scalability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39628", "url": null, "sourceid": 35785, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40041, "uid": "3548c43c2ed4c969f65ee0d12970fac8", "name": "UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision", "authors": [{"id": 176865, "fullname": "Alberto Rota", "url": "http://cvpr.thecvf.com/api/miniconf/users/176865?format=json", "institution": "Politecnico di Milano"}, {"id": 143907, "fullname": "Mert Kiray", "url": "http://cvpr.thecvf.com/api/miniconf/users/143907?format=json", "institution": "Technical University of Munich"}, {"id": 193366, "fullname": "Mert Asim Karaoglu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193366?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen - ImFusion"}, {"id": 86195, "fullname": "Patrick Ruhkamp", "url": "http://cvpr.thecvf.com/api/miniconf/users/86195?format=json", "institution": "Technical University Munich"}, {"id": 193367, "fullname": "Elena De Momi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193367?format=json", "institution": "Polytechnic Institute of Milan"}, {"id": 86152, "fullname": "Nassir Navab", "url": "http://cvpr.thecvf.com/api/miniconf/users/86152?format=json", "institution": "TU Munich"}, {"id": 74044, "fullname": "Benjamin Busam", "url": "http://cvpr.thecvf.com/api/miniconf/users/74044?format=json", "institution": "Technical University of Munich"}], "abstract": "Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40041", "url": null, "sourceid": 36285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40393?format=json"], "related_events_ids": [40393]}, {"id": 40393, "uid": "3548c43c2ed4c969f65ee0d12970fac8", "name": "UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision", "authors": [{"id": 176865, "fullname": "Alberto Rota", "url": "http://cvpr.thecvf.com/api/miniconf/users/176865?format=json", "institution": "Politecnico di Milano"}, {"id": 143907, "fullname": "Mert Kiray", "url": "http://cvpr.thecvf.com/api/miniconf/users/143907?format=json", "institution": "Technical University of Munich"}, {"id": 193366, "fullname": "Mert Asim Karaoglu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193366?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen - ImFusion"}, {"id": 86195, "fullname": "Patrick Ruhkamp", "url": "http://cvpr.thecvf.com/api/miniconf/users/86195?format=json", "institution": "Technical University Munich"}, {"id": 193367, "fullname": "Elena De Momi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193367?format=json", "institution": "Polytechnic Institute of Milan"}, {"id": 86152, "fullname": "Nassir Navab", "url": "http://cvpr.thecvf.com/api/miniconf/users/86152?format=json", "institution": "TU Munich"}, {"id": 74044, "fullname": "Benjamin Busam", "url": "http://cvpr.thecvf.com/api/miniconf/users/74044?format=json", "institution": "Technical University of Munich"}], "abstract": "Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40393", "url": null, "sourceid": -36285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40041?format=json"], "related_events_ids": [40041]}, {"id": 39269, "uid": "da4387d24e5cd56f15505ea8df550eeb", "name": "TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference", "authors": [{"id": 173751, "fullname": "Xuanxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173751?format=json", "institution": "Wuhan University"}, {"id": 173467, "fullname": "ShuHui Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/173467?format=json", "institution": "Faculty of Electronic and Information Engineering Xi&#x27;an Jiaotong University, Xi&#x27;an ,China"}, {"id": 180337, "fullname": "Tianxiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180337?format=json", "institution": "Wuhan University"}, {"id": 182475, "fullname": "Zhetao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182475?format=json", "institution": "Cloudspace Technology Co., Ltd."}, {"id": 176105, "fullname": "Zixuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176105?format=json", "institution": "Wuhan University"}, {"id": 191738, "fullname": "You Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191738?format=json", "institution": "Wuhan University"}], "abstract": "Multi-agent 3D reconstruction, as a key technology for large-scale VR/AR, robot swarms, and digital twins, has attracted growing attention. Recent end-to-end 3D reconstruction methods achieve strong performance in single-agent scenarios, but they are difficult to directly extend to multi-agent collaborative settings, where they often suffer from unstable tracking, excessive memory consumption, and frequent loop-closure failures, thus failing to meet real-time and large-scale deployment requirements. To address these issues, we propose TOPOMA, a real-time end-to-end 3D reconstruction framework tailored for multi-agent collaboration. TOPOMA explicitly models the spatial topological structure of the scene and tightly couples it with end-to-end representation learning, thereby jointly solving core challenges such as inter-agent spatial alignment and submap fusion. Concretely, we introduce topology skeleton modeling and optimization, decentralized loop closure, and topology-guided residual transport, and build upon them a fully distributed inference architecture in which each agent can independently store, reconstruct, and incrementally optimize its map while collaborating through lightweight topological information. Extensive experiments demonstrate that, compared with existing methods, TOPOMA achieves consistently higher trajectory accuracy, reconstruction quality, robustness, and topological consistency, showing superior adaptability and scalability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39269", "url": null, "sourceid": 34997, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38030, "uid": "dd0bc433d9a5a097cf08a42aeeb14df2", "name": "Make it SING: Analyzing Semantic Invariants in Classifiers", "authors": [{"id": 181496, "fullname": "Harel Yadid", "url": "http://cvpr.thecvf.com/api/miniconf/users/181496?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 186994, "fullname": "Meir Yossef Levi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186994?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 180248, "fullname": "Roy Betser", "url": "http://cvpr.thecvf.com/api/miniconf/users/180248?format=json", "institution": "Technion - Israel Institute of Technology, Technion; Fujitsu Research of Europe"}, {"id": 77507, "fullname": "Guy Gilboa", "url": "http://cvpr.thecvf.com/api/miniconf/users/77507?format=json", "institution": "Technion"}], "abstract": "All classifiers, including state-of-the-art vision models, possess invariants, partially rooted in the geometry of their linear mappings. These invariants, which reside in the null-space of the classifier, induce equivalent sets of inputs that map to identical outputs. The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. We use a mapping from network features to multi-modal vision language models. This allows us to obtain natural language descriptions and visual examples of the induced semantic shifts. SING can be applied to a single image, uncovering local invariants, or to sets of images, allowing a breadth of statistical analysis at the class and model levels. For example, our method reveals that ResNet50 leaks relevant semantic attributes to the null space, whereas DINO-ViT, a ViT pretrained with self-supervised DINO, is superior in maintaining class semantics across the invariant space.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38030", "url": null, "sourceid": 44096, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37394, "uid": "7d1fb6b4eec12bee96b88020c7afadb8", "name": "ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis", "authors": [{"id": 187334, "fullname": "KunHo Heo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187334?format=json", "institution": "Kyung Hee University"}, {"id": 181790, "fullname": "SuYeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/181790?format=json", "institution": "Kyung Hee University"}, {"id": 187335, "fullname": "Yonghyun Gwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/187335?format=json", "institution": "Kyung Hee University"}, {"id": 187336, "fullname": "Youngbin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187336?format=json", "institution": "Kyung Hee University"}, {"id": 187337, "fullname": "MyeongAh Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/187337?format=json", "institution": "Kyung Hee University"}], "abstract": "Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37394", "url": null, "sourceid": 31688, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38425, "uid": "30c79ab1cb1bb865fe03120da341ee09", "name": "Unleashing Vision-Language Semantics for Video Deepfake Detection", "authors": [{"id": 107179, "fullname": "Jiawen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107179?format=json", "institution": "Singapore Management University"}, {"id": 153657, "fullname": "Yunqi Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153657?format=json", "institution": "Huawei London Research Center"}, {"id": 187458, "fullname": "Xueyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187458?format=json", "institution": ""}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}, {"id": 89120, "fullname": "Guansong Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89120?format=json", "institution": "Singapore Management University"}], "abstract": "Recent video Deepfake Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength \u2014 the rich vision-language semantics embedded in the latent space. We proposes VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics in enhancing model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture subtle and diverse forgery cues both granularly and holistically, while preserving the pretrained Vision\u2013Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue \u2014 Identity-aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver.  Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38425", "url": null, "sourceid": 34587, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37993, "uid": "44a1266e4d281aaef1704b6e60524cce", "name": "Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models", "authors": [{"id": 188773, "fullname": "Ning Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188773?format=json", "institution": "Xiangtan University"}, {"id": 188774, "fullname": "Zhenyu Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/188774?format=json", "institution": "Xiangtan University"}, {"id": 188775, "fullname": "Feng Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188775?format=json", "institution": "Fudan University"}, {"id": 179257, "fullname": "Yuhua Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/179257?format=json", "institution": "Xiangtan University"}, {"id": 149486, "fullname": "Chengqing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/149486?format=json", "institution": "Xiangtan University"}, {"id": 89124, "fullname": "Jingjing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89124?format=json", "institution": "Fudan University"}], "abstract": "Concept erasure aims to remove harmful, inappropriate, or copyrighted content from text-to-image diffusion models while preserving non-target semantics. However, existing methods either rely on costly fine-tuning or apply coarse semantic separation, often degrading unrelated concepts and lacking adaptability to evolving concept sets. To alleviate this issue, we propose Graph-Guided Online Concept Erasure (GrOCE), a training-free framework that performs precise and adaptive concept removal through graph-based semantic reasoning. GrOCE models concepts and their interrelations as a dynamic semantic graph, enabling principled reasoning over dependencies and fine-grained isolation of undesired content. It comprises three components: (1) Dynamic Topological Graph Construction for incremental graph building, (2) Adaptive Cluster Identification for multi-hop traversal with similarity-decay scoring, and (3) Selective Edge Severing for targeted edge removal while preserving global semantics. Extensive experiments demonstrate that GrOCE achieves state-of-the-art performance on Concept Similarity (CS) and Fr\u00e9chet Inception Distance (FID) metrics, offering efficient, accurate, and stable concept erasure without retraining.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37993", "url": null, "sourceid": 43170, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40160, "uid": "5994cf5ebd61f4806932d5f226cb64d0", "name": "Structural Graph Probing of Vision\u2013Language Models", "authors": [{"id": 193668, "fullname": "Haoyu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193668?format=json", "institution": "Northeastern University"}, {"id": 193669, "fullname": "Yue Zhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193669?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 193670, "fullname": "Yu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193670?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 193671, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193671?format=json", "institution": "Northeastern University"}], "abstract": "The internal organization of vision\u2013language models (VLMs) remains poorly understood, particularly how they distribute and fuse information across layers. We take a topology-first perspective and analyze VLMs through the interaction graphs induced by neuron\u2013neuron correlations, treating each layer as a structured computational network rather than a sequence of token transformations. Operating solely on these graphs, we show that global connectivity patterns are strongly predictive of model behavior across grounded reasoning, counting, and hallucination tasks. Modality-separated graphs reveal that cross-modal fusion strengthens sharply in mid-to-late layers, while contrastive graph alignment exposes how multimodal training reorganizes topology relative to text-only backbones. Targeted interventions on high-degree neurons further demonstrate their causal influence, indicating that VLMs route multimodal reasoning through sparse but structurally critical hubs. These results highlight interaction topology as a powerful, model-agnostic lens for interpreting and comparing multimodal transformers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40160", "url": null, "sourceid": 31410, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39534, "uid": "ad26e1d38be1e2762f63edfbffa3970b", "name": "ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control", "authors": [{"id": 156911, "fullname": "Shishi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156911?format=json", "institution": "Brown University"}, {"id": 192285, "fullname": "Tongyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192285?format=json", "institution": "Adobe Systems"}, {"id": 192286, "fullname": "David H. Laidlaw", "url": "http://cvpr.thecvf.com/api/miniconf/users/192286?format=json", "institution": "Brown University"}, {"id": 192287, "fullname": "Gromit Yeuk-Yin Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192287?format=json", "institution": "Adobe Systems"}], "abstract": "A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific method for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. The code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39534", "url": null, "sourceid": 34201, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38744, "uid": "5bae7561597d0a52e9b41ca506b33302", "name": "A Difference-in-Difference Approach to Detecting AI-Generated Images", "authors": [{"id": 190562, "fullname": "Xinyi Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190562?format=json", "institution": "Tsinghua University"}, {"id": 190563, "fullname": "Kai Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/190563?format=json", "institution": "London School of Economics and Political Science"}, {"id": 190564, "fullname": "Chengchun Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190564?format=json", "institution": "London School of Economics"}, {"id": 190565, "fullname": "Ying Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190565?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 190566, "fullname": "Jin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190566?format=json", "institution": "University of Birmingham"}, {"id": 190567, "fullname": "Hongyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190567?format=json", "institution": "Tsinghua University"}], "abstract": "Diffusion models are able to produce AI-generated images that are almost indistinguishable from real ones, raising concerns about their potential misuse and posing substantial challenges for detecting them. Many existing detectors rely on reconstruction error \u2014 the difference between the input image and its reconstructed version \u2014 as the basis for distinguishing real from fake images. However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. Instead of directly using the reconstruction error (a first-order difference), we compute the difference in reconstruction error -- a second-order difference -- for variance reduction and improving detection accuracy. Extensive experiments demonstrate that our method achieves strong generalization performance, enabling reliable detection of AI-generated images in the era of generative AI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38744", "url": null, "sourceid": 43513, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37502, "uid": "c03dada7934786d94b4c77a67108b3b8", "name": "AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models", "authors": [{"id": 155309, "fullname": "Teng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155309?format=json", "institution": "Southeast University"}, {"id": 179975, "fullname": "Yanting Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179975?format=json", "institution": "Southeast University"}, {"id": 187593, "fullname": "Ruize Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187593?format=json", "institution": "Southeast University"}], "abstract": "We present AutoTraces, an autoregressive vision-language-trajectory model for robot trajectory forecasting in humam-populated environments, which harnesses the inherent reasoning capabilities of large language models (LLMs) to model complex human behaviors. In contrast to prior works that rely solely on textual representations, our key innovation lies in a novel trajectory tokenization scheme, which represents physical waypoints as discrete tokens, seamlessly integrated into the LLM\u2019s space through a lightweight encoder-decoder architecture. This design preserves the LLM\u2019s native autoregressive generation mechanism while extending it to physical coordinate spaces, facilitates modeling of long-term interactions in trajectory data. We further introduce an automated chain-of-thought (CoT) generation mechanism that leverages a multimodal LLM  to infer spatio-temporal relationships from visual observations and trajectory data, eliminating reliance on manual annotation. Through a two-stage training strategy, our AutoTraces achieves SOTA forecasting accuracy, particularly in long-horizon prediction, while exhibiting strong cross-scene generalization and supporting flexible-length forecasting.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37502", "url": null, "sourceid": 34443, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37495, "uid": "63aec506bcfd90818d4d4eeed163edf6", "name": "NI-Tex: Non-isometric Image-based Garment Texture Generation", "authors": [{"id": 182927, "fullname": "Hui Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182927?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 181954, "fullname": "LI MING", "url": "http://cvpr.thecvf.com/api/miniconf/users/181954?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 135042, "fullname": "Haitao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135042?format=json", "institution": "The University of Texas at Austin"}, {"id": 175319, "fullname": "Kai Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175319?format=json", "institution": "Westlake University"}, {"id": 175361, "fullname": "SIzhe Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175361?format=json", "institution": "Westlake University"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}, {"id": 185806, "fullname": "Xiangru Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185806?format=json", "institution": "Westlake University"}], "abstract": "Existing industrial 3D garment meshes already cover most real-world clothing geometries, yet their texture diversity remains limited. To acquire more realistic textures, generative methods are often used to extract Physically-based Rendering (PBR) textures and materials from large collections of wild images and project them back onto garment meshes. However, most image-conditioned texture generation approaches require strict topological consistency between the input image and the input 3D mesh, or rely on accurate mesh deformation to match to the image poses, which significantly constrains the texture generation quality and flexibility.To address the challenging problem of non-isometric image-based garment texture generation, we construct 3D Garment Videos, a physically simulated, garment-centric dataset that provides consistent geometry and material supervision across diverse deformations, enabling robust cross-pose texture learning. We further employ Nano Banana for high-quality non-isometric image editing, achieving reliable cross-topology texture generation between non-isometric image-geometry pairs. Finally, we propose an iterative baking method via uncertainty-guided view selection and reweighting that fuses multi-view predictions into seamless, production-ready PBR textures. Through extensive experiments, we demonstrate that our feedforward dual-branch architecture generates versatile and spatially aligned PBR materials suitable for industry-level 3D garment design.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37495", "url": null, "sourceid": 43938, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38489, "uid": "d3cc657a4d53c20a3915b2ab9899ff53", "name": "ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction", "authors": [{"id": 181954, "fullname": "LI MING", "url": "http://cvpr.thecvf.com/api/miniconf/users/181954?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 182927, "fullname": "Hui Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182927?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 175319, "fullname": "Kai Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175319?format=json", "institution": "Westlake University"}, {"id": 145808, "fullname": "Chentao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/145808?format=json", "institution": "zhejiang university"}, {"id": 175455, "fullname": "Siyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175455?format=json", "institution": "Xidian University"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}, {"id": 189971, "fullname": "Zhen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189971?format=json", "institution": "Adobe Systems"}, {"id": 185806, "fullname": "Xiangru Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185806?format=json", "institution": "Westlake University"}], "abstract": "High-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods, typically rely on the unstructured representations, such as 3D Gaussian Splats, which struggle to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose \\textbf{ReWeaver}, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from \\textit{sparse} multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The reconstructed seams and panels align precisely with the input images, and can be easily converted into simulation-ready and photorealistic 3D garments suitable for high-fidelity physics-based animation and virtual content creation. To enable effective training, we construct a large-scale dataset \\textbf{GCD-TS}, comprising multi-view RGB images, 3D garment geometries, textured human body meshes and annotated sewing patterns. The dataset contains over 100,000 synthetic samples covering a wide range of complex geometries and topologies. Extensive experiments show that ReWeaver consistently outperforms existing methods in terms of topology accuracy, geometry alignment and seam-panel consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38489", "url": null, "sourceid": 44922, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38945, "uid": "803523d233c42a1fc11cc41033a7b76f", "name": "PhaseWin Search Framework Enable Efficient Object-Level Interpretation", "authors": [{"id": 181964, "fullname": "Zihan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181964?format=json", "institution": "Insititue of Information Engineering, Chinese Academy of Sciences"}, {"id": 157688, "fullname": "Ruoyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157688?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 191025, "fullname": "Junchi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191025?format=json", "institution": "Fudan University"}, {"id": 191026, "fullname": "Yue Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191026?format=json", "institution": "Institute of Information Engineering ,CAS"}, {"id": 157692, "fullname": "Hua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157692?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 126239, "fullname": "Xiaochun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126239?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95\\% of greedy attribution faithfulness using only 20\\% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38945", "url": null, "sourceid": 32783, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37321, "uid": "422cf6c6f212dde0fa96c532de240104", "name": "Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning", "authors": [{"id": 187148, "fullname": "Junho Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/187148?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187149, "fullname": "Jaemo Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187149?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187150, "fullname": "Hyunju Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187150?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 183510, "fullname": "Dongman Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/183510?format=json", "institution": "Korea Advanced Institute of Science and Technology(KAIST), School of Computing"}], "abstract": "Aligning egocentric video with wearable sensors have shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability. We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative. While prior egocentric-wearable works predominantly adopt Global Alignment by encoding entire sequences into unified representations, this approach fails in exocentric-ambient settings due to two problems: (P1) inability to capture local details such as subtle motions, and (P2) over-reliance on modality-invariant temporal patterns, causing misalignment between actions sharing similar temporal patterns with different spatio-semantic contexts. To resolve these problems, we propose DETACH, a decomposed spatio-temporal framework. This explicit decomposition preserves local details, while our novel sensor-spatial features discovered via online clustering provide semantic grounding for context-aware alignment. To align the decomposed features, our two-stage approach establishes spatial correspondence through mutual supervision, then performs temporal alignment via a spatial-temporal weighted contrastive loss that adaptively handles easy negatives, hard negatives, and false negatives. Comprehensive experiments with downstream tasks on Opportunity++ and HWU-USP datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37321", "url": null, "sourceid": 43550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38992, "uid": "1632af4762431469541ac66e2d6f4b45", "name": "VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation", "authors": [{"id": 155446, "fullname": "Juhye Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/155446?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191142, "fullname": "Wooju Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191142?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 155447, "fullname": "Dasol Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/155447?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 130030, "fullname": "Changki Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/130030?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 155448, "fullname": "Youngwoo Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/155448?format=json", "institution": "Hanhwa Aerospace"}, {"id": 155449, "fullname": "DongWan Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155449?format=json", "institution": "Hanwha Aerospace"}, {"id": 130034, "fullname": "Hyun Myung", "url": "http://cvpr.thecvf.com/api/miniconf/users/130034?format=json", "institution": "KAIST"}], "abstract": "Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. To address this challenge, we propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods, reducing median position and orientation errors by 50.7\\% and 76.5\\% on KITTI, and 18.0\\% and 46.8\\% on VIGOR, respectively.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38992", "url": null, "sourceid": 41654, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39010, "uid": "7b41e963c997a06a34c7d3d4957e03a7", "name": "S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs", "authors": [{"id": 183138, "fullname": "Yuzhou Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/183138?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191171, "fullname": "Qijian Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191171?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 182098, "fullname": "He Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182098?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191172, "fullname": "Xiaoqi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191172?format=json", "institution": null}, {"id": 191173, "fullname": "Guangzhi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191173?format=json", "institution": ""}, {"id": 89127, "fullname": "Lizhuang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/89127?format=json", "institution": "Dept. of Computer Sci. &amp; Eng., Shanghai Jiao Tong University"}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 86818, "fullname": "Xin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86818?format=json", "institution": "East China Normal University"}], "abstract": "Explicit 3D representations have already become an essential medium for 3D simulation and understanding.However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs.In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs.Specifically, the S2D lifting is two-fold.We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing.Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views.Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity.By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39010", "url": null, "sourceid": 35533, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37023, "uid": "30999ce1f0a35aeff9a456e4487f9924", "name": "BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement", "authors": [{"id": 186507, "fullname": "Zishu Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186507?format=json", "institution": "Fuzhou University"}, {"id": 186508, "fullname": "Xiang-Xiang Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/186508?format=json", "institution": "Fuzhou University"}, {"id": 186509, "fullname": "Shengning Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186509?format=json", "institution": "Shandong Technology and Business University"}, {"id": 155797, "fullname": "Guang-Yong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155797?format=json", "institution": "Fuzhou University"}, {"id": 186510, "fullname": "Guodong Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186510?format=json", "institution": "Shandong Technology and Business University"}, {"id": 186511, "fullname": "Xing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186511?format=json", "institution": "Fuzhou University"}], "abstract": "Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage\u2014which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective\u2014we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SED demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR , 2.03dB in PSNR* and 0.047 in SSIM, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37023", "url": null, "sourceid": 33800, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37281, "uid": "e6c5803498e62cb0ef3bfe44c72c57d6", "name": "TeFlow: Enabling Multi-frame Supervision  for Self-Supervised Feed-forward Scene Flow Estimation", "authors": [{"id": 180890, "fullname": "Qingwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180890?format=json", "institution": "KTH Royal Institute of Technology"}, {"id": 84983, "fullname": "Chenhan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84983?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 187074, "fullname": "Xiaomeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187074?format=json", "institution": null}, {"id": 153657, "fullname": "Yunqi Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153657?format=json", "institution": "Huawei London Research Center"}, {"id": 156921, "fullname": "Yushan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156921?format=json", "institution": "Link\u00f6ping University"}, {"id": 187075, "fullname": "Olov Andersson", "url": "http://cvpr.thecvf.com/api/miniconf/users/187075?format=json", "institution": "KTH Royal Institute of Technology"}, {"id": 131231, "fullname": "Patric Jensfelt", "url": "http://cvpr.thecvf.com/api/miniconf/users/131231?format=json", "institution": "KTH Royal Institute of Technology, Stockholm, Sweden"}], "abstract": "Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals.In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames.Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of **up to 33\\%** on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up **150** times.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37281", "url": null, "sourceid": 37845, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37578, "uid": "bc72fcdec4667b67109692744061556f", "name": "Cupid: Generative 3D Reconstruction via Joint Object and Pose Modeling", "authors": [{"id": 157398, "fullname": "Binbin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157398?format=json", "institution": "University of Hong Kong"}, {"id": 187764, "fullname": "Haobin Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187764?format=json", "institution": "University of Hong Kong"}, {"id": 187765, "fullname": "Yiqun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187765?format=json", "institution": "University of Hong Kong"}, {"id": 186525, "fullname": "Zibo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186525?format=json", "institution": "Tencent"}, {"id": 106959, "fullname": "Yi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/106959?format=json", "institution": "UC Berkeley"}, {"id": 153801, "fullname": "Shenghua Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153801?format=json", "institution": "University of Hong Kong"}], "abstract": "We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10\\% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37578", "url": null, "sourceid": 44474, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38989, "uid": "2fe2b2d2da35bcee61d38adc72c9877b", "name": "Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis", "authors": [{"id": 191129, "fullname": "Yuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191129?format=json", "institution": "Xiangtan University"}, {"id": 191130, "fullname": "Sihao Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191130?format=json", "institution": "Xiangtan University"}, {"id": 181480, "fullname": "Kai Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181480?format=json", "institution": "Xiangtan University"}, {"id": 191131, "fullname": "Shuhua Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191131?format=json", "institution": "Xiangtan University"}, {"id": 191132, "fullname": "Chunhong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191132?format=json", "institution": "Xiangtan University"}, {"id": 191133, "fullname": "Fen Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191133?format=json", "institution": null}, {"id": 107462, "fullname": "Xieping Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107462?format=json", "institution": "Hunan Normal University"}], "abstract": "Endoscopic video analysis is crucial for early gastrointestinal screening, but its progress is constrained by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods designed for natural videos tend to prioritize dense spatio-temporal modeling and exhibit motion bias, neglecting the static, structured semantics that are critical for clinical decision-making. To address this challenge, we propose **F**ocus-to-**P**erceive **R**epresentation **L**earning (***FPRL***), a cognition-inspired hierarchical framework that emulates the clinical examination process of endoscopic videos. ***FPRL*** first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, ***FPRL*** employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics through the application of teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that ***FPRL*** achieves state-of-the-art performance across diverse downstream tasks, demonstrating its effectiveness and strong generalization in endoscopic video representation learning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38989", "url": null, "sourceid": 39247, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38301, "uid": "6b6ed5a249b7e07c8579fe97e165423e", "name": "IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting", "authors": [{"id": 189543, "fullname": "Tao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189543?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences; University of the Chinese Academy of Sciences"}, {"id": 173778, "fullname": "Yuyang Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/173778?format=json", "institution": "Institute of Automation,CAS"}, {"id": 189544, "fullname": "Yang Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/189544?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189545, "fullname": "Kun Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/189545?format=json", "institution": "Institute of Automation"}, {"id": 107116, "fullname": "Zeyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107116?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 149633, "fullname": "Ying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149633?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences; Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 89642, "fullname": "Shiming Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89642?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189546, "fullname": "Chunhong Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189546?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce **IF-Bench**, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (**GenViP**) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38301", "url": null, "sourceid": 43125, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36626, "uid": "de55a46be82a3a33ccd225eb349bcdeb", "name": "FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation", "authors": [{"id": 131724, "fullname": "Cheng Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131724?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 86439, "fullname": "Zhuo Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/86439?format=json", "institution": "ByteDance"}, {"id": 185501, "fullname": "Liao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185501?format=json", "institution": "Bytedance"}, {"id": 185502, "fullname": "Chen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185502?format=json", "institution": "Tsinghua Shenzhen International Graduate School"}, {"id": 129238, "fullname": "Zhaohu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129238?format=json", "institution": "ByteDance"}, {"id": 86787, "fullname": "Chengjiang Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/86787?format=json", "institution": "ByteDance Inc."}, {"id": 152756, "fullname": "Zheng Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/152756?format=json", "institution": null}, {"id": 76470, "fullname": "Jingxiang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76470?format=json", "institution": "Tsinghua University"}, {"id": 127988, "fullname": "Chenyangguang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127988?format=json", "institution": "Tsinghua University"}, {"id": 75944, "fullname": "Yebin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75944?format=json", "institution": "Tsinghua University"}], "abstract": "We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation.For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set.Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality.Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36626", "url": null, "sourceid": 44882, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38937, "uid": "be149bb539a8c3fcbea5809d74aae951", "name": "DREAM: Document Recognition with Explicit Adaptive Memory", "authors": [{"id": 181880, "fullname": "TIANQI ZHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/181880?format=json", "institution": "Tsinghua University"}, {"id": 180984, "fullname": "Di Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180984?format=json", "institution": "Tsinghua University"}, {"id": 191005, "fullname": "Liangrui Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191005?format=json", "institution": "Tsinghua University"}, {"id": 191006, "fullname": "Yifan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191006?format=json", "institution": "Tsinghua University"}, {"id": 191007, "fullname": "Kemeng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191007?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191008, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191008?format=json", "institution": "Tsinghua University"}, {"id": 182578, "fullname": "Zhiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182578?format=json", "institution": "Tsinghua University"}, {"id": 191009, "fullname": "Yizhu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191009?format=json", "institution": "Tsinghua University"}, {"id": 154730, "fullname": "Borui Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154730?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 191010, "fullname": "Yuyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191010?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}], "abstract": "Large multimodal models (LMMs) have shown promising performance for various document recognition tasks. However, LLMs adopt implicit modeling, and the parameters lack interpretability. Inspired by recent advances in human memory and learning research, We propose an explicit multiscale prototype memory that augments document recognition models, explicitly modeling recurrent layout and stylistic patterns across different spatial resolutions. A Memory Retrieval Mechanism enables local regions to sparsely attend to a few prototypes (e.g., image borders, tilted text); the retrieved compositional factors are concatenated with visual features and passed to the decoder, providing explicit region-wise structural context. Prototype memory consolidation updates and stabilizes prototypes via attention-weighted exponential moving average (EMA) strategy, while sparsity and anti-collapse regularization promote selective activation and disentanglement. We further adopt hierarchical memory and a scale-adaptive attention module for multi-resolution encoding, trained with a multi-task, entropy-regularized objective. We validate on two tasks including document recognition on the Fox and the self-built DreamDoc dataset, and handwriting recognition on the SCUT-HCCDoc and SCUT-EPT Chinese handwriting datasets. Experimental results show that the proposed method is effective.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38937", "url": null, "sourceid": 41755, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39246, "uid": "c6bc115044e002e815f108534568bb90", "name": "PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing", "authors": [{"id": 191696, "fullname": "Jiadong Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191696?format=json", "institution": "alibaba"}, {"id": 191697, "fullname": "Bojun Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191697?format=json", "institution": "Alibaba Group"}, {"id": 191698, "fullname": "Jie Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191698?format=json", "institution": null}, {"id": 191699, "fullname": "Hua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191699?format=json", "institution": null}, {"id": 191700, "fullname": "Xiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/191700?format=json", "institution": "Alibaba Group"}, {"id": 191701, "fullname": "Yong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191701?format=json", "institution": "Alibaba Group"}, {"id": 86009, "fullname": "Huan Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86009?format=json", "institution": "Alibaba Group"}], "abstract": "This paper primarily investigates the task of editing facial expression in an input portrait video based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile portrait video expression editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we modify the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. We will release our code and trained models to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39246", "url": null, "sourceid": 34133, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38542, "uid": "8fd7f8741825334334a31ad8eb493a9a", "name": "Enabling Supervised Learning of Generative Signatures for Generalized Synthetic Image Detection", "authors": [{"id": 131560, "fullname": "Jianwei Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/131560?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 100955, "fullname": "Yunshu Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/100955?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 190093, "fullname": "Xiaoyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190093?format=json", "institution": "Jinan University"}, {"id": 190094, "fullname": "Zhihua Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/190094?format=json", "institution": "Jinan University"}, {"id": 190095, "fullname": "Alessandro Piva", "url": "http://cvpr.thecvf.com/api/miniconf/users/190095?format=json", "institution": "University of Florence"}], "abstract": "Extracting reliable generative traces in generated images is critical for AI-generated images (AIGIs) detection. However, a fundamental challenge exists: AIGIs inherently contain generative traces with no trace-free counterpart available, making supervised extraction of these artifacts infeasible. In this work, we overcome this through a surrogate supervision framework. We design a dynamic reconstructor that simulates diverse generative traces on real images through stochastically varied architectures and parameters. The reconstruction residuals serve as supervision to train an extractor that learns to isolate traces, \\textit{i.e.}, generative signatures (GenSign). A detector then fuses extracted GenSign with RGB features to distinguish real images from AIGIs. Our key insight is that sufficient architectural diversity in simulation enables effective transfer to real-world generators, resolving the absence of ground truth GenSign. Extensive experiments across four benchmarks demonstrate state-of-the-art generalization, confirming that our simulation-based learning paradigm is capable of extracting general and transferable forensic features.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38542", "url": null, "sourceid": 46130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36562, "uid": "371d25d024755d6e87bbc2880e592a65", "name": "Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals", "authors": [{"id": 185350, "fullname": "Nate Gillman", "url": "http://cvpr.thecvf.com/api/miniconf/users/185350?format=json", "institution": "Brown University"}, {"id": 182251, "fullname": "Yinghua Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182251?format=json", "institution": "Brown University"}, {"id": 181967, "fullname": "Zitian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181967?format=json", "institution": "Brown University"}, {"id": 185351, "fullname": "Evan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185351?format=json", "institution": "Brown University"}, {"id": 185352, "fullname": "Arjan Chakravarthy", "url": "http://cvpr.thecvf.com/api/miniconf/users/185352?format=json", "institution": "Brown University"}, {"id": 185353, "fullname": "Daksh Aggarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/185353?format=json", "institution": "Brown University"}, {"id": 185354, "fullname": "Michael Freeman", "url": "http://cvpr.thecvf.com/api/miniconf/users/185354?format=json", "institution": "Cornell University"}, {"id": 89379, "fullname": "Chen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/89379?format=json", "institution": "Brown University"}], "abstract": "Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives\u2014such as elastic collisions and falling dominos\u2014teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36562", "url": null, "sourceid": 44111, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36756, "uid": "c8c0c749da5056804a5bfcfe44b778e4", "name": "SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method", "authors": [{"id": 185803, "fullname": "Wentao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185803?format=json", "institution": "Zhejiang University; Westlake University;"}, {"id": 185804, "fullname": "FanZhen KONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185804?format=json", "institution": null}, {"id": 185805, "fullname": "Zejian Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185805?format=json", "institution": "Westlake University; Zhejiang University"}, {"id": 185806, "fullname": "Xiangru Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185806?format=json", "institution": "Westlake University"}], "abstract": "3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose \\textbf{SparseOIT}, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36756", "url": null, "sourceid": 36240, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39071, "uid": "eb6ee4b36cdf23e8167e10a47e340fed", "name": "PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage", "authors": [{"id": 182434, "fullname": "Thomas Gottwald", "url": "http://cvpr.thecvf.com/api/miniconf/users/182434?format=json", "institution": "Bergische Universit\u00e4t Wuppertal"}, {"id": 191303, "fullname": "Edgar Heinert", "url": "http://cvpr.thecvf.com/api/miniconf/users/191303?format=json", "institution": "Universit\u00e4t Osnabr\u00fcck"}, {"id": 191304, "fullname": "Peter Stehr", "url": "http://cvpr.thecvf.com/api/miniconf/users/191304?format=json", "institution": "Bergische Universit\u00e4t Wuppertal"}, {"id": 126456, "fullname": "Chamuditha Jayanga Galappaththige", "url": "http://cvpr.thecvf.com/api/miniconf/users/126456?format=json", "institution": "QUT Centre for Robotics, Australia."}, {"id": 191305, "fullname": "Matthias Rottmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/191305?format=json", "institution": "Universit\u00e4t Osnabr\u00fcck"}], "abstract": "We introduce Primitive-based Representations of Uncertainty (PRIMU), a post-hoc uncertainty estimation (UE) framework for Gaussian Splatting (GS).Reliable UE is essential for deploying GS in safety-critical domains such as robotics and medicine.Existing approaches typically estimate Gaussian-primitive variances and rely on the rendering process to obtain pixel-wise uncertainties.In contrast, we construct primitive-level representations of error and visibility/coverage from training views, capturing interpretable uncertainty information. These representations are obtained by projecting view-dependent training errors and coverage statistics onto the primitives. Uncertainties for novel views are inferred by rendering these primitive-level representations, producing uncertainty feature maps, which are aggregate through pixel-wise regression on holdout data. We analyze combinations of uncertainty feature maps and regression models to understand how their interactions affect prediction accuracy and generalization.PRIMU also enables an effective active view selection strategy by directly leveraging these uncertainty feature maps.Additionally, we study the effect of separating splatting into foreground and background regions.Our estimates show strong correlations with true errors, outperforming state-of-the-art methods, especially for depth UE and foreground objects.Finally, our regression models show generalization capabilities to unseen scenes, enabling UE without additional holdout data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39071", "url": null, "sourceid": 30605, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37692, "uid": "d9ee56d47bdcadd39f2ec0d61f571cf3", "name": "PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning", "authors": [{"id": 100635, "fullname": "Xinxing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/100635?format=json", "institution": "Macau University of Science and Technology"}, {"id": 75781, "fullname": "Ajian Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75781?format=json", "institution": "NLPR, CASIA"}, {"id": 188025, "fullname": "Sunyuan Qiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188025?format=json", "institution": "Macau University of Science and Technology"}, {"id": 188026, "fullname": "Hui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188026?format=json", "institution": "Macau University of Science and Technology"}, {"id": 156875, "fullname": "Liying Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156875?format=json", "institution": "Macau University of Science and Technology"}, {"id": 176491, "fullname": "Yuzhong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176491?format=json", "institution": "Macau University Of Science And Technology"}, {"id": 188027, "fullname": "Zhi Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188027?format=json", "institution": "Macau University of Science and Technology"}, {"id": 130103, "fullname": "Yanyan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130103?format=json", "institution": "Macau University of Science and Technology"}], "abstract": "Scene-level point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances achieved in the field through existing methods, the sample-independent modelling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across different scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into a continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space, and thereby achieve global semantic alignment. Since serialization-based pretraining requires batch-level input organization, we further introduce an asymmetric semantic preservation distillation (SPD) during finetuning to achieve structural alignment of semantic transfer and eliminate inconsistencies caused by batch dependency. The proposed SPD ensures stable transfer of pretrained semantics through a heterogeneous input mechanism and a semantic feature alignment constraint. This enables the model to maintain structured semantic consistency and robustness under single-scene testing conditions. Extensive experiments on multiple benchmark datasets demonstrate that our method consistently outperforms state-of-the-art methods in both performance and semantic consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37692", "url": null, "sourceid": 40544, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39613, "uid": "54a96ac32645d07ae686344a55414be1", "name": "Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models", "authors": [{"id": 173073, "fullname": "Chengsheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173073?format=json", "institution": "University of Science and Technology of China"}, {"id": 192481, "fullname": "Chenghao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192481?format=json", "institution": "University of Science and Technology of China"}, {"id": 192482, "fullname": "Xinyan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192482?format=json", "institution": "Shanghai Advanced Research Institute, Chinese Academy of Sciences"}, {"id": 192483, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192483?format=json", "institution": "University of Science and Technology of China"}, {"id": 127501, "fullname": "Xinmei Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/127501?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses.While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs.To address this, we propose $\\textbf{P}$refill-$\\textbf{T}$ime $\\textbf{I}$ntervention ($\\textbf{PTI}$), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs.Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source.Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39613", "url": null, "sourceid": 31469, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38884, "uid": "9fca0cde755c3322ddd4e8c702aea6c2", "name": "Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy", "authors": [{"id": 180573, "fullname": "Jiahao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180573?format=json", "institution": "Fujian Normal University"}, {"id": 190909, "fullname": "Fengyan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190909?format=json", "institution": "Fujian Normal University"}, {"id": 190910, "fullname": "Xuechao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190910?format=json", "institution": "Royal Melbourne Institute of Technology"}, {"id": 190911, "fullname": "Feng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190911?format=json", "institution": "Fuzhou Polytechnic"}, {"id": 190912, "fullname": "Kexin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190912?format=json", "institution": "Vectolink (Fuzhou, Fujian) Information Technology Co., Ltd."}, {"id": 190913, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190913?format=json", "institution": null}, {"id": 190914, "fullname": "Zhide chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190914?format=json", "institution": "Fujian Normal University"}], "abstract": "The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth\u2014perception, understanding, and interaction\u2014and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce **Nano-EmoX**, a small-scale multitask MLM, and **P2E** (**P**erception-**to**-**E**mpathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38884", "url": null, "sourceid": 37875, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39330, "uid": "0093fe2eedeb098315bf9251da1a5f03", "name": "Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving", "authors": [{"id": 176733, "fullname": "Zehan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176733?format=json", "institution": "University of Science and Technology of China"}, {"id": 184857, "fullname": "Yaoyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184857?format=json", "institution": ""}, {"id": 184856, "fullname": "Neng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184856?format=json", "institution": "Research Department, Shenzhen Yinwang Intelligent Technology Co., Ltd."}, {"id": 184858, "fullname": "Jia Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184858?format=json", "institution": null}], "abstract": "Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can accumulate into unstable trajectories, degrading comfort and safety in closed-loop driving. Several methods attempt to inject history as a static conditioning signal to stabilize outputs, only to induce the planner to copy historical patterns instead of adapting to environment contexts. To address this limitation, we propose Diffusion Forcing Planner (DFP), a diffusion-based planning framework driven by history-guided control. Specifically, DFP decomposes the full trajectory into history, current and future segments, and assign independent noise levels to each segment. The model jointly denoises the historical and the future segments, enforcing a heterogeneous joint diffusion process. At inference, classifier-free guidance (CFG) is applied to steer future sampling using annealed history in a controllable manner. Closed-loop evaluation and comprehensive ablations on nuPlan show that DFP achieves competitive performance while producing continuous, stable, and controllable motion plans in complex driving scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39330", "url": null, "sourceid": 43697, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36352, "uid": "22242fffb164c888879c3513a550427b", "name": "DFM-Drive: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving", "authors": [{"id": 143928, "fullname": "Yifang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143928?format=json", "institution": "Nanjing University"}, {"id": 184852, "fullname": "Jiahao Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184852?format=json", "institution": "Fudan University"}, {"id": 184853, "fullname": "Zhihao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184853?format=json", "institution": "Fudan University"}, {"id": 158004, "fullname": "Hanlin Shang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158004?format=json", "institution": "Fudan University"}, {"id": 184854, "fullname": "Shan Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184854?format=json", "institution": "Fudan University"}, {"id": 159542, "fullname": "Mingwang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159542?format=json", "institution": "Fudan University"}, {"id": 184855, "fullname": "Feipeng Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184855?format=json", "institution": "Shenzhen Yinwang Intelligent Technology Co., Ltd."}, {"id": 184856, "fullname": "Neng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184856?format=json", "institution": "Research Department, Shenzhen Yinwang Intelligent Technology Co., Ltd."}, {"id": 184857, "fullname": "Yaoyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184857?format=json", "institution": ""}, {"id": 184858, "fullname": "Jia Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184858?format=json", "institution": null}, {"id": 154534, "fullname": "Siyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154534?format=json", "institution": "Fudan University"}], "abstract": "We introduce DFM-Drive, a vision\u2013language\u2013action (VLA) model that casts ego-trajectory planning as discrete flow matching over a structured token space. In contrast to autoregressive decoders, DFM-Drive performs fully parallel, bidirectional denoising, enabling coarse-to-fine refinement with a tunable compute\u2013accuracy trade-off. Specifically, the approach combines a metric-aligned numerical tokenizer that preserves scalar geometry via triplet-margin learning, a geometry-aware flow objective and a simulator-guided GRPO alignment that integrates safety, ego progress, and comfort rewards while retaining parallel generation. A multi-stage adaptation converts a pre-trained auto-regressive backbone (Janus-1.5B) from causal decoding to non-causal flow model and strengthens road-scene competence through continued multimodal pretraining. Thanks to the inherent nature of consistency model training and parallel decoding inference, DFM-Drive achieves superior closed-loop performance against autoregressive and diffusion-based VLA baselines, with 1-step inference attaining 88.7 PDMS and 5-step inference reaching 90.3 PDMS on NAVSIM v1 benchmark. These results establish discrete flow matching as a new promising paradigm for end-to-end autonomous driving.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36352", "url": null, "sourceid": 44439, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36719, "uid": "4d17ee846b4ff7813f91cf20c8177213", "name": "HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding", "authors": [{"id": 182517, "fullname": "Lei Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182517?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 185720, "fullname": "Yong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185720?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 149729, "fullname": "YUEJIAO SU", "url": "http://cvpr.thecvf.com/api/miniconf/users/149729?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88428, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88428?format=json", "institution": "Nanyang Technological University"}, {"id": 185721, "fullname": "Moyun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185721?format=json", "institution": ""}, {"id": 88437, "fullname": "Lap-Pui Chau", "url": "http://cvpr.thecvf.com/api/miniconf/users/88437?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Humans commonly reason about object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the reference image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark reveal the superiority and robustness of our method in seen and unseen scenarios compared to existing approaches. The code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36719", "url": null, "sourceid": 40388, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36864, "uid": "8ef1bcf08007f16f32e7933e174c7198", "name": "EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence", "authors": [{"id": 144268, "fullname": "Jiaxu Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144268?format=json", "institution": "BeiHang University"}, {"id": 186045, "fullname": "Xu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186045?format=json", "institution": "Alibaba"}, {"id": 186046, "fullname": "Mengwei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186046?format=json", "institution": null}, {"id": 186047, "fullname": "Hang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186047?format=json", "institution": "Alibaba Group"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}, {"id": 186048, "fullname": "Yang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/186048?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 186049, "fullname": "Ding Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186049?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 186050, "fullname": "Hong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186050?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 186051, "fullname": "Yifan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186051?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for \u201cthinking with images\u2019\u2019 (e.g., ChatGPT\u2013o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present \\textbf{EagleVision}, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics\u2013perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision\u2013language models, demonstrating strong and generalizable spatial understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36864", "url": null, "sourceid": 35821, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37801, "uid": "66f870cbf6b7ef5d6d1c8a3d4671e775", "name": "SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models", "authors": [{"id": 180041, "fullname": "Jiaji Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180041?format=json", "institution": "Zhejiang University"}, {"id": 188297, "fullname": "Ruichao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188297?format=json", "institution": "Zhejiang University"}, {"id": 155452, "fullname": "Hailiang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155452?format=json", "institution": "Zhejiang University"}, {"id": 188298, "fullname": "Jiaju Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188298?format=json", "institution": "Nanyang Technological University"}, {"id": 188299, "fullname": "Peng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188299?format=json", "institution": "Zhejiang University"}, {"id": 188300, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188300?format=json", "institution": "Kuaishou"}, {"id": 188301, "fullname": "Yuying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188301?format=json", "institution": null}, {"id": 155453, "fullname": "Kingsum Chow", "url": "http://cvpr.thecvf.com/api/miniconf/users/155453?format=json", "institution": "Zhejiang University"}, {"id": 188302, "fullname": "GANG XIONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/188302?format=json", "institution": null}, {"id": 155454, "fullname": "Shuiguang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155454?format=json", "institution": "Zhejiang University"}], "abstract": "Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments.Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data.However, existing PTQ methods for diffusion models often rely on manual, architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines.To address these limitations, we propose SegQuant, a deployment-aware quantization framework that adaptively combines complementary techniques to enhance cross-model versatility.SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that  preserves polarity-asymmetric activations using a hardware-native dual-path computation, avoiding performance penalties from custom implementations, which is crucial for maintaining visual fidelity in generated outputs.SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37801", "url": null, "sourceid": 31639, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39603, "uid": "5f1517b532a2dd760f7d865e4d4146c6", "name": "Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination", "authors": [{"id": 181524, "fullname": "Songyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181524?format=json", "institution": "National University of Defense Technology"}, {"id": 176170, "fullname": "Guijian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176170?format=json", "institution": "National University of Defense Technology"}, {"id": 192459, "fullname": "Kun Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192459?format=json", "institution": "National University of Defense Technology"}, {"id": 189014, "fullname": "Haotian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189014?format=json", "institution": "National University of Defense Technology"}, {"id": 192460, "fullname": "Shixuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192460?format=json", "institution": "AMS; National University of Defense Technology"}, {"id": 158616, "fullname": "Wenjing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158616?format=json", "institution": "National University of Defense Technology"}, {"id": 158615, "fullname": "Long Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/158615?format=json", "institution": "National University of Defense Technology"}, {"id": 184470, "fullname": "Huibin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184470?format=json", "institution": "National University of Defense Technology"}], "abstract": "Hallucination limits the reliability of multimodal large language models (MLLMs), and it is particularly damaging in video where errors manifest as distorted narratives rather than single-frame mistakes. We introduce a frame-first study of **Chimera Hallucination**: model stitches visual segments that exist in space and time but do not belong to the same event chain, producing a spurious continuous story. We introduce **CH-Risk**, a single-forward, reference-free risk estimate tailored to this failure mode. CH-Risk combines two complementary signals: $SegCoverage@\\alpha (\\mathrm{SCR}@\\alpha\\)$ measures how many event segments are needed to cover most text-to-frame support, exposing long-range stitching; Alignment with Early Temporal Pathway (AETP) measures rank consistency between support and the temporal pathway formed in early\u2013middle layers, exposing stage mismatch. To turn risk into correction, we further propose **CH-M(itigation)**, a train-free two-stage intervention. Segment-aligned Stage-Aligned Frame Routing (sSAFR) re-weights frames before the mid-layer softmax to route attention toward a small set of pathway-aligned segments. Residual Token Calibration (RTC) then stabilizes token usage within selected segments. Extensive experiments across 9 benchmarks and 6 VideoLLMs show that CH-Risk can predict Chimera and that CH-M consistently reduce hallucination and improves task accuracy with negligible overhead (sub-5\\% latency, sub-2.5\\% memory, \\$\\approx$1\\% FLOPs).", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39603", "url": null, "sourceid": 33921, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40178, "uid": "137853082702b6be2d735817ca348b05", "name": "WaTeRFlow: Watermark Temporal Robustness via Flow Consistency", "authors": [{"id": 170334, "fullname": "Utae Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/170334?format=json", "institution": "Korea University, Seoul, Korea"}, {"id": 181834, "fullname": "Sumin In", "url": "http://cvpr.thecvf.com/api/miniconf/users/181834?format=json", "institution": "Korea University"}, {"id": 193727, "fullname": "Hyunju Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193727?format=json", "institution": "Korea University"}, {"id": 193728, "fullname": "Jaewan Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193728?format=json", "institution": null}, {"id": 89914, "fullname": "Feng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89914?format=json", "institution": "Google Research"}, {"id": 193729, "fullname": "Jongheon Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193729?format=json", "institution": "Korea University"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 130426, "fullname": "Sangpil Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/130426?format=json", "institution": "Korea University"}], "abstract": "Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder\u2013decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40178", "url": null, "sourceid": 40977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39647, "uid": "6658b3fcc28fc1b0ac05e2063eeadad2", "name": "Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning", "authors": [{"id": 181524, "fullname": "Songyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181524?format=json", "institution": "National University of Defense Technology"}, {"id": 131131, "fullname": "Weijiang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131131?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184469, "fullname": "Ziyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184469?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 176170, "fullname": "Guijian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176170?format=json", "institution": "National University of Defense Technology"}, {"id": 158616, "fullname": "Wenjing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158616?format=json", "institution": "National University of Defense Technology"}, {"id": 184470, "fullname": "Huibin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184470?format=json", "institution": "National University of Defense Technology"}, {"id": 184471, "fullname": "Nong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184471?format=json", "institution": "Sun yat-sen University"}], "abstract": "When video reasoning requires external knowledge, many systems with large multimodal models (LMMs) adopt retrieval augmentation to supply the missing context. Appending textual or multi-clip evidence, however, forces heterogeneous signals into a single attention space. We observe diluted attention and higher cognitive load even on non-long videos. The bottleneck is not only what to retrieve but how to represent and fuse external knowledge with the video backbone.We present **Graph-to-Frame RAG (G2F-RAG)**, a training free and auditable paradigm that delivers knowledge in the visual space. On the offline stage, an agent builds a problem-agnostic video knowledge graph that integrates entities, events, spatial relations, and linked world knowledge. On the online stage, a hierarchical multi-agent controller decides whether external knowledge is needed, retrieves a minimal sufficient subgraph, and renders it as a single reasoning frame appended to the video. LMMs then perform joint reasoning in a unified visual domain. This design reduces cognitive load and leaves an explicit, inspectable evidence trail.G2F-RAG is plug-and-play across backbones and scales. It yields consistent gains on diverse public benchmarks, with larger improvements in knowledge-intensive settings. Ablations further confirm that knowledge representation and delivery matter. G2F-RAG reframes retrieval as visual space knowledge fusion for robust and interpretable video reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39647", "url": null, "sourceid": 33167, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36215, "uid": "2711c1c38d9a6f2812b8340b30be6ba8", "name": "Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning", "authors": [{"id": 181524, "fullname": "Songyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181524?format=json", "institution": "National University of Defense Technology"}, {"id": 131131, "fullname": "Weijiang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131131?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184468, "fullname": "Jilin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184468?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 184469, "fullname": "Ziyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184469?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 176170, "fullname": "Guijian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176170?format=json", "institution": "National University of Defense Technology"}, {"id": 158616, "fullname": "Wenjing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158616?format=json", "institution": "National University of Defense Technology"}, {"id": 184470, "fullname": "Huibin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184470?format=json", "institution": "National University of Defense Technology"}, {"id": 184471, "fullname": "Nong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184471?format=json", "institution": "Sun yat-sen University"}], "abstract": "Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce **Reinforce to Learn, Elect to Reason (RLER)**, a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In **RLER-Training**, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In **RLER-Inference**, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3 \\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36215", "url": null, "sourceid": 39526, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39911, "uid": "5c8010125583d79426b73845df9f57f6", "name": "Catch Me if You Can: Active Mapping of Moving 3D Objects", "authors": [{"id": 184303, "fullname": "Davide Allegro", "url": "http://cvpr.thecvf.com/api/miniconf/users/184303?format=json", "institution": "University of Padova"}, {"id": 181208, "fullname": "Shiyao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181208?format=json", "institution": "IP Paris &amp; Inria"}, {"id": 193100, "fullname": "Stefano Ghidoni", "url": "http://cvpr.thecvf.com/api/miniconf/users/193100?format=json", "institution": "University of Padua"}, {"id": 75898, "fullname": "Vincent Lepetit", "url": "http://cvpr.thecvf.com/api/miniconf/users/75898?format=json", "institution": "Ecole des Ponts ParisTech"}], "abstract": "Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39911", "url": "https://davidea97.github.io/paparazzo-page/", "sourceid": 31927, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37280, "uid": "68579a2a1b3f242d15011c629a01e14e", "name": "BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning", "authors": [{"id": 181699, "fullname": "Yuhan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/181699?format=json", "institution": "Shanghai University of Finance and Economics"}, {"id": 187073, "fullname": "Chen Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187073?format=json", "institution": "Shanghai University of Finance and Economics"}], "abstract": "Model Merging (MM) has emerged as a scalable paradigm for multi-task learning (MTL), enabling multiple task-specific models to be integrated without revisiting the original training data. Despite recent progress, the reliability of MM under test-time distribution shift remains insufficiently understood. Most existing MM methods typically assume that test data are clean and distributionally aligned with both the training and auxiliary sources. However, this assumption rarely holds in practice, often resulting in biased predictions with degraded generalization.  To address this issue, we present BD-Merging, a bias-aware unsupervised model merging framework that explicitly models uncertainty to achieve adaptive reliability under distribution shift. First, BD-Merging introduces a joint evidential head that learns uncertainty over a unified label space, capturing cross-task semantic dependencies in MM. Second, building upon this evidential foundation, we propose an Adjacency Discrepancy Score (ADS) that quantifies evidential alignment among neighboring samples. Third, guided by ADS, a discrepancy-aware contrastive learning mechanism refines the merged representation by aligning consistent samples and separating conflicting ones. Combined with general unsupervised learning, this process trains a debiased router that adaptively allocates task-specific or layer-specific weights on a per-sample basis, effectively mitigating the adverse effects of distribution shift.   Extensive experiments across diverse tasks demonstrate that BD-Merging achieves superior effectiveness and robustness compared to state-of-the-art MM baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37280", "url": null, "sourceid": 44117, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36769, "uid": "f6c59252b852c52ff41a84ce6ddc8e24", "name": "ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation", "authors": [{"id": 184298, "fullname": "Qing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184298?format=json", "institution": "Peking University"}, {"id": 153322, "fullname": "Zhipei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153322?format=json", "institution": "Peking University"}, {"id": 70310, "fullname": "Xuanyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70310?format=json", "institution": "Peking University"}, {"id": 185832, "fullname": "Xiangyu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185832?format=json", "institution": "South China University of Technology"}, {"id": 76749, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76749?format=json", "institution": "Peking University"}], "abstract": "The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36769", "url": null, "sourceid": 33275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38757, "uid": "5d347461fef2060cacfa445eb828b790", "name": "Ego: Embedding-Guided Personalization of Vision-Language Models", "authors": [{"id": 190596, "fullname": "Soroush Seifi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190596?format=json", "institution": "Toyota Motors Europe"}, {"id": 190597, "fullname": "Simon Gardier", "url": "http://cvpr.thecvf.com/api/miniconf/users/190597?format=json", "institution": "Toyota Motor Europe, University of Li\u00e8ge"}, {"id": 190598, "fullname": "Vaggelis Dorovatas", "url": "http://cvpr.thecvf.com/api/miniconf/users/190598?format=json", "institution": "National Technical University of Athens"}, {"id": 190599, "fullname": "Daniel Olmeda Reino", "url": "http://cvpr.thecvf.com/api/miniconf/users/190599?format=json", "institution": "Toyota Motor Europe"}, {"id": 75692, "fullname": "Rahaf Aljundi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75692?format=json", "institution": "Toyota Motor Europe"}], "abstract": "AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model\u2019s inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model\u2019s internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and  SOTA  methods across various personalization settings including single-concept, multi-concept, and video personalization,  demonstrating strong performance gains with minimal personalization overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38757", "url": null, "sourceid": 43245, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40109, "uid": "820f7970418d010d52a1c1db2d3c1d65", "name": "BEA-GS : BEyond RAdiance Supervision in 3DGS for Precise Object Extraction", "authors": [{"id": 104483, "fullname": "Alessio Mazzucchelli", "url": "http://cvpr.thecvf.com/api/miniconf/users/104483?format=json", "institution": "Arquimea Research Center"}, {"id": 165797, "fullname": "Mar\u00eda Naranjo Almeida", "url": "http://cvpr.thecvf.com/api/miniconf/users/165797?format=json", "institution": null}, {"id": 193557, "fullname": "Jorge Bustos Sanchez", "url": "http://cvpr.thecvf.com/api/miniconf/users/193557?format=json", "institution": "Arquimea Research Center"}, {"id": 193558, "fullname": "Mariella Dimiccoli", "url": "http://cvpr.thecvf.com/api/miniconf/users/193558?format=json", "institution": "Institut de Rob\u00f2tica i Infom\u00e0tica Industrial, CSIC-UPC"}, {"id": 85783, "fullname": "Francesc Moreno-Noguer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85783?format=json", "institution": "Universidad Polit\u00e9cnica de Cataluna"}, {"id": 129233, "fullname": "Jordi Sanchez-Riera", "url": "http://cvpr.thecvf.com/api/miniconf/users/129233?format=json", "institution": "Institut de Rob\u00f2tica i Inform\u00e0tica Industrial CSIC-UPC"}, {"id": 85768, "fullname": "Adrian Penate-Sanchez", "url": "http://cvpr.thecvf.com/api/miniconf/users/85768?format=json", "institution": "Universidad de Las Palmas de Gran Canaria"}], "abstract": "Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene don't optimize the underlying 3D geometry of the scene. This makes object-level editing or asset extraction challenging. Recent methods, like COBGS, Trace3D, and ObjectGS, acknowledge this limitation and propose approaches that modify the geometry of the scene to represent the underlying semantics. We go a step further and propose a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1. Modifying the geometry of visible Gaussians to respect semantic boundaries, and, 2. Modifying the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization to allow for seamless integration within the optimization of the Gaussian parameters. Our second loss also propagates gradients to the Gaussian parameters, but does so without passing through the rasterization. This allows it to modify the geometry of the scene, even if not much transmittance arrives to a Gaussian (partial or non-visible). Exhaustive comparisons to 12 state of the art methods over 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40109", "url": null, "sourceid": 42374, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38363, "uid": "a667239f0490d033961cadd7c7e620d2", "name": "NaTex: Seamless Texture Generation as Latent Color Diffusion", "authors": [{"id": 155059, "fullname": "Zeqiang Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/155059?format=json", "institution": "Tencent"}, {"id": 186526, "fullname": "Yunfei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186526?format=json", "institution": "Tencent Hunyuan"}, {"id": 186525, "fullname": "Zibo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186525?format=json", "institution": "Tencent"}, {"id": 77135, "fullname": "Xin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77135?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86321, "fullname": "Xin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86321?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 127361, "fullname": "Jingwei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127361?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 95127, "fullname": "Xiangyu Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/95127?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}], "abstract": "We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE\u2013DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38363", "url": null, "sourceid": 41182, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38412, "uid": "f1a235df0a3756dfbbe06983df8de8b0", "name": "Multi-Scale Speculative Decoding", "authors": [{"id": 189825, "fullname": "Elia Peruzzo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189825?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 131700, "fullname": "Guillaume Sautiere", "url": "http://cvpr.thecvf.com/api/miniconf/users/131700?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 106397, "fullname": "Amirhossein Habibian", "url": "http://cvpr.thecvf.com/api/miniconf/users/106397?format=json", "institution": "Qualcomm AI Research"}], "abstract": "Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups --- up to $\\mathbf{1.7\\times}$ --- outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38412", "url": null, "sourceid": 34820, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37029, "uid": "7d49472a0a92a841bb35f0001552032a", "name": "X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion", "authors": [{"id": 179905, "fullname": "XINHAO YAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/179905?format=json", "institution": "shanghaitech university"}, {"id": 186523, "fullname": "Jiachen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186523?format=json", "institution": null}, {"id": 181693, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181693?format=json", "institution": "Tencent"}, {"id": 90524, "fullname": "Changfeng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90524?format=json", "institution": "Nanjing University"}, {"id": 104366, "fullname": "Yunhan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104366?format=json", "institution": "The University of Hong Kong"}, {"id": 186524, "fullname": "Chunshi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186524?format=json", "institution": "Zhejiang University; Tencent Hunyuan3D"}, {"id": 186525, "fullname": "Zibo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186525?format=json", "institution": "Tencent"}, {"id": 155059, "fullname": "Zeqiang Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/155059?format=json", "institution": "Tencent"}, {"id": 186526, "fullname": "Yunfei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186526?format=json", "institution": "Tencent Hunyuan"}, {"id": 155060, "fullname": "Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155060?format=json", "institution": null}, {"id": 129664, "fullname": "Chunchao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129664?format=json", "institution": "Tencent"}], "abstract": "Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structurally sound 3D assets. Codes will be released for public research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37029", "url": null, "sourceid": 43848, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38253, "uid": "f8c24c1d97f272c8b82b7d1ece3c07b6", "name": "Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics", "authors": [{"id": 134702, "fullname": "Najibul Haque Sarker", "url": "http://cvpr.thecvf.com/api/miniconf/users/134702?format=json", "institution": "Virginia Tech"}, {"id": 189432, "fullname": "Zaber Ibn Abdul Hakim", "url": "http://cvpr.thecvf.com/api/miniconf/users/189432?format=json", "institution": "Virginia Tech"}, {"id": 168009, "fullname": "Ali Asgarov", "url": "http://cvpr.thecvf.com/api/miniconf/users/168009?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 189433, "fullname": "Chia-Wei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189433?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 127012, "fullname": "Alvi Md Ishmam", "url": "http://cvpr.thecvf.com/api/miniconf/users/127012?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 127058, "fullname": "Chris Thomas", "url": "http://cvpr.thecvf.com/api/miniconf/users/127058?format=json", "institution": "Virginia Polytechnic Institute and State University"}], "abstract": "Fine-tuning has become the default way to adapt powerful foundation models, but this also enables low-cost repurposing for harmful objectives. Existing immunization methods try to optimize local geometry or simulate short attacker horizons, and penalize observed loss drops. However, in practice, downstream tuners run thousands of updates and overcome these short-horizon defenses.In this paper, we propose CLAMP (Contractive Long-horizon Attacker Mitigation via Progress-bounding), an immunization method that traps harmful fine-tuning by shaping the attacker's optimization dynamics rather than only the initial landscape. Our key idea is to make harmful training locally contractive, making each update smaller than the last. This yields a closed-form bound on the attacker's training beyond the attacker's simulated training steps. We also introduce a Hessian-free directional curvature penalty, to create adversarial landscapes along harmful descent directions. Our bi-level objective minimizes the attacker's predicted improvement from train step zero to infinity. Experiments show our method withstands long-horizon fine-tuning across classification, generative, and autoregressive settings, substantially reduces harmful task adaptation, while preserving benign utility and fine-tuneability.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38253", "url": null, "sourceid": 32973, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39987, "uid": "ed8856951cf76b4e53c7850dcd16c2f0", "name": "Content-Adaptive Hierarchical Hyperprior for Neural Video Coding", "authors": [{"id": 193241, "fullname": "Junqi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193241?format=json", "institution": "University of Science and Technology of China"}, {"id": 193242, "fullname": "Yaojun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193242?format=json", "institution": "ByteDance Inc."}, {"id": 193243, "fullname": "Chaoyi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193243?format=json", "institution": "ByteDance Inc."}, {"id": 191148, "fullname": "Zhipin Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191148?format=json", "institution": "Bytedance Inc."}, {"id": 90450, "fullname": "Li Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90450?format=json", "institution": "University of Science and Technology of China"}, {"id": 87804, "fullname": "Dong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87804?format=json", "institution": "University of Science and Technology of China"}, {"id": 89414, "fullname": "Xiaoyan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/89414?format=json", "institution": "University of Science and Technology of China"}], "abstract": "While neural video codecs (NVCs) have recently demonstrated superior performance over traditional codecs through end-to-end learning, existing approaches primarily focus on architectural enhancements and coding module design, with limited exploration into optimizing hierarchical structures\u2014specifically, quality and reference configurations. Current hierarchical structure optimization methods face two major limitations: (1) insufficient content-adaptive optimization, and (2) disjointed handling of quality and reference structures. To overcome these challenges, we propose a novel NVC framework that introduces content-adaptive hierarchical structure optimization through a hierarchical hyperprior derived from the current frame. Our NVC integrates two key components: (1) a hierarchical hyperprior extracted from the original frame to enable content-aware adaptation of the hierarchical structure; and (2) an adaptor within the hierarchical hyperprior codec combined with a dual-reference scheme, guided by the hyperprior, to jointly optimize quality and reference structures. By leveraging this content-adaptive hierarchical structure, our NVC achieves state-of-the-art rate-distortion performance, outperforming the previous leading NVC method DCVC-FM with BD-rate reductions of 15.51\\% and 12.20\\% relative to VTM-23.4 low-delay B (LDB) under intra-period settings of -1 and 32, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39987", "url": null, "sourceid": 42517, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37839, "uid": "b64e021126c832bb29ec9fa988155eaf", "name": "EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing", "authors": [{"id": 174116, "fullname": "Wei Chow", "url": "http://cvpr.thecvf.com/api/miniconf/users/174116?format=json", "institution": "National University of Singapore"}, {"id": 188378, "fullname": "Linfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188378?format=json", "institution": "ByteDance Inc."}, {"id": 76351, "fullname": "Lingdong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76351?format=json", "institution": "National University of Singapore"}, {"id": 188379, "fullname": "Zefeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188379?format=json", "institution": "ByteDance Inc."}, {"id": 188380, "fullname": "Qi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188380?format=json", "institution": "ByteDance Inc."}, {"id": 188381, "fullname": "Hang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/188381?format=json", "institution": "ByteDance Inc."}, {"id": 152540, "fullname": "Tian Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/152540?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 188382, "fullname": "Xian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188382?format=json", "institution": "ByteDance Inc."}, {"id": 133860, "fullname": "Jinbin Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/133860?format=json", "institution": "NUS"}, {"id": 188383, "fullname": "Shilin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188383?format=json", "institution": "Peking University"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 188384, "fullname": "Junting Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188384?format=json", "institution": "ByteDance Inc."}, {"id": 188385, "fullname": "Shaoteng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188385?format=json", "institution": "ByteDance Inc."}, {"id": 188386, "fullname": "Ran Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188386?format=json", "institution": null}, {"id": 188387, "fullname": "Tianshu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188387?format=json", "institution": "ByteDance Inc."}, {"id": 180264, "fullname": "Songhua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180264?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct Crisp-2M, a high-resolution (>1024) dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves state-of-the-art image similarity performance while enabling  faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37839", "url": null, "sourceid": 39836, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40013, "uid": "808c180bdf9dbb4aa16f114e795a59e8", "name": "SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance", "authors": [{"id": 193298, "fullname": "Minghan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193298?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126580, "fullname": "LAN YANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/126580?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 88809, "fullname": "Ke Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88809?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 86988, "fullname": "Honggang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86988?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 88808, "fullname": "Kaiyue Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88808?format=json", "institution": "SketchX AI"}, {"id": 76978, "fullname": "Yi-Zhe Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/76978?format=json", "institution": "University of Surrey"}], "abstract": "Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions.To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40013", "url": null, "sourceid": 31087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40255, "uid": "79b76a2914b0e5d228c7ad1ab7e700c2", "name": "Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction", "authors": [{"id": 193895, "fullname": "No\u00e9 Artru", "url": "http://cvpr.thecvf.com/api/miniconf/users/193895?format=json", "institution": "\u00c9cole de technologie sup\u00e9rieure, Universit\u00e9 du Qu\u00e9bec; Ubisoft"}, {"id": 193896, "fullname": "Rukhshanda Hussain", "url": "http://cvpr.thecvf.com/api/miniconf/users/193896?format=json", "institution": "\u00c9cole de technologie sup\u00e9rieure, Universit\u00e9 du Qu\u00e9bec"}, {"id": 127842, "fullname": "Emeline Got", "url": "http://cvpr.thecvf.com/api/miniconf/users/127842?format=json", "institution": "La Forge - Ubisoft "}, {"id": 193897, "fullname": "Alexandre Messier", "url": "http://cvpr.thecvf.com/api/miniconf/users/193897?format=json", "institution": "Ubisoft"}, {"id": 77223, "fullname": "David B. Lindell", "url": "http://cvpr.thecvf.com/api/miniconf/users/77223?format=json", "institution": "University of Toronto"}, {"id": 127830, "fullname": "Abdallah Dib", "url": "http://cvpr.thecvf.com/api/miniconf/users/127830?format=json", "institution": "Ubisoft"}], "abstract": "Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms.Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40255", "url": null, "sourceid": 34467, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39608, "uid": "c03243215c4d4070c7ac452513674761", "name": "Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment in CLIP", "authors": [{"id": 129916, "fullname": "Fatimah Zohra", "url": "http://cvpr.thecvf.com/api/miniconf/users/129916?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 76141, "fullname": "Chen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76141?format=json", "institution": "King Abdullah University of Science and Technology (KAUST)"}, {"id": 192468, "fullname": "Hani Itani", "url": "http://cvpr.thecvf.com/api/miniconf/users/192468?format=json", "institution": null}, {"id": 75441, "fullname": "Bernard Ghanem", "url": "http://cvpr.thecvf.com/api/miniconf/users/75441?format=json", "institution": "KAUST"}], "abstract": "CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long detailed captions. In this work, we propose a multi-granular text-conditioned contrastive learning framework, $\\beta$-CLIP, to achieve hierarchical alignment across multiple textual granularities -- from full captions to sentences and phrases -- and their corresponding visual regions. For each level of textual granularity, $\\beta$-CLIP uses cross-attention to dynamically pool image patches, producing contextualized visual embeddings. A $\\beta$-weighted contrastive objective jointly optimizes multi-granular text\u2013contextualized visual pairs, with both soft cross-entropy and hard binary cross-entropy formulations, enabling controllable intra-image competition and balanced fine-to-coarse alignment. Through extensive experiments on various benchmarks with diverse granularities, we show that $\\beta$-CLIP achieves 30.9\\% on FG-OVD (Hard) and, on long-text retrieval, 63.6\\% I2T R@1 on DCI and 92.2\\% T2I R@1 on Urban1K, reaching the state-of-the-art among methods not trained with Hard Negatives. $\\beta$-CLIP establishes a strong, adaptive baseline for dense vision\u2013language correspondence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39608", "url": null, "sourceid": 34752, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40153, "uid": "614a4db226bdaf4ffdbf60e37eda9213", "name": "Beyond Reassembly: Fractured Object Recovery with Missing Parts", "authors": [{"id": 182958, "fullname": "Qunce Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182958?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 193649, "fullname": "Jiahui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193649?format=json", "institution": "University of Bath"}, {"id": 84806, "fullname": "Yan-Pei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84806?format=json", "institution": "Tencent ARC Lab"}, {"id": 89780, "fullname": "Weihao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89780?format=json", "institution": "Tencent ARC"}, {"id": 89605, "fullname": "Tai-Jiang Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89605?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 84809, "fullname": "Ying Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84809?format=json", "institution": "Tencent"}, {"id": 174066, "fullname": "Chuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/174066?format=json", "institution": "Lambda, Inc"}, {"id": 193650, "fullname": "Da Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193650?format=json", "institution": "University of Bath"}, {"id": 131170, "fullname": "Yong-Liang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131170?format=json", "institution": "University of Bath"}, {"id": 89603, "fullname": "Shi-Min Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89603?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "We propose a novel learning-based task named fractured object recovery. Unlike previous fractured object reassembly task that only targets aligning existing parts with overlaps, our task aims to not only reassemble irrelevant parts but also predict missing parts, resulting in a complete shape recovery immediately. Our task coincides with practical experiences, where the prior knowledge of similar shapes can be leverage in the reassembly process, such that even non-overlapping parts can be reasoned into adequate locations. We also present the first learning model for the proposed task by correlating features of both existing and missing parts using a transformer, where the latter is naturally represented as missing tokens. Hence, our model can jointly estimate the poses of the existing parts and predict the shapes of the missing parts. To facilitate the task, we introduce a new dataset based on the existing fractured object benchmark by imposing different configurations of missing parts. We perform extensive evaluations to demonstrate the performance of the proposed model over baselines. The results show that joint part reassembly and prediction can be made possible and also have mutual benefits, which we believe can inspire future research and favor real applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40153", "url": null, "sourceid": 46582, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37021, "uid": "c1bdb162f30a000474d29f98d6e03e80", "name": "Text-Driven 3D Hand Motion Generation from Sign Language Data", "authors": [{"id": 135568, "fullname": "L\u00e9ore Bensabath", "url": "http://cvpr.thecvf.com/api/miniconf/users/135568?format=json", "institution": "Ecole Nationale des Ponts et Chauss\u00e9es"}, {"id": 186500, "fullname": "Mathis Petrovich", "url": "http://cvpr.thecvf.com/api/miniconf/users/186500?format=json", "institution": "NVIDIA"}, {"id": 75588, "fullname": "Gul Varol", "url": "http://cvpr.thecvf.com/api/miniconf/users/75588?format=json", "institution": "Ecole des Ponts ParisTech"}], "abstract": "Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model (HandMDM), that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37021", "url": null, "sourceid": 44425, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39228, "uid": "e1508a958e7df0833ae54777d6475b9e", "name": "Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models", "authors": [{"id": 182443, "fullname": "Sijie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182443?format=json", "institution": "University of Sheffield"}, {"id": 76669, "fullname": "Biao Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/76669?format=json", "institution": "Tsinghua University"}, {"id": 86015, "fullname": "Jungong Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86015?format=json", "institution": "Aberystwyth University"}], "abstract": "Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs.  To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50\\% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39228", "url": null, "sourceid": 42698, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38700, "uid": "bb70a688c0f76326b3e1f5c2e7cfb1f1", "name": "Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner", "authors": [{"id": 151852, "fullname": "Haojie Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151852?format=json", "institution": "Peking University"}, {"id": 89455, "fullname": "Shuchen Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89455?format=json", "institution": "Peking University"}, {"id": 190489, "fullname": "Jingqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190489?format=json", "institution": "Peking University"}, {"id": 89138, "fullname": "Siqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89138?format=json", "institution": "Peking University"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}, {"id": 89234, "fullname": "Xinlong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89234?format=json", "institution": "Beijing Academy of Artificial Intelligence"}], "abstract": "Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions.  We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38700", "url": null, "sourceid": 45810, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36568, "uid": "d7779e625d40191cb8ba59adf9280fe2", "name": "FlowFixer: Towards Detail-Preserving Subject-Driven Generation", "authors": [{"id": 107628, "fullname": "Jinyoung Jun", "url": "http://cvpr.thecvf.com/api/miniconf/users/107628?format=json", "institution": "Korea University"}, {"id": 136477, "fullname": "Wondong Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/136477?format=json", "institution": null}, {"id": 177113, "fullname": "Wenbin Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177113?format=json", "institution": "Amazon"}, {"id": 169376, "fullname": "Raghudeep Gadde", "url": "http://cvpr.thecvf.com/api/miniconf/users/169376?format=json", "institution": "Amazon"}, {"id": 185369, "fullname": "Jungbeom Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/185369?format=json", "institution": "Korea University"}], "abstract": "In this paper, we present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during scene generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36568", "url": null, "sourceid": 39461, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37446, "uid": "0c7105be927a03bb3567bf01feec05a5", "name": "TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement", "authors": [{"id": 187467, "fullname": "Arian Sabaghi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187467?format=json", "institution": "Universiteit Antwerpen"}, {"id": 187468, "fullname": "Jose Oramas", "url": "http://cvpr.thecvf.com/api/miniconf/users/187468?format=json", "institution": "University of Antwerp; imec"}], "abstract": "Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with DINOv2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be available upon paper acceptance at https://anonymousRepoURL.com.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37446", "url": null, "sourceid": 46475, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36424, "uid": "9ce43b7f6e84b6ae7dae4e7d62483626", "name": "GraPHFormer: a multimodal graph persistent homology transformer for the analysis of neuroscience morphologies", "authors": [{"id": 106741, "fullname": "Uzair Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/106741?format=json", "institution": "Hamad Bin Khalifa University"}, {"id": 73962, "fullname": "Marco Agus", "url": "http://cvpr.thecvf.com/api/miniconf/users/73962?format=json", "institution": "Hamad Bin Khalifa University"}, {"id": 185002, "fullname": "Mahmoud Gamal", "url": "http://cvpr.thecvf.com/api/miniconf/users/185002?format=json", "institution": "Hamad Bin Khalifa University"}, {"id": 185003, "fullname": "Mahmood Alzubaidi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185003?format=json", "institution": "Hamad Bin Khalifa University"}, {"id": 185004, "fullname": "Corrado Cali", "url": "http://cvpr.thecvf.com/api/miniconf/users/185004?format=json", "institution": "Department of Neuroscience, University of Turin"}, {"id": 185005, "fullname": "PIERRE MAGISTRETTI", "url": "http://cvpr.thecvf.com/api/miniconf/users/185005?format=json", "institution": "EPFL - EPF Lausanne; King Abdullah University of Science and Technology"}, {"id": 185006, "fullname": "Abdesselam Bouzerdoum", "url": "http://cvpr.thecvf.com/api/miniconf/users/185006?format=json", "institution": "Hamad Bin Khalifa University; University of Wollonong"}, {"id": 185007, "fullname": "Mowafa Househ", "url": "http://cvpr.thecvf.com/api/miniconf/users/185007?format=json", "institution": "Hamad Bin Khalifa University"}], "abstract": "Quantitative analysis of neural morphology is central to understanding how circuits develop, compute, and fail. Skeletonized reconstructions of neurons and glia enable systematic study of branching patterns, path lengths, tapering, and spatial organization, with implications for neurodevelopment, learning and memory, and neurodegenerative disease. Current learning pipelines often treat either topology (via persistent homology) or graph structure (via graph neural networks) in isolation. We argue that these views are complementary and introduce \\emph{GraPHFormer}, a multimodal architecture that fuses topological and graph representations for cell morphology analysis. Our vision branch operates on a novel three-channel persistence image derived from the morphological tree: an unweighted TMD-style density, a branch-length channel (persistence), and a branch-radius channel (mean radius along death-to-leaf paths). In parallel, a graph Transformer processes the original skeleton with geometric/radial attributes. We explore lightweight fusion strategies (late fusion and cross-attention) and train under both supervised and contrastive regimes. We extensively assessed GraPHFormer through established morphology benchmarks, and we showcase that it consistently and significantly outperforms strong topology-only, graph-only, and morphometrics baselines. Beyond accuracy, we demonstrate practical relevance by discriminating neuronal and glial morphologies across cortical areas and species, and by detecting signatures associated with developmental trajectories and degenerative conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36424", "url": null, "sourceid": 37916, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39598, "uid": "a544abb197a74a4fce50a04e5537c39f", "name": "AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment", "authors": [{"id": 192445, "fullname": "Anna \u0160\u00e1rov\u00e1 Mike\u0161t\u00edkov\u00e1", "url": "http://cvpr.thecvf.com/api/miniconf/users/192445?format=json", "institution": "CIIRC, Czech Technical University"}, {"id": 192446, "fullname": "M\u00e9d\u00e9ric Fourmy", "url": "http://cvpr.thecvf.com/api/miniconf/users/192446?format=json", "institution": "CIIRC, Czech Technical University, Czech Technical University of Prague"}, {"id": 182031, "fullname": "Martin C\u00edfka", "url": "http://cvpr.thecvf.com/api/miniconf/users/182031?format=json", "institution": "Czech Technical University in Prague, CIIRC"}, {"id": 85314, "fullname": "Josef Sivic", "url": "http://cvpr.thecvf.com/api/miniconf/users/85314?format=json", "institution": "Czech Technical University in Prague"}, {"id": 192447, "fullname": "Vladimir Petrik", "url": "http://cvpr.thecvf.com/api/miniconf/users/192447?format=json", "institution": "CIIRC, Czech Technical University, Czech Technical University of Prague"}], "abstract": "Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions.First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation.Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-flyrendered object features and observed image features across all views simultaneously.Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39598", "url": null, "sourceid": 34632, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36852, "uid": "3c091577be6349ed5429131fe38513eb", "name": "Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D", "authors": [{"id": 73300, "fullname": "Ping Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73300?format=json", "institution": "China United Network Communications Group Co., Ltd."}, {"id": 186022, "fullname": "Zezhou Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186022?format=json", "institution": "China United Network Communications Group Co., Ltd."}, {"id": 89521, "fullname": "Xingpeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89521?format=json", "institution": "Southwest Petroleum University"}, {"id": 126957, "fullname": "Yanlin Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/126957?format=json", "institution": "Tampere University"}, {"id": 152682, "fullname": "Huan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152682?format=json", "institution": "China Mobile Communications Group Co.,Ltd"}, {"id": 146120, "fullname": "Xiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/146120?format=json", "institution": "chinaunicom group"}, {"id": 186023, "fullname": "Zipeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186023?format=json", "institution": "Chongqing University of Post and Telecommunications; chinaunicom group"}, {"id": 186024, "fullname": "Xin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186024?format=json", "institution": "Beijing University of Posts and Telecommunications; China United Network Communications Group Co., Ltd"}, {"id": 152681, "fullname": "Zhaoxiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152681?format=json", "institution": "Data Science &amp; Artificial Intelligence Research Institute,  China Unicom"}, {"id": 152683, "fullname": "Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152683?format=json", "institution": "China Unicom"}, {"id": 149334, "fullname": "Shiguo Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/149334?format=json", "institution": "AI Innovation Center of China Unicom"}], "abstract": "Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because \"geometric reconstruction\" paradigms mistake deliberate artistic intent\u2014such as strategic zero-plane shifts for \"pop-out\" effects and local depth sculpting\u2014for data \"noise\" or ambiguity. This paper argues for a new paradigm: \\textbf{Artistic Disparity Synthesis}, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose \\textbf{Art3D}, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36852", "url": null, "sourceid": 38400, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39497, "uid": "0c0628d88467dd0051bcd8b2565b619d", "name": "CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models", "authors": [{"id": 180017, "fullname": "Chenxi Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/180017?format=json", "institution": "Tsinghua University"}, {"id": 192199, "fullname": "Yongheng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192199?format=json", "institution": "Tsinghua University"}, {"id": 192200, "fullname": "Jiani Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192200?format=json", "institution": "Tsinghua University"}, {"id": 192201, "fullname": "Yujia Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192201?format=json", "institution": "Tencent"}, {"id": 190534, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190534?format=json", "institution": null}, {"id": 192202, "fullname": "Ju Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/192202?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Large Multimodal Models (LMMs) have shown remarkable success in visual understanding tasks. LMMs encode visual and textual inputs into tokens, which are then processed by Large Language Models (LLMs). However, the large number of visual tokens poses a major bottleneck for inference efficiency and memory usage. Reducing visual tokens is a promising training-free solution, but existing methods remain limited. Importance-based approaches suffer from poor generalization, are incompatible with kernel-level inference optimizations, and only consider information from a single modality. Diversity-based strategies typically focus on pairwise token redundancy and treat all tokens as equally important. Recent attempts to sequentially combine importance and diversity criteria still fail to address the intrinsic drawbacks of their underlying metrics. To address these limitations, we reformulate visual token reduction as an optimal subset selection problem jointly guided by two complementary objectives: informativeness and coverage. Informativeness is quantified through per-token intrinsic saliency and visual\u2013textual alignment, while coverage is enforced via a volume-based subset selection criterion that ensures global representativeness in the visual feature space.This joint formulation effectively integrates visual saliency, cross-modal alignment, and global coverage in an end-to-end token selection process, yielding a computationally efficient, model-agnostic framework compatible with modern inference accelerators. Extensive experiments demonstrate that CoIn substantially reduces computation and memory cost while maintaining strong task performance. We will release our code once accepted.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39497", "url": null, "sourceid": 42631, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40126, "uid": "b94468282c77ce98c794d894337ae500", "name": "A Bit is All You Need! Efficient Video Capture via Single Bit Imaging", "authors": [{"id": 107291, "fullname": "Kanchana Vaishnavi Gandikota", "url": "http://cvpr.thecvf.com/api/miniconf/users/107291?format=json", "institution": "Universit\u00e4t Siegen"}, {"id": 89505, "fullname": "Michael Moeller", "url": "http://cvpr.thecvf.com/api/miniconf/users/89505?format=json", "institution": "University of Siegen"}, {"id": 147554, "fullname": "Andreas Kolb", "url": "http://cvpr.thecvf.com/api/miniconf/users/147554?format=json", "institution": "Universit\u00e4t Siegen"}, {"id": 193590, "fullname": "Bhaskar Choubey", "url": "http://cvpr.thecvf.com/api/miniconf/users/193590?format=json", "institution": "Universit\u00e4t Siegen"}, {"id": 107290, "fullname": "Paramanand Chandramouli", "url": "http://cvpr.thecvf.com/api/miniconf/users/107290?format=json", "institution": "Independent Researcher"}], "abstract": "We introduce a fundamentally new paradigm in video sensing, 1-bit computational video, that redefines the limits of imaging efficiency and performance. Instead of the conventional high-bit-depth capture, we show that one bit measurements captured by time-varying thresholding can be used to reconstruct full-bit-depth videos, eliminating the need for power-hungry, high-precision analog-to-digital conversion at the sensor as well as reducing the energy consumption in data transmission. We propose thresholding strategies to effectively capture spatiotemporal dependencies in video streams.  Despite the radical data compression at acquisition, we recover full-bit-depth videos with high fidelity through neural video reconstruction using a transformer-based neural network. Our method unlocks significant gains in memory efficiency, power savings, and data throughput reduction at the sensor, making it ideal for imaging systems with ultra-low-power requirements or high-speed video capture. We validate our framework on the task of recovering both standard and high-speed videos  from simulated 1-bit  measurements. Our work redefines the camera pipeline, potentially paving the way for gigapixel, kilohertz imaging systems on low-power sensor hardware.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40126", "url": null, "sourceid": 36821, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38511, "uid": "62a5a1b5c98bf0e46e76dbfe74c403f0", "name": "M\u2074-SAM: Multi-modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection", "authors": [{"id": 190027, "fullname": "Jiyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190027?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 161519, "fullname": "jia lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/161519?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 157584, "fullname": "Xiaofei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/157584?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 88498, "fullname": "Runmin Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88498?format=json", "institution": "Shandong University"}, {"id": 190028, "fullname": "Deyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190028?format=json", "institution": null}, {"id": 190029, "fullname": "Zhi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190029?format=json", "institution": "Shanghai University"}], "abstract": "The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present **M**ulti-**M**odal **M**ixture-of-Experts with **M**emory-Augmented **SAM** (**M**$^4$-**SAM**), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject **Modality-Aware MoE-LoRA**, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy **Gated Multi-Level Feature Fusion**, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a **Pseudo-Guided Initialization**, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38511", "url": null, "sourceid": 37791, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36282, "uid": "63804679e4088cefae696e44187e1a5d", "name": "TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery", "authors": [{"id": 180813, "fullname": "Yanan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180813?format=json", "institution": "China Agricultural University"}, {"id": 184670, "fullname": "Yuhan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184670?format=json", "institution": "China Agricultural University"}, {"id": 184671, "fullname": "Tailai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184671?format=json", "institution": "China Agricultural University"}, {"id": 184672, "fullname": "Zhixiang Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184672?format=json", "institution": "University of Toronto"}, {"id": 184673, "fullname": "ZiZhang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184673?format=json", "institution": "Fudan University"}, {"id": 184577, "fullname": "Yi Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184577?format=json", "institution": "Beijing Jiaotong University"}, {"id": 134864, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134864?format=json", "institution": "Concordia University"}, {"id": 184674, "fullname": "Zhenbo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184674?format=json", "institution": "China Agricultural University"}], "abstract": "On-the-fly category discovery (OCD) aims to recognize known categories while simultaneously discovering novel ones from an unlabeled online stream, using a model trained only on labeled data. Existing approaches freeze the feature extractor trained offline and employ a hash-based framework that quantizes features into binary codes as class prototypes. However, discovering novel categories with a fixed knowledge base is counterintuitive, as the learning potential of incoming data is entirely neglected. In addition, feature quantization introduces information loss, diminishes representational expressiveness, and amplifies intra-class variance. It often results in category explosion, where a single class is fragmented into multiple pseudo-classes. To overcome these limitations, we propose a test-time adaptation framework that enables learning through discovery. It incorporates two complementary strategies: a semantic-aware prototype update and a stable test-time encoder update. The former dynamically refines class prototypes to enhance classification, whereas the latter integrates new information directly into the parameter space. Together, these components allow the model to continuously expand its knowledge base with newly encountered samples. Furthermore, we introduce a margin-aware logit calibration in the offline stage to enlarge inter-class margins and improve intra-class compactness, thereby reserving embedding space for future class discovery. Experiments on standard OCD benchmarks demonstrate that our method substantially outperforms existing hash-based state-of-the-art approaches, yielding notable improvements in novel-class accuracy and effectively mitigating category explosion.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36282", "url": null, "sourceid": 46016, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37008, "uid": "2333118fe60f95fb03220d4263795238", "name": "Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization", "authors": [{"id": 158888, "fullname": "Inha Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158888?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186457, "fullname": "Eunki Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186457?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186458, "fullname": "Wonjeong Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186458?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186459, "fullname": "Jaeyo Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186459?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186460, "fullname": "Seungjun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186460?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186461, "fullname": "Yoon-Hee Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186461?format=json", "institution": "Ajou University"}, {"id": 186462, "fullname": "Seongeun Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186462?format=json", "institution": "Ajou University"}, {"id": 186463, "fullname": "Eunhye Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186463?format=json", "institution": "Kunsan National University"}, {"id": 186464, "fullname": "Soontae Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186464?format=json", "institution": "Ajou University"}, {"id": 144939, "fullname": "Hyunjung Shim", "url": "http://cvpr.thecvf.com/api/miniconf/users/144939?format=json", "institution": "KAIST"}], "abstract": "Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non\u2013real-time inputs, limiting their practical utility for localized warning systems. To address this gap, we construct and release the real-world observations and high-resolution CMAQ-OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts. However, standard point-wise objectives cannot reflect asymmetric operational costs, where false alarms deteriorate public trust while missed severe events endanger populations. This cost mismatch causes SFT models to over-predict and yield high False Alarm Rates. We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. Experimental results demonstrate that our framework significantly improves the reliability of the forecast. Compared to the SFT-only baseline, our model reduces the False Alarm Rate by 47.3% while achieving a competitive F1-score, proving its effectiveness for practical, real-world air quality forecasting systems on long lead time scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37008", "url": null, "sourceid": 36529, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40124, "uid": "3ac35a71328200e6980b45dcc8ca36ca", "name": "Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs", "authors": [{"id": 193580, "fullname": "Houston Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193580?format=json", "institution": "Huawei Technologies Ltd.; McMaster University"}, {"id": 193581, "fullname": "TAO ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/193581?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 193582, "fullname": "Baoze Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193582?format=json", "institution": "McMaster University"}, {"id": 193583, "fullname": "Yuanqi Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193583?format=json", "institution": "Huawei Technologies Ltd.; McMaster University"}, {"id": 193584, "fullname": "Yincheng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193584?format=json", "institution": "University of Waterloo; Huawei Technologies Ltd."}, {"id": 85366, "fullname": "Huan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85366?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 155855, "fullname": "Li Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155855?format=json", "institution": "Huawei Canada"}, {"id": 193585, "fullname": "Linfeng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/193585?format=json", "institution": "University of Toronto, University of Toronto"}, {"id": 193586, "fullname": "Ziqiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193586?format=json", "institution": "Concordia University"}, {"id": 152200, "fullname": "Xinxin Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/152200?format=json", "institution": "Concordia University"}, {"id": 134864, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134864?format=json", "institution": "Concordia University"}, {"id": 193552, "fullname": "YUANHAO YU", "url": "http://cvpr.thecvf.com/api/miniconf/users/193552?format=json", "institution": ""}, {"id": 184672, "fullname": "Zhixiang Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184672?format=json", "institution": "University of Toronto"}], "abstract": "User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup. We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code. To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and UI composition modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40124", "url": null, "sourceid": 32076, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38061, "uid": "426c2c140d842b9f9c538b204ff83a6d", "name": "Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning", "authors": [{"id": 126475, "fullname": "Menghao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126475?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 181848, "fullname": "Yiyan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181848?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126479, "fullname": "Pengfei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/126479?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 153469, "fullname": "Haifeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153469?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126480, "fullname": "Qi Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126480?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126498, "fullname": "Zirui Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126498?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188950, "fullname": "Huazheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188950?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188951, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188951?format=json", "institution": "China Unicom Network Communications Co., Ltd."}, {"id": 126466, "fullname": "Jianxin Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126466?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 126469, "fullname": "Jingyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126469?format=json", "institution": "Beijing University of Post and Telecommunication, Tsinghua University"}], "abstract": "In this paper, we explore video anomaly detection (VAD) from a fine-grained perspective, which aims not only to detect anomalous events but also to identify their specific categories. Due to the limited number of examples per category, existing methods either fail to handle intra-class variation across diverse contexts or struggle with inter-class confusion caused by shared visual primitives. To address these challenges, we propose a progressive cross-granularity learning paradigm that leverages coarse- and fine-grained labels in a complementary manner to progressively refine representations from generic anomaly patterns to category-specific semantics.Building on this paradigm, we develop Fine-VAD, a progressive alignment framework that aligns video features with supervision signals at multiple granularities. Extensive experiments on two benchmark datasets demonstrate that Fine-VAD achieves up to a 48\\% improvement in fine-grained anomaly classification, while maintaining state-of-the-art performance in coarse-grained anomaly detection. Notably, our paradigm generalizes well across diverse model architectures, offering an adaptable and effective solution for real-world fine-grained VAD.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38061", "url": null, "sourceid": 43936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39394, "uid": "1f035fdddd3111aba0a4dad801b4e874", "name": "RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework", "authors": [{"id": 71047, "fullname": "Xiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71047?format=json", "institution": "Anhui University"}, {"id": 191986, "fullname": "Haiyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191986?format=json", "institution": "anhui university"}, {"id": 152299, "fullname": "Shiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152299?format=json", "institution": "Anhui University"}, {"id": 191987, "fullname": "Qiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191987?format=json", "institution": "Anhui University"}, {"id": 189091, "fullname": "Jiandong Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189091?format=json", "institution": "Anhui University"}, {"id": 191988, "fullname": "Haoyu Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/191988?format=json", "institution": "Anhui University"}, {"id": 190618, "fullname": "Bo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190618?format=json", "institution": "Anhui University"}, {"id": 183391, "fullname": "Chenglong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183391?format=json", "institution": "Anhui University"}], "abstract": "Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39394", "url": null, "sourceid": 33622, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40076, "uid": "8155cf9d25c763ee101e424ec5cac948", "name": "QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment", "authors": [{"id": 175078, "fullname": "Guohua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175078?format=json", "institution": "Beijing Jiaotong University"}, {"id": 193449, "fullname": "Jian Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193449?format=json", "institution": null}, {"id": 91112, "fullname": "Meiqin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91112?format=json", "institution": "Beijing Jiaotong University"}, {"id": 188926, "fullname": "Chao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188926?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 86501, "fullname": "Weisi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/86501?format=json", "institution": "Nanyang Technological University"}], "abstract": "No-Reference Point Cloud Quality Assessment (NR-PCQA) still struggles with generalization, primarily due to the scarcity of annotated point cloud datasets. Since the Human Visual System (HVS) drives perceptual quality assessment independently of media types, prior knowledge on quality learned from images can be repurposed for point clouds. This insight motivates adopting Unsupervised Domain Adaptation (UDA) to transfer quality-relevant priors from labeled images to unlabeled point clouds. However, existing UDA-based PCQA methods often overlook key characteristics of perceptual quality, such as sensitivity to quality ranking and quality-aware feature alignment, thereby limiting their effectiveness. To address these issues, we propose a novel Quality-aware Domain adaptation framework for PCQA, termed QD-PCQA. The framework comprises two main components: i) a Rank-weighted Conditional Alignment (RCA) strategy that aligns features under consistent quality levels and adaptively emphasizes misranked samples to reinforce perceptual quality ranking awareness; and ii) a Quality-guided Feature Augmentation (QFA) strategy, which includes quality-guided style mixup, multi-layer extension, and dual-domain augmentation modules to augment perceptual feature alignment. Extensive cross-domain experiments demonstrate that QD-PCQA significantly improves generalization in NR-PCQA tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40076", "url": null, "sourceid": 45294, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38705, "uid": "ac82ed7da82faebe98f99b087944502c", "name": "What Are You Doing? A Closer Look at Controllable Human Video Generation", "authors": [{"id": 190495, "fullname": "Emanuele Bugliarello", "url": "http://cvpr.thecvf.com/api/miniconf/users/190495?format=json", "institution": "Google DeepMind"}, {"id": 86066, "fullname": "Anurag Arnab", "url": "http://cvpr.thecvf.com/api/miniconf/users/86066?format=json", "institution": "Google"}, {"id": 157982, "fullname": "Roni Paiss", "url": "http://cvpr.thecvf.com/api/miniconf/users/157982?format=json", "institution": "Google DeepMind"}, {"id": 190496, "fullname": "Christy Koh", "url": "http://cvpr.thecvf.com/api/miniconf/users/190496?format=json", "institution": "Google"}, {"id": 190497, "fullname": "Pieter-Jan Kindermans", "url": "http://cvpr.thecvf.com/api/miniconf/users/190497?format=json", "institution": "Google Deepmind"}, {"id": 69185, "fullname": "Cordelia Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/69185?format=json", "institution": "Inria / Google"}], "abstract": "High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human synthesis. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok, TED-Talks, and HumanVid, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing 'What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1,544 captioned videos that have been meticulously collected and annotated with fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose an evaluation framework, where we adapt existing metrics for better human- and video-level assessment, as shown by human preference. Equipped with our dataset and metrics, we perform in-depth analyses of state-of-the-art open-source models in controllable image-to-video generation, showing how WYD provides novel insights about their capabilities. We release data and code to drive progress in human video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38705", "url": null, "sourceid": 37956, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36364, "uid": "4e253f59ce72377038ab855694e4903c", "name": "CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing", "authors": [{"id": 183469, "fullname": "Yucheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183469?format=json", "institution": "HKUST"}, {"id": 154483, "fullname": "Zedong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154483?format=json", "institution": "The Hong Kong University of Science and Technology (HKUST)"}, {"id": 184883, "fullname": "Yuetong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184883?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 127301, "fullname": "Yue Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/127301?format=json", "institution": "The Hong Kong University of Science and Technology, Hong Kong"}, {"id": 88296, "fullname": "Dan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88296?format=json", "institution": "CSE, HKUST"}], "abstract": "Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts. The source code, models, and dataset will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36364", "url": null, "sourceid": 44929, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36449, "uid": "d2d0cd4ce0710569cf620c2a4f598aa8", "name": "DepthFocus: Controllable Depth Estimation for See-Through Scenes", "authors": [{"id": 175347, "fullname": "junhong min", "url": "http://cvpr.thecvf.com/api/miniconf/users/175347?format=json", "institution": "Samsung Electronics"}, {"id": 185080, "fullname": "Jimin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185080?format=json", "institution": "Samsung Electronics"}, {"id": 185081, "fullname": "Minwook Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185081?format=json", "institution": "Samsung Electronics"}, {"id": 185082, "fullname": "Cheol-Hui Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/185082?format=json", "institution": "Samsung"}, {"id": 185083, "fullname": "YOUNGPIL JEON", "url": "http://cvpr.thecvf.com/api/miniconf/users/185083?format=json", "institution": "Samsung"}, {"id": 185084, "fullname": "Minyong Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185084?format=json", "institution": "Samsung Electronics Co., Ltd."}], "abstract": "Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive, attempting to estimate static depth maps anchored to the nearest surface, while humans actively shift focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. Conditioned on a scalar depth preference, the model dynamically adapts its computation to focus on the intended depth, enabling selective perception within complex scenes. The training primarily leverages our newly constructed 500k multi-layered synthetic dataset, designed to capture diverse see-through effects. DepthFocus not only achieves state-of-the-art performance on conventional single-depth benchmarks like BOOSTER, a dataset notably rich in transparent and reflective objects, but also quantitatively demonstrates intent-aligned estimation on our newly proposed real and synthetic multi-depth datasets. Moreover, it exhibits strong generalization capabilities on unseen see-through scenes, underscoring its robustness as a significant step toward active and human-like 3D perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36449", "url": null, "sourceid": 40693, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40043, "uid": "e7b437d65b3073c6804fbeb2c9e1d16c", "name": "EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding", "authors": [{"id": 136473, "fullname": "Yinuo Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/136473?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 193368, "fullname": "Jinyan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193368?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 193369, "fullname": "Zixi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193369?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 90447, "fullname": "Kongming Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90447?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 75928, "fullname": "Xiatian Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75928?format=json", "institution": "University of Surrey"}, {"id": 90236, "fullname": "Zhanyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90236?format=json", "institution": "Beijing University of Post and Telecommunication"}], "abstract": "Vision-language models (VLMs) have achieved remarkable success across numerous domains, yet they lag significantly in animal behavior understanding due to severe data scarcity. Annotated animal behavior videos are prohibitively expensive and time-consuming to collect, requiring domain expertise and controlled observation conditions. To address this challenge, we leverage structured domain knowledge as an inductive bias from the Neuro Behavior Ontology (NBO), which provides professional annotations, hierarchical behavior structures, and comprehensive semantic coverage. We present EthoCLIP, an ontology-enhanced vision\u2013language contrastive learning framework that explicitly embeds ontology semantics through an ontology-aware graph module to capture hierarchical relationships among behaviors and learn structured semantic dependencies. Incorporating ontological information reduces the burden of learning purely from data, thereby alleviating requirements for large-scale datasets. To enhance EthoCLIP training, we construct AnimalBand, an NBO-consistent dataset integrating 74,671 videos across multiple species and behaviors with semantic standardization and extended knowledge coverage. Extensive experiments validate both our method and dataset. Results demonstrate that EthoCLIP pretrained on AnimalBand substantially improves behavior recognition accuracy and transfer learning performance across diverse benchmarks, confirming that ontology-driven semantic enrichment effectively addresses data scarcity in animal behavior understanding.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40043", "url": null, "sourceid": 33965, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39718, "uid": "b440cc1f36911a5256b3dc30a1d599ba", "name": "Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition", "authors": [{"id": 183492, "fullname": "Minxue Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183492?format=json", "institution": "Google"}, {"id": 192716, "fullname": "Yangyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192716?format=json", "institution": "Accenture, Center for Advanced AI"}, {"id": 192717, "fullname": "Aolin Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/192717?format=json", "institution": "Accenture"}, {"id": 192718, "fullname": "MAZIYAR BARAN POUYAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/192718?format=json", "institution": "Accenture"}, {"id": 192719, "fullname": "Taha Belkhouja", "url": "http://cvpr.thecvf.com/api/miniconf/users/192719?format=json", "institution": "Accenture"}, {"id": 192720, "fullname": "Yujia Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192720?format=json", "institution": "Accenture"}], "abstract": "Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI. However, tackling long-tail pattern recognition tasks remains challenging for current pre-trained foundation models such as LLMs and VLMs. While finetuning pre-trained models can improve accuracy in recognizing implicit patterns, it is usually infeasible due to a lack of training data and high computational overhead. In this paper, we propose ADAMAB, an efficient embedding calibration framework for few-shot pattern recognition. To maximally reduce the computational costs, ADAMAB trains embedder-agnostic light-weight calibrators on top of fixed embedding models without accessing their parameters. To mitigate the need for large-scale training data, we introduce an adaptive data augmentation strategy based on the Multi-Armed Bandit (MAB) mechanism. With a modified upper confidence bound algorithm, ADAMAB diminishes the gradient shifting and offers theoretically guaranteed convergence in few-shot training. Our multi-modal experiments justify the superior performance of ADAMAB, with up to 40\\% accuracy improvement when training with less than 5 initial data samples of each class.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39718", "url": null, "sourceid": 40126, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39844, "uid": "75c489760ba27da0e18b0577b21ec30c", "name": "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning", "authors": [{"id": 192960, "fullname": "Haoji Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192960?format=json", "institution": "Tsinghua University; ByteDance Inc."}, {"id": 71992, "fullname": "Xin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71992?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 192961, "fullname": "Jiawen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192961?format=json", "institution": ""}, {"id": 192962, "fullname": "Chixiang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192962?format=json", "institution": null}, {"id": 127129, "fullname": "Sule Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/127129?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 192963, "fullname": "Chubin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192963?format=json", "institution": "Tsinghua University"}, {"id": 70477, "fullname": "bowen zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70477?format=json", "institution": "Bytedance"}, {"id": 183611, "fullname": "zhichao zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/183611?format=json", "institution": "ByteDance Inc."}, {"id": 87274, "fullname": "Dongliang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/87274?format=json", "institution": "ByteDance Inc. "}, {"id": 69348, "fullname": "Yansong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69348?format=json", "institution": "Tsinghua University"}], "abstract": "The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on eleven challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39844", "url": null, "sourceid": 45253, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37800, "uid": "f843b38155cd45cd93df8b66feaf3492", "name": "EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer", "authors": [{"id": 184271, "fullname": "Munish Monga", "url": "http://cvpr.thecvf.com/api/miniconf/users/184271?format=json", "institution": "Sony Research India (Sony AI)"}, {"id": 92699, "fullname": "Vishal Chudasama", "url": "http://cvpr.thecvf.com/api/miniconf/users/92699?format=json", "institution": "Sony Research India   Sony AI"}, {"id": 156395, "fullname": "Pankaj Wasnik", "url": "http://cvpr.thecvf.com/api/miniconf/users/156395?format=json", "institution": "Sony Research India"}, {"id": 75442, "fullname": "C.V. Jawahar", "url": "http://cvpr.thecvf.com/api/miniconf/users/75442?format=json", "institution": "IIIT-Hyderabad"}], "abstract": "Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as unknown\u2014all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37800", "url": null, "sourceid": 35277, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36187, "uid": "42c089a7e6899fe52a40bbbf7148e4e2", "name": "Progressive Neural Architecture Generation", "authors": [{"id": 184374, "fullname": "Caiyang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184374?format=json", "institution": "Wenzhou University"}, {"id": 184375, "fullname": "Chen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184375?format=json", "institution": "National University of Singapore"}, {"id": 184376, "fullname": "Yun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184376?format=json", "institution": "Sichuan University"}, {"id": 184377, "fullname": "Chenwei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184377?format=json", "institution": "Sichuan University"}, {"id": 184378, "fullname": "Wei Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/184378?format=json", "institution": "Sichuan University"}, {"id": 86144, "fullname": "Jiancheng Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/86144?format=json", "institution": "Sichuan University"}], "abstract": "As a representative technique in neural architecture search, neural architecture generation aims to construct high-performance architectures for a given task directly. It is poised to replace the inefficient random exploration components of some search strategies, such as the acquisition strategies in Bayesian optimization. Despite significant research, current architecture generation techniques face problems such as low generation efficiency and insufficient constraints, leading to invalidly generated architectures. To this end, we propose Progressive Neural Architecture Generation (PNAG), which constructs architectures incrementally through coarse-to-fine evolution, enhancing generation efficiency, and incorporates step-wise refinements to ensure the validity of the generated architecture. To achieve this, PNGA involves two modules, multi-scale sub-architecture quantization (MSQ) and step-wise consistency constraint (SCC). Specifically, MSQ constructs sub-architectures using quantization decoding and progressively expands them, transitioning from simple to complex forms. This operation bypasses network inference to enhance efficiency.Complementing MSQ, SCC, implemented through a tailored regularization mechanism, introduces penalties for deviations during sub-architecture generation, guiding the process towards valid target architectures. As such, PNAG establishes a clear generation path, laying the groundwork for generating suitable architectures in downstream tasks. Extensive experiments demonstrate that PNAG not only generates superior architectures for various downstream tasks (+8.43\\%/+5.07\\%, on average) but also significantly improves generation efficiency, reducing the architecture generation time by 1300$\\times$. Furthermore, PNAG demonstrates strong extensibility by successfully generating Transformer-based architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36187", "url": null, "sourceid": 37214, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38040, "uid": "5d8508487ea46d5235bb6d5431c3c756", "name": "Momentum Memory for Knowledge Distillation in Computational Pathology", "authors": [{"id": 183464, "fullname": "yongxin guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183464?format=json", "institution": "wake forest university"}, {"id": 188900, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188900?format=json", "institution": "Wake Forest University School of Medicine"}, {"id": 178505, "fullname": "Onur Koyun", "url": "http://cvpr.thecvf.com/api/miniconf/users/178505?format=json", "institution": "Wake Forest University"}, {"id": 188901, "fullname": "Zhengjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188901?format=json", "institution": "Wake Forest University"}, {"id": 188902, "fullname": "Muhammet Demir", "url": "http://cvpr.thecvf.com/api/miniconf/users/188902?format=json", "institution": "Wake Forest Baptist Health"}, {"id": 188903, "fullname": "Metin Gurcan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188903?format=json", "institution": "Wake Forest Baptist Health"}], "abstract": "Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology\u2013genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance.To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time.Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38040", "url": null, "sourceid": 37606, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37671, "uid": "7455352c702492ca6c1490cb27e257a6", "name": "OctoNav: Towards Generalist Embodied Navigation", "authors": [{"id": 89790, "fullname": "Chen Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89790?format=json", "institution": "Beihang University"}, {"id": 187982, "fullname": "Liankai Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187982?format=json", "institution": "Beihang University"}, {"id": 89813, "fullname": "Xingyu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89813?format=json", "institution": "Beihang University"}, {"id": 76936, "fullname": "Jiazhao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76936?format=json", "institution": "Peking University"}, {"id": 187983, "fullname": "Yue Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187983?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187984, "fullname": "Annan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187984?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 74064, "fullname": "He Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/74064?format=json", "institution": "Galbot"}, {"id": 75839, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75839?format=json", "institution": "Beihang University"}], "abstract": "Embodied navigation stands as a foundation pillar within the pursuit of embodied intelligence. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task settings/objectives and modalities, making datasets and methods designed individually. In this work, we take steps toward generalist navigation, which can follow free-form instructions that include arbitrary compounds of modality and capability.To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench is constructed via a designed automatic annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions.For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GRPO, and Online RL stages. Each stage contains designed learning policies and rewards. Specifically, inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer, we design TBA-SFT and Nav-GRPO to achieve thinking-before-action for embodied navigation, improving model's reasoning ability toward generalists.TBA-SFT utilizes the TBA-CoT dataset to fine-tune the model, and then we leverage Nav-GRPO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with the previous methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37671", "url": null, "sourceid": 44945, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40309, "uid": "a508812e5f39db30c00a9baf08b5552c", "name": "Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack", "authors": [{"id": 148622, "fullname": "M. Kerem Aydin", "url": "http://cvpr.thecvf.com/api/miniconf/users/148622?format=json", "institution": "Northwestern Univeristy"}, {"id": 164854, "fullname": "Yi-Chun Hung", "url": "http://cvpr.thecvf.com/api/miniconf/users/164854?format=json", "institution": "Northwestern University"}, {"id": 187991, "fullname": "Jaclyn Pytlarz", "url": "http://cvpr.thecvf.com/api/miniconf/users/187991?format=json", "institution": "Dolby Laboratories Inc"}, {"id": 126346, "fullname": "Qi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126346?format=json", "institution": "Purdue University"}, {"id": 88885, "fullname": "Emma Alexander", "url": "http://cvpr.thecvf.com/api/miniconf/users/88885?format=json", "institution": "Northwestern University"}], "abstract": "Hyperspectral cameras rely on spectral filters, dispersive optics, or coded apertures, which reduce light throughput and increase hardware complexity. These systems face harsh trade-offs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have typically required complex optics and/or extensive computation. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that achieves state-of-the-art hyperspectral imaging using only two off-the-shelf lenses, a grayscale sensor, and less than one second of reconstruction time. By capturing a chromatically-aberrated focal stack that preserves nearly all incident light, and reconstructing it with a fast physics-based iterative algorithm, SfD delivers sharp, accurate hyperspectral images. The combination of photon efficiency, optical simplicity, and physical interpretability makes SfD a promising solution for fast, compact, and interpretable hyperspectral imaging.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40309", "url": null, "sourceid": -39724, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37676?format=json"], "related_events_ids": [37676]}, {"id": 37682, "uid": "5152b6ca192c7c14bc740c30954cadb9", "name": "SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation", "authors": [{"id": 181950, "fullname": "Xiaogang Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/181950?format=json", "institution": "Shaanxi University of Science and Technology"}, {"id": 188005, "fullname": "Jiawei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188005?format=json", "institution": "Shaanxi University of Science &amp; Technology"}, {"id": 188006, "fullname": "Tongfei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188006?format=json", "institution": "Shaanxi University of Science and Technology"}, {"id": 188007, "fullname": "Tao Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188007?format=json", "institution": "Shaanxi University of Science and Technology"}, {"id": 188008, "fullname": "Yingbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188008?format=json", "institution": "Shaanxi University of Science and Technology"}], "abstract": "In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes graph partitioning as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37682", "url": null, "sourceid": 42489, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37522, "uid": "0079576419ccacff015989ce74616d69", "name": "HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action models", "authors": [{"id": 187631, "fullname": "Minghui Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187631?format=json", "institution": "Westlake University"}, {"id": 187632, "fullname": "Pengxiang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/187632?format=json", "institution": "Zhejiang University; Westlake University"}, {"id": 106322, "fullname": "Shu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106322?format=json", "institution": null}, {"id": 187633, "fullname": "Zifeng Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187633?format=json", "institution": "Westlake University; Zhejiang University"}, {"id": 187634, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187634?format=json", "institution": "Westlake University"}, {"id": 187635, "fullname": "Xinyang Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187635?format=json", "institution": "Westlake University"}, {"id": 158747, "fullname": "Wenxuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/158747?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 187636, "fullname": "Shangke Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187636?format=json", "institution": "Nanjing university"}, {"id": 107448, "fullname": "Siteng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107448?format=json", "institution": "Zhejiang University &amp; Westlake University"}, {"id": 90158, "fullname": "Donglin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90158?format=json", "institution": "Westlake University"}], "abstract": "Vision-Language-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. Existing attempts to incorporate history by stacking frames are computationally expensive and redundant. We argue that motion provides a more compact and informative representation of temporal context, capturing inter-state dynamics while filtering static noise. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable \u201cthink-while-acting\u201d control. Extensive experiments show that HiF-VLA improves performance from 94.0\\% to 96.4\\% on LIBERO-Long and 4.10 to 4.35 on CALVIN ABC-D, surpassing strong baselines. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in real-world long-horizon settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37522", "url": null, "sourceid": 32141, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37676, "uid": "a508812e5f39db30c00a9baf08b5552c", "name": "Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack", "authors": [{"id": 148622, "fullname": "M. Kerem Aydin", "url": "http://cvpr.thecvf.com/api/miniconf/users/148622?format=json", "institution": "Northwestern Univeristy"}, {"id": 164854, "fullname": "Yi-Chun Hung", "url": "http://cvpr.thecvf.com/api/miniconf/users/164854?format=json", "institution": "Northwestern University"}, {"id": 187991, "fullname": "Jaclyn Pytlarz", "url": "http://cvpr.thecvf.com/api/miniconf/users/187991?format=json", "institution": "Dolby Laboratories Inc"}, {"id": 126346, "fullname": "Qi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126346?format=json", "institution": "Purdue University"}, {"id": 88885, "fullname": "Emma Alexander", "url": "http://cvpr.thecvf.com/api/miniconf/users/88885?format=json", "institution": "Northwestern University"}], "abstract": "Hyperspectral cameras rely on spectral filters, dispersive optics, or coded apertures, which reduce light throughput and increase hardware complexity. These systems face harsh trade-offs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have typically required complex optics and/or extensive computation. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that achieves state-of-the-art hyperspectral imaging using only two off-the-shelf lenses, a grayscale sensor, and less than one second of reconstruction time. By capturing a chromatically-aberrated focal stack that preserves nearly all incident light, and reconstructing it with a fast physics-based iterative algorithm, SfD delivers sharp, accurate hyperspectral images. The combination of photon efficiency, optical simplicity, and physical interpretability makes SfD a promising solution for fast, compact, and interpretable hyperspectral imaging.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37676", "url": null, "sourceid": 39724, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40309?format=json"], "related_events_ids": [40309]}, {"id": 36899, "uid": "499d9afe5b4f88f2585fbff224d8c0f5", "name": "LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven  Prompt Self-Distillation", "authors": [{"id": 180292, "fullname": "xucong wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180292?format=json", "institution": "University of Science and Technology of China"}, {"id": 186147, "fullname": "Pengkun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186147?format=json", "institution": "University of Science and Technology of China"}, {"id": 186148, "fullname": "Zhe Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186148?format=json", "institution": "University of Science and Technology of China"}, {"id": 186149, "fullname": "Liheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186149?format=json", "institution": "University of Science and Technology of China"}, {"id": 186150, "fullname": "Rui Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186150?format=json", "institution": "University of Science and Technology of China"}, {"id": 186151, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186151?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Prompt Learning (PL) has emerged as a parameter-efficient technique for adapting Vision-Language Models (VLMs) to downstream tasks. However, almost all existing PL methods are primarily designed and evaluated on well-curated datasets, overlooking a critical post-deployment phenomenon, i.e., the intrinsic connection between input resolution and storage-memory consumption. Specifically, to satisfy the stringent storage-memory constraints on edge devices, models are often limited to low-resolution inputs (e.g., $\\le$ 224$\\times$224 for CLIP-ViT/B-16) and generate fewer tokens (with the position embedding resized), which poses a unique challenge in performance robustness. To tackle this issue, we propose LOREAL, an efficient prompt self-distillation framework that learns resolution-invariant representations by excavating attribute semantics. At the heart of LOREAL is a dual-student architecture, i.e., two student models fed with inputs at different resolutions synergistically learn from each other. Building upon this, we contextualize the students' prompt with resolution-invariant attributes queried from the LLM, then leverage cross-modality meta-nets to generate attribute semantics. These meta-nets are  bridged between the different encoders of two students, wherein we introduce Low-Level Distillation (LLD) and High-Level Distillation (HLD) to facilitate the learning of more cross-resolution representations.  Extensive experiments show that LOREAL significantly improves VLMs' performance and robustness under varied resolution settings, underscoring significant practical utilities.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36899", "url": null, "sourceid": 44814, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65704, "file": "/media/PosterPDFs/CVPR%202026/36899.png", "modified": "2026-04-17T07:23:54.851729-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65705, "file": "/media/PosterPDFs/CVPR%202026/36899-thumb.png", "modified": "2026-04-17T07:23:55.028168-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36210, "uid": "13019fc8997b04326425e0c525115724", "name": "One Layer\u2019s Trash is Another Layer\u2019s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs", "authors": [{"id": 184456, "fullname": "Yongru Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184456?format=json", "institution": "Hikvision Research Institute"}, {"id": 184457, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184457?format=json", "institution": "Hikvision Research Institute; East China Normal University"}, {"id": 184458, "fullname": "Zeliang Zong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184458?format=json", "institution": "Hikvision Research Institute"}, {"id": 184459, "fullname": "Yuchen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184459?format=json", "institution": "Peking University"}, {"id": 84798, "fullname": "Wenming Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/84798?format=json", "institution": "Hikvision Research Institute"}, {"id": 84808, "fullname": "Ye Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/84808?format=json", "institution": "Zhejiang University"}, {"id": 184460, "fullname": "Jilin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184460?format=json", "institution": "East China Normal University; Aalborg University"}], "abstract": "Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36210", "url": null, "sourceid": 31326, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38974, "uid": "d74e8945a4b4816d630391f86530bf6a", "name": "CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning", "authors": [{"id": 172261, "fullname": "Zhijiang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172261?format=json", "institution": "University of Chinese Academy of Science"}, {"id": 191099, "fullname": "Linhua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191099?format=json", "institution": null}, {"id": 183917, "fullname": "JIAXIN QI", "url": "http://cvpr.thecvf.com/api/miniconf/users/183917?format=json", "institution": "Computer Network Information Center"}, {"id": 191100, "fullname": "Weihao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191100?format=json", "institution": ""}, {"id": 191101, "fullname": "Peng Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191101?format=json", "institution": "Shopee"}, {"id": 191102, "fullname": "Anxiang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191102?format=json", "institution": "Shopee"}, {"id": 190753, "fullname": "Jianqiang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190753?format=json", "institution": "Chinese Academy of Sciences"}], "abstract": "Image captioning remains a fundamental task for vision\u2013language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models.We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?).To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \\textbf{C}omplete and \\textbf{C}orrect \\textbf{Captions}.For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency.For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition.Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38974", "url": null, "sourceid": 34206, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39678, "uid": "a58bf865b185e0e3f665473bf8f3ca6d", "name": "Beyond Single Solution: Multi-Hypothesis Deep Unfolding Network for Image Compressive Sensing", "authors": [{"id": 192628, "fullname": "Wenxue Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/192628?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192629, "fullname": "Hualin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192629?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192630, "fullname": "Yuhang Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192630?format=json", "institution": ""}, {"id": 192631, "fullname": "Yifu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192631?format=json", "institution": "Harbin Institute of Technology"}, {"id": 128724, "fullname": "Xiaopeng Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128724?format=json", "institution": "Harbin Institute of Technology"}, {"id": 72247, "fullname": "Debin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/72247?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent ill-posedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39678", "url": null, "sourceid": 39535, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36857, "uid": "b06ee722e5efe10c6852d6dc07b84616", "name": "Language Models Can Explain Visual Features via Steering", "authors": [{"id": 186031, "fullname": "Javier Ferrando", "url": "http://cvpr.thecvf.com/api/miniconf/users/186031?format=json", "institution": "Amazon"}, {"id": 186032, "fullname": "Enrique Lopez-Cuena", "url": "http://cvpr.thecvf.com/api/miniconf/users/186032?format=json", "institution": "Barcelona Supercomputing Center"}, {"id": 178045, "fullname": "Pablo Agustin Martin-Torres", "url": "http://cvpr.thecvf.com/api/miniconf/users/178045?format=json", "institution": "Barcelona Supercomputing Center"}, {"id": 186033, "fullname": "Daniel Hinjos", "url": "http://cvpr.thecvf.com/api/miniconf/users/186033?format=json", "institution": "Barcelona Supercomputing Center"}, {"id": 94098, "fullname": "Anna Arias Duart", "url": "http://cvpr.thecvf.com/api/miniconf/users/94098?format=json", "institution": "Barcelona Supercomputing Center"}, {"id": 186034, "fullname": "Dario Garcia-Gasulla", "url": "http://cvpr.thecvf.com/api/miniconf/users/186034?format=json", "institution": "Barcelona Supercomputing Center"}], "abstract": "Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it \"sees\", effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed Top-k, a hybrid approach that combines the strengths of causal interventions and input-based approaches to achieve state-of-the-art explanation quality without additional computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36857", "url": null, "sourceid": 35571, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36709, "uid": "95b4738e0867cc64385b42e3cb13d3b6", "name": "DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution", "authors": [{"id": 131338, "fullname": "Hidir Yesiltepe", "url": "http://cvpr.thecvf.com/api/miniconf/users/131338?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 185697, "fullname": "Koutilya PNVR", "url": "http://cvpr.thecvf.com/api/miniconf/users/185697?format=json", "institution": "Adobe Systems"}, {"id": 185698, "fullname": "Gaurav Suresh Pathak", "url": "http://cvpr.thecvf.com/api/miniconf/users/185698?format=json", "institution": "Adobe, Inc"}, {"id": 94179, "fullname": "Navaneeth Bodla", "url": "http://cvpr.thecvf.com/api/miniconf/users/94179?format=json", "institution": "University of Maryland, College Park"}, {"id": 136726, "fullname": "Bharat Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/136726?format=json", "institution": "Cruise LLC"}, {"id": 133116, "fullname": "Pinar Yanardag", "url": "http://cvpr.thecvf.com/api/miniconf/users/133116?format=json", "institution": "Virginia Tech"}, {"id": 185699, "fullname": "Jinrong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/185699?format=json", "institution": "Adobe Systems"}], "abstract": "Recent progress in video diffusion models has enabled remarkable generative fidelity, yet leveraging these priors for restoration remains limited by the strong coupling between conditional and unconditional branches in standard classifier-free guidance. We introduce a training-free framework that enhances distorted and low-resolution videos by decoupling these signals in time. Our proposed Decoupled Time Guidance (DTG) evaluates the unconditional branch at a cleaner diffusion timestep, providing a lookahead prior that preserves geometry while suppressing replication of warped content. This temporal bias is annealed throughout sampling, allowing the model to transition from structure correction to detail refinement without retraining. Combined with any off-the-shelf restoration module in a plug-and-play manner, our approach improves perceptual coherence and restores plausible structure in AI-generated and real-world videos alike. To facilitate evaluation, we curate GenWarp480, a benchmark of 4000 distorted 480p videos synthesized from diverse text-to-video models. GenWarp480 focuses on characteristic generative degradations such as warped faces, body misalignments, and spatial artifacts, providing a purpose-built testbed for assessing robustness to generative errors. Extensive experiments demonstrate that our method achieves significant improvements in structural fidelity and temporal stability without any model training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36709", "url": null, "sourceid": 40371, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40219, "uid": "a9a82b0b868b6252298fb3f261209f90", "name": "Gamba: Mamba-based graph convolutional network with dynamic graph topology learning for action recognition", "authors": [{"id": 182929, "fullname": "Rouyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182929?format=json", "institution": "Shenzhen University; Shenzhen University"}, {"id": 181809, "fullname": "\u6f3e\u4e4b \u5434", "url": "http://cvpr.thecvf.com/api/miniconf/users/181809?format=json", "institution": "Shenzhen University"}, {"id": 146574, "fullname": "Jiajun Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/146574?format=json", "institution": "Shenzhen University"}, {"id": 193813, "fullname": "Can Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193813?format=json", "institution": "Shenzhen University"}, {"id": 193814, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193814?format=json", "institution": "Shenzhen University"}, {"id": 193815, "fullname": "Zhihui Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193815?format=json", "institution": "Shenzhen University"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}], "abstract": "The graph convolutional network has been an important tool for skeleton-based action recognition. However, existing graph models predominantly utilize self-attention mechanisms to model feature correlations between the joints of each sample, which not only neglects dynamic relation dependencies in temporal dimension but also leads to redundant computation as well as to the difficulty in establishing a unified framework for joint relation representation. To address these problems, this paper develops a Mamba-based graph convolution network (Gamba) with dynamic graph topology learning. Specifically, in order to capture local motion patterns through aggregation of intra-class information, a classification-based Mamba module is developed to categorize motion joints into distinct types. To the best of our knowledge, this is the first work to assign motion joints with label information to facilitate correlation learning. To capture the underlying relation of the joints of different categories, the state space model is introduced to the proposed method to process enhanced temporal features, aiming at learning dynamic adjacency matrices for long-range dependencies of the joints across different categories. The proposed framework not only facilitates an adaptive focus on the spatio-temporal feature modeling, but also has less computation complexity than traditional self-attention-based approaches. Extensive experiments on the public NTU RGB+D 60/120 and NW-UCLA benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art methods in recognition accuracy. The proposed framework provides new insights into effective and efficient skeleton-based action recognition and can be potentially applied to a variety of real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40219", "url": null, "sourceid": 35640, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37580, "uid": "b46b855b9bcfd8b80b24ca033b10519e", "name": "Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation", "authors": [{"id": 187766, "fullname": "Imanol Estepa", "url": "http://cvpr.thecvf.com/api/miniconf/users/187766?format=json", "institution": "Universitat de Barcelona"}, {"id": 140949, "fullname": "Jes\u00fas M Rodr\u00edguez-de-Vera", "url": "http://cvpr.thecvf.com/api/miniconf/users/140949?format=json", "institution": "Universitat de Barcelona"}, {"id": 98479, "fullname": "Bhalaji Nagarajan", "url": "http://cvpr.thecvf.com/api/miniconf/users/98479?format=json", "institution": "Universitat de Barcelona"}, {"id": 187767, "fullname": "Petia Radeva", "url": "http://cvpr.thecvf.com/api/miniconf/users/187767?format=json", "institution": "Universitat de Barcelona"}], "abstract": "Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative\u2013discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers.LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning.On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines.Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37580", "url": null, "sourceid": 44965, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38428, "uid": "e6590d413b51f18d2556c4e620adc5f4", "name": "Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding", "authors": [{"id": 181350, "fullname": "Mu Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181350?format=json", "institution": "University of Hong Kong"}, {"id": 189845, "fullname": "Muquan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189845?format=json", "institution": "University of Hong Kong; Chinese University of Hong Kong"}, {"id": 180531, "fullname": "Weijian Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180531?format=json", "institution": "The University of Hong Kong"}, {"id": 189846, "fullname": "Jacob S. Prince", "url": "http://cvpr.thecvf.com/api/miniconf/users/189846?format=json", "institution": "Harvard University"}, {"id": 189847, "fullname": "Hossein Adeli", "url": "http://cvpr.thecvf.com/api/miniconf/users/189847?format=json", "institution": "Columbia University"}, {"id": 187177, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187177?format=json", "institution": "University of Hong Kong"}, {"id": 187176, "fullname": "Jiahang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187176?format=json", "institution": "University of Hong Kong"}, {"id": 189848, "fullname": "Benjamin Becker", "url": "http://cvpr.thecvf.com/api/miniconf/users/189848?format=json", "institution": "University of Hong Kong"}, {"id": 189849, "fullname": "John Pyles", "url": "http://cvpr.thecvf.com/api/miniconf/users/189849?format=json", "institution": "University of Washington"}, {"id": 158394, "fullname": "Margaret Marie Henderson", "url": "http://cvpr.thecvf.com/api/miniconf/users/158394?format=json", "institution": "Carnegie Mellon University"}, {"id": 152969, "fullname": "Chunfeng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152969?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 189850, "fullname": "Nikolaus Kriegeskorte", "url": "http://cvpr.thecvf.com/api/miniconf/users/189850?format=json", "institution": "Columbia University"}, {"id": 158395, "fullname": "Michael J. Tarr", "url": "http://cvpr.thecvf.com/api/miniconf/users/158395?format=json", "institution": "Carnegie Mellon University"}, {"id": 189851, "fullname": "Xiaoqing Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189851?format=json", "institution": "University of Hong Kong"}, {"id": 158393, "fullname": "Andrew Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158393?format=json", "institution": "University of Hong Kong"}], "abstract": "Visual decoding from brain signals is a key challenge at the intersection of computer vision and neuroscience, requiring methods that bridge neural representations and computational models of vision. A field-wide goal is to achieve generalizable, cross-subject models. A major obstacle towards this goal is the substantial variability in neural representations across individuals, which has so far required training bespoke models or fine-tuning separately for each subject. To address this challenge, we introduce a meta-optimized approach for semantic visual decoding from fMRI that **generalizes to novel subjects without any fine-tuning**. By simply conditioning on a small set of image-brain activation examples from the new individual, our model rapidly infers their unique neural encoding patterns to facilitate robust and efficient visual decoding. Our approach is explicitly optimized for in-context learning of the new subject's encoding model and performs decoding by hierarchical inference, inverting the encoder. First, for multiple brain regions, we estimate the per-voxel visual response encoder parameters by constructing a context over multiple stimuli and responses. Second, we construct a context consisting of encoder parameters and response values over multiple voxels to perform aggregated functional inversion. We demonstrate strong cross-subject and cross-scanner generalization across diverse visual backbones without retraining or fine-tuning. Moreover, our approach requires neither anatomical alignment nor stimulus overlap. This work is a critical step towards a generalizable foundation model for non-invasive brain decoding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38428", "url": null, "sourceid": 40170, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36241, "uid": "aa839cfb74f9e5ae2a89572a0093dd58", "name": "SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion", "authors": [{"id": 165529, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/165529?format=json", "institution": "University of electronic science and technology of China"}, {"id": 129509, "fullname": "Heqian Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129509?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 184551, "fullname": "Lanxiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184551?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 184552, "fullname": "Benliu Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184552?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 85491, "fullname": "Fanman Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/85491?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 130920, "fullname": "Linfeng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130920?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 85496, "fullname": "Hongliang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85496?format=json", "institution": "University of Electronic Science and Technology of China, Tsinghua University"}], "abstract": "Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align\u2013Fuse\u2013Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36241", "url": "https://github.com/jack1ee/SAVAX", "sourceid": 40900, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37847, "uid": "e4c5f4bfe531ef1b58bfddf6b260f666", "name": "FedMPT: Federated Multi-Label Prompt Tuning of Vision-Language Models", "authors": [{"id": 180292, "fullname": "xucong wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180292?format=json", "institution": "University of Science and Technology of China"}, {"id": 186147, "fullname": "Pengkun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186147?format=json", "institution": "University of Science and Technology of China"}, {"id": 186148, "fullname": "Zhe Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186148?format=json", "institution": "University of Science and Technology of China"}, {"id": 186149, "fullname": "Liheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186149?format=json", "institution": "University of Science and Technology of China"}, {"id": 188401, "fullname": "Shuangwang Shuangwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188401?format=json", "institution": "University of Science and Technology of China"}, {"id": 186151, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186151?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge  to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications  requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37847", "url": null, "sourceid": 40869, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65706, "file": "/media/PosterPDFs/CVPR%202026/37847.png", "modified": "2026-04-17T07:25:05.879086-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65707, "file": "/media/PosterPDFs/CVPR%202026/37847-thumb.png", "modified": "2026-04-17T07:25:06.066865-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36730, "uid": "f065fe9f90c6f176adf5aca0b889d595", "name": "Fingerprinting Diffusion models in the wild", "authors": [{"id": 183789, "fullname": "Junhoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/183789?format=json", "institution": "Seoul National University"}, {"id": 185741, "fullname": "Mijin Koo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185741?format=json", "institution": "Seoul National University"}, {"id": 86094, "fullname": "Nojun Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/86094?format=json", "institution": "Seoul National University"}], "abstract": "Text-to-image models have become commercially valuable assets distributed under restrictive licenses to prevent unauthorized fine-tuning and redistribution, yet violations are only enforceable when detectable. Existing methods require pre-deployment injection or white-box access to model weights or gray-box access to intermediate activations\u2014capabilities, which are unavailable in commercial API deployments. We present Compositional Semantic Fingerprinting (CSF), the first black-box fingerprinting method that attributes fine-tuned models to their base families using only query access to text-to-image generation APIs. CSF abstracts models as semantic category generators, probing them with compositional underspecified prompts that combine individually common components into exponentially rare compositions. Unlike traditional watermarking, this creates an asymmetric advantage: IP owners can cheaply generate novel prompt compositions at any time post-deployment, while attackers face the intractable challenge of anticipating and removing all possible fingerprints during training. We demonstrate this across 6 model families (FLUX, Kandinsky, SD1.5/2.1/3.0/XL) and 13  variants spanning comprehensive scenario. Our Bayesian attribution framework achieves $>$50\\% posterior mean accuracy with 95\\% credible intervals for all models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36730", "url": null, "sourceid": 43153, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38057, "uid": "99f177d5be3272d4653bef31e6e235ab", "name": "EVA: Efficient Reinforcement Learning for End-to-End Video Agent", "authors": [{"id": 182889, "fullname": "Yaolun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182889?format=json", "institution": "Oregon State University"}, {"id": 184861, "fullname": "Ruohui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184861?format=json", "institution": "Sensetime"}, {"id": 184860, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184860?format=json", "institution": "Sensetime"}, {"id": 184863, "fullname": "Yepeng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184863?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184862, "fullname": "Xuanyu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184862?format=json", "institution": "Sensetime"}, {"id": 188941, "fullname": "Haonan Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188941?format=json", "institution": "Sensetime"}, {"id": 184859, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184859?format=json", "institution": "Sensetime; Beijing University of Aeronautics and Astronautics"}, {"id": 90125, "fullname": "Hanming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90125?format=json", "institution": "Sensetime"}, {"id": 88044, "fullname": "Lewei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88044?format=json", "institution": "SenseTime"}], "abstract": "Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames.Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning.Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary\u2013plan\u2013action\u2013reflection reasoning.EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding.To train such agents, we design a simple yet effective three-stage learning pipeline\u2014comprising supervised fine-tuning (SFT), Kahneman\u2013Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO)\u2014that bridges supervised imitation and reinforcement learning.We further construct high-quality datasets for each stage, supporting stable and reproducible training.We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6--12\\% over general MLLM baselines and a further 1--3\\% gain over prior adaptive agent methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38057", "url": null, "sourceid": 40407, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36354, "uid": "025c89c8e27d84349ef37fa99e69bc59", "name": "ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding", "authors": [{"id": 184859, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184859?format=json", "institution": "Sensetime; Beijing University of Aeronautics and Astronautics"}, {"id": 184860, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184860?format=json", "institution": "Sensetime"}, {"id": 182889, "fullname": "Yaolun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182889?format=json", "institution": "Oregon State University"}, {"id": 184861, "fullname": "Ruohui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184861?format=json", "institution": "Sensetime"}, {"id": 184862, "fullname": "Xuanyu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184862?format=json", "institution": "Sensetime"}, {"id": 184863, "fullname": "Yepeng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184863?format=json", "institution": "Beijing Jiaotong University"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 88044, "fullname": "Lewei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88044?format=json", "institution": "SenseTime"}], "abstract": "We revisit video hallucination in multimodal large language models (Video-MLLMs) from a semantic aggregation perspective. While prior work attributes hallucinations to language priors, missing frames, or visual encoder biases, these explanations overlook errors arising during the aggregation of correct frame-level semantics into event-level interpretations. We term this phenomenon Semantic Aggregation Hallucination (SAH), which becomes increasingly prevalent in complex, multi-event video understanding tasks with rich temporal dependencies. To systematically study SAH, we introduce ELV-Halluc, the first benchmark designed for fine-grained evaluation of semantic aggregation errors. Our experiments reveal that SAH correlates with both semantic complexity and rapid semantic transitions. We further propose mitigation strategies: improved positional encoding preserves temporal structure, and reinforcement learning such as DPO enhances the model\u2019s ability to distinguish semantics within and across events. Using a curated 8K adversarial video-text pair dataset, our approach achieves consistent gains across benchmarks, including a 27.7\\% reduction in SAH rate on ELV-Halluc and Video-MME.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36354", "url": null, "sourceid": 37055, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39091, "uid": "25efc250ba7103ebb3aad53283daf077", "name": "LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings", "authors": [{"id": 148200, "fullname": "chengan che", "url": "http://cvpr.thecvf.com/api/miniconf/users/148200?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 191341, "fullname": "Chao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191341?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 191342, "fullname": "Tom Vercauteren", "url": "http://cvpr.thecvf.com/api/miniconf/users/191342?format=json", "institution": "King&#x27;s College London; Hypervision Surgical"}, {"id": 191343, "fullname": "Sophia Tsoka", "url": "http://cvpr.thecvf.com/api/miniconf/users/191343?format=json", "institution": "King&#x27;s College London, University of London"}, {"id": 181202, "fullname": "Luis Carlos Garcia Peraza Herrera", "url": "http://cvpr.thecvf.com/api/miniconf/users/181202?format=json", "institution": "King&#x27;s College London"}], "abstract": "Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp of mAP in CholecT50), surgical tool presence detection (+5.3pp and +10.2pp of mAP in Cholec80 and GraSP), and surgical semantic segmentation (+8.3pp of mDice in CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39091", "url": null, "sourceid": 39021, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40050, "uid": "df752bc7e93276aa07a87579c2a8e830", "name": "NTK-Guided Implicit Neural Teaching", "authors": [{"id": 180132, "fullname": "Chen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180132?format=json", "institution": "The University of Hong Kong"}, {"id": 193379, "fullname": "Wei Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193379?format=json", "institution": "The University of Hong Kong"}, {"id": 193380, "fullname": "Bingyang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193380?format=json", "institution": ", University of Hong Kong"}, {"id": 193381, "fullname": "Yikun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193381?format=json", "institution": "University of Hong Kong"}, {"id": 193382, "fullname": "Wei-Bin Kou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193382?format=json", "institution": null}, {"id": 193383, "fullname": "Yik-Chung WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/193383?format=json", "institution": "University of Hong Kong"}, {"id": 157299, "fullname": "Ngai Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/157299?format=json", "institution": "The University of Hong Kong"}], "abstract": "Implicit Neural Representations (INRs) parameterize continuous signals via multilayer perceptrons (MLPs), enabling compact, resolution-independent modeling for tasks like image, audio, and 3D reconstruction. However, fitting high-resolution signals demands optimizing over millions of coordinates, incurring prohibitive computational costs. To address it, we propose NTK-Guided Implicit Neural Teaching (NINT), which accelerates training by dynamically selecting coordinates that maximize global functional updates. Leveraging the Neural Tangent Kernel (NTK), NINT scores examples by the norm of their NTK-augmented loss gradients, capturing both fitting errors and heterogeneous leverage (self-influence and cross-coordinate coupling). This dual consideration enables faster convergence compared to existing methods. Through extensive experiments, we demonstrate that NINT significantly reduces training time by nearly half while maintaining or improving representation quality, establishing state-of-the-art acceleration among recent sampling-based strategies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40050", "url": null, "sourceid": 43776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36947, "uid": "2b0a35b51e15d03eadb2665109246f6f", "name": "Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation", "authors": [{"id": 180237, "fullname": "Chenxi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180237?format=json", "institution": "Nankai University"}, {"id": 186282, "fullname": "Chen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186282?format=json", "institution": "Southeast University"}, {"id": 186283, "fullname": "Xiaokun Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186283?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 186284, "fullname": "Aiming Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186284?format=json", "institution": "Alibaba Group"}, {"id": 186285, "fullname": "Jiashu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186285?format=json", "institution": "Alibaba Group"}, {"id": 186286, "fullname": "Jiachen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186286?format=json", "institution": "Alibaba Group"}, {"id": 186287, "fullname": "Jiahong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186287?format=json", "institution": "Alibaba Group"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}, {"id": 76168, "fullname": "Jufeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76168?format=json", "institution": "Nankai University"}], "abstract": "Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results.Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation.Compared to the limited class labels, text conditions pose greater challenges to the model\u2019s understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance.To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability.This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Based on this insight, we propose a novel auxiliary loss design to learn the discriminative text representation space, achieving an effective adaptation of MeanFlow to text-to-image generation for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36947", "url": null, "sourceid": 35917, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38214, "uid": "33f68c58d409a7d8a1524d062a44b5d8", "name": "SAMTok: Representing Any Mask with Two Words", "authors": [{"id": 148079, "fullname": "yikang zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/148079?format=json", "institution": "Wuhan University"}, {"id": 189337, "fullname": "Tao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189337?format=json", "institution": "Wuhan University"}, {"id": 189338, "fullname": "Dengxian Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189338?format=json", "institution": "Wuhan University"}, {"id": 189339, "fullname": "Yuanzheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189339?format=json", "institution": "Wuhan University"}, {"id": 189340, "fullname": "Ye Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189340?format=json", "institution": "Peking University"}, {"id": 89673, "fullname": "Haochen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89673?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 130075, "fullname": "Haobo Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130075?format=json", "institution": "UC Merced / ByteDance"}, {"id": 189341, "fullname": "Jiacong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189341?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 188215, "fullname": "Lu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188215?format=json", "institution": "Insta360; Wuhan University"}, {"id": 178983, "fullname": "Hao Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/178983?format=json", "institution": "National University of Singapore"}, {"id": 189342, "fullname": "Shunping Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/189342?format=json", "institution": "Wuhan University"}, {"id": 189343, "fullname": "Anran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189343?format=json", "institution": "Tiktok"}, {"id": 189344, "fullname": "Zhuochen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189344?format=json", "institution": null}, {"id": 185716, "fullname": "Yujing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185716?format=json", "institution": "ByteDance Inc."}, {"id": 181203, "fullname": "Cheng CHEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/181203?format=json", "institution": "ByteDance"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}], "abstract": "Pixel-wise capabilities are essential for building interactive intelligent systems. However pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To solve these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two textual special tokens and reconstructs masks from these tokens with high fidelity. By treating masks as a new language, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a simple and scalable paradigm for equipping MLLMs with strong pixel-wise capabilities. Code and models will be available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38214", "url": null, "sourceid": 30650, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37210, "uid": "550a339e477f57d149a2dc62ea703bea", "name": "BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation", "authors": [{"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 186928, "fullname": "Haoze Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186928?format=json", "institution": "Dalian University of Technology"}, {"id": 186929, "fullname": "Yuhang Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186929?format=json", "institution": "Dalian University of Technology"}, {"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}, {"id": 186930, "fullname": "Chengpei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186930?format=json", "institution": "University of New South Wales"}, {"id": 186931, "fullname": "Xinwei Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186931?format=json", "institution": "Dalian University of Technology"}, {"id": 88013, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88013?format=json", "institution": "Dalian University of Technology"}, {"id": 91359, "fullname": "Xiangjian He", "url": "http://cvpr.thecvf.com/api/miniconf/users/91359?format=json", "institution": "The University of Nottingham Ningbo China"}, {"id": 130404, "fullname": "Weimin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130404?format=json", "institution": "Dalian University of Techonoly"}], "abstract": "Underwater instance segmentation is essential for fine-grained scene understanding. However, underwater imagery exhibits a strong domain gap from in-air vision due to severe degradation (e.g., turbidity).  Consequently, despite its general segmentation ability, SAM degrades sharply underwater. In this work, we propose BiPA, which effectively adapts SAM to the underwater domain. To be concrete, we construct an underwater SAM with dual prompts and introduce a foreground-attentive injection block to enhance local foreground representation. We formulate dense prompt learning as a bilevel optimization, explicitly capturing the mutual dependency between prompt and model. To make this tractable, we design a two-stage learning strategy. The first stage adapts the dense prompt itself, updating it with Bayesian optimization to learn efficiently. The second stage fine-tunes the model parameters under the frozen optimized prompt, which finally enables effective cross-domain adaptation. Extensive experiments and analyses verify the superiority and efficiency of BiPA. Code will be released if this work can be accepted, fortunately.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37210", "url": null, "sourceid": 37209, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40005, "uid": "a37a3f28f2cd1fdcb28582f28c61adfa", "name": "Bilevel Layer-Positioning LoRA for Real Image Dehazing", "authors": [{"id": 155796, "fullname": "Yan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155796?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 190048, "fullname": "Yuxin Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190048?format=json", "institution": "Xi&#x27;an University"}, {"id": 193280, "fullname": "Zhe Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193280?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 193281, "fullname": "Fan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193281?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 155798, "fullname": "Zhuo Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/155798?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP\u2019s cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The source code will be publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40005", "url": null, "sourceid": 44422, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40367, "uid": "9bb170d5a0adf8a2da13f042b4cffcab", "name": "Streaming Diffusion Model for Fast Infrared and Visible Video Fusion", "authors": [{"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}, {"id": 191976, "fullname": "Ludan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191976?format=json", "institution": "Dalian University of Technology"}, {"id": 151421, "fullname": "Tengyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151421?format=json", "institution": "Dalian University of Technology"}, {"id": 191977, "fullname": "Chunyan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191977?format=json", "institution": "Dalian University of Technology; Dalian Martime University"}, {"id": 155757, "fullname": "Zhiying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155757?format=json", "institution": "Dalian Martime University"}, {"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 131737, "fullname": "Risheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131737?format=json", "institution": "Dalian University of Technology"}, {"id": 88008, "fullname": "Xin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88008?format=json", "institution": "Dalian University of Technology"}], "abstract": "Infrared and visible video fusion is pivotal for robust perceptual systems, aiming to synthesize a comprehensive video stream that leverages both thermal resilience and textured details. However, prevailing methods, by treating video as independent frames, inherently introduce temporal incoherence, such as flickering and ghosting artifacts. While diffusion models possess strong generative priors to remedy this, their iterative nature is prohibitively slow for video. To resolve this fundamental dilemma, we propose a streaming diffusion model for efficient infrared and visible video fusion, termed SDMFusion. Our key insight is to distill the generative prior of a pre-trained diffusion model into a one-step sampling framework, while explicitly modeling temporal dynamics. We design a memory-augmented latent pipeline where a temporal aggregation adapter aligns and propagates cross-frame features to ensure coherence, supported by a dedicated temporal consistency loss. This approach effectively decouples the challenge of achieving high fidelity from maintaining temporal stability. Extensive experiments on four benchmarks demonstrate that our method establishes a new state-of-the-art, generating fused videos with exceptional spatio-temporal consistency at a speed suitable for real-time application.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40367", "url": null, "sourceid": -44054, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39389?format=json"], "related_events_ids": [39389]}, {"id": 37988, "uid": "016ddcc4fef41865c385b820f507463e", "name": "Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)", "authors": [{"id": 183087, "fullname": "Gustavo Chau", "url": "http://cvpr.thecvf.com/api/miniconf/users/183087?format=json", "institution": "Stanford University"}, {"id": 188759, "fullname": "Mohammad H. Abbasi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188759?format=json", "institution": "Stanford University"}, {"id": 188760, "fullname": "Camila Blank", "url": "http://cvpr.thecvf.com/api/miniconf/users/188760?format=json", "institution": "Stanford University"}, {"id": 165062, "fullname": "Juze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/165062?format=json", "institution": "Stanford University"}, {"id": 188761, "fullname": "Alan Q. Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188761?format=json", "institution": "Stanford University"}, {"id": 188762, "fullname": "Sophie Ostmeier", "url": "http://cvpr.thecvf.com/api/miniconf/users/188762?format=json", "institution": "University Hospital Zurich"}, {"id": 166890, "fullname": "Akshay Chaudhari", "url": "http://cvpr.thecvf.com/api/miniconf/users/166890?format=json", "institution": "Stanford University"}, {"id": 188763, "fullname": "Kilian Pohl", "url": "http://cvpr.thecvf.com/api/miniconf/users/188763?format=json", "institution": "Stanford University"}, {"id": 75810, "fullname": "Ehsan Adeli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75810?format=json", "institution": "Stanford University"}], "abstract": "Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our  dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6\\% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37988", "url": null, "sourceid": 37532, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37362, "uid": "ab88830c7df1320ff8e0fc8203fbc80c", "name": "Bridging Human Evaluation to Infrared and Visible Image Fusion", "authors": [{"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}, {"id": 155755, "fullname": "Xingyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155755?format=json", "institution": "Dalian University of Technology"}, {"id": 155754, "fullname": "Qingyun Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/155754?format=json", "institution": "Dalian University of Technology"}, {"id": 187259, "fullname": "HaoYuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187259?format=json", "institution": null}, {"id": 155757, "fullname": "Zhiying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155757?format=json", "institution": "Dalian Martime University"}, {"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 131737, "fullname": "Risheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131737?format=json", "institution": "Dalian University of Technology"}, {"id": 88008, "fullname": "Xin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88008?format=json", "institution": "Dalian University of Technology"}], "abstract": "Infrared and visible image fusion (IVIF) integrates complementary modalities to enhance scene perception. Current methods predominantly focus on optimizing handcrafted losses and objective metrics, often resulting in fusion outcomes that do not align with human visual preferences. This challenge is further exacerbated by the ill-posed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. To address the lack of human-centric evaluation metrics and data, we introduce the first large-scale human feedback dataset for IVIF, containing multidimensional subjective scores and artifact annotations, and enriched by a fine-tuned large language model with expert review. Based on this dataset, we design a domain-specific reward function and train a reward model to quantify perceptual quality. Guided by this reward, we fine-tune fusion networks through Group Relative Policy Optimization, achieving state-of-the-art performance that better aligns fused images with human aesthetics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37362", "url": null, "sourceid": 46249, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38518, "uid": "11eed2e8a05bbbb4508ffa229a7b84c0", "name": "Benchmarking Endoscopic Surgical Image Restoration and Beyond", "authors": [{"id": 186673, "fullname": "Jialun Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186673?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 132802, "fullname": "Diandian Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/132802?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 190047, "fullname": "Donghui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190047?format=json", "institution": "Dalian University of Technology"}, {"id": 186675, "fullname": "Zhixi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186675?format=json", "institution": "Southern Medical University"}, {"id": 190048, "fullname": "Yuxin Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190048?format=json", "institution": "Xi&#x27;an University"}, {"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}, {"id": 87709, "fullname": "Pheng-Ann Heng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87709?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "In endoscopic surgery, a clear and high-quality visual field is critical for surgeons to make accurate intraoperative decisions. However, persistent visual degradation, including smoke generated by energy devices, lens fogging from thermal gradients, and lens contamination due to blood or tissue fluid splashes during surgical procedures, severely impairs visual clarity. These degenerations can seriously hinder surgical workflow and pose risks to patient safety. To systematically investigate and address various forms of surgical scene degradation, we introduce a real-world open-source surgical image restoration dataset covering laparoscopic environments, called SurgClean, which involves multi-type image restoration tasks from two medical sites, i.e., desmoking, defogging, and desplashing. SurgClean comprises 3,113 images with diverse degradation types and corresponding paired reference labels. Based on SurgClean, we establish a standardized evaluation benchmark and provide performance for 22 representative generic task-specific image restoration approaches, including 12 generic and 10 task-specific image restoration approaches. Experimental results reveal substantial performance gaps relative to clinical requirements, highlighting a critical opportunity for algorithm advancements in intelligent surgical restoration. Furthermore, we explore the degradation discrepancies between surgical and natural scenes from structural perception and semantic understanding perspectives, providing fundamental insights for domain-specific image restoration research. Our work aims to empower restoration algorithms and improve the efficiency of clinical procedures. Data and code are available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38518", "url": null, "sourceid": 34410, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39389, "uid": "9bb170d5a0adf8a2da13f042b4cffcab", "name": "Streaming Diffusion Model for Fast Infrared and Visible Video Fusion", "authors": [{"id": 152576, "fullname": "Jinyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152576?format=json", "institution": "Dalian University of Technology"}, {"id": 191976, "fullname": "Ludan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191976?format=json", "institution": "Dalian University of Technology"}, {"id": 151421, "fullname": "Tengyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151421?format=json", "institution": "Dalian University of Technology"}, {"id": 191977, "fullname": "Chunyan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191977?format=json", "institution": "Dalian University of Technology; Dalian Martime University"}, {"id": 155757, "fullname": "Zhiying Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155757?format=json", "institution": "Dalian Martime University"}, {"id": 69209, "fullname": "Long Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69209?format=json", "institution": "Dalian University of Technology"}, {"id": 131737, "fullname": "Risheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131737?format=json", "institution": "Dalian University of Technology"}, {"id": 88008, "fullname": "Xin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88008?format=json", "institution": "Dalian University of Technology"}], "abstract": "Infrared and visible video fusion is pivotal for robust perceptual systems, aiming to synthesize a comprehensive video stream that leverages both thermal resilience and textured details. However, prevailing methods, by treating video as independent frames, inherently introduce temporal incoherence, such as flickering and ghosting artifacts. While diffusion models possess strong generative priors to remedy this, their iterative nature is prohibitively slow for video. To resolve this fundamental dilemma, we propose a streaming diffusion model for efficient infrared and visible video fusion, termed SDMFusion. Our key insight is to distill the generative prior of a pre-trained diffusion model into a one-step sampling framework, while explicitly modeling temporal dynamics. We design a memory-augmented latent pipeline where a temporal aggregation adapter aligns and propagates cross-frame features to ensure coherence, supported by a dedicated temporal consistency loss. This approach effectively decouples the challenge of achieving high fidelity from maintaining temporal stability. Extensive experiments on four benchmarks demonstrate that our method establishes a new state-of-the-art, generating fused videos with exceptional spatio-temporal consistency at a speed suitable for real-time application.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39389", "url": null, "sourceid": 44054, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40367?format=json"], "related_events_ids": [40367]}, {"id": 37271, "uid": "8dc962351b83a23171a713e55a77365f", "name": "DiffGraph:  An Automated Agent-driven  Model Merging Framework for  In-the-Wild Text-to-Image Generation", "authors": [{"id": 156926, "fullname": "Zhuoling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156926?format=json", "institution": "Lancaster University"}, {"id": 73672, "fullname": "Hossein Rahmani", "url": "http://cvpr.thecvf.com/api/miniconf/users/73672?format=json", "institution": "Lancaster University"}, {"id": 187058, "fullname": "Jiarui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187058?format=json", "institution": "Lancaster University"}, {"id": 176292, "fullname": "Yu Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/176292?format=json", "institution": "Lancaster University"}, {"id": 91082, "fullname": "Majid Mirmehdi", "url": "http://cvpr.thecvf.com/api/miniconf/users/91082?format=json", "institution": "University of Bristol"}, {"id": 87401, "fullname": "Jason Kuen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87401?format=json", "institution": "Adobe Research"}, {"id": 187059, "fullname": "Jiuxiang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187059?format=json", "institution": "Adobe Systems"}, {"id": 153315, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153315?format=json", "institution": "Lancaster University"}], "abstract": "The rapid growth of the text-to-image (T2I) community has fostered a thriving online ecosystem of expert models, which are variants of pretrained diffusion models specialized for diverse generative capabilities. Yet, existing model merging methods remain limited in fully leveraging abundant online expert resources and still struggle to meet diverse in-the-wild user needs. We present DiffGraph, a novel automated agent-driven graph-based model merging framework, which automatically harnesses online experts and flexibly merges them for diverse user needs. Our DiffGraph constructs a scalable graph and organizes ever-expanding online experts within it through node registration and calibration. Then, DiffGraph dynamically activates specific subgraphs based on user needs, enabling flexible combinations of different experts to achieve user-desired generation. Extensive experiments show the efficacy of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37271", "url": null, "sourceid": 36074, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37928, "uid": "22ecc5d03493ba26f7778851c126bee3", "name": "Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification", "authors": [{"id": 181210, "fullname": "Jinjia Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181210?format=json", "institution": "Hebei University"}, {"id": 188605, "fullname": "Jican Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188605?format=json", "institution": "Hebei University"}, {"id": 103019, "fullname": "Jiazuo Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103019?format=json", "institution": "Dalian University of Technology"}, {"id": 188606, "fullname": "Zeze Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188606?format=json", "institution": "Hebei University"}, {"id": 188607, "fullname": "Huibing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188607?format=json", "institution": "Dalian Maritime University"}], "abstract": "Lifelong Person Re-Identification (LReID) aims to adapt to new domains while preserving old knowledge. Existing methods, whether distillation-based or rehearsal-based, attempt to consolidate diverse knowledge within a fixed model architecture. However, the limited adaptability of such architectures often leads to catastrophic forgetting by overwriting previously acquired knowledge. To overcome this limitation, we propose Versatile Incremental Adaptation (VIA), a novel dynamical expansion framework for LReID, which unleashes restricted knowledge during continuous learning by large pre-trained models. Specifically, Unseen-domain person Adapter (UnA) is embedded in VIA, which employs incremental modular learning to capture the domain's specific knowledge, thereby reducing cross-domain interference and releasing task-specific capacity that is previously limited by static parameter sharing. Meanwhile, considering the substantial amount of sharing knowledge across domains in LReID, we design the Domain-aware Dispatch (DAD) module to enable inter-domain collaboration and knowledge reuse through adaptive cooperation among multiple shared adapters. Furthermore, a Holistic Domain Controller (HDC) is designed to dynamically regulate the learning capacity for new domains based on knowledge similarity, thereby effectively unleashing the generalization potential of pre-trained models. Additionally, a lightweight Similarity-Guided Auto-Selector (SGAS) is proposed to assign inputs to relevant adapters during inference automatically. Extensive experiments are conducted to validate the effectiveness of VIA, which surpasses state-of-the-art methods across both seen and unseen domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37928", "url": null, "sourceid": 44420, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37117, "uid": "5b0d4c35f0631337b5286a02bfbae6a0", "name": "RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation", "authors": [{"id": 180874, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180874?format=json", "institution": "Army Engineering University of PLA"}, {"id": 129222, "fullname": "Yuhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129222?format=json", "institution": "Dalian University of Technology"}, {"id": 186702, "fullname": "Wenning Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186702?format=json", "institution": "Army Engineering University of PLA"}, {"id": 128308, "fullname": "Pingping Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128308?format=json", "institution": "Dalian University of Technology"}, {"id": 89488, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89488?format=json", "institution": "Dalian University of Technology"}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}], "abstract": "RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions by fusing visible and thermal infrared modalities. However, existing RGBT trackers rely solely on initial-frame visual information for target modeling,  failing to adapt to appearance variations  due to the absence of language guidance. Furthermore, current methods suffer from redundant search regions and heterogeneous modality gaps, causing background distraction. To address these issues, we first introduce textual descriptions into RGBT tracking benchmarks. This is accomplished through a pipeline that leverages Multi-modal Large Language Models (MLLMs) to automatically produce texual annotations. Afterwards, we propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking. To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. Then, we design an Adaptive Token Fusion (ATF) to select target-relevant tokens and perform channel exchanges based on cross-modal correlations, mitigating search redundancies and modality gaps. Finally, we propose a Context-aware Reasoning Module (CRM) to maintain a dynamic knowledge base and employ a Retrieval-Augmented Generation (RAG) to enable temporal linguistic reasoning for robust target modeling. Extensive experiments on four RGBT benchmarks demonstrate that our framework achieves state-of-the-art performance across various challenging scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37117", "url": null, "sourceid": 32518, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38610, "uid": "287d6438aae8f21fcda18b911c9c23ef", "name": "RiskProp: Collision-Anchored Self-supervised Temporal Constraints for Early Accident Anticipation", "authors": [{"id": 190291, "fullname": "Yiyang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190291?format=json", "institution": "Wuhan University"}, {"id": 133158, "fullname": "Tianhao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/133158?format=json", "institution": "Wuhan University"}, {"id": 132095, "fullname": "Peilun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/132095?format=json", "institution": "Didi Research"}, {"id": 190292, "fullname": "Hongyu Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190292?format=json", "institution": "Didi Research"}, {"id": 190293, "fullname": "Longyu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190293?format=json", "institution": "Didi Research"}, {"id": 190294, "fullname": "Yuxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190294?format=json", "institution": "Didi Research"}, {"id": 190295, "fullname": "Liyin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190295?format=json", "institution": "Didi Research"}, {"id": 190296, "fullname": "Yifeng Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190296?format=json", "institution": "Didi Research"}, {"id": 190297, "fullname": "Chunbo Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190297?format=json", "institution": "DiDi Group"}, {"id": 190298, "fullname": "Yutian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190298?format=json", "institution": "Wuhan University"}, {"id": 185783, "fullname": "Zhihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185783?format=json", "institution": "University of Science and Technology of China"}, {"id": 76420, "fullname": "Yu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76420?format=json", "institution": "Wuhan University"}], "abstract": "Accident anticipation aims to predict impending collisions from dashcam videos and trigger early alerts. Existing methods rely on binary supervision with manually annotated \u201canomaly onset\u201d frames, which are subjective and inconsistent, leading to inaccurate risk estimation. In contrast, we propose Risk Propagation (RiskProp), a collision-anchored supervised framework enhanced with self-supervised temporal constraints, which removes the need for anomaly onset annotations by leveraging only the reliably labeled collision frame. RiskProp models temporal risk evolution through two observation-driven losses: first, since future frames contain more definitive evidence of an impending accident, we introduce a future-frame regularization loss that uses the model\u2019s next-frame prediction as a soft target to supervise the current frame, enabling backward propagation of risk signals; second, inspired by the empirical trend of rising risk before accidents, we design an adaptive monotonic constraint to encourage a non-decreasing progression over time. Experiments on CAP-DATA and Nexar demonstrate that RiskProp achieves state-of-the-art performance and produces smoother, more discriminative risk curves, improving both early anticipation and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38610", "url": null, "sourceid": 36886, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65713, "modified": "2026-04-20T09:32:26.144928-07:00", "display_section": 1, "type": "PDF", "name": "Slides", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "/media/cvpr-2026/Slides/38610.pdf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36932, "uid": "8f4a20534264fce22dc1a12a6d8e6158", "name": "HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models", "authors": [{"id": 180704, "fullname": "Qihui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180704?format=json", "institution": "student"}, {"id": 186249, "fullname": "Tao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186249?format=json", "institution": "University of Science and Technology of China"}, {"id": 161680, "fullname": "yuchen wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/161680?format=json", "institution": "11"}, {"id": 186250, "fullname": "Shuangwu chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186250?format=json", "institution": ""}, {"id": 186251, "fullname": "Xiaobin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186251?format=json", "institution": "University of Science and Technology of China"}, {"id": 186252, "fullname": "jianyang jianyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186252?format=json", "institution": "University of Science and Technology of China; University of Science and Technology of China"}, {"id": 186253, "fullname": "liuyang liuyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186253?format=json", "institution": "ChangXin Memory Technologies"}, {"id": 186254, "fullname": "PanYinfei PanYinfei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186254?format=json", "institution": null}], "abstract": "In multimodal large language models (MLLMs), the surge of visual tokens  significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications.Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens.Existing researches usually assume that all attention heads contribute equally to the visual interpretation.However, our study reveals that different heads may capture distinct visual semantics and inherently  play distinct roles  in visual processing.In light of this observation, we propose HAWK, a \\textbf{h}ead importance-\\textbf{aw}are visual to\\textbf{k}en pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens.By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones.The proposed HAWK  is entirely training-free and can be seamlessly applied to various MLLMs.Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0\\% of the original accuracy after pruning 80.2\\% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4\\% of the original and further decreases GPU memory usage across the tested models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36932", "url": null, "sourceid": 43065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36504, "uid": "c5e0299714bda49d29f444c4ac527453", "name": "ReBaPL: Repulsive Bayesian Prompt Learning", "authors": [{"id": 152173, "fullname": "Yassir Bendou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152173?format=json", "institution": "Sigma Nova"}, {"id": 185227, "fullname": "Omar Ezzahir", "url": "http://cvpr.thecvf.com/api/miniconf/users/185227?format=json", "institution": "Sigma Nova"}, {"id": 180456, "fullname": "Eduardo Fernandes Montesuma", "url": "http://cvpr.thecvf.com/api/miniconf/users/180456?format=json", "institution": "Sigma Nova"}, {"id": 185228, "fullname": "Gabriel Mahuas", "url": "http://cvpr.thecvf.com/api/miniconf/users/185228?format=json", "institution": "Sigma Nova"}, {"id": 185229, "fullname": "Victoria Shevchenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/185229?format=json", "institution": "Sigma Nova"}, {"id": 185230, "fullname": "Mike Gartrell", "url": "http://cvpr.thecvf.com/api/miniconf/users/185230?format=json", "institution": "Rhizome Labs"}], "abstract": "Prompt learning has emerged as an effective technique for fine-tuning large-scale foundation models for downstream tasks. However, conventional prompt tuning methods are prone to overfitting and can struggle with out-of-distribution generalization. To address these limitations, Bayesian prompt learning has been proposed, which frames prompt optimization as a Bayesian inference problem to enhance robustness. This paper introduces Repulsive Bayesian Prompt Learning (ReBaPL), a novel method for Bayesian prompt learning, designed to efficiently explore the complex and often multimodal posterior landscape of prompts. Our method integrates a cyclical step-size schedule with a stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm, enabling alternating phases of exploration to discover new modes, and exploitation to refine existing modes. Furthermore, we introduce a repulsive force derived from a potential function over probability metrics (including Maximum Mean Discrepancy and Wasserstein distance) computed on the distributions of representations produced by different prompts. This representation-space repulsion diversifies exploration and prevents premature collapse to a single mode. Our approach allows for a more comprehensive characterization of the prompt posterior distribution, leading to improved generalization. In contrast to prior Bayesian prompt learning methods, our method provides a modular plug-and-play Bayesian extension of any existing prompt learning method based on maximum likelihood estimation. We demonstrate the efficacy of ReBaPL on several benchmark datasets, showing superior performance over state-of-the-art methods for prompt learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36504", "url": null, "sourceid": 37195, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38013, "uid": "6c465ed53d17739467c95fe414d8e056", "name": "Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation", "authors": [{"id": 188829, "fullname": "Yaowen Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188829?format=json", "institution": "Imperial College London"}, {"id": 164283, "fullname": "Zhen Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/164283?format=json", "institution": "Wuhan University"}, {"id": 87543, "fullname": "Xu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87543?format=json", "institution": "HKUST"}, {"id": 188830, "fullname": "Xiaoxin Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188830?format=json", "institution": "Wuhan University of Technology"}, {"id": 77237, "fullname": "Zhen Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77237?format=json", "institution": "Wuhan University"}], "abstract": "Panoramic semantic segmentation is pivotal for comprehensive 360\u00b0 scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label-rich pinhole-camera datasets offers a viable alternative, many real-world tasks impose a stricter source-free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo-labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence-Guided Denoising (PCGD) module generates high-fidelity, class-balanced pseudo-labels by enforcing perturbation consistency and incorporating neighborhood-level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine-grained details from high-resolution crops with global semantics from low-resolution contexts. DAPASS achieves state-of-the-art performances on outdoor (Cityscapes-to-DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38013", "url": null, "sourceid": 44611, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38126, "uid": "a2896cb11aed969ffec7bb02a7c772a8", "name": "Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols", "authors": [{"id": 174627, "fullname": "xianchao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/174627?format=json", "institution": "Beihang  University"}, {"id": 180537, "fullname": "Xinyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/180537?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 189108, "fullname": "Youcheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189108?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 189109, "fullname": "Jiayou Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189109?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189110, "fullname": "Tianle Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189110?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189111, "fullname": "Liangming Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189111?format=json", "institution": "Southern University for Science and Technology; Southern University of Science and Technology"}, {"id": 189112, "fullname": "Lei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/189112?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 127551, "fullname": "Yonglu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127551?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Code and data will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38126", "url": null, "sourceid": 35109, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37509, "uid": "22e1d53aa7304c599e40f712b921a50f", "name": "STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows", "authors": [{"id": 187603, "fullname": "Jiatao Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187603?format=json", "institution": "University of Pennsylvania; Apple"}, {"id": 169953, "fullname": "Ying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/169953?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 187604, "fullname": "Tianrong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187604?format=json", "institution": "Apple"}, {"id": 187605, "fullname": "Laurent Dinh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187605?format=json", "institution": "Apple"}, {"id": 187606, "fullname": "Yuyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187606?format=json", "institution": "Apple"}, {"id": 152428, "fullname": "Miguel \u00c1ngel Bautista", "url": "http://cvpr.thecvf.com/api/miniconf/users/152428?format=json", "institution": "Apple"}, {"id": 187607, "fullname": "David Berthelot", "url": "http://cvpr.thecvf.com/api/miniconf/users/187607?format=json", "institution": "Apple"}, {"id": 148925, "fullname": "Joshua Susskind", "url": "http://cvpr.thecvf.com/api/miniconf/users/148925?format=json", "institution": "Apple"}, {"id": 152427, "fullname": "Shuangfei Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/152427?format=json", "institution": "Apple"}], "abstract": "Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V  employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37509", "url": null, "sourceid": 37116, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36771, "uid": "3a93d176fad19664783fb3511f4f36ec", "name": "META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding", "authors": [{"id": 128208, "fullname": "Jing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128208?format=json", "institution": "Zhejiang University"}, {"id": 185838, "fullname": "Luyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185838?format=json", "institution": "Zhejiang University"}, {"id": 185839, "fullname": "Zhijie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185839?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 86376, "fullname": "Yadong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86376?format=json", "institution": "Ant Group"}, {"id": 185840, "fullname": "Xingzhong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185840?format=json", "institution": "Ant Group"}, {"id": 185841, "fullname": "Siye Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185841?format=json", "institution": "Alibaba Group"}, {"id": 89719, "fullname": "Jie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89719?format=json", "institution": "City University of Hong Kong"}, {"id": 107273, "fullname": "Ming Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107273?format=json", "institution": "Zhejiang University"}, {"id": 128218, "fullname": "Qiang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128218?format=json", "institution": "Zhejiang University"}], "abstract": "Long-video understanding remains challenging due to extreme temporal redundancy, sparse yet decisive events, and the instability of long-horizon reasoning in visual\u2013language models (VLMs). Existing agent-based methods invoke external micro-tools but remain static, repeatedly rebuilding long chains of fine-grained operations for each task without acquiring reusable multi-step perceptual skills.We propose META, the first training-free agent capable of self-evolving its tool-augmented reasoning. META operates through dual Solving and Evolving loops: it analyzes its own tool trajectories, abstracts recurring multi-step patterns into reusable macro-tools, and distills failed executions into structured failure priors that refine tool usage. Through symbolic consolidation and pruning, META progressively shortens reasoning paths and acquires more general perceptual and temporal abilities\u2014without any parameter updates. META achieves state-of-the-art performance on long-video benchmarks, demonstrating a scalable, model-agnostic paradigm for long-video understanding that can continually evolve without additional training.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36771", "url": null, "sourceid": 37561, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38403, "uid": "7c678b0576f9b98b734c11d7c4f7683f", "name": "AVION: Aerial Vision\u2013Language Instruction from Offline Teacher to Prompt-Tuned Network", "authors": [{"id": 180487, "fullname": "Yu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180487?format=json", "institution": "University of British Columbia"}, {"id": 152538, "fullname": "Jianyang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152538?format=json", "institution": "Zhejiang University"}, {"id": 189800, "fullname": "Hao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189800?format=json", "institution": "University of British Columbia"}, {"id": 189801, "fullname": "Yue Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189801?format=json", "institution": "University of British Columbia"}, {"id": 189802, "fullname": "Jozsef Hamari", "url": "http://cvpr.thecvf.com/api/miniconf/users/189802?format=json", "institution": null}, {"id": 189803, "fullname": "Zheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189803?format=json", "institution": "University of British Columbia"}, {"id": 94684, "fullname": "Mohsen Zardadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/94684?format=json", "institution": "TerraSense Analytics"}], "abstract": "Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference.Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38403", "url": null, "sourceid": 41570, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39864, "uid": "c262b62c257bb4e97a9618c9bf4bacf5", "name": "Helios: Stable Latent Image Modeling for Multimodal Earth Observation", "authors": [{"id": 193007, "fullname": "Henry Herzog", "url": "http://cvpr.thecvf.com/api/miniconf/users/193007?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 148776, "fullname": "Favyen Bastani", "url": "http://cvpr.thecvf.com/api/miniconf/users/148776?format=json", "institution": "Allen Institute for AI"}, {"id": 193008, "fullname": "Yawen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193008?format=json", "institution": "Allen Institute for AI"}, {"id": 139880, "fullname": "Gabriel Tseng", "url": "http://cvpr.thecvf.com/api/miniconf/users/139880?format=json", "institution": "Allen Institute for Artificial Intelligence; McGill University"}, {"id": 193009, "fullname": "Joseph Redmon", "url": "http://cvpr.thecvf.com/api/miniconf/users/193009?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193010, "fullname": "Hadrien Sablon", "url": "http://cvpr.thecvf.com/api/miniconf/users/193010?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193011, "fullname": "Ryan Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/193011?format=json", "institution": "Stanford University"}, {"id": 193012, "fullname": "Jacob Morrison", "url": "http://cvpr.thecvf.com/api/miniconf/users/193012?format=json", "institution": "University of Washington; Allen Institute for Artificial Intelligence"}, {"id": 193013, "fullname": "Alexandra Buraczynski", "url": "http://cvpr.thecvf.com/api/miniconf/users/193013?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193014, "fullname": "Karen Farley", "url": "http://cvpr.thecvf.com/api/miniconf/users/193014?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193015, "fullname": "Josh Hansen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193015?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193016, "fullname": "Andrew Howe", "url": "http://cvpr.thecvf.com/api/miniconf/users/193016?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 178513, "fullname": "Patrick Johnson", "url": "http://cvpr.thecvf.com/api/miniconf/users/178513?format=json", "institution": "The Allen Institute for Artificial Intelligence"}, {"id": 193017, "fullname": "Mark Otterlee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193017?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193018, "fullname": "Ted Schmitt", "url": "http://cvpr.thecvf.com/api/miniconf/users/193018?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193019, "fullname": "Hunter Pitelka", "url": "http://cvpr.thecvf.com/api/miniconf/users/193019?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193020, "fullname": "Stephen Daspit", "url": "http://cvpr.thecvf.com/api/miniconf/users/193020?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193021, "fullname": "Rachel Ratner", "url": "http://cvpr.thecvf.com/api/miniconf/users/193021?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193022, "fullname": "Christopher Wilhelm", "url": "http://cvpr.thecvf.com/api/miniconf/users/193022?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193023, "fullname": "Sebastian Wood", "url": "http://cvpr.thecvf.com/api/miniconf/users/193023?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 193024, "fullname": "Mike Jacobi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193024?format=json", "institution": "Allen Institute for Artificial Intelligence"}, {"id": 75560, "fullname": "Hannah Kerner", "url": "http://cvpr.thecvf.com/api/miniconf/users/75560?format=json", "institution": "Arizona State University"}, {"id": 76142, "fullname": "Evan Shelhamer", "url": "http://cvpr.thecvf.com/api/miniconf/users/76142?format=json", "institution": "UBC / Vector"}, {"id": 88931, "fullname": "Ali Farhadi", "url": "http://cvpr.thecvf.com/api/miniconf/users/88931?format=json", "institution": "University of Washington"}, {"id": 84558, "fullname": "Ranjay Krishna", "url": "http://cvpr.thecvf.com/api/miniconf/users/84558?format=json", "institution": "University of Washington"}, {"id": 144676, "fullname": "Patrick Beukema", "url": "http://cvpr.thecvf.com/api/miniconf/users/144676?format=json", "institution": "Allen Institute for Artificial Intelligence"}], "abstract": "Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present Helios: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. Helios achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings Helios achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy Helios as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The Helios platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. Helios source code, training data, and pre-trained weights are available at REDACTED.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39864", "url": null, "sourceid": 46377, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38809, "uid": "67c34b9630469b3c13b7982316ffe7a1", "name": "YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction", "authors": [{"id": 190726, "fullname": "Miro Miranda", "url": "http://cvpr.thecvf.com/api/miniconf/users/190726?format=json", "institution": "German Research Center for Artificial Intelligence"}, {"id": 190727, "fullname": "Deepak Pathak", "url": "http://cvpr.thecvf.com/api/miniconf/users/190727?format=json", "institution": "German Research Center for AI"}, {"id": 190728, "fullname": "Patrick Helber", "url": "http://cvpr.thecvf.com/api/miniconf/users/190728?format=json", "institution": "Vision Impulse GmbH; German Research Center for AI"}, {"id": 190729, "fullname": "Benjamin Bischke", "url": "http://cvpr.thecvf.com/api/miniconf/users/190729?format=json", "institution": null}, {"id": 190730, "fullname": "Hiba Najjar", "url": "http://cvpr.thecvf.com/api/miniconf/users/190730?format=json", "institution": "Universit\u00e4t Kaiserslautern"}, {"id": 190731, "fullname": "Francisco Mena", "url": "http://cvpr.thecvf.com/api/miniconf/users/190731?format=json", "institution": "GFZ Helmholtz Centre for Geosciences; Rheinland-Pf\u00e4lzische Technische Universit\u00e4t"}, {"id": 190732, "fullname": "Cristhian Sanchez", "url": "http://cvpr.thecvf.com/api/miniconf/users/190732?format=json", "institution": null}, {"id": 190733, "fullname": "Akshay Pai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190733?format=json", "institution": "German Research Center for AI"}, {"id": 190734, "fullname": "Diego Arenas", "url": "http://cvpr.thecvf.com/api/miniconf/users/190734?format=json", "institution": "German Research Center for AI"}, {"id": 95125, "fullname": "Matias Valdenegro", "url": "http://cvpr.thecvf.com/api/miniconf/users/95125?format=json", "institution": "University of Groningen"}, {"id": 190735, "fullname": "Marcela Charfuelan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190735?format=json", "institution": "German Research Center for AI"}, {"id": 190736, "fullname": "Marlon Nuske", "url": "http://cvpr.thecvf.com/api/miniconf/users/190736?format=json", "institution": null}, {"id": 86783, "fullname": "Andreas Dengel", "url": "http://cvpr.thecvf.com/api/miniconf/users/86783?format=json", "institution": "DFKI &amp; RPTU"}], "abstract": "Crop yield prediction requires substantial data to train data-driven models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across two continents and multiple countries, including Argentina, Brazil, Uruguay, and Germany. The dataset was collected between 2016 and 2024 and comprises four major crop types\u2014corn, rapeseed, soybeans, and wheat\u2014across 2,176 curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of $\\SI{10}{m}$. Each field is paired with multispectral satellite imagery, resulting in 113,630 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as an image regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data. To mitigate this, we explore a domain-informed Deep Ensemble that exhibits greater diversity in the weight space.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38809", "url": null, "sourceid": 40184, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37136, "uid": "ce53c47e9ce09161564a02707c0b409b", "name": "Unique Lives, Shared World: Learning from Single-Life Videos", "authors": [{"id": 89189, "fullname": "Tengda Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/89189?format=json", "institution": "University of Oxford, University of Oxford"}, {"id": 186743, "fullname": "Sayna Ebrahimi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186743?format=json", "institution": "Google DeepMind"}, {"id": 129350, "fullname": "Dilara Gokay", "url": "http://cvpr.thecvf.com/api/miniconf/users/129350?format=json", "institution": "Google DeepMind"}, {"id": 186744, "fullname": "Li Yang Ku", "url": "http://cvpr.thecvf.com/api/miniconf/users/186744?format=json", "institution": "Google"}, {"id": 85376, "fullname": "Maks Ovsjanikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85376?format=json", "institution": "Ecole Polytechnique, France"}, {"id": 186745, "fullname": "Iva Babukova", "url": "http://cvpr.thecvf.com/api/miniconf/users/186745?format=json", "institution": "Google"}, {"id": 137195, "fullname": "Daniel Zoran", "url": "http://cvpr.thecvf.com/api/miniconf/users/137195?format=json", "institution": "Google DeepMind"}, {"id": 128676, "fullname": "Viorica Patraucean", "url": "http://cvpr.thecvf.com/api/miniconf/users/128676?format=json", "institution": "DeepMind"}, {"id": 75596, "fullname": "Joao Carreira", "url": "http://cvpr.thecvf.com/api/miniconf/users/75596?format=json", "institution": "DeepMind"}, {"id": 75512, "fullname": "Andrew Zisserman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75512?format=json", "institution": "University of Oxford"}, {"id": 73556, "fullname": "Dima Damen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73556?format=json", "institution": "University of Bristol and Google DeepMind"}], "abstract": "We introduce the ``single-life\u201d learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning.  Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37136", "url": null, "sourceid": 40230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36687, "uid": "145bc269a39952b5630edd00cb27824d", "name": "Small Object, Great Challenge: A  Benchmark for Small Object Visual Grounding", "authors": [{"id": 182313, "fullname": "Wenqi Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/182313?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 128948, "fullname": "Ruifan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128948?format=json", "institution": "Beijing University of Post and Telecommunication"}, {"id": 185644, "fullname": "Pengyue Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185644?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 185645, "fullname": "Fangxiang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185645?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 90236, "fullname": "Zhanyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90236?format=json", "institution": "Beijing University of Post and Telecommunication"}, {"id": 185646, "fullname": "Xiaojie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185646?format=json", "institution": "Beijing University of Post and Telecommunication"}], "abstract": "The task of visual grounding (i.e., VG)  aims to locate or segment objects in images based on referring expressions. Existing research on VG primarily focuses on large objects. However, these images often contain objects at various scales. Although large objects are usually the visual focus, small objects sometimes carry crucial information. To bridge the gap, we propose a novel benchmark for small object visual grounding, i.e., SoVG. Specifically, we introduce an automatic pipeline using MLLMs to build a benchmark dataset. Our pipeline is built on the popular dataset COCO. Thus, we obtain our RefCOCOs dataset. The visual objects in our RefCOCOs have an average area of 1/50 area of an entire image, whereas that of classic VG datasets is 1/5. Furthermore, we propose SoVG-Net with a hierarchical textual infusion module for the novel SoVG task. Finally, we conduct extensive experiments using classic datasets with our RefCOCOs. The results showcase that our built dataset is useful for advancing VG research, and our proposed SoVG-Net is a strong baseline. Our dataset and code will be made publicly available after review.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36687", "url": null, "sourceid": 43402, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39953, "uid": "10fa97233fb64af4b6ce8316d0bc2eca", "name": "pH-Strips for Selective Forgetting: A Blunt but Fast Diagnostic Baseline for Machine Unlearning", "authors": [{"id": 193186, "fullname": "Chengyao Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193186?format=json", "institution": "Monash University"}, {"id": 193187, "fullname": "Jing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193187?format=json", "institution": "Monash University"}, {"id": 128675, "fullname": "Trung Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/128675?format=json", "institution": "Monash University"}, {"id": 128648, "fullname": "Dinh Phung", "url": "http://cvpr.thecvf.com/api/miniconf/users/128648?format=json", "institution": "Monash University"}, {"id": 87144, "fullname": "Mehrtash Harandi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87144?format=json", "institution": "Monash University"}], "abstract": "Machine Unlearning (MU), erasing undesirable content from Artificial Intelligence (AI) models, plays an essential role in developing safe and trustworthy AI systems.Despite notable advances, the baseline MU methods rely on retraining from scratch without the data to be removed, which is computationally expensive and financially prohibitive.To address this challenge, we propose a simple yet efficient \\textbf{training-free} and \\textbf{retain-set-free} MU algorithm designed explicitly as a \\textbf{diagnostic baseline}: Machine Unlearning pH-Test (MUpHT).It is designed to serve as a practical evaluation reference for future MU methods.Our method eliminates the low dimensional subspaces associated with undesirable concepts from the space spanned by the model's weight vectors, thereby rendering the model ``blind\" to these undesirable contents. Additionally, we extend our retain-aware variant to handle entangled features by leveraging a generalized Rayleigh quotient over the undesirable and retain sets, enabling an efficient tradeoff between preserving retained knowledge and suppressing undesirable knowledge.Our method enables evaluation of MU across diverse visual tasks, including concept erasure for classification, image generation, and multimodal applications.By producing an unlearned model instantly from only a few samples, our method serves as a quick litmus test for MU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39953", "url": null, "sourceid": 43537, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40096, "uid": "37fe435a1d7956df247dde078074254b", "name": "Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining", "authors": [{"id": 127688, "fullname": "Junxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127688?format=json", "institution": "Meta Reality Labs"}, {"id": 127197, "fullname": "Rawal Khirodkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/127197?format=json", "institution": "Meta"}, {"id": 87620, "fullname": "Egor Zakharov", "url": "http://cvpr.thecvf.com/api/miniconf/users/87620?format=json", "institution": "Meta Reality Labs"}, {"id": 193498, "fullname": "Jihyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193498?format=json", "institution": "Meta"}, {"id": 133394, "fullname": "Zhaoen Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/133394?format=json", "institution": "Facebook"}, {"id": 96465, "fullname": "Yuan Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/96465?format=json", "institution": "Meta"}, {"id": 95418, "fullname": "Julieta Martinez", "url": "http://cvpr.thecvf.com/api/miniconf/users/95418?format=json", "institution": "Reality Labs Research"}, {"id": 193499, "fullname": "Kai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193499?format=json", "institution": "Meta Platforms, Inc."}, {"id": 168292, "fullname": "Qingyang Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/168292?format=json", "institution": "Facebook"}, {"id": 89936, "fullname": "Takaaki Shiratori", "url": "http://cvpr.thecvf.com/api/miniconf/users/89936?format=json", "institution": "Meta Reality Labs Research"}, {"id": 193500, "fullname": "Matthew Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193500?format=json", "institution": "Facebook"}, {"id": 193501, "fullname": "Peihong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193501?format=json", "institution": "Facebook"}, {"id": 193502, "fullname": "Xuhua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193502?format=json", "institution": "Meta Platforms Inc"}, {"id": 152348, "fullname": "Zhongshi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152348?format=json", "institution": "Meta"}, {"id": 193503, "fullname": "LINGCHEN YANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/193503?format=json", "institution": "Facebook"}, {"id": 193504, "fullname": "Ariyan Zarei", "url": "http://cvpr.thecvf.com/api/miniconf/users/193504?format=json", "institution": "Facebook"}, {"id": 193505, "fullname": "Marco Pesavento", "url": "http://cvpr.thecvf.com/api/miniconf/users/193505?format=json", "institution": "Facebook"}, {"id": 132004, "fullname": "Yichen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132004?format=json", "institution": "Meta platforms inc"}, {"id": 193506, "fullname": "Chengan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193506?format=json", "institution": "Facebook"}, {"id": 92442, "fullname": "He Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/92442?format=json", "institution": "Meta Platformts, Inc."}, {"id": 159460, "fullname": "Giljoo Nam", "url": "http://cvpr.thecvf.com/api/miniconf/users/159460?format=json", "institution": "Meta"}, {"id": 153790, "fullname": "Teng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153790?format=json", "institution": "Meta"}, {"id": 193507, "fullname": "Wyatt Borsos", "url": "http://cvpr.thecvf.com/api/miniconf/users/193507?format=json", "institution": "Meta"}, {"id": 193508, "fullname": "Anjali Thakrar", "url": "http://cvpr.thecvf.com/api/miniconf/users/193508?format=json", "institution": "Meta"}, {"id": 75515, "fullname": "Jean-Charles Bazin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75515?format=json", "institution": "Meta"}, {"id": 193509, "fullname": "Rinat Abdrashitov", "url": "http://cvpr.thecvf.com/api/miniconf/users/193509?format=json", "institution": "Facebook"}, {"id": 193510, "fullname": "Carsten Stoll", "url": "http://cvpr.thecvf.com/api/miniconf/users/193510?format=json", "institution": "Meta"}, {"id": 193511, "fullname": "Gin\u00e9s Hidalgo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193511?format=json", "institution": "Meta"}, {"id": 193512, "fullname": "James Booth", "url": "http://cvpr.thecvf.com/api/miniconf/users/193512?format=json", "institution": "Meta"}, {"id": 193513, "fullname": "Lucy Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193513?format=json", "institution": "Meta"}, {"id": 176478, "fullname": "Xiaowen Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/176478?format=json", "institution": "Meta"}, {"id": 153791, "fullname": "Yu Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153791?format=json", "institution": "Meta Reality Labs Research"}, {"id": 193514, "fullname": "Sairanjith Thalanki", "url": "http://cvpr.thecvf.com/api/miniconf/users/193514?format=json", "institution": "Facebook"}, {"id": 89875, "fullname": "Chen Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89875?format=json", "institution": "Facebook"}, {"id": 193515, "fullname": "Christian H\u00e4ne", "url": "http://cvpr.thecvf.com/api/miniconf/users/193515?format=json", "institution": "Meta"}, {"id": 88930, "fullname": "Abhishek Kar", "url": "http://cvpr.thecvf.com/api/miniconf/users/88930?format=json", "institution": "Google"}, {"id": 175438, "fullname": "Sofien Bouaziz", "url": "http://cvpr.thecvf.com/api/miniconf/users/175438?format=json", "institution": "Meta"}, {"id": 193516, "fullname": "Jason Saragih", "url": "http://cvpr.thecvf.com/api/miniconf/users/193516?format=json", "institution": "Meta"}, {"id": 193517, "fullname": "Yaser Sheikh", "url": "http://cvpr.thecvf.com/api/miniconf/users/193517?format=json", "institution": "Meta"}, {"id": 89935, "fullname": "Shunsuke Saito", "url": "http://cvpr.thecvf.com/api/miniconf/users/89935?format=json", "institution": "Reality Labs Research"}], "abstract": "High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40096", "url": null, "sourceid": 34646, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36292, "uid": "64bb4061482a669cede062dfe81b88e9", "name": "Kaleidoscopic Scintillation Event Imaging", "authors": [{"id": 183368, "fullname": "Alex Bocchieri", "url": "http://cvpr.thecvf.com/api/miniconf/users/183368?format=json", "institution": "University of Wisconsin, Madison"}, {"id": 154563, "fullname": "John Mamish", "url": "http://cvpr.thecvf.com/api/miniconf/users/154563?format=json", "institution": "Georgia Institute of Technology"}, {"id": 184706, "fullname": "David Appleyard", "url": "http://cvpr.thecvf.com/api/miniconf/users/184706?format=json", "institution": "Ubicept"}, {"id": 184707, "fullname": "Andreas Velten", "url": "http://cvpr.thecvf.com/api/miniconf/users/184707?format=json", "institution": "University of Wisconsin - Madison"}], "abstract": "Scintillators are transparent materials that interact with high-energy particles and emit visible light as a result. They are used in state of the art methods of measuring high-energy particles and radiation sources.Most existing methods use fast single-pixel detectors to detect and time scintillation events.Cameras provide spatial resolution but can only capture an average over many events, making it difficult to image the events associated with an individual particle.Emerging single-photon avalanche diode cameras combine speed and spatial resolution to enable capturing images of individual events.This allows us to use machine vision techniques to analyze events, enabling new types of detectors.The main challenge is the very low brightness of the events.Techniques have to work with a very limited number of photons.We propose a kaleidoscopic scintillator to increase light collection in a single-photon camera while preserving the event's spatial information.The kaleidoscopic geometry creates mirror reflections of the event in known locations for a given event location that are captured by the camera.We introduce theory for imaging an event in a kaleidoscopic scintillatorand an algorithm to estimate the event's 3D position.We find that the kaleidoscopic scintillator design provides sufficient light collection to perform high-resolution event measurements for advanced radiation imaging techniques using a commercial CMOS single-photon camera.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36292", "url": null, "sourceid": 39614, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37238, "uid": "dcdd0d62a00c7ccf110885b9275419cf", "name": "VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image", "authors": [{"id": 186988, "fullname": "Haokun GUI", "url": "http://cvpr.thecvf.com/api/miniconf/users/186988?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 158206, "fullname": "Senqiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158206?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 186989, "fullname": "Mingkang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186989?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 184074, "fullname": "Meng Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184074?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 128180, "fullname": "WU Sitong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128180?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 186990, "fullname": "Changsheng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186990?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 186991, "fullname": "Zihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186991?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 89574, "fullname": "Zhuotao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89574?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 154575, "fullname": "Jiaya Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/154575?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}], "abstract": "The \"think-with-image\u201d paradigm has recently gained traction for complex visual reasoning tasks. However, existing approaches often struggle with inference inefficiency due to a fixed number of redundant reasoning steps, as well as training instability.This challenge primarily arises from the direct use of standard reinforcement learning policies, which do not incorporate improvements for the think-with-image multi-turn conversational scenario.To address this challenge, we propose VisionLeaf, an entropy-guided, tree-based reasoning framework. Unlike conventional GRPO, where all nodes expand from the root and each leaf has only a single branch, our method grows the reasoning tree from the leaf nodes and selects the most valuable nodes based on entropy for thorough rollout exploration. This leaf-first expansion naturally aligns with the hierarchical nature of multi-step image analysis. Without modifying any model or training data, our VisionLeaf achieves a 4.2\\% performance improvement on benchmarks such as VSTAR and HRBench, while reducing the number of inference rounds by nearly half\u2014demonstrating significant gains in both accuracy and speed. All our code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37238", "url": null, "sourceid": 32241, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37525, "uid": "73103ddd969c87aee1a6cfe805b7dbd4", "name": "VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis", "authors": [{"id": 184074, "fullname": "Meng Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184074?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 158206, "fullname": "Senqiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158206?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 76190, "fullname": "Haoxuan Che", "url": "http://cvpr.thecvf.com/api/miniconf/users/76190?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 187642, "fullname": "Suiyun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187642?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187643, "fullname": "Xichen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187643?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 90762, "fullname": "Shaozuo Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90762?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 186988, "fullname": "Haokun GUI", "url": "http://cvpr.thecvf.com/api/miniconf/users/186988?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187644, "fullname": "Zhefan Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187644?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 187645, "fullname": "Dandan Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187645?format=json", "institution": null}, {"id": 187646, "fullname": "Rui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187646?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 154575, "fullname": "Jiaya Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/154575?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}], "abstract": "Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models\u2019 performance in real-world, we introduce Long Goal Bench(**LGBench**), a 2000-task suite (1000 T2I, 1000 I2I) whose average instruction contains 18---22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find even state-of-the-art commercial APIs satisfy fewer than 72\\% of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present **VisionDirector**, a training-free, vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling plus semantic verification/rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 vs.\\ 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (+7\\% overall), and ImgEdit (+0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing. The code, benchmark, and evaluation scripts will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37525", "url": null, "sourceid": 46059, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40333, "uid": "9c8c3c20976adbc398d5366a0cd2f817", "name": "MeshSplatting: Differentiable Rendering with Opaque Meshes", "authors": [{"id": 154594, "fullname": "Jan Held", "url": "http://cvpr.thecvf.com/api/miniconf/users/154594?format=json", "institution": "ULiege / KAUST"}, {"id": 189791, "fullname": "Sanghyun Son", "url": "http://cvpr.thecvf.com/api/miniconf/users/189791?format=json", "institution": "University of Maryland, College Park"}, {"id": 154595, "fullname": "Renaud Vandeghen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154595?format=json", "institution": "University of Li\u00e8ge"}, {"id": 132581, "fullname": "Daniel Rebain", "url": "http://cvpr.thecvf.com/api/miniconf/users/132581?format=json", "institution": "Wayve; University of British Columbia"}, {"id": 106544, "fullname": "Matheus Gadelha", "url": "http://cvpr.thecvf.com/api/miniconf/users/106544?format=json", "institution": "Adobe Systems"}, {"id": 88688, "fullname": "Yi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88688?format=json", "institution": "Adobe Systems"}, {"id": 92295, "fullname": "Anthony Cioppa", "url": "http://cvpr.thecvf.com/api/miniconf/users/92295?format=json", "institution": "Universit\u00e9 de Li\u00e8ge"}, {"id": 150899, "fullname": "Ming Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/150899?format=json", "institution": "University of Maryland, College Park"}, {"id": 71630, "fullname": "Marc Van Droogenbroeck", "url": "http://cvpr.thecvf.com/api/miniconf/users/71630?format=json", "institution": "University of Li\u00e8ge"}, {"id": 126249, "fullname": "Andrea Tagliasacchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126249?format=json", "institution": "Simon Fraser University, Google Brain"}], "abstract": "Primitive-based splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering.However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering.By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines.On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40333", "url": null, "sourceid": -36660, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38397?format=json"], "related_events_ids": [38397]}, {"id": 39075, "uid": "f5b49df0ec42774c1c13ef6f93d7865c", "name": "VL-Eraser: Vacuum Distillation for Machine Unlearning in Vision-Language Models", "authors": [{"id": 181451, "fullname": "Yili Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181451?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)."}, {"id": 191315, "fullname": "Lu Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191315?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 191316, "fullname": "Tairan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191316?format=json", "institution": "Central South University"}, {"id": 191317, "fullname": "Yijie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191317?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 187183, "fullname": "Hui Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187183?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Machine unlearning (MU) aims to remove sensitive or undesired content from pre-trained models. Existing MU methods are commonly characterized as gradually degrading model performance on undesired data to realize approximate forgetting. Despite their successes, the effectiveness in multimodal unlearning tasks remains largely unexplored. In this paper, we first conduct an in-depth analysis and reveal that traditional MU methods tend to disrupt cross-modal alignment, leading to incomplete forgetting in multimodal scenarios. To tackle this challenge, we propose VL-Eraser, a novel unlearning paradigm for VLM unlearning. VL-Eraser reformulates unlearning in VLMs as a two-stage process: distillation and deletion. Specifically, VL-Eraser first introduces a vacuum distillation that disentangles undesired knowledge from the intricate parameters of VLMs and transfers it into low-rank adapters (LoRA). After distillation, unlearning is efficiently achieved by deleting the LoRA parameters from the original model. Extensive experiments across multiple benchmarks demonstrate that VL-Eraser achieves superior unlearning performance while preserving utility compared to the state-of-the-art baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39075", "url": null, "sourceid": 34518, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36613, "uid": "effa6c4644b3dcb6a4ab6159e328f99e", "name": "GenTract: Generative Global Tractography", "authors": [{"id": 183975, "fullname": "Alec Sargood", "url": "http://cvpr.thecvf.com/api/miniconf/users/183975?format=json", "institution": "UCL"}, {"id": 185466, "fullname": "Lemuel Puglisi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185466?format=json", "institution": "University of Catania"}, {"id": 185467, "fullname": "Elinor Thompson", "url": "http://cvpr.thecvf.com/api/miniconf/users/185467?format=json", "institution": "Hawkes Institute, University College London"}, {"id": 185468, "fullname": "Mirco Musolesi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185468?format=json", "institution": "University College London, University of London; University of Bologna"}, {"id": 185469, "fullname": "Daniel C. Alexander", "url": "http://cvpr.thecvf.com/api/miniconf/users/185469?format=json", "institution": "University College London"}], "abstract": "Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract\u2019s performance against state-of-the-art baselines. Notably, GenTract achieves precision 2.1$\\times$ higher than the next-best method, TractOracle. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by an order of magnitude. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36613", "url": null, "sourceid": 38819, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36430, "uid": "4ab232445f9b21b65dfdf6ea5f27f704", "name": "SemLayer: Semantic Generative Segmentation and Layer Reconstruction for Vector Icons", "authors": [{"id": 152992, "fullname": "Haiyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152992?format=json", "institution": "University of California, San Diego"}, {"id": 157435, "fullname": "Ronghuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157435?format=json", "institution": "City University of Hong Kong"}, {"id": 185025, "fullname": "Li-Yi Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/185025?format=json", "institution": "Adobe Systems"}, {"id": 88458, "fullname": "Nanxuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88458?format=json", "institution": "Adobe Research"}, {"id": 185026, "fullname": "Chenxi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185026?format=json", "institution": "University of Toronto"}, {"id": 185027, "fullname": "Cuong Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185027?format=json", "institution": "Adobe Systems"}, {"id": 84901, "fullname": "Zhuowen Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84901?format=json", "institution": "University of California, San Diego"}, {"id": 86829, "fullname": "Zhaowen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86829?format=json", "institution": "Adobe Research"}], "abstract": "Graphic icons are a cornerstone of modern design workflows, yet they are often distributed as flattened single-path or compound-path graphics, where the original semantic layering is lost. This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation\u2013based pipeline that restores editable layered structures. Given an abstract icon, SemLayer first generates a chromatically differentiated representation in which distinct semantic components become visually separable. To recover the complete geometry of each part, including occluded regions, we then perform a semantic completion step that reconstructs coherent object-level shapes. Finally, the recovered parts are assembled into a layered vector representation with inferred occlusion relationships. Extensive qualitative comparisons and quantitative evaluations demonstrate the effectiveness of SemLayer, enabling editing workflows previously inaccessible for flattened vector graphics and establishing semantic layer reconstruction as a practical and valuable task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36430", "url": null, "sourceid": 33853, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36249, "uid": "0aa519b9fd9be846fcedee48dbdb44fd", "name": "CVA: Context-aware Video-text Alignment for Video Temporal Grounding", "authors": [{"id": 184578, "fullname": "Sungho Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/184578?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 157090, "fullname": "Seunghun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/157090?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 184579, "fullname": "Jiwan Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184579?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 86689, "fullname": "Sunghoon Im", "url": "http://cvpr.thecvf.com/api/miniconf/users/86689?format=json", "institution": "DGIST"}], "abstract": "We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. %CVA integrates complementary data-centric and architectural innovations to enhance contextual robustness. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the false negative caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36249", "url": null, "sourceid": 38005, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38270, "uid": "f9639421ad8666783822bcb2e5b8eb8b", "name": "SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection", "authors": [{"id": 131840, "fullname": "Yue Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131840?format=json", "institution": "Xidian University"}, {"id": 189468, "fullname": "Tao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189468?format=json", "institution": "Xidian University"}, {"id": 131839, "fullname": "Yongzhe Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131839?format=json", "institution": "Xidian University"}, {"id": 188828, "fullname": "Kaiyuan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188828?format=json", "institution": ""}, {"id": 188827, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188827?format=json", "institution": "Xidian University; Xidian University"}, {"id": 131847, "fullname": "Maoguo Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131847?format=json", "institution": "Xidian University"}, {"id": 85385, "fullname": "Qiguang Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85385?format=json", "institution": "Xidian University"}, {"id": 131849, "fullname": "Wenping Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/131849?format=json", "institution": "Xidian University"}], "abstract": "With the growing accessibility of large-scale 3D point clouds from LiDAR and photogrammetric techniques, 3D change detection (3DCD) has become essential for understanding dynamic scenes. Existing methods typically formulate this as segmentation, treating each point independently for binary classification. This leads to isolated misclassified noise points inside regions. Meanwhile, feature similarity at boundaries causes boundary ambiguity. The more severe class imbalance inherent to change detection further exacerbates this issue. To address these challenges, we propose SRGCD, a Stability-Driven Region Growth Framework that redefines 3DCD as region growing rather than segmentation. Our key insight is that progressively expanding from highly confident seeds avoids pitfalls of point-wise classification while elegantly alleviating class imbalance. Specifically, we first apply strict constraints through Mutual Geometric Consistency Prior to identify minimal highly reliable unchanged seeds. From these seeds, Stability-Guided Controlled Attention modules progressively propagate stability from stable regions to neighboring uncertain points, enabling unchanged regions to grow layer-by-layer from interior cores toward boundaries. This coarse-to-fine growing process naturally forms coherent regions, avoiding isolated noise while achieving compact, well-defined boundaries through progressive expansion. Extensive experiments on the synthetic dataset Urb3DCD and the real-world dataset HKCD demonstrate that SRGCD achieves state-of-the-art performance, significantly improving interior completeness and boundary compactness compared with existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38270", "url": null, "sourceid": 44040, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37282, "uid": "497aa134143557a83712fcfa7c503eb0", "name": "Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking", "authors": [{"id": 181094, "fullname": "Yuheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181094?format=json", "institution": "Yangzhou University"}, {"id": 187076, "fullname": "Weitong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187076?format=json", "institution": "Yangzhou University"}, {"id": 154874, "fullname": "chengcheng zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154874?format=json", "institution": "Yangzhou University"}, {"id": 154875, "fullname": "Jiale Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154875?format=json", "institution": "Yangzhou University"}, {"id": 187077, "fullname": "Chunpeng Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/187077?format=json", "institution": "Shandong University"}, {"id": 184154, "fullname": "Di Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184154?format=json", "institution": "La Trobe University"}, {"id": 153665, "fullname": "Guodong Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/153665?format=json", "institution": "University of Technology Sydney"}], "abstract": "Deep learning-based watermarking has made remarkable progress in recent years. To achieve robustness against various distortions, current methods commonly adopt a training strategy where a \\underline{\\textbf{s}}ingle \\underline{\\textbf{r}}andom \\underline{\\textbf{d}}istortion (SRD) is chosen as the noise layer in each training batch. However, the SRD strategy treats distortions independently within each batch, neglecting the inherent relationships among different types of distortions and causing optimization conflicts across batches.As a result, the robustness and generalizability of the watermarking model are limited. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \\underline{\\textbf{meta}}-learning with \\underline{\\textbf{f}}eature \\underline{\\textbf{c}}onsistency (Meta-FC). Specifically, we randomly sample multiple distortions from the noise pool to construct a meta-training task, while holding out one distortion as a simulated ``unknown'' distortion for the meta-testing phase.Through meta-learning, the model is encouraged to identify and utilize neurons that exhibit stable activations across different types of distortions, mitigating the optimization conflicts caused by the random sampling of diverse distortions in each batch.To further promote the transformation of stable activations into distortion-invariant representations, we introduce a feature consistency loss that constrains the decoded features of the same image subjected to different distortions to remain consistent.Extensive experiments demonstrate that, compared to the SRD training strategy, Meta-FC improves the robustness and generalization of various watermarking models by an average of 1.59\\%, 4.71\\%, and 2.38\\% under high-intensity, combined, and unknown distortions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37282", "url": null, "sourceid": 37573, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39550, "uid": "86800fe41915de0ec2b86fe688662acd", "name": "Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification", "authors": [{"id": 192327, "fullname": "Shuilian Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192327?format=json", "institution": "Dalian University of Technology"}, {"id": 130390, "fullname": "Qi Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/130390?format=json", "institution": "Dalian University of Technology"}, {"id": 107617, "fullname": "Qi Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/107617?format=json", "institution": "Dalian University of Technology"}, {"id": 192328, "fullname": "Zhang pengshuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192328?format=json", "institution": "Dalian University of Technology"}, {"id": 192329, "fullname": "Lili Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192329?format=json", "institution": "liaoning Cancer of China Medical University Shenyang"}, {"id": 130404, "fullname": "Weimin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130404?format=json", "institution": "Dalian University of Techonoly"}, {"id": 192330, "fullname": "Yanmei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192330?format=json", "institution": "Liaoning cancer of China Medical University Shenyang"}, {"id": 192331, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192331?format=json", "institution": "Facebook"}, {"id": 88008, "fullname": "Xin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88008?format=json", "institution": "Dalian University of Technology"}], "abstract": "Lymph node metastasis diagnosis in pathological images is a highly challenging four-class classification task, comprising macrometastasis, micrometastasis, isolated tumor cells (ITC), and negative lesions.Unlike conventional classification settings, this four-class scenario simultaneously suffers from inter-class and intra-slide scarcity of minority information.Existing approaches based on CNNs or GNNs primarily emphasize node-level feature learning, making it difficult to capture high-order feature interactions and topological dependencies among cells, while also overlooking the representational insufficiency induced by class scarcity.To address these challenges, we propose a dual-level generative framework that integrates class-prompt priors with high-order structural modeling to enhance the representation capacity of minority classes.At the hypergraph level, we develop a prompt-guided hierarchical hypergraph variational autoencoder (HGVAE) capable of generating diverse and topologically consistent hypergraph representations for minority classes.At the hypernode level, we introduce an anchor-diffusion mixup strategy to enrich the minority node features of high-attention positive anchor nodes.Extensive experiments on the four-class NIMM dataset, as well as TCGA datasets, demonstrate that the proposed framework effectively alleviates feature scarcity and significantly boosts the classification performance of minority classes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39550", "url": null, "sourceid": 39744, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39012, "uid": "97652673df105b7ad2ba940585e53500", "name": "Scaling Dense Event-Stream Pretraining from Visual Foundation Models", "authors": [{"id": 191177, "fullname": "Zhiwen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191177?format=json", "institution": "City University of Hong Kong"}, {"id": 128772, "fullname": "Junhui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128772?format=json", "institution": "City University of Hong Kong"}, {"id": 104193, "fullname": "Zhiyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104193?format=json", "institution": "City University of Hong Kong"}, {"id": 90655, "fullname": "Jinjian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90655?format=json", "institution": "Xidian University"}, {"id": 86458, "fullname": "Guangming Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86458?format=json", "institution": "Xidian University"}], "abstract": "Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability. The source code will be available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39012", "url": null, "sourceid": 31354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39657, "uid": "037adb4f3aa1d0bad47958c8bc165985", "name": "Edit-aware RAW reconstruction", "authors": [{"id": 192581, "fullname": "Abhijith Punnappurath", "url": "http://cvpr.thecvf.com/api/miniconf/users/192581?format=json", "institution": "AI Center - Toronto, Samsung Electronics"}, {"id": 93060, "fullname": "Luxi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/93060?format=json", "institution": "York University"}, {"id": 175284, "fullname": "Ke Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/175284?format=json", "institution": "Samsung AI Center - Toronto"}, {"id": 192582, "fullname": "Hue Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192582?format=json", "institution": "Samsung AI Center - Toronto"}, {"id": 192583, "fullname": "Radek Grzeszczuk", "url": "http://cvpr.thecvf.com/api/miniconf/users/192583?format=json", "institution": "Samsung"}, {"id": 89869, "fullname": "Michael S. Brown", "url": "http://cvpr.thecvf.com/api/miniconf/users/89869?format=json", "institution": "Samsung / York"}], "abstract": "Users frequently edit camera images post-capture to achieve their preferred photofinishing style. While editing in the RAW domain provides greater accuracy and flexibility, most edits are performed on the camera\u2019s display-referred output (e.g., 8-bit sRGB JPEG) since RAW images are rarely stored. Existing RAW reconstruction methods can recover RAW data from sRGB images, but these approaches are typically optimized for pixel-wise RAW reconstruction fidelity and tend to degrade under diverse rendering styles and editing operations.  We introduce a plug-and-play, edit-aware loss function that can be integrated into any existing RAW reconstruction framework to make the recovered RAWs more robust to different rendering styles and edits. Our loss formulation incorporates a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, parameters for each ISP module are randomly sampled from carefully designed distributions that model practical variations in real camera processing. The loss is then computed in sRGB space between ground-truth and reconstructed RAWs rendered through this differentiable ISP. Incorporating our loss improves sRGB reconstruction quality by up to 1.5\u20132 dB PSNR across various editing conditions. Moreover, when applied to metadata-assisted RAW reconstruction methods, our approach enables fine-tuning for target edits, yielding further gains. Since photographic editing is the primary motivation for RAW reconstruction in consumer imaging, our simple yet effective loss function provides a general mechanism for enhancing edit fidelity and rendering flexibility across existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39657", "url": null, "sourceid": 42257, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38481, "uid": "87f3c42a65381d7184324c73cf41a400", "name": "Perceptual 3D Simulation With Physical World Modeling", "authors": [{"id": 182471, "fullname": "Wanhee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182471?format=json", "institution": "Stanford University"}, {"id": 187521, "fullname": "Klemen Kotar", "url": "http://cvpr.thecvf.com/api/miniconf/users/187521?format=json", "institution": "Stanford University"}, {"id": 187520, "fullname": "Rahul Mysore Venkatesh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187520?format=json", "institution": "Stanford University"}, {"id": 187526, "fullname": "Jared Watrous", "url": "http://cvpr.thecvf.com/api/miniconf/users/187526?format=json", "institution": null}, {"id": 187528, "fullname": "Daniel LK Yamins", "url": "http://cvpr.thecvf.com/api/miniconf/users/187528?format=json", "institution": "Stanford University"}], "abstract": "Predicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3D transform signal for conditioning the world model at inference time. The persistent scene memory integrates predictions over time, enabling online updates and consistency under uncertainty. By combining learned inference with explicit geometric structure, P3Sim balances data-driven flexibility with built-in inductive bias. This design yields a flexible perceptual simulator that generalizes across diverse 3D transformation tasks, such as novel view synthesis, object manipulation, and dynamic scene prediction, advancing toward general purpose 3D scene understanding and transformation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38481", "url": null, "sourceid": 39730, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40070, "uid": "81ed4d7130d22722194f33b1ab0f937c", "name": "Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models", "authors": [{"id": 193438, "fullname": "Zhirong Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193438?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 181493, "fullname": "Rui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181493?format=json", "institution": "UESTC"}, {"id": 186336, "fullname": "Jiacheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186336?format=json", "institution": "Shanghai Jiaotong University; Tencent; Shandong University"}, {"id": 179949, "fullname": "Chang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/179949?format=json", "institution": "Shanghai Jiao Tong University &amp; Tencent Hunyuan"}, {"id": 180784, "fullname": "Peiliang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180784?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 181164, "fullname": "Shikang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181164?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193439, "fullname": "zhengyi shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193439?format=json", "institution": "Xiamen University"}, {"id": 193440, "fullname": "Liang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193440?format=json", "institution": "Fudan University"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "Diffusion Transformers (DiTs) have achieved state-of-the-art image and video generation performance, but sampling remains expensive due to repeated transformer forward passes over many timesteps. Feature caching offers a training-free way to accelerate inference by reusing or forecasting hidden representations, yet recent forecasting-based methods derive their coefficients from hand-crafted formulas (e.g., Taylor expansion), which ultimately reduce to fixed linear combinations of a few historical features. Such fixed coefficients are suboptimal and fragile under aggressive skipping. In this paper, we first show that existing forecasting-based caching methods can be unified in a common linear form, and then analyze DiT feature trajectories, finding that for most denoising steps the current feature can be reconstructed from past features with projection fidelity above 0.95, indicating that accurate linear prediction is feasible. Motivated by this, we propose $L^2P$ (Learnable Linear Predictor), a simple data-driven caching framework that replaces hand-designed coefficients with learnable per-timestep weights trained on a small set of cached trajectories using a mean-squared error loss, converging in about 20 seconds on a single GPU. Extensive experiments on state-of-the-art DiTs demonstrate that L2P consistently outperforms existing caching baselines: on FLUX.1-dev, L2P achieves a 4.55$\\times$ FLOPs reduction and 4.15$\\times$ latency speedup with a PSNR of 31.459, and on Qwen-Image and Qwen-Image-Lightning, it maintains high visual fidelity even under up to 7.18$\\times$ acceleration, where prior methods suffer from noticeable quality degradation. These results show that learning linear predictors is a practical and effective alternative to designing increasingly complex forecasting formulas for efficient diffusion model inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40070", "url": null, "sourceid": 43682, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37466, "uid": "3983c99d475e80e874512a0bd582dcc9", "name": "Physical Object Understanding with a Physically Controllable World Model", "authors": [{"id": 187520, "fullname": "Rahul Mysore Venkatesh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187520?format=json", "institution": "Stanford University"}, {"id": 187521, "fullname": "Klemen Kotar", "url": "http://cvpr.thecvf.com/api/miniconf/users/187521?format=json", "institution": "Stanford University"}, {"id": 187522, "fullname": "Lilian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187522?format=json", "institution": "Stanford University"}, {"id": 182471, "fullname": "Wanhee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/182471?format=json", "institution": "Stanford University"}, {"id": 187523, "fullname": "Gia Ancone", "url": "http://cvpr.thecvf.com/api/miniconf/users/187523?format=json", "institution": "Stanford University"}, {"id": 187524, "fullname": "Seungwoo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187524?format=json", "institution": "Stanford University"}, {"id": 187525, "fullname": "Luca Wheeler", "url": "http://cvpr.thecvf.com/api/miniconf/users/187525?format=json", "institution": "Stanford University"}, {"id": 187526, "fullname": "Jared Watrous", "url": "http://cvpr.thecvf.com/api/miniconf/users/187526?format=json", "institution": null}, {"id": 187527, "fullname": "Honglin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187527?format=json", "institution": "OpenAI"}, {"id": 177205, "fullname": "Daniel Bear", "url": "http://cvpr.thecvf.com/api/miniconf/users/177205?format=json", "institution": "Noetik, Inc."}, {"id": 90905, "fullname": "Stefan Stojanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/90905?format=json", "institution": "Georgia Institute of Technology"}, {"id": 187528, "fullname": "Daniel LK Yamins", "url": "http://cvpr.thecvf.com/api/miniconf/users/187528?format=json", "institution": "Stanford University"}], "abstract": "A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations -- capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract coherent physical objects and articulated object subparts, achieving state-of-the-art results on SpelkeBench and DragAMove. Having discovered these objects, our world model can manipulate them in 3D, emerging as the strongest performer on 3DEditBench. Finally, we demonstrate that physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37466", "url": null, "sourceid": 42363, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39645, "uid": "e86a702029116de126ed5b9341566230", "name": "SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild", "authors": [{"id": 156899, "fullname": "Patrick Rim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156899?format=json", "institution": "Yale University / Google"}, {"id": 192555, "fullname": "Kevin Harris", "url": "http://cvpr.thecvf.com/api/miniconf/users/192555?format=json", "institution": "Meta"}, {"id": 192556, "fullname": "Braden Copple", "url": "http://cvpr.thecvf.com/api/miniconf/users/192556?format=json", "institution": "Facebook"}, {"id": 156808, "fullname": "Shangchen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/156808?format=json", "institution": "Meta Inc"}, {"id": 192557, "fullname": "Xu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192557?format=json", "institution": "Meta, Inc."}, {"id": 189039, "fullname": "Ivan Shugurov", "url": "http://cvpr.thecvf.com/api/miniconf/users/189039?format=json", "institution": "Facebook"}, {"id": 189038, "fullname": "Sizhe An", "url": "http://cvpr.thecvf.com/api/miniconf/users/189038?format=json", "institution": "Meta Reality Labs"}, {"id": 92442, "fullname": "He Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/92442?format=json", "institution": "Meta Platformts, Inc."}, {"id": 72573, "fullname": "Alex Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/72573?format=json", "institution": "Yale University"}, {"id": 88562, "fullname": "Tomas Hodan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88562?format=json", "institution": "Meta"}, {"id": 185960, "fullname": "Kun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185960?format=json", "institution": "Meta"}], "abstract": "Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. The dataset will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39645", "url": null, "sourceid": 42641, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37989, "uid": "091a0a35b9f57dcb09e0668f4aeb16f8", "name": "NERFIFY: Multi Agent Framework for Turning NeRF Papers into code", "authors": [{"id": 188764, "fullname": "Seemandhar Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/188764?format=json", "institution": "University of California, San Diego"}, {"id": 188765, "fullname": "Keshav Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/188765?format=json", "institution": "University of California, San Diego"}, {"id": 188766, "fullname": "Kunal Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/188766?format=json", "institution": "University of California, San Diego"}, {"id": 75820, "fullname": "Manmohan Chandraker", "url": "http://cvpr.thecvf.com/api/miniconf/users/75820?format=json", "institution": "UC San Diego"}], "abstract": "The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel regularizers or architectural optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (\u00b10.5 dB PSNR, \u00b10.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37989", "url": null, "sourceid": 42666, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39658, "uid": "b31d6f39e5d5dc13ca4a3bcc809a3dcc", "name": "MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition\u2013Perception\u2013Reasoning Guided Text-Image Machine Translation", "authors": [{"id": 145360, "fullname": "Gengluo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/145360?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 157988, "fullname": "Chengquan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157988?format=json", "institution": "Tencent"}, {"id": 192584, "fullname": "Yupu Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192584?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 188367, "fullname": "Huawen Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188367?format=json", "institution": "Institute of Information Engineering"}, {"id": 192585, "fullname": "Yaping Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192585?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 157987, "fullname": "Pengyuan Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157987?format=json", "institution": "Tencent"}, {"id": 186638, "fullname": "Weinong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186638?format=json", "institution": null}, {"id": 188369, "fullname": "Xingyu Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188369?format=json", "institution": "Tencent"}, {"id": 154558, "fullname": "Gangyan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154558?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 126300, "fullname": "Han Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126300?format=json", "institution": "Microsft Research Asia"}, {"id": 152556, "fullname": "Can Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/152556?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 152555, "fullname": "Yu ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/152555?format=json", "institution": "Nankai University"}], "abstract": "End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision\u2013language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition\u2013Perception\u2013Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39658", "url": null, "sourceid": 37234, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40140, "uid": "3aadc78106e09c44728f01ead5bcde2b", "name": "PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment", "authors": [{"id": 193610, "fullname": "Suhyeon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193610?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 85079, "fullname": "Jong Chul Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/85079?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Despite the recent progress, reinforcement learning (RL)-based fine-tuning of diffusion models often struggles with generalization, composability, and robustness against reward hacking. Recent studies have explored prompt refinement as a modular alternative, but most adopt a feed-forward approach that applies a single refined prompt throughout the entire sampling trajectory, thereby failing to fully leverage the sequential nature of reinforcement learning. To address this, here we introduce *PromptLoop*, a plug-and-play RL framework that incorporates latent feedback into step-wise prompt refinement. Rather than modifying diffusion model weights, a multimodal large language model (MLLM) is trained with RL to iteratively update prompts based on intermediate latent states of diffusion models. This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment. Extensive experiments across diverse reward functions and diffusion backbones demonstrate that PromptLoop (i) achieves effective reward optimization, (ii) generalizes seamlessly to unseen models, (iii) composes orthogonally with existing alignment methods, and (iv) mitigates over-optimization and reward hacking while introducing only a practically negligible inference overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40140", "url": null, "sourceid": 44883, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37193, "uid": "790eacf1faf6db5e63bb55814f315351", "name": "CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning", "authors": [{"id": 186882, "fullname": "Zhenquan Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186882?format=json", "institution": "Harbin Institute of Technology"}, {"id": 181962, "fullname": "Zitong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181962?format=json", "institution": "Harbin Institute of Technology"}, {"id": 99040, "fullname": "yihan zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/99040?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 85001, "fullname": "Jianhua Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85001?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 91873, "fullname": "Hang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91873?format=json", "institution": "Huawei Noah\u2018s Ark Lab"}, {"id": 186883, "fullname": "Chun-Mei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186883?format=json", "institution": "University College Dublin"}, {"id": 186884, "fullname": "Jianwei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186884?format=json", "institution": null}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Graphical User Interface (GUI) Agents, benefiting from recent advances in multimodal large language models (MLLM), have achieved significant development. However, due to the frequent updates of GUI applications, adapting to new tasks without forgetting old tasks in GUI continual learning remains an open problem.  Existing works are generally trained on a fixed set of tasks and adapt to new tasks either through supervised fine-tuning (SFT) or reinforcement learning (RL), suffering from catastrophic forgetting and slow adaptation. In this work, we propose a \\textbf{C}ontinual \\textbf{G}UI \\textbf{L}earning (CGL) framework that dynamically balances adaptation efficiency and skill retention by enhancing the synergy between SFT and RL. Specifically, we introduce an SFT proportion adjustment mechanism guided by policy entropy to dynamically control the weight allocation between the SFT and RL training phases. Additionally, we propose gradient surgery and entropy-regulated tuning strategies to enable GUI agents to continuously evolve while maintaining competence across previously learned domains. On top of that, we propose a AndroidControl-CL benchmark, which divides GUI applications into distinct task groups to effectively simulate and evaluate the performance of GUI continual learning. Experimental results demonstrate the effectiveness of our proposed CGL framework under the continual learning setting. The benchmark, code and model will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37193", "url": null, "sourceid": 39217, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36325, "uid": "1e60bf71283dac0b8777b83250813e56", "name": "Event6D: Event-based Novel Object 6D Pose Tracking", "authors": [{"id": 152691, "fullname": "Jae-Young Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152691?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 76908, "fullname": "Hoonhee Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/76908?format=json", "institution": "KAIST"}, {"id": 76892, "fullname": "Taeyeop Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/76892?format=json", "institution": "KAIST"}, {"id": 152436, "fullname": "Minjun Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152436?format=json", "institution": "NVIDIA"}, {"id": 71925, "fullname": "Bowen Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71925?format=json", "institution": "NVIDIA"}, {"id": 152692, "fullname": "Youngho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/152692?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 76867, "fullname": "Kuk-Jin Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/76867?format=json", "institution": "KAIST"}], "abstract": "Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclusively on synthetic data, EventTrack6D generalizes effectively to real-world scenarios without fine-tuning, maintaining accurate tracking across diverse objects and motion patterns. Our method and datasets establish event cameras as a viable solution for event-based novel object 6D pose tracking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36325", "url": "https://chohoonhee.github.io/Event6D/", "sourceid": 35696, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39275, "uid": "fe772ff1261b820e437821342b445539", "name": "VideoWorld 2: Learning Transferable Knowledge from Real-world Videos", "authors": [{"id": 151985, "fullname": "Zhongwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/151985?format=json", "institution": "Beijing Jiaotong University"}, {"id": 75837, "fullname": "Yunchao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/75837?format=json", "institution": "Beijing Jiaotong University"}, {"id": 191747, "fullname": "Xiao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191747?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184514, "fullname": "Guixun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184514?format=json", "institution": "Beijing Jiaotong University"}, {"id": 88385, "fullname": "Yao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88385?format=json", "institution": "Beijing Jiaotong University"}, {"id": 128453, "fullname": "Bingyi Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128453?format=json", "institution": "TikTok"}, {"id": 86968, "fullname": "Jiashi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86968?format=json", "institution": "ByteDance"}, {"id": 107156, "fullname": "Xiaojie Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/107156?format=json", "institution": "TikTok"}], "abstract": "Learning transferable knowledge from unlabeled video data and applying it in new environments is a hallmark of advanced artificial intelligence. We present VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a disentangled Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related changes. These latent codes are then modeled autoregressively as a sequence to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on real-world video handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. VideoWorld 2 achieves over a 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39275", "url": null, "sourceid": 33552, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38728, "uid": "e5337b6705bcd3099129719cee0d46e4", "name": "M4V: Multimodal Mamba for Efficient Text-to-Video Generation", "authors": [{"id": 190532, "fullname": "Jiancheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190532?format=json", "institution": ""}, {"id": 188274, "fullname": "Gengwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188274?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 88289, "fullname": "Zequn Jie", "url": "http://cvpr.thecvf.com/api/miniconf/users/88289?format=json", "institution": "Meituan"}, {"id": 71415, "fullname": "Siyu Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71415?format=json", "institution": "Beijing Jiaotong University"}, {"id": 190533, "fullname": "Yinlong Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190533?format=json", "institution": "Meituan"}, {"id": 190278, "fullname": "Ling Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190278?format=json", "institution": "University of Technology Sydney"}, {"id": 75837, "fullname": "Yunchao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/75837?format=json", "institution": "Beijing Jiaotong University"}, {"id": 88103, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88103?format=json", "institution": "Meituan"}], "abstract": "Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multimodal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a multimodal Mamba framework for efficient text-to-video generation. Specifically, a MultiModal diffusion Mamba (MM-DiM) block is designed within the framework to enable seamless integration of multimodal information and spatiotemporal modeling. In detail, we introduce a novel multimodal token re-composition design, which employs a bidirectional scheme for multimodal information integration through simple token arrangement, along with visual registers to enhance spatial\u2013temporal consistency. As a result, the MM-DiM blocks in M4V reduce FLOPs by 45% compared with the attention-based alternative when generating videos at 768$\\times$1280 resolution. Additionally, several training strategies are explored in this work to provide a better understanding of training text-to-video models using only publicly available datasets. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38728", "url": null, "sourceid": 36181, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36229, "uid": "a3530a3c4ed7dfb3f7ab7fe787fea790", "name": "StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation", "authors": [{"id": 145754, "fullname": "Ke Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/145754?format=json", "institution": "Beijing Jiaotong University"}, {"id": 181647, "fullname": "longfei li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181647?format=json", "institution": "Bytedance"}, {"id": 184513, "fullname": "Yuyang Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184513?format=json", "institution": "Beijing Jiaotong University"}, {"id": 157154, "fullname": "Hanwen Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157154?format=json", "institution": "University of Toronto"}, {"id": 184514, "fullname": "Guixun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184514?format=json", "institution": "Beijing Jiaotong University"}, {"id": 88294, "fullname": "Chen Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88294?format=json", "institution": "Tencent PCG"}, {"id": 88155, "fullname": "Jue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88155?format=json", "institution": "Tencent AI Lab"}, {"id": 157156, "fullname": "Konstantinos N. Plataniotis", "url": "http://cvpr.thecvf.com/api/miniconf/users/157156?format=json", "institution": "University of Toronto"}, {"id": 107156, "fullname": "Xiaojie Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/107156?format=json", "institution": "TikTok"}, {"id": 88385, "fullname": "Yao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88385?format=json", "institution": "Beijing Jiaotong University"}, {"id": 75837, "fullname": "Yunchao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/75837?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone.To address this challenge, we present **StereoWorld**, an **end-to-end framework** that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a **geometry-aware regularization** to ensure 3D structural fidelity.A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis.To enable large-scale training and evaluation, we curate a **high-definition stereo video dataset** containing over 11M frames aligned to natural human interpupillary distance (IPD).Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36229", "url": null, "sourceid": 39226, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38435, "uid": "f427810d6c49d16a865d20c29ac11e61", "name": "Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair", "authors": [{"id": 189863, "fullname": "Zeqing Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189863?format=json", "institution": "Carnegie Mellon University"}, {"id": 131210, "fullname": "Mani Ramanagopal", "url": "http://cvpr.thecvf.com/api/miniconf/users/131210?format=json", "institution": "Carnegie Mellon University"}, {"id": 90516, "fullname": "Aswin C. Sankaranarayanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90516?format=json", "institution": "Carnegie Mellon University"}, {"id": 89406, "fullname": "Srinivasa G. Narasimhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89406?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Decomposing an image into its underlying photometric factors\u2014surface reflectance and shading\u2014is a long-standing challenge due to the lack of extensive ground-truth data for real-world scenes. We introduce a novel physics-based approach for intrinsic image decomposition using a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities (or relative magnitudes) between visible and thermal image intensities to the ordinalities of shading and reflectance, which enables a dense self-supervision of an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse outdoor scenes. The results demonstrate superior performance over both classical physics-based and recent learning-based methods, providing a path toward scalable real-world data curation with supervision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38435", "url": null, "sourceid": 31554, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38936, "uid": "ef17814ab770a950071cfdd5c635d5de", "name": "ReasonEdit: Towards Reasoning-Enhanced Image Editing Models", "authors": [{"id": 183954, "fullname": "Fukun Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/183954?format=json", "institution": "StepFun"}, {"id": 189461, "fullname": "Shiyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189461?format=json", "institution": "ShanghaiTech University"}, {"id": 190999, "fullname": "Yucheng Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190999?format=json", "institution": "Nanyang Technological University"}, {"id": 85050, "fullname": "Zhibo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85050?format=json", "institution": "SenseTime Research"}, {"id": 191000, "fullname": "Peng Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/191000?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191001, "fullname": "Rui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191001?format=json", "institution": "Stepfun"}, {"id": 88826, "fullname": "Wei Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88826?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 191002, "fullname": "Yingming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191002?format=json", "institution": null}, {"id": 191003, "fullname": "Aojie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191003?format=json", "institution": ""}, {"id": 71748, "fullname": "Zixin Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/71748?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 157365, "fullname": "Pengtao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157365?format=json", "institution": "Fudan University"}, {"id": 126775, "fullname": "Xianfang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126775?format=json", "institution": "Tencent PCG"}, {"id": 87502, "fullname": "Gang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87502?format=json", "institution": "Tencent"}, {"id": 191004, "fullname": "Daxin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191004?format=json", "institution": "StepFun"}], "abstract": "Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training.In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy.Based on that, our proposed framework enables image editing in a thinking\u2013editing\u2013reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round.Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit(+4.4\\%), GEdit(+3.1\\%), and Kris(+11.5\\%) when initializing our DiT from the Step1X-Edit(ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit(ReasonEdit-Q).Code and checkpoints will be open-sourced.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38936", "url": null, "sourceid": 46384, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38709, "uid": "f5a6556225bb4e85ce5be5dda9d05b8a", "name": "MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs", "authors": [{"id": 190500, "fullname": "Xinyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190500?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 126479, "fullname": "Pengfei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/126479?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 183318, "fullname": "HaoYang ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/183318?format=json", "institution": "University College Dublin"}, {"id": 190501, "fullname": "Hanling Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190501?format=json", "institution": "Tianjin University"}, {"id": 190502, "fullname": "Yingxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190502?format=json", "institution": "Peking University"}, {"id": 149059, "fullname": "Liang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/149059?format=json", "institution": "National University of Defense Technology, Tsinghua University"}, {"id": 190503, "fullname": "Yue Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190503?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 149083, "fullname": "Erwei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/149083?format=json", "institution": "Defense Innovation Institute, Academy of Military Sciences (AMS)"}], "abstract": "3D hand pose estimation (HPE) from sparse inertial measurement units (IMUs) has shown great potential in human-computer interaction. However, due to the significant semantic gap between sparse local motion information and structured global pose information, estimating the hand poses from sparse IMU signals is ambiguous and challenging. Knowledge distillation can transfer rich knowledge from the stronger teacher to the student, so that the student enhances performance. Existing approaches distill morphological priors into the IMU-based student model, effectively improving its accuracy in complex scenarios. Nevertheless, overlooking the visual-inertial inherent semantic mismatch and information density difference leads to difficulties for students to learn coupled priors. In this paper, we propose a \\textbf{M}ulti-\\textbf{G}ranularity Prior-to-Inertial \\textbf{D}istillation Framework for Sequential 3D \\textbf{H}PE from Sparse IMUs (\\textbf{MGDHand}). We first pre-train a MANO-IMU fusion model as a teacher to encode static geometric morphology prior, dynamic kinematic prior and temporal motion prior. Then, a \\textbf{M}ulti-\\textbf{G}ranularity Decoupled \\textbf{Distill}ation (\\textbf{MGDistill}) scheme is proposed to bridge the semantic gap. MGDistill includes a \\textbf{Static Shape Distillation} module to transfer time-invariant hand shape priors, and a \\textbf{Dynamic Pose Distillation} module to transfer complex joint kinematics and dense pose priors. Additionally,  a \\textbf{Temporal Motion Distillation} module transfers the fast-changing motion priors (velocity and acceleration).  Extensive experiments on public dataset demonstrate that our method outperforms state-of-the-art approaches under sparse IMU configurations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38709", "url": null, "sourceid": 46627, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37147, "uid": "90da5fb6873f5daa02586c51fec88189", "name": "GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping", "authors": [{"id": 180176, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180176?format=json", "institution": "Sun Yat-sen University"}, {"id": 186773, "fullname": "Jiajun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186773?format=json", "institution": "Kuaishou"}, {"id": 186774, "fullname": "Jie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186774?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 180657, "fullname": "Henglin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180657?format=json", "institution": "Tsinghua University"}, {"id": 186775, "fullname": "Gongye Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186775?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 186776, "fullname": "Jun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186776?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186777, "fullname": "Wanyuan Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186777?format=json", "institution": "Xiaomi Corporation"}, {"id": 184465, "fullname": "Ao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184465?format=json", "institution": "JD.com"}, {"id": 89656, "fullname": "Zhenyu Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/89656?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 75722, "fullname": "Xintao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75722?format=json", "institution": "Tencent"}, {"id": 180129, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180129?format=json", "institution": "Kling AI"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 69930, "fullname": "Xiaodan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69930?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution\u2014its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an **implicit over-optimization stage**\u2014while the proxy reward continues to increase, essential metrics such as image quality and text\u2013prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce **GRPO-Guard**, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality. We provide detailed demonstrations of the over-optimization process and corresponding visualizations in **Supplementary Materials. 5**.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37147", "url": null, "sourceid": 41730, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39486, "uid": "2bf720f77d3874e07949cfcd1f75e91e", "name": "JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization", "authors": [{"id": 144220, "fullname": "yunlong lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/144220?format=json", "institution": "Xiamen university"}, {"id": 180828, "fullname": "Linqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180828?format=json", "institution": "\u817e\u8baf\u79d1\u6280\uff08\u6df1\u5733\uff09\u6709\u9650\u516c\u53f8"}, {"id": 192173, "fullname": "Kunjie Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192173?format=json", "institution": "Xiamen University"}, {"id": 192174, "fullname": "Zixu Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192174?format=json", "institution": "Xiamen University"}, {"id": 107299, "fullname": "Kaixiong Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107299?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 128851, "fullname": "Wenbo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128851?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 153157, "fullname": "Bin Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/153157?format=json", "institution": "Peking University"}, {"id": 187696, "fullname": "Zhenxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187696?format=json", "institution": "Tencent Hunyuan"}, {"id": 181780, "fullname": "Shiyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181780?format=json", "institution": "Tsinghua University"}, {"id": 144441, "fullname": "Yuyang Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/144441?format=json", "institution": "Tsinghua University"}, {"id": 88610, "fullname": "Wenxun Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/88610?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 90595, "fullname": "Xinghao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/90595?format=json", "institution": "Xiamen University"}, {"id": 186615, "fullname": "Chunyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186615?format=json", "institution": "Tencent Hunyuan"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}], "abstract": "Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination\u2014text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking\u2014dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor\u2013evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both preservative and generative editing through seamless integration of Adobe Lightroom and Qwen-Image-Edit tools. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity, while maintaining competitive performance in generative editing tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39486", "url": null, "sourceid": 33546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39833, "uid": "6fcdb2951819a022c6c46c51f89df49a", "name": "OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition", "authors": [{"id": 145893, "fullname": "Haochen Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145893?format=json", "institution": "Sun Yat-sen University, SYSU"}, {"id": 126479, "fullname": "Pengfei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/126479?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 192942, "fullname": "Buyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192942?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192943, "fullname": "Da Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192943?format=json", "institution": "Nankai University"}, {"id": 190065, "fullname": "Tianhao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190065?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 183318, "fullname": "HaoYang ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/183318?format=json", "institution": "University College Dublin"}, {"id": 149059, "fullname": "Liang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/149059?format=json", "institution": "National University of Defense Technology, Tsinghua University"}, {"id": 186965, "fullname": "Hongbo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186965?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 149083, "fullname": "Erwei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/149083?format=json", "institution": "Defense Innovation Institute, Academy of Military Sciences (AMS)"}], "abstract": "Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\\% in detection rate, establishing a strong baseline for online micro gesture recognition. Our code is available in Suppl. Mat. and dataset will be available later.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39833", "url": null, "sourceid": 43543, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37269, "uid": "4adc2d13b1a41b5e1652d957fc6f2082", "name": "Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories", "authors": [{"id": 143393, "fullname": "Junyao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143393?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 187057, "fullname": "Zhongwei Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187057?format=json", "institution": "Huhu AI Inc."}, {"id": 91398, "fullname": "Waikeung Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/91398?format=json", "institution": "Laboratory for Artificial Intelligence in Design"}, {"id": 185493, "fullname": "Xingxing Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185493?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple items, layering, fine-grained categories, and diverse styling\u2014beyond current systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the **first** large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 361+ fine-grained subcategories. Each pair includes 3-12 item images, a model image in a complete outfit, and detailed item and try-on annotations. We further design a synthesis pipeline balancing authenticity and diversity: it maximizes use of raw images for realism, and explicitly injects diverse styles and specific styling techniques during outfit/look synthesis. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editors to establish baselines. Results show current methods struggle to try on full outfits seamlessly and to infer correct layering, leading to misalignment and artifacts. All data will be open-sourced.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37269", "url": "https://artmesciencelab.github.io/Garments2Look/", "sourceid": 39708, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37954, "uid": "8f5d0e3b6f94ffa323e84b47fb03c260", "name": "Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance", "authors": [{"id": 181429, "fullname": "Vanessa Emanuela Guarino", "url": "http://cvpr.thecvf.com/api/miniconf/users/181429?format=json", "institution": "Max Delbr\u00fcck Center for Molecular Medicine in the Helmholtz Association"}, {"id": 188674, "fullname": "Claudia Winklmayr", "url": "http://cvpr.thecvf.com/api/miniconf/users/188674?format=json", "institution": "Max Delbr\u00fcck Center for Molecular Medicine"}, {"id": 181431, "fullname": "Jannik Franzen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181431?format=json", "institution": "MDC Berlin"}, {"id": 130239, "fullname": "Josef Rumberger", "url": "http://cvpr.thecvf.com/api/miniconf/users/130239?format=json", "institution": "Max Delbr\u00fcck Center for Molecular Medicine"}, {"id": 188675, "fullname": "Manuel Pfeuffer", "url": "http://cvpr.thecvf.com/api/miniconf/users/188675?format=json", "institution": "Humboldt Universit\u00e4t zu Berlin"}, {"id": 188676, "fullname": "Sonja Greven", "url": "http://cvpr.thecvf.com/api/miniconf/users/188676?format=json", "institution": "Humboldt Universit\u00e4t Berlin"}, {"id": 85944, "fullname": "Klaus Maier-Hein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85944?format=json", "institution": "German Cancer Research Center"}, {"id": 130272, "fullname": "Dagmar Kainmueller", "url": "http://cvpr.thecvf.com/api/miniconf/users/130272?format=json", "institution": "Universit\u00e4t Potsdam"}, {"id": 181597, "fullname": "Christoph Karg", "url": "http://cvpr.thecvf.com/api/miniconf/users/181597?format=json", "institution": "Max Delbr\u00fcck Center, Robert-R\u00f6ssle-Str. 10, 13092 Berlin, Germany"}, {"id": 188677, "fullname": "Carsten T. L\u00fcth", "url": "http://cvpr.thecvf.com/api/miniconf/users/188677?format=json", "institution": "Ruprecht-Karls-Universit\u00e4t Heidelberg; German Cancer Research Center"}], "abstract": "Uncertainty quantification (UQ) is crucial for ensuring the reliability of automated image segmentations in safety-critical domains like biomedical image analysis or autonomous driving. UQ generates pixel-wise uncertainty maps that must be aggregated into scalar scores for downstream tasks like OoD- or failure-detection.Despite widespread use of aggregation strategies, their properties and impact on downstream task performance have not yet been comprehensively studied.Global Average is the default choice, yet it does not account for spatial and structural features of uncertainty estimates. Alternatives like patch-, class- and threshold-based strategies exist, but lack systematic comparison, leading to inconsistent reporting and unclear best practices.We address this gap by (1) formally analyzing properties, limitations, and pitfalls of common strategies; (2) proposing novel strategies that incorporate spatial uncertainty structure and (3) benchmarking their performance on OoD and failure detection across ten datasets that vary in image geometry and structure.We find that aggregators leveraging spatial structure yield stronger performance in both downstream tasks studied. However, performance of individual aggregators is highly dependent on dataset characteristics, thus we propose a meta aggregator that integrates multiple aggregators and shows robust performance across datasets.To foster reproducibility, we release an open-source Python package for benchmarking uncertainty aggregation methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37954", "url": null, "sourceid": 34167, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39712, "uid": "ed45aea8d9710aade017fc1aea4054cf", "name": "Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering", "authors": [{"id": 130952, "fullname": "Yalan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130952?format=json", "institution": "Shanghai University"}, {"id": 192707, "fullname": "Hanzhou Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192707?format=json", "institution": "Shanghai University"}], "abstract": "Large-scale multi-view clustering aims to explore the complementary and consistent information among different views in efficient manner. Despite the impressive performance gained by the existing methods, they just perform anchor learning in a single space with the orthogonal or some other constraints from the multi-view data, leading to undesired anchors. The anchors can simultaneously occur in more spaces and the complementary information among these spaces is able to be adopted for learning anchors. Meanwhile, the space with basis being the anchored cluster center is neglected to learn anchors by most existing works. In this work, we propose learning anchor in Dual Orthogonal Space for Fast Multi-view Clustering (DOSFMVC). DOSFMVC conducts anchor learning in dual orthogonal space, aiming at utilizing the complementary information among two spaces in producing anchors with high quality. DOSMFVC introduces the consensus anchored cluster center as basis of the extra space and clustering indicator of anchors based on this bais in anchor learning. The anchor learning and partition are integrated into a unified model, where the final cluster assignment can be adopted for clustering results. Extensive experiments confirm the superiority of our method compared with some state-of-the-art methods on several benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39712", "url": null, "sourceid": 40575, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39592, "uid": "f52e706be0c17df6673391a7f5c7814b", "name": "Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning", "authors": [{"id": 144209, "fullname": "Jiachen Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144209?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 192424, "fullname": "Guanghui Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192424?format=json", "institution": "Microsoft"}, {"id": 140494, "fullname": "Theodore Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/140494?format=json", "institution": "Microsoft"}, {"id": 88536, "fullname": "Jeya Maria Jose Valanarasu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88536?format=json", "institution": "Johns Hopkins University"}, {"id": 192425, "fullname": "Sheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192425?format=json", "institution": "Microsoft"}, {"id": 192426, "fullname": "Tristan Naumann", "url": "http://cvpr.thecvf.com/api/miniconf/users/192426?format=json", "institution": "Microsoft Research"}, {"id": 192427, "fullname": "Fan Lam", "url": "http://cvpr.thecvf.com/api/miniconf/users/192427?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 192428, "fullname": "Sheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192428?format=json", "institution": "University of Washington, Seattle"}, {"id": 140668, "fullname": "Hoifung Poon", "url": "http://cvpr.thecvf.com/api/miniconf/users/140668?format=json", "institution": "Microsoft"}], "abstract": "Effective medical image analysis requires representations that capture both global anatomical structure and fine-grained tissue texture. Current self-supervised approaches exhibit limited capacity to address both requirements simultaneously. Invariance-based methods learn through augmentation consistency but face challenges in medical imaging where common augmentations may discard diagnostically relevant intensity patterns. Masked image modeling approaches employ high masking ratios to enforce holistic reasoning, yet inherently limit exposure to fine-grained texture. Recent work in general-domain vision demonstrates that generative and semantic objectives can mutually benefit each other, yet this paradigm remains unexplored for 3D medical imaging. We introduce Masked-Diffusion Autoencoders (MDAE), a self-supervised framework that imposes concurrent spatial masking and diffusion corruption, encouraging the model to learn complementary objectives: masked region reconstruction for structural coherence and visible region denoising for textural characteristics. This dual corruption enables the network to learn structure-texture representations within a unified time-conditioned objective. Evaluated on brain MRI across tumor classification, molecular marker detection, and dense segmentation benchmarks, MDAE consistently outperforms state-of-the-art baselines, with improvements most pronounced in cross-modal generalization tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39592", "url": null, "sourceid": 36463, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39801, "uid": "b07d9c21e2dad4b3527b97c54ccc4603", "name": "Open-Med-Reasoner: Data Recipes for Multimodal Medical Reasoning", "authors": [{"id": 192889, "fullname": "Timothy Ossowski", "url": "http://cvpr.thecvf.com/api/miniconf/users/192889?format=json", "institution": "University of Wisconsin-Madison"}, {"id": 192425, "fullname": "Sheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192425?format=json", "institution": "Microsoft"}, {"id": 87710, "fullname": "Qianchu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87710?format=json", "institution": "University of Cambridge"}, {"id": 192424, "fullname": "Guanghui Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192424?format=json", "institution": "Microsoft"}, {"id": 87888, "fullname": "Reuben Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87888?format=json", "institution": "Boston University"}, {"id": 192426, "fullname": "Tristan Naumann", "url": "http://cvpr.thecvf.com/api/miniconf/users/192426?format=json", "institution": "Microsoft Research"}, {"id": 131545, "fullname": "Junjie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131545?format=json", "institution": "University of Wisconsin, Madison"}, {"id": 140668, "fullname": "Hoifung Poon", "url": "http://cvpr.thecvf.com/api/miniconf/users/140668?format=json", "institution": "Microsoft"}], "abstract": "High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39801", "url": null, "sourceid": 46721, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36234, "uid": "4b71f83d2d3803b8d4cd2e6c53698990", "name": "ProPhy: Progressive Physical Alignment for Dynamic World Simulation", "authors": [{"id": 180067, "fullname": "Zijun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180067?format=json", "institution": "Sun Yat-sen University"}, {"id": 184528, "fullname": "Panwen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184528?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 180176, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180176?format=json", "institution": "Sun Yat-sen University"}, {"id": 184529, "fullname": "Terry Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184529?format=json", "institution": "Vector Institute"}, {"id": 184530, "fullname": "Yuhao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184530?format=json", "institution": "Lenovo"}, {"id": 184531, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184531?format=json", "institution": "Lenovo Group Limited"}, {"id": 149155, "fullname": "Yiqiang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/149155?format=json", "institution": "Lenovo Research"}, {"id": 184532, "fullname": "Zutao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184532?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 184533, "fullname": "Hanhui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184533?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 69930, "fullname": "Xiaodan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69930?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36234", "url": null, "sourceid": 35840, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40256?format=json"], "related_events_ids": [40256]}, {"id": 40256, "uid": "4b71f83d2d3803b8d4cd2e6c53698990", "name": "ProPhy: Progressive Physical Alignment for Dynamic World Simulation", "authors": [{"id": 180067, "fullname": "Zijun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180067?format=json", "institution": "Sun Yat-sen University"}, {"id": 184528, "fullname": "Panwen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184528?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 180176, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180176?format=json", "institution": "Sun Yat-sen University"}, {"id": 184529, "fullname": "Terry Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184529?format=json", "institution": "Vector Institute"}, {"id": 184530, "fullname": "Yuhao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184530?format=json", "institution": "Lenovo"}, {"id": 184531, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184531?format=json", "institution": "Lenovo Group Limited"}, {"id": 149155, "fullname": "Yiqiang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/149155?format=json", "institution": "Lenovo Research"}, {"id": 184532, "fullname": "Zutao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184532?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 184533, "fullname": "Hanhui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184533?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 69930, "fullname": "Xiaodan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/69930?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40256", "url": null, "sourceid": -35840, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36234?format=json"], "related_events_ids": [36234]}, {"id": 36802, "uid": "09d0dcb0061c4ca788d8ae68536c46a1", "name": "RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection", "authors": [{"id": 181439, "fullname": "Yueying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181439?format=json", "institution": "Shanghai University"}, {"id": 185901, "fullname": "Yiteng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185901?format=json", "institution": "Henan Institute of Science and Technology"}, {"id": 185300, "fullname": "Weidong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185300?format=json", "institution": "Zhengzhou University; Henan Institute of Science and Technology"}, {"id": 76232, "fullname": "Jie Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76232?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 131303, "fullname": "Liquan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131303?format=json", "institution": "Shanghai University"}, {"id": 185902, "fullname": "Huaicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185902?format=json", "institution": null}, {"id": 149249, "fullname": "Xin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149249?format=json", "institution": "National University of Defense Technology"}], "abstract": "Underwater images commonly suffer from foreground-background ambiguity, loss of structural details, and severely reduced contrast, which collectively make underwater object detection (UOD) an inherently challenging task. To handle this issue, we present a residual-guided hierarchical calibration network (RHCNet) designed to achieve more efficient and robust UOD, which comprises a residual-guided feature enhancement module (RGFE) and a hierarchical feature calibration pyramid module (HFCP). Concretely, RHCNet extends the standard ResNet-50 backbone by embedding the RGFE, which effectively strengthens the representation of edge and texture features in blurry regions by jointly leveraging convolutional operations and attention mechanisms to achieve more discriminative feature extraction for UOD. Subsequently, the HFCP integrates a bottom-up semantic enhancement path and a top-down fine-grained feature compensation path, while a K-means clustering\u2013guided feature calibration module is jointly employed to ensure multi-level cross-scale semantic consistency and accurate alignment of salient region features. Extensive experiments on the DUO and UTDAC benchmark datasets demonstrated that our RHCNet attains the highest AP scores of 70.53% and 53.35%, respectively. Besides, our RHCNet also maintains excellent detection accuracy and strong generalization capability on the COCO dataset for terrestrial scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36802", "url": null, "sourceid": 35703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36536, "uid": "c90425d6f7d882fb67038702d155e16b", "name": "S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation", "authors": [{"id": 181413, "fullname": "Yuhao Qing", "url": "http://cvpr.thecvf.com/api/miniconf/users/181413?format=json", "institution": "Shanghai University"}, {"id": 181439, "fullname": "Yueying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181439?format=json", "institution": "Shanghai University"}, {"id": 185299, "fullname": "Chaoyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185299?format=json", "institution": "Hunan University of Science and Technology"}, {"id": 185300, "fullname": "Weidong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185300?format=json", "institution": "Zhengzhou University; Henan Institute of Science and Technology"}, {"id": 76232, "fullname": "Jie Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76232?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 149249, "fullname": "Xin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149249?format=json", "institution": "National University of Defense Technology"}], "abstract": "Open-vocabulary semantic segmentation extends pixel-level recognition to arbitrary text-described categories. Despite strong global semantic understanding, vision-language models such as CLIP exhibit limited spatial precision and semantic ambiguity across large vocabularies, constraining their effectiveness for dense prediction. We present S2C2Seg, a training-free framework that integrates with existing methods through Category Subset Selection (CSS) and Consistent Semantic Guidance (CSG). CSS employs three complementary scoring functions to filter category candidates: CLIP-based global semantic similarity, spatial presence from dense prediction models, and multi-view consistency via alignment and conditional entropy. This joint exploitation of semantic, spatial, and consistency cues reduces category redundancy and semantic ambiguity. CSG adaptively fuses CLIP global features with local spatial predictions through category-specific confidence weighting, applying stronger regularization to high-similarity categories for correcting prediction biases while preserving spatial precision for low-confidence categories. Extensive experiments across eight benchmarks demonstrate broad applicability: when integrated with SCLIP, ProxyCLIP, and CorrCLIP, S2C2Seg achieves consistent improvements of 3.4 to 9.7 percentage points in mIoU, establishing a new state-of-the-art of 51.2\\% average mIoU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36536", "url": null, "sourceid": 34665, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38849, "uid": "169c542306442d8ef169c0761d661257", "name": "InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs", "authors": [{"id": 190828, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190828?format=json", "institution": "ShanghaiTech University"}, {"id": 190829, "fullname": "Ruichi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190829?format=json", "institution": "University of Pennsylvania"}, {"id": 127937, "fullname": "Han Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127937?format=json", "institution": "ShanghaiTech University"}, {"id": 190830, "fullname": "Jingyan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190830?format=json", "institution": "ShanghaiTech University"}, {"id": 165062, "fullname": "Juze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/165062?format=json", "institution": "Stanford University"}, {"id": 89781, "fullname": "Xin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89781?format=json", "institution": "University of Chinese Academy of Sciences, ShanghaiTech University"}, {"id": 73999, "fullname": "Lan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73999?format=json", "institution": "ShanghaiTech University"}, {"id": 75945, "fullname": "Jingyi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75945?format=json", "institution": "Shanghai Tech University"}, {"id": 87694, "fullname": "Jingya Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87694?format=json", "institution": "ShanghaiTech University"}], "abstract": "Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38849", "url": null, "sourceid": 39382, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39642, "uid": "45e7200bd1dbaf868c1b69de0dec23b9", "name": "Incentivizing Versatile Video Reasoning in MLLMs via Data-Efficient Reinforcement Learning", "authors": [{"id": 182179, "fullname": "Xiaodong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182179?format=json", "institution": "Peking University"}, {"id": 177963, "fullname": "Zhirong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177963?format=json", "institution": "Peking University"}, {"id": 192551, "fullname": "Langling Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192551?format=json", "institution": "Peking University"}, {"id": 192552, "fullname": "Yuxi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192552?format=json", "institution": "Peking University"}, {"id": 90413, "fullname": "Peixi Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90413?format=json", "institution": "Peking University"}], "abstract": "Multimodal Large Language Models (MLLMs) have made great progress in video understanding tasks. However, when it comes to understanding complex or lengthy videos, MLLMs tend to overlook details or produce hallucinations. To alleviate these issues, recent work has attempted to leverage reinforcement learning (RL) to boost models' deep linguistic reasoning of complex videos. But these methods have two main problems: First, the RL framework they used has unstable training, high training costs, and is difficult to train satisfactory video reasoning models; Second, the linguistic reasoning process is difficult to guarantee the reliability of visual information. To alleviate these problems, we propose to use multimodal elements for reasoning, and we design a novel framework to build and enhance versatile video reasoning capabilities on MLLMs. We carefully design a multi-task cold start and multi-task reinforcement learning to improve the model's visual perception and proficiency in multiple capabilities. In the inference phase, we leverage multimodal reasoning and dynamic sampling to further improve the performance. We verified the efficiency of the framework on a base MLLM (Qwen2-VL-7B-Base). Through cold-start with 3k data and reinforcement learning training with 5k data, combined with inference design, our final model significantly outperforms the base model on seven public video benchmarks, even surpassing and approaching the state-of-the-art Instruct Models such as Qwen2.5-VL-7B-Instruct trained with large-scale data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39642", "url": null, "sourceid": 31796, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37779, "uid": "33d81348225fe436802063fc73e6f2c5", "name": "From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation", "authors": [{"id": 147354, "fullname": "Chenyang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147354?format=json", "institution": "Peking University"}, {"id": 152418, "fullname": "Jiaming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152418?format=json", "institution": "Peking University"}, {"id": 144526, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144526?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 188248, "fullname": "Runzhong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188248?format=json", "institution": "Peking University"}, {"id": 188249, "fullname": "Qingpo Wuwu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188249?format=json", "institution": "Peking University"}, {"id": 129188, "fullname": "Xiaoqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129188?format=json", "institution": "Peking University"}, {"id": 188250, "fullname": "Zhuoyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188250?format=json", "institution": "Peking University"}, {"id": 188251, "fullname": "Ying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188251?format=json", "institution": "Peking University"}, {"id": 91572, "fullname": "Renrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91572?format=json", "institution": "MMLab of CUHK &amp;amp;amp; Shanghai AI Laboratory"}, {"id": 188252, "fullname": "Peng Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/188252?format=json", "institution": "Simplexity Robotics"}, {"id": 87709, "fullname": "Pheng-Ann Heng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87709?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 91956, "fullname": "Shanghang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91956?format=json", "institution": "Peking University"}], "abstract": "Vision\u2013Language\u2013Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating long-horizon planning with precise manipulation.Therefore, we aim to endow a VLA model with the capability to infer the \u201chow\u201d process from the \u201cwhat\u201d outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, visual prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32\\% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37779", "url": null, "sourceid": 44607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40327, "uid": "bb4e0dfd6acf0e0ac1279445b598971e", "name": "Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation", "authors": [{"id": 189349, "fullname": "Jiacong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189349?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 73555, "fullname": "Jiaxu Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73555?format=json", "institution": "Zhejiang University"}, {"id": 189350, "fullname": "Yourun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189350?format=json", "institution": "Harbin Institute of Technology"}, {"id": 186689, "fullname": "Xianyun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186689?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}, {"id": 88598, "fullname": "Jun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88598?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Aerial object-goal navigation (Aerial ObjectNav) requires  an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Guided Memory Query module that extracts task-relevant scene and exploration tokens through instruction-modulated queries. By integrating these tokens with visual observations and language instructions, OctoMem-Agent achieves comprehensive scene understanding and effective spatial exploration for target localization. Extensive experiments on the Aerial ObjectNav benchmark UAV-ON demonstrate that our method achieves a significant 7.5\\% improvement in success rate over existing methods, validating the effectiveness of our design.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40327", "url": null, "sourceid": -39677, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38218?format=json"], "related_events_ids": [38218]}, {"id": 36937, "uid": "229fbc1bdb2ca4abe9cc79f1dcb71542", "name": "M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models", "authors": [{"id": 186267, "fullname": "Joongmin Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186267?format=json", "institution": "Korea University"}, {"id": 186268, "fullname": "Jeongbae Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/186268?format=json", "institution": "Korea University"}, {"id": 186269, "fullname": "Jaehyung Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186269?format=json", "institution": "Konkuk University"}, {"id": 186270, "fullname": "Heuiseok Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186270?format=json", "institution": "Korea University"}], "abstract": "In large-scale industrial documents with scanned images, complex layouts, and multiple pages, the effectiveness of retrieval-augmented generation (RAG) is highly dependent on chunking quality. However, existing text-centric chunkers overlook the visual and structural cues present in real-world documents, leading to redundant or ambiguous chunks that impair retrieval and answer accuracy. To address this problem, we propose \\textbf{\\ours} which integrates (i) SharedDet for normalizing document parsing and OCR outputs into a document-level frame, (ii) Multi-modal block embeddings with boundary-aware SoftROI, (iii) global document-tree reconstruction via biaffine scoring, and (iv) structure-aware dependency chunking that preserves boundaries and reduces redundancy. \\ours achieves consistent gains across both Document Hierarchical Parsing (DHP) and corpus-level RAG evaluations, improving STEDS by +28.5--39.6\\%, retrieval nDCG by +1.1--15.3\\%, and QA ANLS by +4.5--15.3\\%. These results demonstrate that modeling document-level dependencies with Multi-modal, structure-aware chunking improves RAG performance on long, multi-page industrial documents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36937", "url": null, "sourceid": 40719, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38218, "uid": "bb4e0dfd6acf0e0ac1279445b598971e", "name": "Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation", "authors": [{"id": 189349, "fullname": "Jiacong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189349?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 73555, "fullname": "Jiaxu Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73555?format=json", "institution": "Zhejiang University"}, {"id": 189350, "fullname": "Yourun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189350?format=json", "institution": "Harbin Institute of Technology"}, {"id": 186689, "fullname": "Xianyun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186689?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}, {"id": 88598, "fullname": "Jun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88598?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Aerial object-goal navigation (Aerial ObjectNav) requires  an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Guided Memory Query module that extracts task-relevant scene and exploration tokens through instruction-modulated queries. By integrating these tokens with visual observations and language instructions, OctoMem-Agent achieves comprehensive scene understanding and effective spatial exploration for target localization. Extensive experiments on the Aerial ObjectNav benchmark UAV-ON demonstrate that our method achieves a significant 7.5\\% improvement in success rate over existing methods, validating the effectiveness of our design.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38218", "url": null, "sourceid": 39677, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40327?format=json"], "related_events_ids": [40327]}, {"id": 37477, "uid": "c801fb350a465fe48197dda942b8091a", "name": "When Robots Should Say ''I Don\u2019t Know'': Benchmarking Abstention in Embodied Question Answering", "authors": [{"id": 181836, "fullname": "Tao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181836?format=json", "institution": "Nanyang Technological University"}, {"id": 187544, "fullname": "Chuhao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187544?format=json", "institution": "Nanyang Technological University"}, {"id": 187545, "fullname": "Guangyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187545?format=json", "institution": "Peking University"}, {"id": 187546, "fullname": "Haozhi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187546?format=json", "institution": "The University of Tokyo, The University of Tokyo"}, {"id": 187547, "fullname": "Yewen Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187547?format=json", "institution": "Nanyang Technological University"}, {"id": 187548, "fullname": "Jianfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187548?format=json", "institution": "Nanyang Technological University"}], "abstract": "Embodied Question Answering (EQA) requires an agent to interpret language, perceive its environment, and navigate within 3D scenes to produce responses. Existing EQA benchmarks assume that every question must be answered, but embodied agents should know when they do not have sufficient information to answer. In this work, we focus on a minimal requirement for EQA agents, abstention: knowing when to withhold an answer. From an initial study of 500 human queries, we find that 32.4\\% contain missing or underspecified context. Drawing on this initial study and cognitive theories of human communication errors, we derive five representative categories requiring abstention: actionability limitation, referential underspecification, preference dependence, information unavailability, and false presupposition. We augment OpenEQA by having annotators transform well-posed questions into ambiguous variants outlined by these categories. The resulting dataset, AbstainEQA, comprises 1,636 annotated abstention cases paired with 1,636 original OpenEQA instances for balanced evaluation. Evaluating on AbstainEQA, we find that even the best frontier model only attains 42.79\\% abstention recall, while humans achieve 91.17\\%. We also find that scaling, prompting, and reasoning only yield marginal gains, and that fine-tuned models overfit to textual cues. Together, these results position abstention as a fundamental prerequisite for reliable interaction in embodied settings and as a necessary basis for effective clarification.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37477", "url": null, "sourceid": 37519, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37907, "uid": "0d21d741c65d5b132fa8db59fcf73abc", "name": "ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding", "authors": [{"id": 182455, "fullname": "Jovana Kondic", "url": "http://cvpr.thecvf.com/api/miniconf/users/182455?format=json", "institution": "MIT"}, {"id": 161354, "fullname": "Pengyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/161354?format=json", "institution": "MIT-IBM Watson AI Lab"}, {"id": 85838, "fullname": "Dhiraj Joshi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85838?format=json", "institution": "IBM Research"}, {"id": 188548, "fullname": "Isaac Sanchez", "url": "http://cvpr.thecvf.com/api/miniconf/users/188548?format=json", "institution": "International Business Machines; Massachusetts Institute of Technology"}, {"id": 188549, "fullname": "Ben wiesel", "url": "http://cvpr.thecvf.com/api/miniconf/users/188549?format=json", "institution": "Research, International Business Machines"}, {"id": 188550, "fullname": "Shafiq Abedin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188550?format=json", "institution": "International Business Machines"}, {"id": 188551, "fullname": "Amit Alfassy", "url": "http://cvpr.thecvf.com/api/miniconf/users/188551?format=json", "institution": null}, {"id": 134193, "fullname": "Eli Schwartz", "url": "http://cvpr.thecvf.com/api/miniconf/users/134193?format=json", "institution": "International Business Machines"}, {"id": 188552, "fullname": "Daniel Caraballo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188552?format=json", "institution": "International Business Machines"}, {"id": 188553, "fullname": "Yagmur Gizem Cinar", "url": "http://cvpr.thecvf.com/api/miniconf/users/188553?format=json", "institution": "IBM Research"}, {"id": 188554, "fullname": "Florian Scheidegger", "url": "http://cvpr.thecvf.com/api/miniconf/users/188554?format=json", "institution": "International Business Machines"}, {"id": 188555, "fullname": "Steven I Ross", "url": "http://cvpr.thecvf.com/api/miniconf/users/188555?format=json", "institution": "International Business Machines"}, {"id": 188556, "fullname": "Daniel Weidele", "url": "http://cvpr.thecvf.com/api/miniconf/users/188556?format=json", "institution": "International Business Machines"}, {"id": 134539, "fullname": "Hang Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/134539?format=json", "institution": "University of Rochester"}, {"id": 188557, "fullname": "Ekaterina Arutyunova", "url": "http://cvpr.thecvf.com/api/miniconf/users/188557?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 89687, "fullname": "Roei Herzig", "url": "http://cvpr.thecvf.com/api/miniconf/users/89687?format=json", "institution": "Tel Aviv University"}, {"id": 179636, "fullname": "Zihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179636?format=json", "institution": null}, {"id": 188558, "fullname": "Xinyue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188558?format=json", "institution": "Abaka AI"}, {"id": 168802, "fullname": "Yunfei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/168802?format=json", "institution": null}, {"id": 188559, "fullname": "Sicong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188559?format=json", "institution": "McGill University"}, {"id": 157150, "fullname": "Minghao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157150?format=json", "institution": "2077AI"}, {"id": 157144, "fullname": "Qunshu Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/157144?format=json", "institution": "Abaka AI"}, {"id": 188560, "fullname": "Aude Oliva", "url": "http://cvpr.thecvf.com/api/miniconf/users/188560?format=json", "institution": "Massachusetts Institute of Technology; Massachusetts Institute of Technology"}, {"id": 89688, "fullname": "Rogerio Feris", "url": "http://cvpr.thecvf.com/api/miniconf/users/89688?format=json", "institution": "International Business Machines"}], "abstract": "Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language \u2014 a capability where current vision-language models (VLMs) remain limited. We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. ChartNet leverages a novel code-guided synthesis pipeline to generate 1.5 million diverse chart samples spanning 24 chart types and 6 plotting libraries. Each sample consists of five aligned components: plotting code, rendered chart image, data table, natural language summary, and question-answering with reasoning, providing fine-grained cross-modal alignment. To capture the full spectrum of chart comprehension, ChartNet additionally includes specialized subsets encompassing human annotated data, real-world data, safety, and grounding. A rigorous quality-filtering pipeline ensures visual fidelity, semantic accuracy, and diversity across chart representations. Fine-tuning on ChartNet consistently improves results across our benchmarks, demonstrating its utility as large-scale supervision for multimodal models. As the largest open-source dataset of its kind, ChartNet aims to support the development of foundation models with robust and generalizable capabilities for data visualization understanding. We will make both ChartNet data and models publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37907", "url": null, "sourceid": 41542, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36657, "uid": "c2a905eb852f356108eaa8082531b993", "name": "Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance", "authors": [{"id": 158956, "fullname": "William June Suk Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158956?format=json", "institution": "KAIST"}, {"id": 129972, "fullname": "Kyungmin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/129972?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 84720, "fullname": "Sihyun Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84720?format=json", "institution": "KAIST"}, {"id": 153180, "fullname": "Yisol Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/153180?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 84533, "fullname": "Jinwoo Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84533?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 126267, "fullname": "Kimin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/126267?format=json", "institution": "KAIST"}], "abstract": "Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple training-free fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying a low-pass filter at the early stage of denoising. Extensive experiments show ALG significantly improves the temporal dynamics of generated videos, while preserving or even improving image fidelity and text alignment. For instance, on the VBench test suite, ALG achieves a 33\\% average improvement across models in dynamic degree while maintaining the original video quality.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36657", "url": "https://choi403.github.io/ALG/", "sourceid": 35852, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37995, "uid": "6dfbdd2796f306866bd7fa91b79f2339", "name": "SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection", "authors": [{"id": 188784, "fullname": "Hao Vo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188784?format=json", "institution": "University of Arkansas"}, {"id": 93171, "fullname": "Khoa Vo", "url": "http://cvpr.thecvf.com/api/miniconf/users/93171?format=json", "institution": "University of Arkansas - Fayetteville"}, {"id": 164512, "fullname": "Tran Phan Phan", "url": "http://cvpr.thecvf.com/api/miniconf/users/164512?format=json", "institution": "University of Arkansas - Fayetteville"}, {"id": 184289, "fullname": "Cuong Ngo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184289?format=json", "institution": "University of Arkansas, Fayetteville"}, {"id": 188785, "fullname": "Gianfranco Doretto", "url": "http://cvpr.thecvf.com/api/miniconf/users/188785?format=json", "institution": "University of Utah"}, {"id": 188786, "fullname": "Hien Van Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188786?format=json", "institution": "University of Houston"}, {"id": 87853, "fullname": "Anh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87853?format=json", "institution": "University of Liverpool"}, {"id": 89972, "fullname": "Ngan Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/89972?format=json", "institution": "University of Arkansas, Fayetteville"}], "abstract": "Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37995", "url": null, "sourceid": 31335, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38715, "uid": "a0b19d4cf512379d7ee34a5cc006c6c7", "name": "ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction", "authors": [{"id": 174177, "fullname": "Aoyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174177?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 189711, "fullname": "Zhen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189711?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 190514, "fullname": "Ziyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190514?format=json", "institution": "Zhejiang University; University of Electronic Science and Technology of China"}, {"id": 190515, "fullname": "Dian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190515?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 105589, "fullname": "Bing Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/105589?format=json", "institution": "University of Electronic Science and Technology of China,"}, {"id": 93490, "fullname": "Shuaicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/93490?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Single-image HDR reconstruction aims to recover high dynamic range radiance from a single low dynamic range (LDR) input, but remains highly ill-posed due to detail saturation in over-exposed regions and noise amplification in under-exposed areas. While recent diffusion-based approaches offer powerful generative priors, they often overlook the exposure-dependent nature of the degradation and incur substantial computational costs from iterative sampling. To address these challenges, we propose ExpoCM, a novel one-step generative HDR reconstruction framework that reformulates HDR reconstruction as a Probability Flow ODE (PF-ODE) and constructs exposure-aware consistency trajectories via exposure-dependent perturbations. Specifically, a soft exposure mask is first constructed to separate the LDR image into over-, under-, and well-exposed regions. Based on this partition, region-conditioned consistency trajectories are designed to hallucinate saturated details, suppress noise in dark regions, and preserve reliable structures within a single, distillation-free inference step. To further enhance perceptual quality, we introduce an Exposure-guided Luminance-Chromaticity Loss in the CIE Lab space, which assigns exposure-aware weights to luminance and chromaticity components, effectively mitigating brightness bias and color drift. Extensive experiments on the HDR-REAL, HDR-EYE, and AIM2025 benchmarks demonstrate that ExpoCM achieves state-of-the-art fidelity and perceptual accuracy, while enabling over 400$\\times$ and 20$\\times$ faster inference compared to DDPM (1000 steps) and DDIM (50 steps), respectively. Code will be released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38715", "url": null, "sourceid": 41803, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37187, "uid": "358680ef4f169bc21f0eec123b85119d", "name": "OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery", "authors": [{"id": 186876, "fullname": "Yiwen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186876?format=json", "institution": "the Robotics Institute, Carnegie Mellon University"}, {"id": 128570, "fullname": "Ce Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128570?format=json", "institution": "University of Central Florida"}, {"id": 130633, "fullname": "Yufu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130633?format=json", "institution": "University of Pennsylvania"}, {"id": 186877, "fullname": "Hsueh-Han Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186877?format=json", "institution": null}, {"id": 186878, "fullname": "Liting Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186878?format=json", "institution": "Carnegie Mellon University"}, {"id": 76711, "fullname": "Laszlo Jeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/76711?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, efficiency, faithfulness, and temporal consistency. Built upon the TRAM architecture, OnlineHMR enables streaming inference via a causal key\u2013value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physical plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37187", "url": null, "sourceid": 32287, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37069, "uid": "39cea3836e15eae9838354dd7356c960", "name": "ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction", "authors": [{"id": 102872, "fullname": "Jiangtong Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/102872?format=json", "institution": "University of Science and Technology of China"}, {"id": 186597, "fullname": "Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186597?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 86575, "fullname": "Jie Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86575?format=json", "institution": "University of Science and Technology of China"}, {"id": 89232, "fullname": "Xiaopeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89232?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 86207, "fullname": "Feng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86207?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose ParaUni. It extracts features from variants VLM's layers in a Parallel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in Unified multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37069", "url": null, "sourceid": 45754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37740, "uid": "6a148326712c649e4b738786289e9eec", "name": "Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs", "authors": [{"id": 86548, "fullname": "Rujiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/86548?format=json", "institution": "Alibaba Group"}, {"id": 188139, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188139?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188140, "fullname": "Xingyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188140?format=json", "institution": "Alibaba Group"}, {"id": 188141, "fullname": "Weixun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188141?format=json", "institution": "Alibaba Group"}, {"id": 188142, "fullname": "Tianqianjin Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188142?format=json", "institution": "Zhejiang University"}, {"id": 188143, "fullname": "Xi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188143?format=json", "institution": "Alibaba Group"}, {"id": 188144, "fullname": "Yuchi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188144?format=json", "institution": "Alibaba Group"}, {"id": 188145, "fullname": "Wenbo Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/188145?format=json", "institution": "Alibaba Group"}, {"id": 88195, "fullname": "Junchi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88195?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}], "abstract": "Exploration capacity shapes both inference\u2011time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question\u2013answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37740", "url": null, "sourceid": 40984, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39763, "uid": "51e0fff3f3771b9df247ee2f30931fe8", "name": "Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning", "authors": [{"id": 151755, "fullname": "Chen-Chen Zong", "url": "http://cvpr.thecvf.com/api/miniconf/users/151755?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192805, "fullname": "YuQi Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192805?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192806, "fullname": "Xie-Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192806?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192807, "fullname": "Yan Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/192807?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 156156, "fullname": "Sheng-Jun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156156?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}], "abstract": "Open-set active learning (OSAL) aims to identify informative samples for annotation when unlabeled data may contain previously unseen classes\u2014a common challenge in safety-critical and open-world scenarios. Existing approaches typically rely on separately trained open-set detectors, introducing substantial training overhead and overlooking the supervisory value of labeled unknowns for improving known-class learning.In this paper, we propose E$^2$OAL (Effective and Efficient Open-set Active Learning), a unified and detector-free framework that fully exploits labeled unknowns for both stronger supervision and more reliable querying. E$^2$OAL first uncovers the latent class structure of unknowns through label-guided clustering in a frozen contrastively pre-trained feature space, optimized by a structure-aware F1-product objective. To leverage labeled unknowns, it employs a Dirichlet-calibrated auxiliary head that jointly models known and unknown categories, improving both confidence calibration and known-class discrimination. Building on this, a logit-margin purity score estimates the likelihood of known classes to construct a high-purity candidate pool, while an OSAL-specific informativeness metric prioritizes partially ambiguous yet reliable samples. These components together form a flexible two-stage query strategy with adaptive precision control and minimal hyperparameter sensitivity.Extensive experiments across multiple OSAL benchmarks demonstrate that E$^2$OAL consistently surpasses state-of-the-art methods in accuracy, efficiency, and query precision, highlighting its effectiveness and practicality for real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39763", "url": null, "sourceid": 44880, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37308, "uid": "52bfa38b4f8a5010613ee88b7abbfe72", "name": "CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing", "authors": [{"id": 187131, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187131?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 186597, "fullname": "Lin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186597?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 89232, "fullname": "Xiaopeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89232?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 131529, "fullname": "Wei Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/131529?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 86370, "fullname": "Wenhan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86370?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 131536, "fullname": "Yike Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131536?format=json", "institution": "Imperial College London"}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Instruction-based image editing with diffusion models has achieved impressive results, yet existing methods struggle with fine-grained instructions specifying precise attributes such as colors, positions, and quantities. While recent approaches employ Group Relative Policy Optimization (GRPO) for alignment, they optimize only at individual sampling steps, providing sparse feedback that limits trajectory-level control. We propose a unified framework **CogniEdit**, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. Our method comprises three components: (1) Multi-modal Large Language Models for decomposing complex instructions into actionable directives, (2) Dynamic Token Focus Relocation that adaptively emphasizes fine-grained attributes, and (3) Dense GRPO-based Optimization that propagates gradients across consecutive steps for trajectory-level supervision. Extensive experiments on benchmark datasets demonstrate that our CogniEdit achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37308", "url": null, "sourceid": 37943, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39571, "uid": "0c541d7108149c1be78194c1115eb42a", "name": "Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning", "authors": [{"id": 192371, "fullname": "Yinan Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192371?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192372, "fullname": "Kejia Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192372?format=json", "institution": "Beijing Institute of Technology"}, {"id": 175162, "fullname": "Ye Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/175162?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192373, "fullname": "Jianyu Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192373?format=json", "institution": null}, {"id": 192374, "fullname": "Jiahui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192374?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192375, "fullname": "Jingyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192375?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192376, "fullname": "Haojia Ao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192376?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155459, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155459?format=json", "institution": "Beijing Institute of Technology"}, {"id": 155458, "fullname": "Yufeng Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/155458?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Scalable robot learning is hindered by the high cost of acquiring diverse, high-quality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under novel object arrangements. Furthermore, by augmenting backgrounds, textures, lighting, and camera views, Video2Robo further enhances the diversity of generated data. Extensive evaluations in both simulation and real-world environments demonstrate that policies trained on Video2Robo data achieve superior generalization and transfer performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39571", "url": null, "sourceid": 38578, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39274, "uid": "8782253a604c16e52832f1bbf7dabd5b", "name": "Prototype-based Causal Intervention for Multi-Label Image Classification", "authors": [{"id": 181217, "fullname": "Yanmin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181217?format=json", "institution": "National University of Defense Technology"}, {"id": 191743, "fullname": "Zhilong Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191743?format=json", "institution": "National University of Defense Technology"}, {"id": 191744, "fullname": "Mao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191744?format=json", "institution": "National University of Defense Technology"}, {"id": 191745, "fullname": "Lihua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191745?format=json", "institution": "National University of Defense Technology"}, {"id": 191746, "fullname": "Jibing Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191746?format=json", "institution": "National University of Defense Technology"}, {"id": 158708, "fullname": "Weidong Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158708?format=json", "institution": "National University of Defense Technology"}], "abstract": "Modern multi-label image classification models suffer from a critical reliance on spurious correlations, failing to learn the underlying causal mechanisms.Many causality-inspired methods are impractical, demanding box-level supervision that is rarely available in real-world datasets.Others rely on static confounder dictionaries, which are inherently inflexible and fail to capture complex biases or adapt to feature space changes during training.To address this, we present prototype-based causal intervention (ProCI), a novel framework that approximates the backdoor adjustment using only image-level supervision. It models confounders as learnable contextual prototypes which, unlike traditional prototypes designed for discriminative features, are engineered to represent class-wise co-occurring bias.These prototypes are learned dynamically within a stable memory and leveraged to construct sample-specific bias vectors for an adaptive feature adjustment, effectively counteracting spurious correlations.Experiments on MS-COCO, Pascal VOC, and the challenging Sewer-ML dataset validate our approach. ProCI achieves competitive performance on standard benchmarks while setting a new state-of-the-art on the highly-confounded Sewer-ML. It outperforms the previous best model by a remarkable +5.44 points on the primary $F2_{CIW}$ metric. These results demonstrate the effectiveness of our approach in mitigating complex real-world biases using only image-level supervision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39274", "url": null, "sourceid": 36380, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36668, "uid": "55a9f4ef0702c7d0df11424f4d538ce9", "name": "MacTok: Robust Continuous Tokenization for Image Generation", "authors": [{"id": 181195, "fullname": "Hengyu Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181195?format=json", "institution": "Fudan University"}, {"id": 185604, "fullname": "Xin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185604?format=json", "institution": "Fudan University"}, {"id": 183004, "fullname": "Guanghao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183004?format=json", "institution": "Fudan University"}, {"id": 185605, "fullname": "Yuxiang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185605?format=json", "institution": "Fudan University"}, {"id": 185606, "fullname": "Jiaoyang Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185606?format=json", "institution": "Fudan University"}, {"id": 103309, "fullname": "Ma Junpeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/103309?format=json", "institution": "Peking University"}, {"id": 185607, "fullname": "Haoyu Albert Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185607?format=json", "institution": "Fudan University"}, {"id": 76452, "fullname": "Jian Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76452?format=json", "institution": "Fudan University"}], "abstract": "Continuous image tokenizers enable efficient visual generation, and those based on variational frameworks can learn smooth, structured latent representations through KL regularization. Yet this often leads to posterior collapse when using fewer tokens, where the encoder fails to encode informative features into the compressed latent space. To address this, we introduce **MacTok**, a **M**asked **A**ugmenting 1D **C**ontinuous **Tok**enizer that leverages image masking and representation alignment to prevent collapse while learning compact and robust representations. MacTok applies both random masking to regularize latent learning and DINO-guided semantic masking to emphasize informative regions in images, forcing the model to encode robust semantics from incomplete visual evidence. Combined with global and local representation alignment, MacTok preserves rich discriminative information in a highly compressed 1D latent space, requiring only 64 or 128 tokens. On ImageNet, MacTok achieves a competitive gFID of 1.44 at 256$\\times$256 and a state-of-the-art 1.52 at 512$\\times$512 with SiT-XL, while reducing token usage by up to 64$\\times$. These results confirm that masking and semantic guidance together prevent posterior collapse and achieve efficient, high-fidelity tokenization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36668", "url": null, "sourceid": 40645, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39314, "uid": "6c714999e23482178f39ffb77647f8a8", "name": "Lyapunov Probes for Hallucination Detection in Large Foundation Models", "authors": [{"id": 144564, "fullname": "Bozhi Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144564?format=json", "institution": "University of Science and Technology of China"}, {"id": 191826, "fullname": "Gen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191826?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 130952, "fullname": "Yalan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/130952?format=json", "institution": "Shanghai University"}, {"id": 191827, "fullname": "Jifeng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191827?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 191828, "fullname": "Yun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191828?format=json", "institution": "National University of Defense Technology"}, {"id": 176806, "fullname": "Faguo Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176806?format=json", "institution": null}, {"id": 191829, "fullname": "Hongwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191829?format=json", "institution": "Beijing academy of blockchain and edge computing"}, {"id": 187269, "fullname": "wenjun wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187269?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 153722, "fullname": "Zhaoxin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153722?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge\u2014transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone areas. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39314", "url": null, "sourceid": 39855, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37802, "uid": "3a2fa8ce176e261768a601fa98f76ef6", "name": "TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation", "authors": [{"id": 95698, "fullname": "Dong-Guw Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/95698?format=json", "institution": "Seoul National University"}, {"id": 188303, "fullname": "Tai Hyoung Rhee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188303?format=json", "institution": "Seoul National University"}, {"id": 188304, "fullname": "Hyunsoo Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188304?format=json", "institution": "Seoul National University"}, {"id": 188305, "fullname": "Young-Sik Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188305?format=json", "institution": "Kyungpook National University; Korea Institute of Machinery &amp; Materials"}, {"id": 127907, "fullname": "Ukcheol Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/127907?format=json", "institution": "Carnegie Mellon University (CMU)"}, {"id": 107266, "fullname": "Ayoung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/107266?format=json", "institution": "Seoul National University"}], "abstract": "Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37802", "url": "https://donkeymouse.github.io/thera_cvpr26/", "sourceid": 32039, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36491, "uid": "ba662e164a309d42867d16592696807f", "name": "Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?", "authors": [{"id": 182027, "fullname": "zhi zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182027?format=json", "institution": "nanjing university"}, {"id": 185183, "fullname": "YaoQi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185183?format=json", "institution": "nanjing university"}, {"id": 88513, "fullname": "Zhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88513?format=json", "institution": "Nanjing University"}, {"id": 185184, "fullname": "Yue Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185184?format=json", "institution": "Nanjing University"}, {"id": 185185, "fullname": "Yangzhou Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185185?format=json", "institution": "Nanjing University"}, {"id": 88504, "fullname": "Tong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88504?format=json", "institution": "Nanjing University"}], "abstract": "The rapid advancement of Multimodal Large Language Models (MLLMs) has revealed the limitations of existing benchmarks in evaluating complex reasoning over multiple images. To address this gap, we introduce $\\textbf{MIRACLE}$, a novel benchmark for Multi-Image complex Reasoning And Comprehension Logic Evaluation, featuring 4,000 questions across diverse reasoning types such as visual comparison, temporal sequencing, and spatial relations, with each question involving an average of seven tightly correlated images. MIRACLE emphasizes strong inter-image dependencies through a systematic data collection process, followed by delicate instance grouping and question design that enforce cross-image reasoning.Evaluation on leading MLLMs shows that even top-performing models like Gemini-2.5-Pro achieve only 55.91\\% points, highlighting the significant challenges of multi-image reasoning. Moreover, in scenarios characterized by high visual information density, such as puzzle tasks and ultra multi-image input conditions, all models exhibit a significant drop in performance, which highlights the limitations of MLLMs in handling complex structural relations and collaborative reasoning, revealing deficiencies in their cognitive capabilities under high-load visual reasoning settings. We hope MIRACLE will inspire the community to push the boundaries of multi-image reasoning. The benchmark shall be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36491", "url": null, "sourceid": 41313, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36580, "uid": "6fc54846861b7f851eb0541206edc578", "name": "SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation", "authors": [{"id": 152596, "fullname": "Sashuai zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152596?format=json", "institution": "Zhejiang University"}, {"id": 185401, "fullname": "Qiang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185401?format=json", "institution": "Alibaba Group"}, {"id": 103309, "fullname": "Ma Junpeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/103309?format=json", "institution": "Peking University"}, {"id": 185184, "fullname": "Yue Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185184?format=json", "institution": "Nanjing University"}, {"id": 185402, "fullname": "Ruofan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185402?format=json", "institution": "Zhejiang University"}, {"id": 152599, "fullname": "Ziang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152599?format=json", "institution": "Zhejiang University"}, {"id": 185403, "fullname": "Xiaoda Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185403?format=json", "institution": "Zhejiang University"}, {"id": 86261, "fullname": "Zhibin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86261?format=json", "institution": "Alibaba Group"}, {"id": 152920, "fullname": "Jun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152920?format=json", "institution": "Alibaba Group"}, {"id": 185404, "fullname": "YuCheng YuCheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185404?format=json", "institution": null}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 84818, "fullname": "Zhou Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84818?format=json", "institution": "Zhejiang University, Tsinghua University"}], "abstract": "Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present SpatialReward, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a Prompt Decomposer extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce SpatRelBench, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36580", "url": null, "sourceid": 45630, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39919, "uid": "cf80f2ee3e9cb3e92f6bc04b08b4ded3", "name": "Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation", "authors": [{"id": 154782, "fullname": "Mengshi Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154782?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 193113, "fullname": "Jiaxuan Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193113?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 193114, "fullname": "Xianlin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193114?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 85161, "fullname": "Huadong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/85161?format=json", "institution": "Beijing University of Post and Telecommunication, Tsinghua University"}], "abstract": "3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the widely adopted multi-modal dataset, MM-Fi, demonstrating the superiority of our approach in enhancing 3D pose estimation under complex conditions. We will release our codes soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39919", "url": null, "sourceid": 35569, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36793, "uid": "d616a8d3d6f93b0347f1faad28124948", "name": "GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding", "authors": [{"id": 103309, "fullname": "Ma Junpeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/103309?format=json", "institution": "Peking University"}, {"id": 152596, "fullname": "Sashuai zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152596?format=json", "institution": "Zhejiang University"}, {"id": 183004, "fullname": "Guanghao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183004?format=json", "institution": "Fudan University"}, {"id": 185604, "fullname": "Xin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185604?format=json", "institution": "Fudan University"}, {"id": 185184, "fullname": "Yue Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185184?format=json", "institution": "Nanjing University"}, {"id": 181195, "fullname": "Hengyu Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181195?format=json", "institution": "Fudan University"}, {"id": 185605, "fullname": "Yuxiang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185605?format=json", "institution": "Fudan University"}, {"id": 86261, "fullname": "Zhibin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86261?format=json", "institution": "Alibaba Group"}, {"id": 152920, "fullname": "Jun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152920?format=json", "institution": "Alibaba Group"}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 91956, "fullname": "Shanghang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91956?format=json", "institution": "Peking University"}, {"id": 76452, "fullname": "Jian Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76452?format=json", "institution": "Fudan University"}], "abstract": "Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical application. Existing methods alleviate this by selecting keyframes, but their greedy decision-making, combined with a decoupled evaluation of relevance and diversity, often falls into local optima and results in erroneously selecting irrelevant noise frames. To address these challenges, we propose **GIFT**: **G**lobal **I**rreplaceability **F**rame **T**argeting, a novel training-free framework that selects frames by assessing their intrinsic irreplaceability. Specifically, we first introduce Directed Diversity to quantify a frame's uniqueness conditioned on relevance, which allows us to formulate a unified irreplaceability score. Subsequently, our Budget-Aware Refinement strategy employs a adaptive iterative process that first secures a core set of frames with the highest irreplaceability, and then shifts its priority to building crucial temporal context around these selections as the budget expands. Extensive experiments demonstrate that GIFT achieves a maximum average improvement of **12.5\\%** across long-form video benchmarks on LLaVA-Video-7B compared to uniform sampling. Code will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36793", "url": null, "sourceid": 39135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39456, "uid": "e6483ae9e85b801c582ca117ecd16eb6", "name": "Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution", "authors": [{"id": 152627, "fullname": "Hongsong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152627?format=json", "institution": "Southeast University"}, {"id": 192107, "fullname": "Renxi Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192107?format=json", "institution": "Southeast University"}, {"id": 192108, "fullname": "Chaolei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192108?format=json", "institution": "Southeast University"}, {"id": 131829, "fullname": "Jie Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/131829?format=json", "institution": "Southeast University"}], "abstract": "With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation.Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39456", "url": null, "sourceid": 43828, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36218, "uid": "c98f943dad08d85bee443336d0208431", "name": "SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control", "authors": [{"id": 69585, "fullname": "Xiaofeng Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/69585?format=json", "institution": "Southeast University"}, {"id": 184475, "fullname": "Yu-Xin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184475?format=json", "institution": "Southeast University"}, {"id": 131826, "fullname": "Hao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/131826?format=json", "institution": "Hefei University of Technology"}, {"id": 156998, "fullname": "Yeying Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/156998?format=json", "institution": "Tencent"}, {"id": 128548, "fullname": "Junming Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128548?format=json", "institution": "Southeast University"}, {"id": 131829, "fullname": "Jie Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/131829?format=json", "institution": "Southeast University"}], "abstract": "Underwater images often exhibit dominant blue-green hues due to wavelength-dependent light attenuation. While existing enhancement methods have achieved promising performance, they typically overlook the subjective nature of visual preferences. To address this gap, we propose SDUIE, a level-aware Semi-supervised Diffusion framework for Underwater Image Enhancement that enables dual control through both quantitative and textual inputs. SDUIE-Quant allows continuous, numerical adjustment of enhancement levels via low-rank adaptation weight merging within a dual-branch diffusion model. This model comprises a supervised branch trained on synthetic underwater-terrestrial pairs and a self-supervised branch designed to preserve the natural hues of real-world underwater scenes. Building on this, SDUIE-Text introduces intuitive, language-guided control by aligning semantic prompts with visual enhancement effects, leveraging the learned fusion weights. This dual-modality design offers both precise control and flexible, user-preferred enhancement. Experimental results demonstrate that SDUIE achieves state-of-the-art results while better preserving the aesthetic qualities often missed by conventional methods. The source code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36218", "url": null, "sourceid": 38137, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38337, "uid": "bb519123d4b75a73b9691c9fd3d9aeb4", "name": "FoleyDirector: Directing Temporal Controllable Video-to-Audio Generation via Fine-Grained Temporal Scripts", "authors": [{"id": 129015, "fullname": "You Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129015?format=json", "institution": "Zhejiang University"}, {"id": 102479, "fullname": "Dewei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/102479?format=json", "institution": "Zhejiang University"}, {"id": 107182, "fullname": "Fan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/107182?format=json", "institution": "Zhejiang University"}, {"id": 87276, "fullname": "Fu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87276?format=json", "institution": "Dalian University of Technology"}, {"id": 87274, "fullname": "Dongliang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/87274?format=json", "institution": "ByteDance Inc. "}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}], "abstract": "Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded/partially visible objects.In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model\u2019s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts(STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability.To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSound-Director and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38337", "url": null, "sourceid": 39939, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38891, "uid": "35f27037d233966a4a849da1e3123cb9", "name": "Towards Generalized Multimodal Homography Estimation", "authors": [{"id": 190922, "fullname": "Jinkun You", "url": "http://cvpr.thecvf.com/api/miniconf/users/190922?format=json", "institution": "University of Macau"}, {"id": 190923, "fullname": "Jiaxin Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190923?format=json", "institution": "University of Macau"}, {"id": 70453, "fullname": "Jie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70453?format=json", "institution": "University of Macau"}, {"id": 86115, "fullname": "Yicong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86115?format=json", "institution": "University of Macau"}], "abstract": "Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization performance. The results also confirm the effectiveness of the proposed network.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38891", "url": null, "sourceid": 35412, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37934, "uid": "9812886076e5749e00292cc3c9777ab3", "name": "MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration", "authors": [{"id": 174673, "fullname": "Runxun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174673?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 187008, "fullname": "Yizhou Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187008?format=json", "institution": "Fudan University"}, {"id": 188625, "fullname": "Li Dongrui", "url": "http://cvpr.thecvf.com/api/miniconf/users/188625?format=json", "institution": "Hebei Medical University"}, {"id": 188626, "fullname": "Bo XU", "url": "http://cvpr.thecvf.com/api/miniconf/users/188626?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 181909, "fullname": "Jingwei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/181909?format=json", "institution": "Chinese Academy of Sciences, Institute of Automation"}], "abstract": "Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR\u2013CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37934", "url": "https://github.com/cai114514/MorphSeek", "sourceid": 45514, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37922, "uid": "f027bfaeb72367e4f6c0522e2cd9f008", "name": "Mitigating Error Amplification in Fast Adversarial Training", "authors": [{"id": 182573, "fullname": "Mengnan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182573?format=json", "institution": "Dalian University of Technology"}, {"id": 127233, "fullname": "Lihe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127233?format=json", "institution": "Dalian University of Technology"}, {"id": 188595, "fullname": "BoWang BoWang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188595?format=json", "institution": "Dalian University of Technology"}, {"id": 188596, "fullname": "Tianhang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188596?format=json", "institution": "The State Key Laboratory of Blockchain and Data Security, Zhejiang University"}, {"id": 188597, "fullname": "Hong Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188597?format=json", "institution": "Anhui University"}, {"id": 188598, "fullname": "Geyong Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/188598?format=json", "institution": "University of Exeter"}], "abstract": "Fast Adversarial Training (FAT) has proven effective in enhancing model robustness by encouraging networks to learn perturbation-invariant representations.However, FAT often suffers from catastrophic overfitting (CO), where the model overfits to the training attack and fails to generalize to unseen ones. Moreover, robustness-oriented optimization typically leads to notable performance degradation on clean inputs, and such degradation becomes increasingly severe as the perturbation budget grows.In this work, we conduct a comprehensive analysis of how guidance strength affects model performance by modulating perturbation and supervision levels across distinct confidence groups.The findings reveal that low-confidence samples are the primary contributors to CO and the robustness\u2013accuracy trade-off. Building on this insight, we propose a Distribution-aware Dynamic Guidance (DDG) strategy that dynamically adjusts both the perturbation budget and supervision signal. Specifically, DDG scales the perturbation magnitude according to the sample confidence at the ground-truth class, thereby guiding samples toward consistent decision boundaries while mitigating the influence of learning spurious correlations. Simultaneously, it dynamically adjusts the supervision signal based on the prediction state of each sample, preventing overemphasis on incorrect signals. To alleviate potential gradient instability arising from dynamic guidance, we further design a weighted regularization constraint.Extensive experiments on standard benchmarks demonstrate that DDG effectively alleviates both CO and the robustness\u2013accuracy trade-off.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37922", "url": null, "sourceid": 33686, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40237, "uid": "fea6f06e9ea5fe29ebdd6e510b308c1f", "name": "SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model", "authors": [{"id": 138470, "fullname": "Bingliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/138470?format=json", "institution": "California Institute of Technology"}, {"id": 129530, "fullname": "Wenda Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129530?format=json", "institution": "California Institute of Technology"}, {"id": 131462, "fullname": "Yizhuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131462?format=json", "institution": "The University of Hong Kong"}, {"id": 91080, "fullname": "Linjie Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91080?format=json", "institution": "ByteDance Inc."}, {"id": 76018, "fullname": "Yisong Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/76018?format=json", "institution": "Caltech"}, {"id": 184011, "fullname": "Katie Bouman", "url": "http://cvpr.thecvf.com/api/miniconf/users/184011?format=json", "institution": "California Institute of Technology"}, {"id": 154412, "fullname": "Yang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/154412?format=json", "institution": "OpenAI"}, {"id": 130493, "fullname": "Qiushan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/130493?format=json", "institution": "The University of Hong Kong"}], "abstract": "We present Scalable Pixel-anchored End-to-end Diffusion (SpeeDiff), a latent diffusion method that jointly trains the VAE and the diffusion model from scratch. In principle, joint training allows the diffusion loss gradient to directly guide the VAE encoder, encouraging the formation of a generation-friendly latent space and potentially yielding faster convergence than the conventional two-stage approach with a pretrained frozen VAE. However, a naive end-to-end implementation severely degrades performance, as unrestricted backpropagation of the diffusion loss leads to latent space collapse. Our main technical contribution is a simple yet effective Tweedie Pixel Reconstruction (TPR) loss, which provides additional pixel-level feedback by decoding a predicted clean latent from an intermediate noisy state using Tweedie's formula, thereby alleviating collapse. Furthermore, our method enables jointly scaling a fully transformer-based architecture and enhances representation alignment within the end-to-end framework. Our SpeeDiff-XL model achieves over 140\u00d7 and 61\u00d7 faster training compared to Vanilla SiT and REPA, respectively, while attaining an FID of 1.50 without guidance on ImageNet 256\u00d7256 generation. With a more efficient 32\u00d7 compressed VAE, our model further reaches an FID of 1.53 without guidance on ImageNet 512\u00d7512 generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40237", "url": null, "sourceid": 40670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38319, "uid": "c175c9d1e6adc300ee406d504e5c34f5", "name": "ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding", "authors": [{"id": 189591, "fullname": "Ao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189591?format=json", "institution": "National University of Defense Technology"}, {"id": 189592, "fullname": "Xingming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189592?format=json", "institution": "National University of Defense Technology"}, {"id": 189593, "fullname": "Xuanyu Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/189593?format=json", "institution": "National University of Defense Technology"}, {"id": 143537, "fullname": "Xixiang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/143537?format=json", "institution": "National University of Defense Technology"}, {"id": 189594, "fullname": "Qiyao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189594?format=json", "institution": "National University of Defense Technology"}, {"id": 189595, "fullname": "Chunping Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189595?format=json", "institution": "Songshan Laboratory; Information Engineering University"}, {"id": 189596, "fullname": "Runke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189596?format=json", "institution": null}, {"id": 189597, "fullname": "Qingyong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189597?format=json", "institution": null}], "abstract": "Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure---requiring specialized maritime expertise for interpretation.  We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, emergency planning under multiple constraints). All samples are generated from raw S-57 data through a calibrated vector-to-image pipeline with automated consistency checks and expert review. We evaluate 10 state-of-the-art MLLMs such as GPT-4o, Gemini 2.5, Qwen3-VL, InternVL-3, and GLM-4.5V, under a unified zero-shot protocol. The best model achieves only 47.88% accuracy, with systematic challenges in symbolic grounding, spatial computation, multi-constraint reasoning, and robustness to lighting and scale variations. By establishing the first rigorous ENC benchmark, we open a new research frontier at the intersection of specialized symbolic reasoning and safety-critical AI, providing essential infrastructure for advancing MLLMs toward professional maritime applications.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38319", "url": null, "sourceid": 45151, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38438, "uid": "1a09e19f5eacbec80640b3286d175afa", "name": "Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure", "authors": [{"id": 153275, "fullname": "Jooyeol Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153275?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic.Yet automating the animation of vector graphics remains challenging for vision\u2013language models (VLMs) despite recent progress in code generation and motion planning.VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions.By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38438", "url": null, "sourceid": 33750, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39509, "uid": "556dd6e75400f8f61f8e864ebef83147", "name": "MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes", "authors": [{"id": 176793, "fullname": "Kehua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/176793?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 192231, "fullname": "Tianlu Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192231?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 159457, "fullname": "Xinzhu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/159457?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 192232, "fullname": "Hao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192232?format=json", "institution": "The Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 192233, "fullname": "Zehao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192233?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 192234, "fullname": "Zihan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192234?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 146223, "fullname": "Shuqin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/146223?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 146371, "fullname": "Honglong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/146371?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 89586, "fullname": "Feng Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/89586?format=json", "institution": "ICT, Chinese Academy of Sciences"}, {"id": 153000, "fullname": "Yucheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153000?format=json", "institution": "Institute of Computing Technology, CAS"}, {"id": 192235, "fullname": "Zhaoqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192235?format=json", "institution": ", Chinese Academy of Sciences"}], "abstract": "Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that organically integrates monocular and multi-view optimization to achieve efficient and accurate geometric refinement. Finally, to address the appearance inconsistency commonly observed in large-scale scenes, we introduce a depth-guided appearance modeling approach that learns spatial features with 3D consistency, facilitating effective decoupling between geometry and appearance and further enhancing reconstruction stability. Experiments on large-scale urban datasets demonstrate that MetroGS achieves superior geometric accuracy, rendering quality, offering a unified solution for high-fidelity large-scale scene reconstruction. The code will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39509", "url": null, "sourceid": 44007, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37832, "uid": "378271842c95b7894122b598f9874a14", "name": "Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models", "authors": [{"id": 183673, "fullname": "Seunghwan Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/183673?format=json", "institution": "KAIST"}, {"id": 153275, "fullname": "Jooyeol Yun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153275?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 188358, "fullname": "Youngdo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188358?format=json", "institution": "KAIST"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Text-to-image models are increasingly utilized in design workflows, but articulating nuanced design intentions through text remains a challenge. This work proposes a method that extracts a visual attribute from a reference image and injects it directly into the generation pipeline. The method optimizes a text token to exclusively represent the target attribute using a custom training prompt and two novel embeddings: distilled embedding and residual embedding. Through this approach, a wide range of attributes can be extracted, including the shape, material, or color of an object, as well as the camera angle of the image. The method is validated on various target attributes and text prompts drawn from a newly constructed dataset. The results show that it outperforms existing approaches in selectively extracting and applying target attributes across diverse contexts. Ultimately, the proposed method enables intuitive and controllable text-to-image generation, streamlining the design process.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37832", "url": null, "sourceid": 34524, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39609, "uid": "50ca3a3bb3e7bf160f31f82354221313", "name": "Low-Rank Test-Time Training for Pre-Trained Point Cloud Models", "authors": [{"id": 181498, "fullname": "Ouyangzi Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/181498?format=json", "institution": "zhejiang university"}, {"id": 154184, "fullname": "Feifei Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154184?format=json", "institution": "Zhejiang University"}, {"id": 192469, "fullname": "Kexin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192469?format=json", "institution": null}, {"id": 129256, "fullname": "Yawei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129256?format=json", "institution": "Zhejiang University"}, {"id": 152183, "fullname": "Zikai Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152183?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 129361, "fullname": "Ping Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129361?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}, {"id": 187064, "fullname": "Fengda Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187064?format=json", "institution": "Nanyang Technological University"}, {"id": 131756, "fullname": "Hongwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131756?format=json", "institution": "Zhejiang University"}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}], "abstract": "Test-time training (TTT) enhances the robustness of pretrained models to out-of-distribution (OOD) data through auxiliary self-supervised tasks, without requiring labeled samples. However, existing TTT methods predominantly rely on decoder-based auxiliary objectives, which suffer from inefficient adaptation and weak coupling with the primary task. To solve these limitations, we revisit the mechanism of test-time training by analyzing masking-based pretrained models to uncover the fundamental source of their OOD robustness. Our investigation reveals that their generalization capability stems from a latent feature-level structural invariance, the consistency of encoded representations under masked perturbations. Building on this insight, we introduce LoTT-PC, a lightweight LoRA-based framework that operationalizes this invariance-preserving principle for 3D point cloud classification. LoTT-PC consists of two main components: (1) low-rank modulation units for parameter-efficient adaptation, and (2) a permutation-invariant alignment mechanism that enforces representation consistency through masked feature alignment. Extensive experiments on multiple benchmarks demonstrate that this unified design enables pretrained point cloud models to self-tune rapidly and reliably across diverse OOD scenarios, outperforming state-of-the-art methods by an average of $2.7\\%$ in accuracy under various corruption types.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39609", "url": null, "sourceid": 38994, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37027, "uid": "89ffff7b8d7d48403685344754816cf3", "name": "Dynamic Token Reweighting for Robust Vision-Language Models", "authors": [{"id": 180136, "fullname": "Tanqiu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180136?format=json", "institution": "Stony Brook University"}, {"id": 186520, "fullname": "Jiacheng Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186520?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 132383, "fullname": "Rongyi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132383?format=json", "institution": "Stony Brook University, State University of New York at Stony Brook"}, {"id": 156079, "fullname": "Jiawei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156079?format=json", "institution": "Stony Brook University"}, {"id": 186208, "fullname": "Fenglong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186208?format=json", "institution": "Pennsylvania State University"}, {"id": 186207, "fullname": "Ting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186207?format=json", "institution": "State University of New York at Stony Brook"}], "abstract": "Large vision-language models (VLMs) are highly vulnerable to multimodal jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model\u2019s key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model\u2019s general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that DTR outperforms existing defenses in both attack robustness and benign-task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. (The code for replicating DTR is included in the supplementary materials.)", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37027", "url": null, "sourceid": 37773, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37506, "uid": "a0def054ef84ac2784ea52baee05d95f", "name": "LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes", "authors": [{"id": 174298, "fullname": "Yichao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174298?format=json", "institution": "Zhejiang University"}, {"id": 143366, "fullname": "Qiaowei Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/143366?format=json", "institution": "Zhejiang University"}, {"id": 181906, "fullname": "Jinsheng Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181906?format=json", "institution": "Zhejiang University"}, {"id": 90421, "fullname": "Wei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90421?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185783, "fullname": "Zhihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185783?format=json", "institution": "University of Science and Technology of China"}, {"id": 129256, "fullname": "Yawei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129256?format=json", "institution": "Zhejiang University"}], "abstract": "Constructing a 4D language field that supports open-vocabulary queries is essential for semantic perception and interaction in dynamic environments. Existing 4D Gaussian-based approaches face two major challenges. First, the assumption of a static identity per Gaussian leads to semantic inconsistency, as motion fields warp Gaussians across object boundaries over time, causing oscillating identity assignments. Second, current methods typically model dynamic semantics as a set of discrete, predefined state prototypes, which fail to capture dynamic continuity and delineate fine-grained temporal boundaries. To address these issues, we propose LangField4D, a novel 4D Gaussian framework that jointly models spatio-temporal identity and semantics in a unified and continuous representation. We introduce an Identity-Adaptive Gaussian Grouping module that assigns each Gaussian a learnable adaptation feature to dynamically capture its object affiliation, ensuring consistent semantic tracking across time. Building upon this affiliation structure, we further design a Continuous Spatio-Temporal Semantic Learning mechanism based on a Tetraplane representation, which encodes both time-invariant and time-varying semantics within a continuous latent space. Extensive experiments on dynamic scene benchmarks demonstrate that we achieve state-of-the-art performance on both time-agnostic and time-sensitive open-vocabulary query tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37506", "url": null, "sourceid": 31836, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37247, "uid": "a702207db66fc2b8859a0e7aa413f0b2", "name": "Z-Order Transformer for Feed-Forward Gaussian Splatting", "authors": [{"id": 181441, "fullname": "Can Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181441?format=json", "institution": null}, {"id": 187006, "fullname": "Lei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187006?format=json", "institution": "University of Hong Kong"}, {"id": 92788, "fullname": "Wei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/92788?format=json", "institution": "Futurewei Technologies"}, {"id": 86789, "fullname": "Dong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86789?format=json", "institution": "University of Hong Kong"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details.  This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37247", "url": null, "sourceid": 37361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40291?format=json"], "related_events_ids": [40291]}, {"id": 40291, "uid": "a702207db66fc2b8859a0e7aa413f0b2", "name": "Z-Order Transformer for Feed-Forward Gaussian Splatting", "authors": [{"id": 181441, "fullname": "Can Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181441?format=json", "institution": null}, {"id": 187006, "fullname": "Lei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187006?format=json", "institution": "University of Hong Kong"}, {"id": 92788, "fullname": "Wei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/92788?format=json", "institution": "Futurewei Technologies"}, {"id": 86789, "fullname": "Dong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86789?format=json", "institution": "University of Hong Kong"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details.  This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40291", "url": null, "sourceid": -37361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37247?format=json"], "related_events_ids": [37247]}, {"id": 40059, "uid": "8eecb5252905f8dcf307a09d6fb6745f", "name": "ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation", "authors": [{"id": 181906, "fullname": "Jinsheng Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181906?format=json", "institution": "Zhejiang University"}, {"id": 143366, "fullname": "Qiaowei Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/143366?format=json", "institution": "Zhejiang University"}, {"id": 174298, "fullname": "Yichao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174298?format=json", "institution": "Zhejiang University"}, {"id": 193408, "fullname": "Zizhuo Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193408?format=json", "institution": "Zhejiang University"}, {"id": 193409, "fullname": "Ying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193409?format=json", "institution": "North China University of Technology"}, {"id": 90421, "fullname": "Wei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90421?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185783, "fullname": "Zhihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185783?format=json", "institution": "University of Science and Technology of China"}, {"id": 129256, "fullname": "Yawei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129256?format=json", "institution": "Zhejiang University"}], "abstract": "The ability to extrapolate dynamic 3D scenes beyond the observed timeframe is fundamental to advancing physical world understanding and predictive modeling. Existing dynamic 3D reconstruction methods have achieved high-fidelity rendering of temporal interpolation, but typically lack physical consistency in predicting the future. To overcome this issue, we propose ParticleGS, a physics-based framework that reformulates dynamic 3D scenes as physically grounded systems. ParticleGS comprises three key components: 1) an encoder that decomposes the scene into static properties and initial dynamic physical fields; 2) an evolver based on Neural Ordinary Differential Equations (Neural ODEs) that learns continuous-time dynamics for motion extrapolation; and 3) a decoder that reconstructs 3D Gaussians from evolved particle states for rendering. Through this design, ParticleGS integrates physical reasoning into dynamic 3D representations, enabling accurate and consistent prediction of the future. Experiments show that ParticleGS achieves state-of-the-art performance in extrapolation while maintaining rendering quality comparable to leading dynamic 3D reconstruction methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40059", "url": null, "sourceid": 36975, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39958, "uid": "486879a776de8b19e7c2b67ba78ecd94", "name": "ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding", "authors": [{"id": 182019, "fullname": "Quan Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182019?format=json", "institution": "Zhejiang University"}, {"id": 193192, "fullname": "Yuhao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193192?format=json", "institution": "Zhejiang University"}, {"id": 193193, "fullname": "Yicheng Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/193193?format=json", "institution": "Zhejiang University"}, {"id": 193194, "fullname": "Huan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193194?format=json", "institution": "Zhejiang University"}, {"id": 156470, "fullname": "Cong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156470?format=json", "institution": "Zhejiang University"}], "abstract": "Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporates an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention\u2011guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\\sim1.8\\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\\times$ on LLaVA-Onevision-72B and 2.42$\\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39958", "url": null, "sourceid": 39127, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39627, "uid": "89f590f6d31fb9ba02239da87497d982", "name": "Spatial Retrieval Augmented Autonomous Driving", "authors": [{"id": 75577, "fullname": "Xiaosong Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/75577?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 192515, "fullname": "Chenhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192515?format=json", "institution": "Fudan University"}, {"id": 192516, "fullname": "Yule Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192516?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 192517, "fullname": "Songbur Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192517?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192518, "fullname": "Zhiyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192518?format=json", "institution": "Horizon Robotics; Huazhong University of Science and Technology; NVIDIA; Stanford University; Horizon Robotics; Waymo; Tsinghua University, Tsinghua University; Department of Computer Science, Unive"}, {"id": 144977, "fullname": "chen chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144977?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 181212, "fullname": "Shaofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181212?format=json", "institution": "University of Science and Technology of China"}, {"id": 192519, "fullname": "Xuanhe Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192519?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 128276, "fullname": "Xue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128276?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 88195, "fullname": "Junchi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88195?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86826, "fullname": "Yu-Gang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86826?format=json", "institution": "Fudan University"}], "abstract": "Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this \"recall\" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD stacks.For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling.  Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39627", "url": null, "sourceid": 39634, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38468, "uid": "de4dbaab116c6ce122ca050041e1546f", "name": "GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation", "authors": [{"id": 189914, "fullname": "Xujing Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189914?format=json", "institution": "University of Science and Technology of China"}, {"id": 183623, "fullname": "Chuxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183623?format=json", "institution": "University of Science and Technology of China"}, {"id": 189915, "fullname": "Yubo Ai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189915?format=json", "institution": "University of Science and Technology of China"}, {"id": 148100, "fullname": "Zhixin Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/148100?format=json", "institution": "University of Science and Technology of China"}, {"id": 189916, "fullname": "Zhuoyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189916?format=json", "institution": "University of Science and Technology of China"}, {"id": 189917, "fullname": "Liangsheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189917?format=json", "institution": "University of Science and Technology of China"}, {"id": 185768, "fullname": "Yujia Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185768?format=json", "institution": "University of Science and Technology of China"}, {"id": 104337, "fullname": "Xinjun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/104337?format=json", "institution": "University of Science and Technology of China"}, {"id": 186545, "fullname": "Qiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186545?format=json", "institution": "University of Science and Technology of China"}, {"id": 88062, "fullname": "Wenfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88062?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set.Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose \\textbf{GeoGuide}, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an \\textbf{Uncertainty-based Superpoint Distillation} module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information  to enhance local semantic consistency. Furthermore, our \\textbf{Instance-level Mask Reconstruction} module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our \\textbf{Inter-Instance Relation Consistency} module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift.Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38468", "url": null, "sourceid": 41626, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40006, "uid": "84fde371d2d8fc288efe6f932d8af208", "name": "Progressive Supernet Training for Efficient Visual Autoregressive Modeling", "authors": [{"id": 180887, "fullname": "Xiaoyue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180887?format=json", "institution": "Tsinghua University"}, {"id": 193282, "fullname": "Yuling Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193282?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 193283, "fullname": "kaiyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193283?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 193284, "fullname": "Huandong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193284?format=json", "institution": "Tsinghua University"}, {"id": 193285, "fullname": "Yong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193285?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 193286, "fullname": "Xiaodong Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193286?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 149284, "fullname": "Xinlei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149284?format=json", "institution": "Tsinghua University"}, {"id": 89684, "fullname": "Mingbao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89684?format=json", "institution": "Xiamen University"}], "abstract": "Visual Autoregressive (VAR) models have demonstrated competitive performance with diffusion models in image generation by adopting a \"next-scale\" prediction paradigm that significantly reduces inference steps. However, VAR's progressive multi-scale generation leads to severe memory overhead due to KV cache accumulation across all scales, limiting practical deployment. Existing solutions either require training and deploying multiple specialized models or sacrifice generation quality.We observe a critical scale-depth asymmetric dependency in VAR: small scales (low-resolution tokens) are highly sensitive to network depth and require deep layers to capture global semantic information, while large scales (high-resolution tokens) exhibit remarkable robustness to depth reduction.Motivated by this insight, we propose VARiant, a unified supernet framework that enables dynamic depth adjustment within a single model through parameter sharing. Our approach employs an even-spacing layer selection strategy to construct quality-preserving subnetworks, and introduces a dynamic-ratio progressive training strategy that gradually transitions from joint optimization (full network to subnetwork ratio 2:8) to subnetwork optimization (ratio 10:0), effectively resolving the inherent optimization conflicts between full network and subnetworks in supernet training.Extensive experiments on ImageNet demonstrate that our method achieves Pareto-optimal trade-offs between generation quality and inference efficiency: by using full depth (30 layers) for the first 7 scales and a 16-layer subnetwork (50% depth) for subsequent scales, we obtain 50% cache reduction and 1.8\u00d7 inference speedup with minimal quality loss (FID increases by only 0.3).Unlike approaches requiring deployment of multiple models, our single-model solution eliminates deployment complexity, supports zero-cost runtime depth switching, and seamlessly integrates into standard transformer inference frameworks, making it highly practical for resource-constrained scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40006", "url": null, "sourceid": 41858, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38618, "uid": "10d4899fefa97ea9d92fc184f3c63655", "name": "Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight", "authors": [{"id": 182828, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182828?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190311, "fullname": "Xueqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190311?format=json", "institution": "Southern University of Science and Technology"}, {"id": 190312, "fullname": "Yiyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190312?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190313, "fullname": "Jin Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/190313?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 190314, "fullname": "Yihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190314?format=json", "institution": "Fudan University"}, {"id": 190315, "fullname": "Zipeng Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190315?format=json", "institution": null}, {"id": 190316, "fullname": "Jiadi Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/190316?format=json", "institution": "Fudan University"}, {"id": 190317, "fullname": "You Qiaoben", "url": "http://cvpr.thecvf.com/api/miniconf/users/190317?format=json", "institution": null}, {"id": 190318, "fullname": "Pengfei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190318?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 129902, "fullname": "Zhijie Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129902?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.5\\% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $\\pi_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. We also introduce the adaptive temporal ensemble (ATE) strategy to balance computational efficiency and motion stability during inference, yielding the Mantis-ATE variant, which reduces inference counts by 45\\% while maintaining performance. The code and weights will be open-sourced after acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38618", "url": null, "sourceid": 46526, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40164, "uid": "e4f0a7d668a99dcb01354ae6a925eaac", "name": "FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning", "authors": [{"id": 182717, "fullname": "Yuan Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182717?format=json", "institution": "Beijing Teleinfo Technology Company Ltd., CAICT"}, {"id": 193678, "fullname": "Lixu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193678?format=json", "institution": "Nanyang Technological University"}, {"id": 193679, "fullname": "Jiaqi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193679?format=json", "institution": "Tsinghua University"}, {"id": 190313, "fullname": "Jin Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/190313?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 84721, "fullname": "Simin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84721?format=json", "institution": "University of Texas at Dallas "}, {"id": 193680, "fullname": "Zehua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193680?format=json", "institution": "University of British Columbia"}, {"id": 193681, "fullname": "Zijian Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193681?format=json", "institution": "China University of Mining Technology - Beijing"}, {"id": 193682, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193682?format=json", "institution": "China University of Mining Technology - Xuzhou"}, {"id": 193683, "fullname": "Huixia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193683?format=json", "institution": "Beijing Jiaotong University"}, {"id": 129041, "fullname": "Xiaoxiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129041?format=json", "institution": "University of British Columbia"}], "abstract": "Federated learning (FL) enables collaborative training across clients without compromising privacy. While most existing FL methods assume homogeneous model architectures, client heterogeneity in data and resources renders this assumption impractical, motivating model-heterogeneous FL. To address this problem, we propose Federated Representation Entanglement (FedRE), a framework built upon a novel form of client knowledge termed entangled representation. In FedRE, each client aggregates its local representations into a single entangled representation using normalized random weights and applies the same weights to integrate the corresponding one-hot label encodings into the entangled-label encoding. Those are then uploaded to the server to train a global classifier. During training, each entangled representation is supervised across categories via its entangled-label encoding, while random weights are re-sampled each round to introduce diversity, mitigating the global classifier\u2019s overconfidence and promoting smoother decision boundaries. Furthermore, each client uploads a single cross-category entangled representation along with its entangled-label encoding, mitigating the risk of representation inversion attacks and reducing communication overhead. Extensive experiments demonstrate that FedRE achieves an effective trade-off among model performance, privacy protection, and communication overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40164", "url": null, "sourceid": 36969, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36989, "uid": "fc51906f285d57fbd09b59cc1aeb051d", "name": "OntoAug: Rethinking Generative Data Augmentation via Ontology Guidance", "authors": [{"id": 181457, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181457?format=json", "institution": "Huazhong Agricultural University"}, {"id": 186395, "fullname": "Zhichuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186395?format=json", "institution": "Huazhong Agricultural University"}, {"id": 186396, "fullname": "Jun Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186396?format=json", "institution": "Huazhong Agricultural University"}], "abstract": "Generative data augmentation techniques open new avenues for improving image recognition models. The core of image recognition lies in accurately capturing the ontological features of the subject. However, existing methods often treat the image as a whole during augmentation, ignoring the uneven semantic distribution between foreground and background. This can lead to semantic shifts in generated samples, weakening the model\u2019s ability to represent the subject\u2019s ontology. In human perception, category recognition typically relies on the stable essence of the subject while tolerating variations in background and environment. Inspired by this human perceptual mechanism of \u201cstable subjects, diverse backgrounds, and overall coherence,\u201d we propose OntoAug, a data augmentation framework based on the distinction between ontology and environment that redefines the boundary of ontology-oriented enhancement. OntoAug explicitly separates the foreground subject and background context, guiding diffusion models through structured layout control to generate samples with consistent subjects and diverse backgrounds. Experiments show that OntoAug significantly improves performance in image classification, few-shot learning, weakly supervised object localization (WSOL), and large vision-language model (LVLM) reasoning, demonstrating its advantages in semantic fidelity and sample diversity. It offers a new direction for building visual systems more aligned with human perception. Code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36989", "url": null, "sourceid": 39676, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37672, "uid": "92b0605fa6bbad929f59797ab8db168f", "name": "TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise", "authors": [{"id": 187985, "fullname": "Tengfei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187985?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 180826, "fullname": "Weiran Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180826?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 130050, "fullname": "Wei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/130050?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Fine-tuning large-scale Vision-Language Models (VLMs) is crucial for specialized tasks, but their performance is often undermined by the label noise prevalent in real-world datasets. Traditional approaches to learning with noisy labels typically rely on a self-referential loop, using a model's own predictions to correct errors. While recent VLM-specific methods have begun to leverage cross-modal information to aid in noise detection, we explore an alternative direction: using the text modality not just to identify noise, but to establish a source of ground truth that is fully independent of the training data's potentially corrupt labels.To this end, we propose \\textbf{T}ext-\\textbf{AN}chored \\textbf{G}uided \\textbf{O}ptimization (TANGO), a framework centered on ``semantic anchors\"\u2014a set of pure, immutable reference points generated from diverse text descriptions. Building upon these anchors, our approach reframes two key aspects of learning with noisy labels. First, we replace the conventional linear classifier with a parameter-free Text-Anchored Classifier, making predictions a direct, weighted consensus of the clean anchors. Second, we introduce an Anchor-Guided Refinement mechanism that validates each sample's given label against this external ground truth, providing a more robust signal for sample selection and label correction. Extensive experiments demonstrate that this approach achieves competitive and often state-of-the-art performance. Our code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37672", "url": null, "sourceid": 36940, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36745, "uid": "4dad26de916a6570f06503163a4ab660", "name": "DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection", "authors": [{"id": 185771, "fullname": "Chaolang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185771?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 154561, "fullname": "Pengwen Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/154561?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 145305, "fullname": "Jingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/145305?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 185772, "fullname": "Siyuan Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185772?format=json", "institution": "Sun Yat-sen University"}, {"id": 177909, "fullname": "Yuchen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177909?format=json", "institution": "School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University"}, {"id": 185773, "fullname": "Zhuoran Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185773?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Multimodal tiny object detection  plays a critical role in real-world applications. However, detecting tiny objects remains challenging due to environmental complexities. While recent methods leverage spatial multi-scale representations or frequency-domain enhancements, most focus solely on visible images and overlook complementary multimodal frequency cues. This paper explores how to effectively harness cross-modal frequency information for infrared\u2013visible tiny object detection. Through frequency characteristic analysis, we observe that tiny objects exhibit rich mid- and high-frequency energy across both modalities, motivating the design of a Dynamic Frequency-decoupled Cross-modal Learning Transformer (DyFCLT). Our approach introduces a Dynamic Frequency-Band Decoupled Cross-Modal Attention (DFCA) mechanism to extract and interact frequency components across modalities. To suppress noise while enhancing foreground signals, a Selective Smoothing Enhancement (SSE) strategy is proposed, which smoothes background interference and guides multi-scale feature fusion. DFCA and SSE collaborate to achieve synergistic enrichment and refinement of cross-modal features. Extensive experiments on two tiny-object benchmarks and one general-scale benchmark demonstrate that DyFCLT sets new state-of-the-art results, outperforming prior leading methods by significant margins and exhibiting strong generalization across scales and scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36745", "url": null, "sourceid": 42592, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39937, "uid": "dada30935bcfc75c69e651430ea8815f", "name": "DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval", "authors": [{"id": 152909, "fullname": "Xinwei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/152909?format=json", "institution": "Huazhong Agricultural University"}, {"id": 178590, "fullname": "Yansong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/178590?format=json", "institution": "Huazhong Agricultural University"}, {"id": 193153, "fullname": "Qianru Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/193153?format=json", "institution": ""}, {"id": 186395, "fullname": "Zhichuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186395?format=json", "institution": "Huazhong Agricultural University"}, {"id": 146319, "fullname": "Yuxuan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/146319?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 76527, "fullname": "Yang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76527?format=json", "institution": "Shenzhen University"}, {"id": 193154, "fullname": "Jingbo Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193154?format=json", "institution": ""}, {"id": 131539, "fullname": "Yulong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131539?format=json", "institution": "Huazhong Agricultural University"}, {"id": 193155, "fullname": "Jinhai Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193155?format=json", "institution": null}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images.Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder\u2014DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes.To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39937", "url": null, "sourceid": 41212, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36717, "uid": "c48ec6112a8e66ac6e211fd95210b551", "name": "RecTok: Reconstruction Distillation along Rectified Flow", "authors": [{"id": 148249, "fullname": "Qingyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/148249?format=json", "institution": "Peking University"}, {"id": 185714, "fullname": "Size Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185714?format=json", "institution": "Nanyang Technological University"}, {"id": 133860, "fullname": "Jinbin Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/133860?format=json", "institution": "NUS"}, {"id": 185715, "fullname": "Kaidong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185715?format=json", "institution": "Insitute of Artificial Intelligence, China Telecom"}, {"id": 185716, "fullname": "Yujing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185716?format=json", "institution": "ByteDance Inc."}, {"id": 127629, "fullname": "Yunhai Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/127629?format=json", "institution": "Peking University"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 185717, "fullname": "Xuelong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185717?format=json", "institution": "China Telecom; Northwestern Polytechnical University"}], "abstract": "Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models (VFMs) to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction\u2013alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distill the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36717", "url": null, "sourceid": 36001, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38763, "uid": "b95e6f83cab58477a9fbac688fc3cc3e", "name": "VisPlay: Self-Evolving Vision-Language Models", "authors": [{"id": 181311, "fullname": "Yicheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181311?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 190613, "fullname": "Chengsong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190613?format=json", "institution": "Washington University, Saint Louis"}, {"id": 126703, "fullname": "Zongxia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126703?format=json", "institution": "University of Maryland, College Park"}, {"id": 190614, "fullname": "Jiaxin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190614?format=json", "institution": "Washington University, Saint Louis"}, {"id": 190615, "fullname": "Yonghui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190615?format=json", "institution": "National University of Singapore"}], "abstract": "Reinforcement learning (RL) provides a principled framework for improving vision-language models (VLMs) on complex reasoning tasks. However, existing RL approaches often depend on human-annotated labels or task-specific heuristics to define verifiable rewards\u2014both costly and limited in scalability. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning capabilities from massive unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained using Group Relative Policy Optimization (GRPO), which uses diversity and difficulty rewards to balance the difficulty of generated questions with the quality of silver answers. VisPlay scales efficiently across two model families. Trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks including MM-Vet and MMMU, and establishes a scalable path toward self-evolving multimodal intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38763", "url": null, "sourceid": 41235, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38854, "uid": "bf38970e96dee81ef0b4b76c2feb1a2a", "name": "PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation", "authors": [{"id": 183915, "fullname": "Viktor Zaytsev", "url": "http://cvpr.thecvf.com/api/miniconf/users/183915?format=json", "institution": "Samsung R&amp;D Institute Ukraine"}, {"id": 190843, "fullname": "Olena Vynokurova", "url": "http://cvpr.thecvf.com/api/miniconf/users/190843?format=json", "institution": "Samsung R&amp;D Institute Ukraine"}, {"id": 190844, "fullname": "Pavlo Tytarchuk", "url": "http://cvpr.thecvf.com/api/miniconf/users/190844?format=json", "institution": "Samsung"}, {"id": 190845, "fullname": "Dmytro Kozii", "url": "http://cvpr.thecvf.com/api/miniconf/users/190845?format=json", "institution": "Samsung R&amp;D Institute Ukraine"}, {"id": 190846, "fullname": "Vitalii Pohribnyi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190846?format=json", "institution": "Samsung"}, {"id": 190847, "fullname": "Olga Radyvonenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/190847?format=json", "institution": "Samsung Research R&amp;D Institute"}, {"id": 190848, "fullname": "Artem Shcherbina", "url": "http://cvpr.thecvf.com/api/miniconf/users/190848?format=json", "institution": ""}], "abstract": "Table structure recognition in document AI faces significant challenges due to layout inconsistencies, merged cells, and complex nested structures, which is further exacerbated by the scarcity of large, diverse annotated datasets. In this paper, we present PIX-TAB (Efficient PIXel-Precise TABle Structure Recognition Approach) that provides exact, pixel-level structure using a small, lightweight model that can run on-device. The approach is language-agnostic, as it allows adding support for a new languages simply by replacing the Optical Character Recognition (OCR) model without modifying to the core structure recognition model. Key innovations include: position-aware pixel-precise tokens for deterministic cell reconstruction; speculative decoding for faster sequence generation, and training-only box supervision to stabilize spatial grounding; region-based image segmentation. To mitigate data scarcity we propose a pipeline for generating a large synthetic table dataset. Experimental results validate each component. To address the limitations of existing evaluation methods we introduce $TEDS_{struct100}$ and $TEDS_{100}$ metrics. Speculative decoding approach significantly improves recognition speed while maintaining accuracy. Finally, the  combined techniques enable a mobile-optimized model that is more than three times faster than the full-size version.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38854", "url": null, "sourceid": 42131, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39194, "uid": "303fdade7ec2e86c1be79bf8a8f327f5", "name": "Benchmarking Single-Factor Physical Video-to-Audio Generation", "authors": [{"id": 191553, "fullname": "Tingle Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191553?format=json", "institution": "University of California, Berkeley"}, {"id": 191554, "fullname": "Siddharth Gururani", "url": "http://cvpr.thecvf.com/api/miniconf/users/191554?format=json", "institution": "NVIDIA"}, {"id": 133349, "fullname": "Kevin Shih", "url": "http://cvpr.thecvf.com/api/miniconf/users/133349?format=json", "institution": "NVIDIA"}, {"id": 157010, "fullname": "Gantavya Bhatt", "url": "http://cvpr.thecvf.com/api/miniconf/users/157010?format=json", "institution": "University of Washington, Seattle"}, {"id": 177012, "fullname": "Sang-gil Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/177012?format=json", "institution": "NVIDIA"}, {"id": 191555, "fullname": "Zhifeng Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191555?format=json", "institution": "NVIDIA"}, {"id": 167094, "fullname": "Arushi Goel", "url": "http://cvpr.thecvf.com/api/miniconf/users/167094?format=json", "institution": "NVIDIA"}, {"id": 191556, "fullname": "Gopala Anumanchipalli", "url": "http://cvpr.thecvf.com/api/miniconf/users/191556?format=json", "institution": "University of California, Berkeley"}, {"id": 90941, "fullname": "Ming-Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90941?format=json", "institution": "NVIDIA"}], "abstract": "Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39194", "url": null, "sourceid": 35680, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36378, "uid": "ba368424b735e1761bc78ea814268bc3", "name": "AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models", "authors": [{"id": 85353, "fullname": "Zheda Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85353?format=json", "institution": "Ohio State University"}, {"id": 98230, "fullname": "Arpita Chowdhury", "url": "http://cvpr.thecvf.com/api/miniconf/users/98230?format=json", "institution": "Ohio State University, Columbus"}, {"id": 184909, "fullname": "Zihe Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184909?format=json", "institution": "Pennsylvania State University"}, {"id": 154842, "fullname": "Sooyoung Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/154842?format=json", "institution": "Ohio State University, Columbus"}, {"id": 184910, "fullname": "Lemeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184910?format=json", "institution": "Ohio State University, Columbus"}, {"id": 184911, "fullname": "Jiacheng Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184911?format=json", "institution": "Ohio State University, Columbus"}, {"id": 184622, "fullname": "Jihyung Kil", "url": "http://cvpr.thecvf.com/api/miniconf/users/184622?format=json", "institution": "Adobe Research"}, {"id": 84728, "fullname": "Wei-Lun Chao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84728?format=json", "institution": "Ohio State University"}], "abstract": "The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) Instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than VFMs' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities in a single question, making it difficult to determine whether errors arise from the lack of all required abilities or just one key ability. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs), foundational skills such as localization, depth estimation, and spatial understanding, which collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive \"ability fingerprints,\" turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36378", "url": null, "sourceid": 30726, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40067, "uid": "2fdc51313d62668207cd6a89fa8500f7", "name": "GenMask: Adapting DiT for Segmentation via Direct Mask Generation", "authors": [{"id": 180904, "fullname": "Yang yuhuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180904?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 154318, "fullname": "Xianwei Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154318?format=json", "institution": "Peking University"}, {"id": 155469, "fullname": "Yuxuan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/155469?format=json", "institution": "Alibaba"}, {"id": 176145, "fullname": "Chaofan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/176145?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191896, "fullname": "Shuai Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191896?format=json", "institution": "Alibaba Group"}, {"id": 86992, "fullname": "Jiangchao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86992?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 76651, "fullname": "Ya Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76651?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 185837, "fullname": "Junyang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185837?format=json", "institution": "Alibaba Group"}, {"id": 126281, "fullname": "Yanfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126281?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation.It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation.In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner.We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents.To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training.We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40067", "url": null, "sourceid": 45655, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36900, "uid": "c66f4400c66259105a6f2cda99cc5f41", "name": "GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation", "authors": [{"id": 172413, "fullname": "Zhenya Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172413?format=json", "institution": "The University of Hong Kong"}, {"id": 169543, "fullname": "Zhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/169543?format=json", "institution": "The University of Hong Kong"}, {"id": 179302, "fullname": "Yuxiang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179302?format=json", "institution": "The University of Hong Kong"}, {"id": 186153, "fullname": "Liping Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186153?format=json", "institution": null}, {"id": 186154, "fullname": "Chenxuan Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186154?format=json", "institution": "Zhejiang University"}, {"id": 177219, "fullname": "peng siyi", "url": "http://cvpr.thecvf.com/api/miniconf/users/177219?format=json", "institution": "Huawei Technologies Co., Ltd."}, {"id": 91405, "fullname": "Bailan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/91405?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}], "abstract": "Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We further introduce Mutual Control Attention (MCA) to accurately model the influence of control on occupancy evolution, and we jointly train the VAE and the subsequent prediction module in an end-to-end manner to maximize forecasting accuracy. Together, these designs yield a 7.2% improvement in forecasting mIoU at an inference speed of 41 FPS, while using only 3.47 M parameters. Additionally, a Normalized Multi-View Attention is introduced in the video generation model to generate multi-view driving videos with guidance from our 4D occupancy, significantly improving video quality with a 20.7% reduction in FVD. Experiments demonstrate that GenieDrive enables highly controllable, multi-view consistent, and physics-aware driving video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36900", "url": null, "sourceid": 37646, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37084, "uid": "e606763b01171afd48e13b91a9cd1307", "name": "ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering", "authors": [{"id": 186625, "fullname": "Aymen Lassoued", "url": "http://cvpr.thecvf.com/api/miniconf/users/186625?format=json", "institution": "University of Carthage"}, {"id": 186626, "fullname": "Mohamed Ali Souibgui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186626?format=json", "institution": "Openchip and Technologies S.L"}, {"id": 186627, "fullname": "Yousri Kessentini", "url": "http://cvpr.thecvf.com/api/miniconf/users/186627?format=json", "institution": null}], "abstract": "Document Visual Question Answering (DocVQA) remains challenging for existing Vision-Language Models (VLMs), especially under complex reasoning and multi-step workflows. Current approaches struggle to decompose intricate questions into manageable sub-tasks and often fail to leverage specialized processing paths for different document elements. We present ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering, a novel multi-agent framework that addresses these limitations through strategic agent coordination and iterative refinement. ORCA begins with a reasoning agent that decomposes queries into logical steps, followed by a routing mechanism that activates task-specific agents from a specialized agent dock. Our framework leverages a set of specialized AI agents, each dedicated to a distinct modality, enabling fine-grained understanding and collaborative reasoning across diverse document components. To ensure answer reliability, ORCA employs a debate mechanism with stress-testing, and when necessary, a thesis-antithesis adjudication process. This is followed by a sanity checker to ensure format consistency. Extensive experiments on three benchmarks demonstrate that our approach achieves significant improvements over state-of-the-art methods, establishing a new paradigm for collaborative agent systems in vision-language reasoning. The code will be available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37084", "url": null, "sourceid": 33050, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38479, "uid": "26bc286a7c996dc103ada0982493576e", "name": "Lens Component Deletion based on Differentiable Ray Tracing", "authors": [{"id": 180193, "fullname": "Wenguan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180193?format=json", "institution": "Zhejiang University"}, {"id": 189942, "fullname": "Qirun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189942?format=json", "institution": "Zhejiang University"}, {"id": 189943, "fullname": "Tuo Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189943?format=json", "institution": "Zhejiang University of Technology"}, {"id": 189944, "fullname": "Jiajian He", "url": "http://cvpr.thecvf.com/api/miniconf/users/189944?format=json", "institution": "Zhejiang University"}, {"id": 189945, "fullname": "Jiahui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189945?format=json", "institution": "Zhejiang University"}, {"id": 189946, "fullname": "Huajun Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189946?format=json", "institution": "Zhejiang University"}, {"id": 178457, "fullname": "Qi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178457?format=json", "institution": "Zhejiang University"}], "abstract": "To achieve compactness or cost reduction for optical lens systems, designers typically rely on commercial software to design lens systems independently of post-processing algorithms, leading to excessive dependence on designers' expertise and often requiring significant time. Recently, joint optimization approaches utilizing differentiable ray tracing have emerged, demonstrating significant potential in lens design tasks. However, these existing pipelines fail to provide accurate and efficient diffraction modeling for complex refractive systems. In this work, we propose a novel lens component deletion pipeline for miniature optical systems, which automatically deletes the suitable lens component, and then optimizes both the lens system and the post-processing network to achieve joint aberration correction. Additionally, we introduce a novel metric for evaluating the contribution of each lens component within an optical system, aimed at identifying the lens component that has the least impact on the system. We also develop an efficient differentiable point spread function estimation method based on the Rayleigh-Sommerfeld diffraction model, significantly reducing GPU memory consumption. Our proposed pipeline does not rely on human design expertise, achieving lens component deletion while maintaining imaging quality comparable to the original lens system, thereby enabling the compactness or cost-effective optimization of optical systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38479", "url": null, "sourceid": 33977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38774, "uid": "aa4945c49394155f65b3767a8473219a", "name": "LensWalk: Agentic Video Understanding by Planning How You See in Videos", "authors": [{"id": 181229, "fullname": "Keliang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181229?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 190637, "fullname": "Yansong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190637?format=json", "institution": "Hunan University"}, {"id": 190638, "fullname": "Hongze Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190638?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 190639, "fullname": "Mengdi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190639?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 86286, "fullname": "Hong Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86286?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 86312, "fullname": "Shiguang Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86312?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, thetemporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38774", "url": null, "sourceid": 38848, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37622, "uid": "c48d7848e187d76c5a72510d5a5ca71e", "name": "TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size", "authors": [{"id": 141329, "fullname": "Stefan Lionar", "url": "http://cvpr.thecvf.com/api/miniconf/users/141329?format=json", "institution": "National University of Singapore"}, {"id": 86340, "fullname": "Gim Hee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/86340?format=json", "institution": "National University of Singapore"}], "abstract": "Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37622", "url": "https://splionar.github.io/TeamHOI/", "sourceid": 45474, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37232, "uid": "d52dfe51e06cd683cad1736ee23f6353", "name": "Multi-modal Frequency Decomposition Network for Semantic Scene Completion", "authors": [{"id": 186966, "fullname": "Die Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186966?format=json", "institution": "Tianjin University"}, {"id": 186967, "fullname": "Lubo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186967?format=json", "institution": null}, {"id": 186968, "fullname": "Ruonan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186968?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 85967, "fullname": "Qing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85967?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}, {"id": 186969, "fullname": "Chong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186969?format=json", "institution": "University of Science and Technology of China"}, {"id": 186970, "fullname": "Dongdong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186970?format=json", "institution": "Tianjin University"}, {"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}, {"id": 149357, "fullname": "Kairui Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149357?format=json", "institution": "Tianjin University"}, {"id": 154517, "fullname": "Di Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/154517?format=json", "institution": "Tianjin University"}], "abstract": "Based on a RGB-D image pair, semantic scene completion (SSC) provides a description for 3D scene understanding by predicting 3D semantic occupancy map. Recent methods extract RGB-D multi-modal features and fuse them in spatial domain, which disregards the misalignment caused by the imperfect raw multi-modal data and the multi-modal feature learning. Moreover, the operations of extracting high-level features they utilized tend to introduce feature smoothing and detail loss, exacerbating the above misalignment. To tackle these problems, this paper introduces MFDNet, a lightweight semantic scene completion network based on a multi-modal frequency decomposition strategy. By integrating frequency processing with limited layers of convolution and downsampling, MFDNet achieves a balance between modalities alignment and detail retainment. The network is equipped with Multi-modal Adaptive Frequency Fusion (MAFF) and Frequency Detail Compensation (FDC). MAFF models the intra-modal multi-bands dependencies and inter-modal relationships from a global perspective, enabling modality-specific calibration while facilitating the aligned fusion of multi-modal features. FDC excavates the high-frequency cues in shallow features to compensate for the missing local details of the fused feature and achieve fine-grained alignment for completion. MAFF and FDC formulate a global-to-local alignment and completion paradigm for multi-modal SSC. Extensive experiments demonstrate that MFDNet reduces parameters by 54.4% while achieving state-of-the-art performance on the NYUv2 and NYUCAD datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37232", "url": null, "sourceid": 34129, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37729, "uid": "8aa168167e983b0cb8b753e7ce8f0307", "name": "Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors", "authors": [{"id": 188111, "fullname": "Yingjie Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188111?format=json", "institution": "Harbin Institute of Technology(Shenzhen)"}, {"id": 188112, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188112?format=json", "institution": "Central South University"}, {"id": 188113, "fullname": "Jiaze Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188113?format=json", "institution": null}, {"id": 188114, "fullname": "Anfeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188114?format=json", "institution": "Central South University"}, {"id": 89574, "fullname": "Zhuotao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89574?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Self-supervised contrastive learning has emerged as a powerful paradigm for skeleton-based action recognition by enforcing consistency in the embedding space. However, existing methods rely on binary contrastive objectives that overlook the intrinsic continuity of human motion, resulting in fragmented feature clusters and rigid class boundaries. To address these limitations, we propose TranCLR, a Transitional anchor-based Contrastive Learning framework that captures the continuous geometry of the action space. Specifically, the proposed Action Transitional Anchor Construction (ATAC) explicitly models the geometric structure of transitional states to enhance the model's perception of motion continuity. Building upon these anchors, a Multi-Level Geometric Manifold Calibration (MGMC) mechanism is introduced to adaptively calibrate the action manifold across multiple levels of continuity, yielding a smoother and more discriminative representation space. Extensive experiments on the NTU RGB+D, NTU RGB+D 120 and PKU-MMD datasets demonstrate that TranCLR achieves superior accuracy and calibration performance, effectively learning continuous and uncertainty-aware skeleton representations. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37729", "url": null, "sourceid": 32178, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37459, "uid": "119653373afcc8c8f089832cb7eeb57e", "name": "SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models", "authors": [{"id": 187498, "fullname": "Haowen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187498?format=json", "institution": "University of Maryland, College Park"}, {"id": 187499, "fullname": "Shaoxiong Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187499?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187500, "fullname": "Haonan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187500?format=json", "institution": "Harvard University"}, {"id": 187501, "fullname": "Jiawei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187501?format=json", "institution": "School of Engineering and Applied Sciences, Harvard University; Tsinghua University, Tsinghua University"}, {"id": 187502, "fullname": "Jiayuan Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187502?format=json", "institution": "Amazon FAR UPenn"}, {"id": 88945, "fullname": "Jia-Bin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88945?format=json", "institution": "University of Maryland, College Park"}, {"id": 85755, "fullname": "Yilun Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/85755?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes.Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning.To overcome this, we present $\\textbf{SIMPACT}$, a test-time, $\\textbf{SIM}$ulation-enabled $\\textbf{ACT}$ion $\\textbf{P}$lanning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training.From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning.By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37459", "url": null, "sourceid": 35227, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38032, "uid": "181d697ba3986e50233a8dfb70d2d11a", "name": "Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction", "authors": [{"id": 148213, "fullname": "Ruihong Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/148213?format=json", "institution": "University of Amsterdam"}, {"id": 188882, "fullname": "Xuepeng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188882?format=json", "institution": "Qualcomm"}, {"id": 188883, "fullname": "Oleksandr Bailo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188883?format=json", "institution": "Qualcomm"}, {"id": 188884, "fullname": "Marco Manfredi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188884?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 153394, "fullname": "Theo Gevers", "url": "http://cvpr.thecvf.com/api/miniconf/users/153394?format=json", "institution": "University of Amsterdam, University of Amsterdam"}], "abstract": "Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structural language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability.Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on the ASE and Structured3D benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only 7.5\\% additional parameters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38032", "url": null, "sourceid": 45304, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39830, "uid": "333a55a4dd50a0fafb33f7e2e5b0df03", "name": "CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval", "authors": [{"id": 192935, "fullname": "Jiahui Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192935?format=json", "institution": "Link\u00f6ping University/MBZUAI"}, {"id": 192936, "fullname": "Qing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192936?format=json", "institution": "University of Groningen"}, {"id": 192937, "fullname": "Fengyu Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192937?format=json", "institution": "Technische Universit\u00e4t Darmstadt"}, {"id": 192938, "fullname": "Fakhri Karray", "url": "http://cvpr.thecvf.com/api/miniconf/users/192938?format=json", "institution": "University of Waterloo; Mohamed bin Zayed University of Artificial Intelligence"}], "abstract": "Code search, framed as information retrieval (IR), underpins modern software engineering and increasingly powers retrieval-augmented generation (RAG), improving code discovery, reuse, and the reliability of LLM-based coding.Yet existing code IR models remain largely text-centric and often overlook the visual and structural aspects inherent in programming artifacts such as web interfaces, data visualizations, SVGs, schematic diagrams, and UML.To bridge this gap, we introduce MMCoIR, the first comprehensive benchmark for evaluating multimodal code IR across five visual domains, and show through extensive evaluation the task is challenging.Therefore, we then propose CodeMMR, a unified retrieval model that jointly embeds natural language, code, and images into a shared semantic space through instruction-based multimodal alignment.CodeMMR achieves strong generalization across modalities and languages, outperforming competitive baselines (e.g., UniIR, GME, VLM2Vec) by an average of 10 points on nDCG@10.Moreover, integrating CodeMMR into RAG enhances code generation fidelity and visual grounding on unseen code generation tasks, underscoring the potential of multimodal retrieval as a core enabler for next-generation intelligent programming systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39830", "url": null, "sourceid": 44602, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36656, "uid": "27e34e1093ba7e24075b9f5b25dcf5a7", "name": "ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation", "authors": [{"id": 185576, "fullname": "Zhenyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185576?format=json", "institution": "Fudan University"}, {"id": 185577, "fullname": "Yongchong Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185577?format=json", "institution": "Fudan University"}, {"id": 76561, "fullname": "Yikai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76561?format=json", "institution": "Nanyang Technological University"}, {"id": 89233, "fullname": "Xiangyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/89233?format=json", "institution": "Fudan University"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}], "abstract": "Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages:(1) Critical region localization. ActiveVLA projects 3D inputs onto multi-view 2D projections, identifies critical 3D regions, and supports dynamic spatial awareness.(2) Active perception optimization. Drawing on the localized critical regions, ActiveVLA uses an active view selection strategy to choose optimal viewpoints. These viewpoints aim to maximize amodal relevance and diversity while minimizing occlusions. Additionally, ActiveVLA applies a 3D zoom-in to improve resolution in key areas. Together, these steps enable finer-grained active perception for precise manipulation.Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks. Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36656", "url": null, "sourceid": 31722, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37153, "uid": "d3ac43d9713bf1e9d37a453da0385b3b", "name": "Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning", "authors": [{"id": 186794, "fullname": "Xingyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186794?format=json", "institution": "USTC"}, {"id": 177126, "fullname": "Yi Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177126?format=json", "institution": "University of Science and Technology of China"}, {"id": 76097, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76097?format=json", "institution": "University of Science and Technology of China"}, {"id": 154106, "fullname": "Wenbo Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154106?format=json", "institution": "Opus AI Research"}, {"id": 151997, "fullname": "Yongliang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151997?format=json", "institution": "Southeast University"}, {"id": 159049, "fullname": "Beier Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159049?format=json", "institution": "Nanyang Technological University"}, {"id": 91500, "fullname": "Hanwang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91500?format=json", "institution": "Nanyang Technological University"}], "abstract": "Large multimodal 3D vision--language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable.To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training.Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4\\% average improvement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37153", "url": null, "sourceid": 32411, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40242, "uid": "10d3aeb6639675594477989a6098627d", "name": "ThinkGen: Generalized Thinking for Visual Generation", "authors": [{"id": 71415, "fullname": "Siyu Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71415?format=json", "institution": "Beijing Jiaotong University"}, {"id": 193860, "fullname": "Yiheng Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193860?format=json", "institution": "Beijing Jiaotong University"}, {"id": 88127, "fullname": "Yujie Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88127?format=json", "institution": "Meituan Inc."}, {"id": 185040, "fullname": "Qi She", "url": "http://cvpr.thecvf.com/api/miniconf/users/185040?format=json", "institution": "Bytedance AI Lab"}, {"id": 144030, "fullname": "Wei zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/144030?format=json", "institution": "Sun Yat-sen University"}, {"id": 193861, "fullname": "Xiaohan Lan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193861?format=json", "institution": "Meituan"}, {"id": 155910, "fullname": "Zilong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155910?format=json", "institution": "Bytedance"}, {"id": 193862, "fullname": "Fei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193862?format=json", "institution": null}, {"id": 76526, "fullname": "Yingchen Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76526?format=json", "institution": "Nanyang Technological University"}, {"id": 185582, "fullname": "Yunqing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185582?format=json", "institution": "ByteDance"}, {"id": 88385, "fullname": "Yao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88385?format=json", "institution": "Beijing Jiaotong University"}, {"id": 75837, "fullname": "Yunchao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/75837?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios.  ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40242", "url": null, "sourceid": 33233, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38453, "uid": "fba2373edb0695aa6e4d1962101d336d", "name": "When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance", "authors": [{"id": 145965, "fullname": "Yongli Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145965?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 120013, "fullname": "Ziming Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/120013?format=json", "institution": "The University of Sydney"}, {"id": 89655, "fullname": "Zhaoqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89655?format=json", "institution": "The University of Sydney, University of Sydney"}, {"id": 189887, "fullname": "Xiangyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189887?format=json", "institution": "City University of Hong Kong"}, {"id": 84821, "fullname": "Bo Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/84821?format=json", "institution": "HKBU"}, {"id": 84796, "fullname": "Tongliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84796?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}], "abstract": "Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to \\textit{``harmful conflicts''} where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model\u2019s evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4\\% compared to existing methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38453", "url": null, "sourceid": 32601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38475, "uid": "63e9274c3a8aeec2906f656bcd9919bb", "name": "Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models", "authors": [{"id": 186794, "fullname": "Xingyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186794?format=json", "institution": "USTC"}, {"id": 159049, "fullname": "Beier Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159049?format=json", "institution": "Nanyang Technological University"}, {"id": 76097, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76097?format=json", "institution": "University of Science and Technology of China"}, {"id": 189932, "fullname": "Junfeng Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189932?format=json", "institution": "National University of Singapore"}, {"id": 189933, "fullname": "Kesen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189933?format=json", "institution": "Nanyang Technological University"}, {"id": 91500, "fullname": "Hanwang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91500?format=json", "institution": "Nanyang Technological University"}, {"id": 89363, "fullname": "Xiangnan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/89363?format=json", "institution": "University of Science and Technology of China"}], "abstract": "As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage.Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness.However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs.Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness.To better balance safety and utility, we propose \\texttt{NullSteer}, a null-space projected activation defense framework.Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model\u2019s general capabilities.Extensive experiments show that \\texttt{NullSteer} significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15\\% on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38475", "url": null, "sourceid": 42223, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39910, "uid": "2f94d22f5e9da5bd2cf285436cfc3bac", "name": "DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs", "authors": [{"id": 172405, "fullname": "Yixin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/172405?format=json", "institution": "Fudan University"}, {"id": 178395, "fullname": "He Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/178395?format=json", "institution": "Fudan University"}, {"id": 193093, "fullname": "Yuxin Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193093?format=json", "institution": "Fudan University"}, {"id": 193094, "fullname": "Changhua Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193094?format=json", "institution": "Fudan University"}, {"id": 193095, "fullname": "Zihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193095?format=json", "institution": "Fudan University"}, {"id": 193096, "fullname": "Peng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193096?format=json", "institution": "Fudan University"}, {"id": 193097, "fullname": "Lu ChengLong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193097?format=json", "institution": "Fudan University"}, {"id": 193098, "fullname": "Xu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193098?format=json", "institution": "Microsoft Research; Fudan University; Fudan University"}, {"id": 193099, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193099?format=json", "institution": "Fudan University"}], "abstract": "While prior research on Multimodal Large Language Model (MLLM) hallucinations has primarily examined cross-modal inconsistencies in natural images, hallucination over complex graph structures remains underexplored.Concurrently, there is a lack of robust evaluation for fine-grained reasoning integrating structural, visual, and semantic information.To address these gaps, we present DiGraphHal-Bench, the first large-scale Visual Question Answering (VQA) benchmark for evaluating both hallucination phenomena and fine-grained reasoning of MLLMs on real-world directed graphs. DiGraphHal-Bench comprises high-quality procedural graphs from over six distinct domains and is organized around a taxonomy of four high-level capabilities and twelve fine-grained tasks. To ensure benchmark fidelity, we propose a novel two-stage automatic data curation pipeline that reconciles the trade-off between data scale and quality, thereby guaranteeing reliable evaluation.Experiments reveal that state-of-the-art MLLMs hallucinate notably in fine-grained graph reasoning. Although SFT substantially mitigates these hallucinations and strengthens complex reasoning, performance remains far from optimal. Ablation studies highlight the importance of fundamental capabilities for integrative reasoning, and our benchmark provides a foundation for advancing robust multi-modal graph understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39910", "url": null, "sourceid": 37936, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37748, "uid": "ebcdc4e2a7e73118fab4921b78200b2b", "name": "MotionHiFlow: Text-to-Motion via Hierarchical Flow Matching", "authors": [{"id": 188154, "fullname": "Heng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188154?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188155, "fullname": "Xiaotong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188155?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188156, "fullname": "Ling-An Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188156?format=json", "institution": null}, {"id": 188157, "fullname": "Yulei Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188157?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 157617, "fullname": "Shuai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157617?format=json", "institution": "Shandong University"}, {"id": 88842, "fullname": "Jian-Fang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88842?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Text-to-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we propose *MotionHiFlow*, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37748", "url": null, "sourceid": 43481, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37003, "uid": "2d968cb99817ea5e49f4bcb70f7f154a", "name": "IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis", "authors": [{"id": 146589, "fullname": "Jaehoon Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/146589?format=json", "institution": "LG Corporation; Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 147817, "fullname": "Yi Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147817?format=json", "institution": "LG AI Research"}, {"id": 186442, "fullname": "Soopil Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186442?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 186443, "fullname": "Jongseong Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186443?format=json", "institution": "LG AI Research"}, {"id": 152638, "fullname": "Soonyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/152638?format=json", "institution": "LG AI Research"}, {"id": 186444, "fullname": "Sang Hyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/186444?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}], "abstract": "Nuclei instance segmentation and classification are fundamental but remain challenging in pathology due to severe class imbalance and organ- and stain-induced variability. While vision\u2013language approaches can inject explicit semantic cues that reduce spurious contextual bias under imbalance, the absence of instance level textual annotations has limited their utility for nucleus-level analysis. We introduce an instance-level vision\u2013language framework that derives attribute-guided textual descriptions from ground-truth masks. We then align visual representations with these semantic text anchors via contrastive learning, coupling morphology with semantics at the instance level. To capture intra-class variations while maintaining organ-consistent class semantics, we learn multiple class-specific tokens that act as prototypes representing diverse submodes within a class, summarizing morphologically similar nuclei. Our approach improves both segmentation and classification without manual text labels, indicating that language-guided instance alignment combined with prototype-based semantic feedback yields more discriminative and generalizable nuclei representations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37003", "url": null, "sourceid": 39959, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36190, "uid": "fde4d164a6f20582854b04bb779f621d", "name": "Ego-1K \u2013 A Large-Scale Multiview Video Dataset for Egocentric Vision", "authors": [{"id": 184386, "fullname": "Jae Yong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/184386?format=json", "institution": "Facebook"}, {"id": 89478, "fullname": "Daniel Scharstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/89478?format=json", "institution": "Meta"}, {"id": 89430, "fullname": "Akash Bapat", "url": "http://cvpr.thecvf.com/api/miniconf/users/89430?format=json", "institution": "Meta Inc"}, {"id": 184387, "fullname": "Hao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184387?format=json", "institution": "Meta"}, {"id": 184388, "fullname": "Andrew Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184388?format=json", "institution": null}, {"id": 177277, "fullname": "Haoru Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/177277?format=json", "institution": "Facebook"}, {"id": 184389, "fullname": "Paul Sammut", "url": "http://cvpr.thecvf.com/api/miniconf/users/184389?format=json", "institution": "Meta"}, {"id": 139172, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139172?format=json", "institution": "Facebook"}, {"id": 184390, "fullname": "Stephen Jeapes", "url": "http://cvpr.thecvf.com/api/miniconf/users/184390?format=json", "institution": "Facebook"}, {"id": 184391, "fullname": "Anik Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/184391?format=json", "institution": "Facebook"}, {"id": 184392, "fullname": "Lior David", "url": "http://cvpr.thecvf.com/api/miniconf/users/184392?format=json", "institution": "Meta"}, {"id": 184393, "fullname": "Saketh Madhuvarasu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184393?format=json", "institution": null}, {"id": 184394, "fullname": "JAY JOSHI", "url": "http://cvpr.thecvf.com/api/miniconf/users/184394?format=json", "institution": null}, {"id": 184395, "fullname": "Jason Wither", "url": "http://cvpr.thecvf.com/api/miniconf/users/184395?format=json", "institution": "Facebook"}], "abstract": "We present Ego-1K, a large-scale, time-synchronized collection of egocentric multiview videos designed to advance neural 3D video synthesis, dynamic scene understanding, and embodied perception. The dataset contains nearly 1,000 short egocentric videos taken with a custom rig with 12 synchronous cameras surrounding a VR headset worn by the user.  Scene content focuses on hand motions and hand-object interactions in different settings.  We describe rig design, data processing, and calibration.  Our dataset enables new ways to benchmark egocentric scene reconstruction methods.  We believe this is an important area of research as smart glasses with multiple cameras become omnipresent.  Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to high disparities and image motion caused by close dynamic objects and rig egomotion.  Our dataset supports future research in this challenging domain, enabling 4D world creation and sharing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36190", "url": null, "sourceid": 39678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39161, "uid": "988d62aed8989449e54fa33d1879db76", "name": "Improving Vision-language Models with Perception-centric Process Reward Models", "authors": [{"id": 189494, "fullname": "Yingqian Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/189494?format=json", "institution": "Renmin University of China"}, {"id": 189493, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189493?format=json", "institution": "University of California, San Diego"}, {"id": 187377, "fullname": "Yifan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187377?format=json", "institution": "Renmin University of China"}, {"id": 191483, "fullname": "Yuhuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191483?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 191484, "fullname": "Han Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191484?format=json", "institution": "Renmin University of China"}, {"id": 107072, "fullname": "Yifan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/107072?format=json", "institution": "Renmin University of China"}, {"id": 189496, "fullname": "Xin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189496?format=json", "institution": "Renmin University of China"}, {"id": 191485, "fullname": "Min Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191485?format=json", "institution": "Bytedance"}, {"id": 126218, "fullname": "Ji-Rong Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126218?format=json", "institution": "Renmin University of China"}], "abstract": "Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model\u2019s response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39161", "url": null, "sourceid": 32011, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38283, "uid": "0f9ef8cb70bb4135133a24a464ad55e1", "name": "Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization", "authors": [{"id": 107072, "fullname": "Yifan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/107072?format=json", "institution": "Renmin University of China"}, {"id": 189493, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189493?format=json", "institution": "University of California, San Diego"}, {"id": 189494, "fullname": "Yingqian Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/189494?format=json", "institution": "Renmin University of China"}, {"id": 189495, "fullname": "Yue Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/189495?format=json", "institution": "ByteDance Inc."}, {"id": 189496, "fullname": "Xin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189496?format=json", "institution": "Renmin University of China"}, {"id": 189497, "fullname": "Youbin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189497?format=json", "institution": ""}, {"id": 126218, "fullname": "Ji-Rong Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126218?format=json", "institution": "Renmin University of China"}], "abstract": "We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as ''think with image'', has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. \\ignore{However, it is costly to construct or synthesize and may contain a complicated format that increases the risk of incorrect intermediate steps, hurting downstream reinforcement learning (RL).} To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a ``short is long'' effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning. Our code and data will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38283", "url": null, "sourceid": 31297, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38805, "uid": "fca770bbc016b9f553eddcbad569617d", "name": "BrickNet: Graph-Backed Generative Brick Assembly", "authors": [{"id": 90583, "fullname": "Peter Kulits", "url": "http://cvpr.thecvf.com/api/miniconf/users/90583?format=json", "institution": "Max Planck Institute for Intelligent Systems, Max-Planck Institute"}, {"id": 69185, "fullname": "Cordelia Schmid", "url": "http://cvpr.thecvf.com/api/miniconf/users/69185?format=json", "institution": "Inria / Google"}], "abstract": "We train a language model to generate LEGO\u00ae-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38805", "url": null, "sourceid": 35141, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39460, "uid": "cc0843a977aad2421486ae32ccc2c018", "name": "A Faster Path to Continual Learning", "authors": [{"id": 192115, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192115?format=json", "institution": "Sichuan University"}, {"id": 126537, "fullname": "Hangjie Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/126537?format=json", "institution": "Zhejiang University"}, {"id": 130963, "fullname": "Zixiang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130963?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 171245, "fullname": "Borui Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/171245?format=json", "institution": "Nanjing University "}, {"id": 192116, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192116?format=json", "institution": "Nanyang Technological University; Sichuan University"}, {"id": 131066, "fullname": "Tao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131066?format=json", "institution": "Tsinghua University"}], "abstract": "Continual Learning (CL) aims to train neural networks on a dynamic stream of tasks without forgetting previously learned knowledge. Among optimization-based approaches, C-Flat has emerged as a promising solution due to its plug-and-play nature and its ability to encourage uniformly low-loss regions for both new and old tasks. However, C-Flat requires three additional gradient computations per iteration, imposing substantial overhead on the optimization process. In this work, we propose C-Flat Turbo, a faster yet stronger optimizer that significantly reduces the training cost. We show that the gradients associated with first-order flatness contain direction-invariant components relative to the proxy-model gradients, enabling us to skip redundant gradient computations in the perturbed ascent steps. Moreover, we observe that these flatness-promoting gradients progressively stabilize across tasks, which motivates a linear scheduling strategy with an adaptive trigger to allocate larger turbo steps for later tasks. Experiments demonstrate that C-Flat Turbo accelerates a wide range of CL methods by at least $1\\times$ (and up to $1.25\\times$) compared to C-Flat, while achieving comparable or even improved accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39460", "url": null, "sourceid": 30961, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40012, "uid": "df5bb13a517ca3fb459bb1ef4300f331", "name": "MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation", "authors": [{"id": 180220, "fullname": "Chade Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180220?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 193296, "fullname": "Haida Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193296?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 193297, "fullname": "Pengju Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193297?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 153789, "fullname": "Yihong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153789?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Current query-driven 3D understanding methods are constrained to static point clouds, limiting their ability to reason about dynamic scenes. To bridge this gap, we propose $\\textbf{MORE-STEM}$, a unified framework for Long-Short $\\textbf{M}$em$\\textbf{O}$ry $\\textbf{RE}$call and $\\textbf{S}$patio-$\\textbf{TE}$mporal Consistency $\\textbf{M}$odel in Query-Driven 3D/4D Point Cloud Segmentation. The framework first introduces a Cross-Frame Text-Visual Alignment module that establishes fine-grained, time-aware correspondences between linguistic queries and dynamic 3D features. Building on this, a Spatio-Temporal Consistency Model module enforces motion-aware coherence across consecutive frames, ensuring stable and temporally consistent segmentation. A Long-Short Memory Recall module further enhances cross-scene reasoning through hierarchical memory that balances long-term semantic recall and short-term adaptation. We also construct a new outdoor benchmark for both 3D and 4D instruction segmentation with temporally aligned, motion-centric text annotations. Experiments demonstrate that MORE-STEM achieves state-of-the-art performance across multiple 3D and 4D understanding tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40012", "url": null, "sourceid": 42625, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39106, "uid": "d9d60b7a1937e200df7852bf20ea210d", "name": "Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos", "authors": [{"id": 135172, "fullname": "Yujin Ham", "url": "http://cvpr.thecvf.com/api/miniconf/users/135172?format=json", "institution": "Rice University"}, {"id": 191374, "fullname": "Junho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191374?format=json", "institution": "Rice University"}, {"id": 158988, "fullname": "Vivek Boominathan", "url": "http://cvpr.thecvf.com/api/miniconf/users/158988?format=json", "institution": "Rice University"}, {"id": 131368, "fullname": "Guha Balakrishnan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131368?format=json", "institution": "Rice University"}], "abstract": "Egocentric ``walking tour'' videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking videos around the world to maintain visual diversity. We then used this dataset to fine-tune the state-of-the-art Casper video diffusion model for object and effects inpainting, and demonstrate that the resulting model performs far better than Casper both qualitatively and quantitatively at removing humans from walking tour clips with significant human presence and complex backgrounds. Finally, we show that the resulting generated clips can be used to build successful 3D Gaussian Splatting models of urban locations which was otherwise not possible from the original clips.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39106", "url": null, "sourceid": 42224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36664, "uid": "d7eb348738f4b7857da42fd209cdc389", "name": "PANDA: Pretraining for vision ANd language with Dense Alignment", "authors": [{"id": 126788, "fullname": "Bingyi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126788?format=json", "institution": "Google Research"}, {"id": 185585, "fullname": "Koert Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185585?format=json", "institution": "Google DeepMind"}, {"id": 130350, "fullname": "Kevis-kokitsi Maninis", "url": "http://cvpr.thecvf.com/api/miniconf/users/130350?format=json", "institution": "Google"}, {"id": 127151, "fullname": "Kaifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127151?format=json", "institution": "Google"}, {"id": 126787, "fullname": "Arjun Karpur", "url": "http://cvpr.thecvf.com/api/miniconf/users/126787?format=json", "institution": "Google Research"}, {"id": 158236, "fullname": "Ye Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/158236?format=json", "institution": "Google"}, {"id": 185586, "fullname": "Sahil Dua", "url": "http://cvpr.thecvf.com/api/miniconf/users/185586?format=json", "institution": "Google DeepMind"}, {"id": 185587, "fullname": "Tanmaya Shekhar Dabral", "url": "http://cvpr.thecvf.com/api/miniconf/users/185587?format=json", "institution": "Google"}, {"id": 92178, "fullname": "Guangxing Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/92178?format=json", "institution": "Columbia University"}, {"id": 75881, "fullname": "Bohyung Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/75881?format=json", "institution": "Seoul National University"}, {"id": 185588, "fullname": "Joshua Ainslie", "url": "http://cvpr.thecvf.com/api/miniconf/users/185588?format=json", "institution": "Google"}, {"id": 185589, "fullname": "Alex Bewley", "url": "http://cvpr.thecvf.com/api/miniconf/users/185589?format=json", "institution": "DeepMind; Google"}, {"id": 133646, "fullname": "Mithun Jacob", "url": "http://cvpr.thecvf.com/api/miniconf/users/133646?format=json", "institution": "Google"}, {"id": 185590, "fullname": "Ren\u00e9 Wagner", "url": "http://cvpr.thecvf.com/api/miniconf/users/185590?format=json", "institution": "Google"}, {"id": 182948, "fullname": "Washington Ramos", "url": "http://cvpr.thecvf.com/api/miniconf/users/182948?format=json", "institution": "Google"}, {"id": 151089, "fullname": "Krzysztof Choromanski", "url": "http://cvpr.thecvf.com/api/miniconf/users/151089?format=json", "institution": "Google DeepMind Robotics &amp; Columbia University"}, {"id": 185591, "fullname": "Mojtaba Seyedhosseini", "url": "http://cvpr.thecvf.com/api/miniconf/users/185591?format=json", "institution": "Google"}, {"id": 129997, "fullname": "Howard Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129997?format=json", "institution": "Google Research"}, {"id": 87736, "fullname": "Andr\u00e9 Araujo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87736?format=json", "institution": "Google Research"}], "abstract": "Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop PANDA (Pretraining for vision ANd language with Dense Alignment), a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36664", "url": null, "sourceid": 40486, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39240, "uid": "77a8a15e83db795ece719efa7ce127da", "name": "4C4D: 4 Camera 4D Gaussian Splatting", "authors": [{"id": 153057, "fullname": "Junsheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153057?format=json", "institution": "Tsinghua University"}, {"id": 191687, "fullname": "Zhifan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191687?format=json", "institution": "Tsinghua University"}, {"id": 155250, "fullname": "Liang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/155250?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 154669, "fullname": "Wenyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154669?format=json", "institution": "Software Engineering"}, {"id": 130537, "fullname": "Kanle Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/130537?format=json", "institution": "Kuaishou Technology"}, {"id": 184608, "fullname": "Shenkun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184608?format=json", "institution": null}, {"id": 76426, "fullname": "Yu-Shen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76426?format=json", "institution": "Tsinghua University"}], "abstract": "This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose 4C4D, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS gradients to focus more on geometric learning. Extensive experiments across sparse-view datasets with varying camera overlaps show that 4C4D achieves superior performance over prior art.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39240", "url": null, "sourceid": 31132, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36797, "uid": "5669eb0ccfa4c6d23bf6d896ddd051ca", "name": "Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation", "authors": [{"id": 179984, "fullname": "Chuancheng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/179984?format=json", "institution": "The University of Sydney"}, {"id": 185890, "fullname": "Shangze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185890?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 185891, "fullname": "Shiming Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185891?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 185892, "fullname": "Simiao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/185892?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 185893, "fullname": "Wenhua Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185893?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 185894, "fullname": "Jingtong Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185894?format=json", "institution": "The University of Sydney"}, {"id": 185895, "fullname": "Chao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185895?format=json", "institution": "Nanjing University of Science and Technology; Northeastern University"}, {"id": 185896, "fullname": "Canran Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185896?format=json", "institution": "Shenzhen Campus of Sun Yat-sen University"}, {"id": 183174, "fullname": "Cong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183174?format=json", "institution": "Nanjing University"}, {"id": 185897, "fullname": "Zifeng Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185897?format=json", "institution": "Nanjing University"}, {"id": 185898, "fullname": "Fei Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185898?format=json", "institution": "National University of Singapore; Tencent"}, {"id": 88927, "fullname": "Tat-seng Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/88927?format=json", "institution": "National University of Singapore"}], "abstract": "Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilised. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts.Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36797", "url": null, "sourceid": 46271, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36262, "uid": "1bbc1eb46ce8d3a516cc1220536fd234", "name": "GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance", "authors": [{"id": 130530, "fullname": "Weiqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130530?format=json", "institution": "Tsinghua University"}, {"id": 153057, "fullname": "Junsheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153057?format=json", "institution": "Tsinghua University"}, {"id": 184607, "fullname": "Haotian Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184607?format=json", "institution": "Tsinghua University"}, {"id": 130537, "fullname": "Kanle Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/130537?format=json", "institution": "Kuaishou Technology"}, {"id": 184608, "fullname": "Shenkun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184608?format=json", "institution": null}, {"id": 90510, "fullname": "Yi Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90510?format=json", "institution": "New York University"}, {"id": 76426, "fullname": "Yu-Shen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76426?format=json", "institution": "Tsinghua University"}], "abstract": "3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored to predict point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain on novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose observing the largest un-grown regions in point clouds and inpaints them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36262", "url": null, "sourceid": 36871, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39200, "uid": "f6ec6017f81a1a735257468f5b31d02d", "name": "Retrieving Counterfactuals Improves Visual In-Context Learning", "authors": [{"id": 182692, "fullname": "Guangzhi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182692?format=json", "institution": "University of Virginia"}, {"id": 191568, "fullname": "Sanchit Sinha", "url": "http://cvpr.thecvf.com/api/miniconf/users/191568?format=json", "institution": null}, {"id": 191569, "fullname": "Zhenghao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191569?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 191570, "fullname": "Aidong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191570?format=json", "institution": "University of Virginia"}], "abstract": "Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and causally grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39200", "url": "https://gzxiong.github.io/CIRCLES/", "sourceid": 33822, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65708, "file": "/media/PosterPDFs/CVPR%202026/39200.png", "modified": "2026-04-17T10:34:15.372413-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65709, "file": "/media/PosterPDFs/CVPR%202026/39200-thumb.png", "modified": "2026-04-17T10:34:15.565036-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36431, "uid": "77c67b0f0e59ae3e0e2b4ba9e249b9d8", "name": "Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment", "authors": [{"id": 185028, "fullname": "Dongjun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185028?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 158820, "fullname": "Weichen Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158820?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 185029, "fullname": "Jingsheng Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/185029?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 185030, "fullname": "Honggang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185030?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 185031, "fullname": "Hangjie Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185031?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 158824, "fullname": "Wanzeng Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/158824?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Brain visual decoding aims to recognize and reconstruct perceptual visual content from neural activity, representing a promising avenue for developing brain-computer interfaces and building brain-inspired artificial intelligence. However, this task faces a fundamental challenge of information asymmetry: while natural images contain complex visual scenes with objects and backgrounds, the corresponding brain signals reflect focused attention on central objects while being contaminated by various neural noise. Previous methods that directly align visual and brain representations often overlook this inherent asymmetry, resulting in suboptimal decoding performance. To address this, we propose linguistic-prior-guided visual decoupling method, which introducing object-oriented textual descriptions as semantic guidance to explicitly decouple foreground objects from complex backgrounds in natural images, thereby establishing symmetric vision-brain alignment. This design enables the model to automatically focus on task-relevant visual concepts while effectively filtering out irrelevant neural noise in brain signals, achieving a transition from asymmetric feature alignment to semantic symmetric alignment. Extensive experiments on the THINGS-EEG and THINGS-MEG datasets demonstrate that our method achieves new state-of-the-art performance in the zero-shot brain-to-image retrieval task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36431", "url": null, "sourceid": 46022, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38691, "uid": "08f3648792fc7148287e0934cafdd002", "name": "EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation", "authors": [{"id": 190473, "fullname": "Bingxuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190473?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 137993, "fullname": "Yiming Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/137993?format=json", "institution": "ByteDance inc."}, {"id": 190474, "fullname": "Yicheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190474?format=json", "institution": "ByteDance"}, {"id": 182615, "fullname": "Yiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182615?format=json", "institution": "University of California, Merced"}, {"id": 126813, "fullname": "Shu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126813?format=json", "institution": "SalesForce.com"}, {"id": 91813, "fullname": "Longyin Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91813?format=json", "institution": "Bytedance Inc."}, {"id": 90023, "fullname": "Yulei Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90023?format=json", "institution": "ByteDance"}], "abstract": "Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video\u2013text\u2013to\u2013audio (VT2A), the current formulation faces three key limitations: (1) an imbalance between visual and textual conditioning that leads to visual dominance; (2) the absence of a concrete definition for fine-grained controllable generation; (3) weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley (Event-Centric Hierarchical cOntrol Foley), a new task designed for video-grounded sound generation with both event-level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video\u2013instruction\u2013annotation triplets and 42,000 fine-grained sounding event annotations.Building upon this foundation, we propose EchoVidia, a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38691", "url": null, "sourceid": 36375, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37396, "uid": "18d30de5260b28e09faee73740f523d2", "name": "Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis", "authors": [{"id": 181530, "fullname": "Chunlei Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181530?format=json", "institution": "College of Intelligent Robotics and Advanced Manufacturing, Fudan University"}, {"id": 187341, "fullname": "Jiabin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187341?format=json", "institution": "Peking University"}, {"id": 187342, "fullname": "Zhenglin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187342?format=json", "institution": "Fudan University"}, {"id": 187343, "fullname": "Zhenyu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187343?format=json", "institution": "Fudan University"}, {"id": 187344, "fullname": "Rong Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187344?format=json", "institution": "University of Macau"}, {"id": 89907, "fullname": "Zhongxue Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89907?format=json", "institution": "Fudan University"}, {"id": 187345, "fullname": "Chun Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187345?format=json", "institution": "Fudan University"}], "abstract": "Multimodal Sentiment Analysis (MSA) integrates language, visual, and acoustic modalities to infer human sentiment. Most existing methods either focus on globally shared representations or modality-specific features, while overlooking signals that are shared only by certain modality pairs. This limits the expressiveness and discriminative power of multimodal representations. To address this limitation, we propose a Tri-Subspace Disentanglement (TSD) framework that explicitly factorizes features into three complementary subspaces: a common subspace capturing global consistency, submodally-shared subspaces modeling pairwise cross-modal synergies, and private subspaces preserving modality-specific cues. To keep these subspaces pure and independent, we introduce a decoupling supervisor together with structured regularization losses. We further design a Subspace-Aware Cross-Attention (SACA) fusion module that adaptively models and integrates information from the three subspaces to obtain richer and more robust representations. Experiments on CMU-MOSI and CMU-MOSEI demonstrate that TSD achieves state-of-the-art performance across all key metrics, reaching 0.691 MAE on CMU-MOSI and 54.9\\% ACC$_7$ on CMU-MOSEI, and also transfers well to multimodal intent recognition tasks. Ablation studies confirm that tri-subspace disentanglement and SACA jointly enhance the modeling of multi-granular cross-modal sentiment cues.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37396", "url": null, "sourceid": 41186, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39475, "uid": "01cc9bcfcd567d83304a3843b7169ba1", "name": "Learnability-Driven Submodular Optimization for Active Roadside BEV Perception", "authors": [{"id": 184131, "fullname": "Ruiyu Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184131?format=json", "institution": "University of Texas at Dallas"}, {"id": 192154, "fullname": "Baoming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192154?format=json", "institution": "University of Texas at Dallas"}, {"id": 192155, "fullname": "Nicholas Ruozzi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192155?format=json", "institution": "University of Texas, Dallas"}, {"id": 73934, "fullname": "Yunhui Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/73934?format=json", "institution": "The University of Texas at Dallas"}], "abstract": "Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs. However, real deployment often requires annotation of roadside-only data due to hardware and privacy constraints. Even human experts struggle to produce accurate labels without vehicle-side data (image, LIDAR), which not only increases annotation difficulty and cost, but also reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle--roadside frames. We refer to such cases as inherently ambiguous samples. To reduce wasted annotation effort on inherently ambiguous samples while still obtaining high-performing models, we turn to active learning. This work focuses on active learning for roadside monocular 3D object detection and proposes a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method, LH3D, achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I, significantly outperforming uncertainty-based baselines. This confirms that learnability, not uncertainty, matters for roadside 3D perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39475", "url": null, "sourceid": 31317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37555, "uid": "df5e57c1b772e3cdab2b9d868fe9743b", "name": "Beyond Generation: Advancing Image Editing Priors for Depth and Normal Estimation", "authors": [{"id": 182122, "fullname": "jiyuan WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/182122?format=json", "institution": "Nanyang Technology University"}, {"id": 91034, "fullname": "Chunyu Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/91034?format=json", "institution": "Beijing Jiaotong University"}, {"id": 185770, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185770?format=json", "institution": "Alibaba Group"}, {"id": 187711, "fullname": "Rongying Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187711?format=json", "institution": "Beijing Jiaotong University"}, {"id": 187712, "fullname": "Lang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187712?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 187713, "fullname": "Mingxing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187713?format=json", "institution": null}, {"id": 156305, "fullname": "Kang Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156305?format=json", "institution": "Nanyang Technological University"}, {"id": 88278, "fullname": "Xiangxiang Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88278?format=json", "institution": "MeiTuan"}], "abstract": "Pre-trained text-to-image (T2I) generative priors have shown success in depth and normal prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning.   Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by \"refining\" their innate features, and ultimately achieve higher performance than their generative counterparts.   Based on these findings, we introduce \\textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the \"consistent velocity\" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we repurpose the editor's discarded region for a cost-free joint estimation of depth and normals, which improves the inference efficiency.   Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\\times$ data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37555", "url": null, "sourceid": 37813, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37962, "uid": "995e869dac6a1eb7a46c430768a04db3", "name": "Omni2Sound: A Fundamental Study on Dataset, Base Model, and Benchmark for Unified Video-Text-to-Audio Generation", "authors": [{"id": 181172, "fullname": "yusheng dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181172?format=json", "institution": "Monash University"}, {"id": 188692, "fullname": "Zehua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188692?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188693, "fullname": "Yuxuan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188693?format=json", "institution": "Tsinghua University"}, {"id": 75819, "fullname": "Qiuhong Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/75819?format=json", "institution": "Monash University"}, {"id": 126993, "fullname": "Jianfei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126993?format=json", "institution": "Monash University"}, {"id": 86599, "fullname": "Jun Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86599?format=json", "institution": "Tsinghua University"}], "abstract": "Training a unified model for the generation of video-to-audio (V2A), text-to-audio (T2A) and joint video-text-to-audio (VT2A) offers significant flexibility, but is hindered by critical and unexplored challenges. We identify two foundational problems: (1) the scarcity of high-quality audio captions that feature a tight A-V-T alignment, leading to severe semantic conflict in multimodal training data, and (2) cross-task and intra-task competition during joint multi-task training, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce **SoundAtlas**, the first large-scale, human-expert-level audio caption dataset, augmenting VGGSound and AudioSet with semantically rich and temporally detailed captions. Powered by a novel, multi-turn agentic annotation pipeline (using advanced foundation models) that operates cost-effectively, SoundAtlas features a tight A-V-T alignment and a much lower hallucination rate than existing datasets. Second, we propose **Omni2Sound**, a diffusion-based unified VT2A model that supports flexible modality combinations. To address cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct **VGGSound-Omni**, a comprehensive benchmark for unified evaluation of VT2A, V2A and T2A, including challenging off-screen tracks. As a result, with a vanilla DiT backbone, Omni2Sound achieves unified state-of-the-art performance in all three tasks within a single model. It also demonstrates strong generalization across multiple benchmarks with different caption and video styles. Demonstrations are provided in the Appendix.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37962", "url": null, "sourceid": 34361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38350, "uid": "d8ee5549d305c9d259dbbf870d5bc712", "name": "MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration", "authors": [{"id": 180933, "fullname": "Heng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180933?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 189674, "fullname": "Xingyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189674?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189675, "fullname": "Yang Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189675?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189676, "fullname": "Yunan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189676?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187990, "fullname": "Xiangping Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187990?format=json", "institution": "Harbin Institute of Technology"}, {"id": 86681, "fullname": "Qingcai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86681?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "Restoring degraded document image is essential for both improving visual quality and optimizing performance in downstream document analysis tasks. Although existing methods have demonstrated substantial improvements in restoration outcomes, they primarily address single-type degradation scenarios. Current approaches typically necessitate training multiple specialized models for specific degradation types or rely on explicit prior knowledge of degradation patterns to guide the training process. To overcome these limitations, we propose $\\textit{MMDIR}$, a multimodal instruction-driven framework designed for document image restoration under mixed and uncertain degradation conditions. By leveraging semantically structured instructions, MMDIR dynamically identifies present degradation types (blur, shadow, text watermark, and seal), while enhancing degradation-aware representation learning. Furthermore, we introduce a novel benchmark named $\\textit{MixedDoc}$ comprising complex mixed degradations, where each image contains randomized combinations of the aforementioned types. This benchmark addresses a critical gap in existing datasets, which lack realistic multi-degradation samples and often overlook common obstructions such as seals and text watermarks. The effectiveness of our approach is thoroughly validated across both released public benchmarks and our newly proposed dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38350", "url": null, "sourceid": 43586, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38015, "uid": "bf62dcc9f49c25719a42bd9e2b261f40", "name": "gQIR: Generative Quanta Image Reconstruction", "authors": [{"id": 188834, "fullname": "Aryan Garg", "url": "http://cvpr.thecvf.com/api/miniconf/users/188834?format=json", "institution": "Department of Computer Science, University of Wisconsin - Madison"}, {"id": 94938, "fullname": "Sizhuo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/94938?format=json", "institution": "Snap Inc."}, {"id": 126898, "fullname": "Mohit Gupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/126898?format=json", "institution": "Department of Computer Sciences, University of Wisconsin - Madison"}], "abstract": "Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw $\\textit{quanta frames}$ contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging $\\textit{Deforming (XD)}$ video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38015", "url": null, "sourceid": 35044, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39376, "uid": "e35adc826b021f91f72f183d1bb24773", "name": "Group Editing: Edit Multiple Images in One Go", "authors": [{"id": 127301, "fullname": "Yue Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/127301?format=json", "institution": "The Hong Kong University of Science and Technology, Hong Kong"}, {"id": 181329, "fullname": "Xinyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181329?format=json", "institution": "Tsinghua University"}, {"id": 151500, "fullname": "Qianli Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/151500?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 189855, "fullname": "Qinghe Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189855?format=json", "institution": "Kuaishou; Dalian University of Technology"}, {"id": 191951, "fullname": "Mingzhe Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191951?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 101961, "fullname": "xiangpeng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101961?format=json", "institution": "University of Techolodgy Sydney"}, {"id": 191952, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191952?format=json", "institution": "Anhui University of Science and Technology"}, {"id": 191953, "fullname": "Chongbo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191953?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 191954, "fullname": "Jixuan Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/191954?format=json", "institution": "Tsinghua University"}, {"id": 153060, "fullname": "Harry Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153060?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 87692, "fullname": "Hongyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87692?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 87711, "fullname": "Qifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87711?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two types of correspondences, we inject the explicit geometric cues from VGGT into the video model through a novel fusion mechanism. To support large-scale training, we construct GroupEditData, a new dataset containing high-quality masks and detailed captions for numerous image groups. Furthermore, to ensure identity preservation during editing, we introduce an alignment-enhanced RoPE module, which improves the model\u2019s ability to maintain consistent appearance across multiple images. Finally, we present GroupEditBench, a dedicated benchmark designed to evaluate the effectiveness of group-level image editing. Extensive experiments demonstrate that GroupEditing significantly outperforms existing methods in terms of visual quality, cross-view consistency, and semantic alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39376", "url": null, "sourceid": 31735, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38501, "uid": "d37a4794d6e3e3136f27b5e6ac12aca1", "name": "HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis", "authors": [{"id": 190001, "fullname": "Mingjin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190001?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 158047, "fullname": "Junhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158047?format=json", "institution": "Tsinghua University"}, {"id": 153722, "fullname": "Zhaoxin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153722?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 190002, "fullname": "Yujian Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/190002?format=json", "institution": "Hong Kong Baptist University"}, {"id": 177299, "fullname": "Zichen Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177299?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 190003, "fullname": "Lili Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190003?format=json", "institution": null}, {"id": 190004, "fullname": "Yawen Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/190004?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88437, "fullname": "Lap-Pui Chau", "url": "http://cvpr.thecvf.com/api/miniconf/users/88437?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 88428, "fullname": "Yi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88428?format=json", "institution": "Nanyang Technological University"}], "abstract": "Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations.  To achieve   a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis, as well as the corresponding training and inference setting. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (ii) a hybrid pipeline for constructing input and condition signals, enabling flexible and precise control during both training and inference. During inference, given a single real image and a 3D control signal from either simulation or real data, HVG-3D generates high-fidelity, temporally consistent videos with precise spatial and temporal control. Experiments on the TASTE-Rob dataset demonstrate that HVG-3D achieves state-of-the-art spatial fidelity, temporal coherence, and controllability, while enabling effective utilization of both real and simulated data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38501", "url": null, "sourceid": 32071, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36415, "uid": "2c5343ac475f639320f822c448fdf93d", "name": "Clone Deterministic 3D Worlds", "authors": [{"id": 184983, "fullname": "Zaishuo Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184983?format=json", "institution": "University of California, Davis"}, {"id": 184984, "fullname": "Yukuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184984?format=json", "institution": null}, {"id": 184985, "fullname": "Xinyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184985?format=json", "institution": "Nanjing University"}, {"id": 184986, "fullname": "Yifan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184986?format=json", "institution": "Coinbase Global, Inc."}, {"id": 129401, "fullname": "Yubei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129401?format=json", "institution": "New York University"}], "abstract": "A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric structure of the latent representation, not the dynamics model itself. 2) Building on this insight, we show that applying temporal contrastive learning principle as a geometric regularization can effectively curate a latent space that better reflects the underlying physical state manifold, demonstrating that contrastive constraints can serve as a powerful inductive bias for stable world modeling; we call this approach Geometrically-Regularized World Models (GRWM). At its core is a lightweight geometric regularization module that can be seamlessly integrated into standard autoencoders, reshaping their latent space to provide a stable foundation for effective dynamics modeling. By focusing on representation quality, GRWM offers a simple yet powerful pipeline for improving world model fidelity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36415", "url": null, "sourceid": 42260, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38557, "uid": "9beff913467a3024cbd3d7a92308347b", "name": "Tokenizing Vector Animation for Autoregresive Generation", "authors": [{"id": 158047, "fullname": "Junhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158047?format=json", "institution": "Tsinghua University"}, {"id": 190139, "fullname": "Gao Kejun", "url": "http://cvpr.thecvf.com/api/miniconf/users/190139?format=json", "institution": "Tsinghua University"}, {"id": 190140, "fullname": "Yuehan Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/190140?format=json", "institution": "Tianjin University"}, {"id": 71203, "fullname": "Mingze Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/71203?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 190001, "fullname": "Mingjin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190001?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 176015, "fullname": "Shaohui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176015?format=json", "institution": "Tencent"}, {"id": 186119, "fullname": "Xiao-Xiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186119?format=json", "institution": "Nanjing University"}, {"id": 187136, "fullname": "Fei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187136?format=json", "institution": "Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), also known as Guangming Laboratory"}, {"id": 88171, "fullname": "Qi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/88171?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 87774, "fullname": "Ruqi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87774?format=json", "institution": "Tsinghua Shenzhen International Graduate School/Tsinghua Berkeley Shenzhen Institute "}], "abstract": "Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides , 3D meshes , LEGO sequences , and indoor layouts , suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct \\textbf{LottieAnimation-660K}, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create \\textbf{LottieGPT}, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38557", "url": null, "sourceid": 35631, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37204, "uid": "41b9afd19dbb79a24459202a0a696e9c", "name": "Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation", "authors": [{"id": 88656, "fullname": "Jiangning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88656?format=json", "institution": "Tencent Youtu Lab"}, {"id": 160361, "fullname": "junwei zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/160361?format=json", "institution": "Tencent Youtu Lab"}, {"id": 89708, "fullname": "Zhenye Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89708?format=json", "institution": "Tencent Youtu Lab"}, {"id": 131219, "fullname": "Donghao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/131219?format=json", "institution": "Tencent YouTu Lab"}, {"id": 152688, "fullname": "Chuming Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/152688?format=json", "institution": "Tencent AI Lab"}, {"id": 186914, "fullname": "FeiFan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186914?format=json", "institution": null}, {"id": 156406, "fullname": "Xu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156406?format=json", "institution": "Tencent YouTu Lab"}, {"id": 158340, "fullname": "Jianlong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158340?format=json", "institution": "Tencent Youtu Lab"}, {"id": 175257, "fullname": "Yuansen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175257?format=json", "institution": "National University of Singapore"}, {"id": 186915, "fullname": "Yijia Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186915?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 89324, "fullname": "Weijian Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89324?format=json", "institution": "Tencent Youtu Lab"}, {"id": 131512, "fullname": "Han Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131512?format=json", "institution": "Wuhan University"}, {"id": 157272, "fullname": "Xu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157272?format=json", "institution": "Tencent YouTu Lab"}, {"id": 175917, "fullname": "Chencan Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175917?format=json", "institution": "Tencent Youtu Lab"}, {"id": 186916, "fullname": "Keke He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186916?format=json", "institution": "Tencent YouTu Lab"}, {"id": 152689, "fullname": "Xiaobin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152689?format=json", "institution": "Tencent AI Lab"}, {"id": 86912, "fullname": "Chengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86912?format=json", "institution": "Tencent Youtu Lab; Shanghai Jiao Tong University"}], "abstract": "We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed Soul, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video\u2013text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37204", "url": null, "sourceid": 31375, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37994, "uid": "b5f15297073c1d370246af6617cef13a", "name": "Building a Precise Video Language with Human\u2013AI Oversight", "authors": [{"id": 89055, "fullname": "Zhiqiu Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/89055?format=json", "institution": "Carnegie Mellon University"}, {"id": 188776, "fullname": "Siyuan Cen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188776?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 133187, "fullname": "Chancharik Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/133187?format=json", "institution": "University of California, Berkeley"}, {"id": 188777, "fullname": "Isaac Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188777?format=json", "institution": "Carnegie Mellon University"}, {"id": 188778, "fullname": "Yuhan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188778?format=json", "institution": ""}, {"id": 188779, "fullname": "Yu Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/188779?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 188780, "fullname": "Hewei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188780?format=json", "institution": null}, {"id": 188781, "fullname": "Irene Pi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188781?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 188782, "fullname": "Shihang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188782?format=json", "institution": "Nanjing University"}, {"id": 188783, "fullname": "Yili Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188783?format=json", "institution": "Carnegie Mellon University"}, {"id": 85755, "fullname": "Yilun Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/85755?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 75455, "fullname": "Deva Ramanan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75455?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Video\u2013language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, supported by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce a critique-based human\u2013AI (CHAI) oversight framework, where trained human experts provide correctional critiques to revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for fine-tuning, improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through standard SFT, offline RL (DPO), online RL (GSPO), and inference-time scaling. With modest expert supervision, the resulting system outperforms even closed-source models such as Gemini-2.5-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of over 400 words, achieving finer control over camera motion, angle, lens, perspectives, and shot composition. Overall, our results show that precise specification and human\u2013AI oversight are key to achieving professional-level video understanding and generation.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37994", "url": null, "sourceid": 43064, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39338, "uid": "5c1697b2090acdf06d589303b060b78a", "name": "GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction", "authors": [{"id": 129806, "fullname": "Chao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129806?format=json", "institution": "NNCosmos"}, {"id": 191878, "fullname": "Xiaochen Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191878?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 104146, "fullname": "xiang deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/104146?format=json", "institution": "tsinghua university"}, {"id": 76470, "fullname": "Jingxiang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76470?format=json", "institution": "Tsinghua University"}, {"id": 153284, "fullname": "Donglin Di", "url": "http://cvpr.thecvf.com/api/miniconf/users/153284?format=json", "institution": "Harbin Institute of Technology"}, {"id": 86439, "fullname": "Zhuo Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/86439?format=json", "institution": "ByteDance"}, {"id": 75944, "fullname": "Yebin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75944?format=json", "institution": "Tsinghua University"}], "abstract": "Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are distilled into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39338", "url": null, "sourceid": 39213, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36475, "uid": "f88f8f0d9f5d571b38a7914b74f63a9e", "name": "DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing", "authors": [{"id": 185134, "fullname": "Xingyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185134?format=json", "institution": "Massachusetts Institute of Technology; Shanghai Jiaotong University"}, {"id": 185135, "fullname": "Samuel Tesfai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185135?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185136, "fullname": "Zhekai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185136?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 154459, "fullname": "Haocheng Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154459?format=json", "institution": "University of California, Berkeley"}, {"id": 185137, "fullname": "Shuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185137?format=json", "institution": "University of California, Berkeley"}, {"id": 185138, "fullname": "Lvmin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185138?format=json", "institution": "Stanford University"}, {"id": 185139, "fullname": "Yufei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185139?format=json", "institution": "Tsinghua University"}, {"id": 185140, "fullname": "Kelly Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185140?format=json", "institution": "Stanford University"}, {"id": 73046, "fullname": "Maneesh Agrawala", "url": "http://cvpr.thecvf.com/api/miniconf/users/73046?format=json", "institution": "Stanford University"}, {"id": 156293, "fullname": "Ion Stoica", "url": "http://cvpr.thecvf.com/api/miniconf/users/156293?format=json", "institution": "University of California, Berkeley"}, {"id": 75618, "fullname": "Kurt Keutzer", "url": "http://cvpr.thecvf.com/api/miniconf/users/75618?format=json", "institution": "EECS, UC Berkeley"}, {"id": 75570, "fullname": "Jun-Yan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75570?format=json", "institution": "Carnegie Mellon University"}, {"id": 85763, "fullname": "Song Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/85763?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 185141, "fullname": "Yujun Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185141?format=json", "institution": "NVIDIA"}, {"id": 185142, "fullname": "Muyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185142?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "Video diffusion models have achieved remarkable generative performance, but their substantial computational and memory costs pose significant challenges for deployment, especially on consumer GPUs. As recent advances in attention optimization mitigate previous computational bottlenecks, linear layers now dominate both computational cost and inference memory. In this work, we focus on quantizing both weights and activations to 4 bits to accelerate these layers. Previous methods, such as SVDQuant, overlook the highly dynamic nature of activations across denoising timesteps, where outlier channels and magnitudes vary dramatically. However, video data inherently exhibits strong activation similarity among neighboring tokens in space and time, which we term \\textbf{spatiotemporal activation similarity}, analogous to how video codecs exploit intra- and inter-frame redundancy. Leveraging this property, we introduce \\textbf{DeltaQuant}, which partitions activations into local 3D spatiotemporal cubes and uses each cube's mean token as a \\coretoken, quantizing only the small differences (delta tokens) to 4 bits while keeping core tokens in FP8. This decomposition substantially reduces quantization error with minimal overhead.For weight quantization, DeltaQuant incorporates SVDQuant's low-rank decomposition to further reduce quantization error.We also implement an efficient kernel that translates DeltaQuant's computational benefits into real-world speedups.Extensive experiments on Wan 2.2 I2V, Wan 2.2 T2V, and LTX-Video T2V demonstrate that DeltaQuant maintains high generation fidelity.On Wan 2.2, it compresses model size by 2.9\u00d7 and reduces memory footprint by 2.3\u00d7. DeltaQuant is compatible with efficient attention mechanisms and few-step distillation. When integrated with these techniques, it achieves an additional 3.0\u00d7 acceleration, for a total 111.8\u00d7 end-to-end speedup. Code and models will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36475", "url": null, "sourceid": 31005, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38084, "uid": "49c6013040690fa5a2a112d8deed3ed2", "name": "COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation", "authors": [{"id": 189027, "fullname": "Yuchen Che", "url": "http://cvpr.thecvf.com/api/miniconf/users/189027?format=json", "institution": "Institute of Science Tokyo"}, {"id": 180474, "fullname": "JINGTU WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180474?format=json", "institution": "Institute of Science Tokyo"}, {"id": 189028, "fullname": "Hao ZHENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/189028?format=json", "institution": "RIKEN"}, {"id": 189029, "fullname": "Asako Kanezaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/189029?format=json", "institution": "Tohoku University"}], "abstract": "Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, viewpoint changes, and outliers.A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse keypoints.We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem.COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as target marginals, naturally suppressing non-overlapping regions.Semantic priors from vision foundation model features further regularize the correspondences, leading to stable pose estimation.This design integrates confidence into the end-to-end correspondence finding and pose estimation pipeline, enabling fully unsupervised learning.Experiments show unsupervised COG achieves comparable performance to supervised methods, while the supervised variant outperforms them.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38084", "url": null, "sourceid": 45730, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38255, "uid": "08bc6a3cf0983489f86e2c1c24719a22", "name": "RetFormer: Multimodal Retrieval for Enhancing Image Recognition", "authors": [{"id": 189435, "fullname": "Tianrui Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189435?format=json", "institution": "Zhejiang University"}, {"id": 189436, "fullname": "Xiubo Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189436?format=json", "institution": "Zhejiang University"}, {"id": 189437, "fullname": "Hongzhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189437?format=json", "institution": null}], "abstract": "The expansion of Transformers and the collection of high-quality multimodal datasets have propelled deep neural networks to achieve unprecedented performance in vision and language tasks. However, applying these advances is non-trivial in real-world applications. The extensive number of parameters complicates model updates, and real-world data often features a long-tailed distribution along with noisy labels. To address the above issues, we propose to explore the internal structure of the neural network for learning with sample relationships, rather than just increasing the number of model parameters. Specifically, we introduce RetFormer, a model enhanced with a multimodal knowledge base for storing world knowledge, and a retrieval cross-fusion module designed to establish robust multimodal sample relationships by leveraging content from the knowledge base. RetFormer establishes a robust relationship between image and text modalities by integrating information from external knowledge bases into the model's decision-making process, thus overcoming the limitations of traditional approaches on model size and datasets. Our experiments demonstrate the benefits of integrating large-scale image-text datasets into vision tasks and exemplify the importance of modeling the relationship between image and text modalities. We have evaluated our approach on the task of long-tailed recognition and learning with noisy labels and have shown that it achieves state-of-the-art accuracies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38255", "url": null, "sourceid": 36101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36822, "uid": "5134cb3d435ee631ad39aaa79f8874d0", "name": "Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves", "authors": [{"id": 185955, "fullname": "Xinyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185955?format=json", "institution": "Rutgers University"}, {"id": 185956, "fullname": "Ziyi Kou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185956?format=json", "institution": "Facebook"}, {"id": 140234, "fullname": "Chuan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/140234?format=json", "institution": "Facebook"}, {"id": 185957, "fullname": "Mia Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185957?format=json", "institution": "Meta"}, {"id": 185958, "fullname": "Ergys Ristani", "url": "http://cvpr.thecvf.com/api/miniconf/users/185958?format=json", "institution": "Meta"}, {"id": 185959, "fullname": "Ankit Kumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/185959?format=json", "institution": "Facebook; Westcliff University"}, {"id": 133258, "fullname": "Lele Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/133258?format=json", "institution": "Sony America"}, {"id": 185960, "fullname": "Kun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185960?format=json", "institution": "Meta"}, {"id": 185961, "fullname": "Abdeslam Boularias", "url": "http://cvpr.thecvf.com/api/miniconf/users/185961?format=json", "institution": ", Rutgers University"}, {"id": 75956, "fullname": "Li Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/75956?format=json", "institution": "Meta Reality Labs Research"}], "abstract": "Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information, such as contact forces and motion dynamics, and are prone to frequent occlusions. To address these challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove data in HOI videos into photorealistic bare-hand representations, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures both temporal and multi-view rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we introduce HandSense, the first multi-modal HOI dataset featuring multi-view bare-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36822", "url": null, "sourceid": 35126, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37776, "uid": "2d5c023a11d70ab3e9acdb98d0053fcc", "name": "Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems", "authors": [{"id": 188242, "fullname": "Tolga Dimlioglu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188242?format=json", "institution": "New York University"}, {"id": 150901, "fullname": "Nadine Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150901?format=json", "institution": "NVIDIA"}, {"id": 73955, "fullname": "Maying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/73955?format=json", "institution": "NVIDIA"}, {"id": 150977, "fullname": "Rafid Mahmood", "url": "http://cvpr.thecvf.com/api/miniconf/users/150977?format=json", "institution": "University of Ottawa / NVIDIA"}, {"id": 73959, "fullname": "Jose M. Alvarez", "url": "http://cvpr.thecvf.com/api/miniconf/users/73959?format=json", "institution": "NVIDIA"}], "abstract": "Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address the different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80\\% less data.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37776", "url": null, "sourceid": 43444, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40342, "uid": "a83840c17c2b2a522f05290db1efbb18", "name": "NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization", "authors": [{"id": 190367, "fullname": "Edwin Vargas", "url": "http://cvpr.thecvf.com/api/miniconf/users/190367?format=json", "institution": "Rice University"}, {"id": 130345, "fullname": "Jhon Lopez", "url": "http://cvpr.thecvf.com/api/miniconf/users/130345?format=json", "institution": "Universidad Industrial de Santander"}, {"id": 95406, "fullname": "Henry Arguello", "url": "http://cvpr.thecvf.com/api/miniconf/users/95406?format=json", "institution": "Universidad Industrial de Santander"}, {"id": 85552, "fullname": "Ashok Veeraraghavan", "url": "http://cvpr.thecvf.com/api/miniconf/users/85552?format=json", "institution": "William Marsh Rice University"}], "abstract": "Ensuring the authenticity and ownership of digital images is increasingly challenging as modern editing tools enable highly realistic forgeries. Existing image protection systems mainly rely on digital watermarking, which is susceptible to sophisticated digital attacks. To address this limitation, we propose a hybrid optical-digital framework that incorporates physical authentication cues during image formation and preserves them through a learned reconstruction process. At the optical level, a phase mask in the camera aperture produces a Null-space Optical Watermark (NOWA) that lies in the Null Space of the imaging operator and therefore remains invisible in the captured image. Then, a Null-Space Network (NSN) performs measurement-consistent reconstruction that delivers high-quality protected images while preserving the NOWA signature.The proposed design enables tamper localization by projecting the image onto the camera's null space and detecting pixel-level inconsistencies. Our design preserves perceptual quality, resists common degradations such as compression, and establishes a structural security asymmetry: without access to the optical or NSN parameters, adversaries cannot forge the NOWA signature. Experiments with simulations and a prototype camera demonstrate competitive performance in terms of image quality preservation and tamper localization accuracy compared to state-of-the-art digital watermarking and learning-based authentication methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40342", "url": null, "sourceid": -40551, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38638?format=json"], "related_events_ids": [38638]}, {"id": 38089, "uid": "a8a2ab1b683dd9cfe2de7ff7522cdf7a", "name": "DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing", "authors": [{"id": 178380, "fullname": "Gyanendra Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/178380?format=json", "institution": "Sync."}, {"id": 189036, "fullname": "Sai Jena", "url": "http://cvpr.thecvf.com/api/miniconf/users/189036?format=json", "institution": "Zynix AI"}], "abstract": "Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross-modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine-tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non-relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision-language representations. This process structurally isolates concepts, enabling precise, non-interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi-term loss function for maintaining task fidelity, edit locality, and cross-modal alignment. With the base model frozen, our method achieves 98% single-edit success, remains over 95% after 1,000 sequential edits, lowers hallucination by 3-5%, and achieves the best backward transfer (BWT) scores on continual instruction-tuning benchmarks. Extensive experiments demonstrate DSCA\u2019s state-of-the-art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38089", "url": null, "sourceid": 39899, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37854, "uid": "35ae540ac24d774598bdf2bcfb3e5421", "name": "Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models", "authors": [{"id": 188411, "fullname": "Tao Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188411?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188412, "fullname": "Huili Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188412?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188413, "fullname": "Yuanhong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188413?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188414, "fullname": "Wendan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188414?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188415, "fullname": "Lianchao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188415?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 188416, "fullname": "Jinrui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188416?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188417, "fullname": "Zichen Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188417?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 91196, "fullname": "Shangguang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91196?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188418, "fullname": "Yongfeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188418?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data.Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training.Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status.However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data).Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality.In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models.We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37854", "url": null, "sourceid": 40135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40315?format=json"], "related_events_ids": [40315]}, {"id": 38914, "uid": "063eb8aa17714d0c60ef8b2d1e03cdf7", "name": "DROID-SLAM in the Wild", "authors": [{"id": 181928, "fullname": "Moyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181928?format=json", "institution": "ETH Zurich"}, {"id": 126325, "fullname": "Zihan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126325?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}, {"id": 74200, "fullname": "Daniel Barath", "url": "http://cvpr.thecvf.com/api/miniconf/users/74200?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 8 FPS. The source code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38914", "url": "https://moyangli00.github.io/droid-w/", "sourceid": 39541, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37694, "uid": "eadccc9ad3d1c5ce3861fabfbf759493", "name": "Bridging Facial Understanding and Animation via Language Models", "authors": [{"id": 96787, "fullname": "Luchuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/96787?format=json", "institution": "University of Rochester"}, {"id": 188032, "fullname": "Pinxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188032?format=json", "institution": "Independent Researcher"}, {"id": 155112, "fullname": "Haiyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155112?format=json", "institution": "the university of tokyo"}, {"id": 187244, "fullname": "Zhenchao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187244?format=json", "institution": "University of Hong Kong"}, {"id": 152930, "fullname": "Yunlong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152930?format=json", "institution": "University of Rochester"}, {"id": 176340, "fullname": "Zichong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176340?format=json", "institution": "University of Rochester"}, {"id": 127813, "fullname": "Susan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127813?format=json", "institution": "University of Rochester"}, {"id": 149423, "fullname": "Jing Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/149423?format=json", "institution": "University of Rochester"}, {"id": 95802, "fullname": "Jason Corso", "url": "http://cvpr.thecvf.com/api/miniconf/users/95802?format=json", "institution": "Voxel51; University of Michigan"}, {"id": 90334, "fullname": "Chenliang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90334?format=json", "institution": "University of Rochester"}], "abstract": "Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37694", "url": null, "sourceid": 34097, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38946, "uid": "3d4375c2cc0fd3842600003f183bd34c", "name": "A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps", "authors": [{"id": 145915, "fullname": "Xuanlong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145915?format=json", "institution": "Intellindust"}, {"id": 191027, "fullname": "Youyang Sha", "url": "http://cvpr.thecvf.com/api/miniconf/users/191027?format=json", "institution": "Intellindust"}, {"id": 191028, "fullname": "Longfei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191028?format=json", "institution": "Intellindust"}, {"id": 87658, "fullname": "Xi Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87658?format=json", "institution": "Tencent AI Lab"}, {"id": 189382, "fullname": "Di Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189382?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Our code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38946", "url": null, "sourceid": 32834, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37489, "uid": "9b316a2cacfb19a4e8dcf917278bd816", "name": "Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering", "authors": [{"id": 187573, "fullname": "Bowen Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187573?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 76798, "fullname": "Sisi You", "url": "http://cvpr.thecvf.com/api/miniconf/users/76798?format=json", "institution": "Hefei University of Technology"}, {"id": 76806, "fullname": "Bing-Kun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76806?format=json", "institution": "Hefei University of Technology"}], "abstract": "Embodied Question Answering (EQA) requires agents to navigate 3D environments, accumulate visual evidence, and reason over partial observations to answer questions. However, current agents struggle to maintain coherent, long-horizon behavior: planning remains reactive, causing inconsistent actions, while monolithic memories entangle all observations, hindering retrieval of the sparse but crucial evidence. We address these issues by reframing EQA through the lens of predictive processing, in which coherent behavior emerges from a prediction\u2013correction loop grounded in stable priors. Guided by this perspective, we propose Predict Before You Explore (Pred-EQA), an architecture that integrates predictive planning with specialized memory. A high-level planner predicts where question-relevant evidence is likely to appear and generates a compact set of actionable exploration branches encoding long-horizon intent. A low-level executor then reduces uncertainty within these branches, revising predictions when they fail. A dual-memory system complements this process by separating slowly evolving structural priors from compact, question-relevant visual evidence, enabling consistent planning and efficient evidence accumulation. Through this prediction-guided exploration, Pred-EQA achieves coherent trajectories under partial observability. Experiments on OpenEQA and Express-Bench show that Pred-EQA achieves state-of-the-art results in both accuracy and exploration efficiency, demonstrating the benefits of prediction-driven embodied reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37489", "url": null, "sourceid": 36715, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39931, "uid": "aa5dfa238719e3c67d4386b0eb1063c8", "name": "More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain", "authors": [{"id": 153052, "fullname": "Haodong Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/153052?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 153051, "fullname": "Dongyao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153051?format=json", "institution": "Xi\u2018an Jiaotong University"}, {"id": 193138, "fullname": "Jixin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193138?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 193139, "fullname": "Junhao Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193139?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 193140, "fullname": "Yanshu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193140?format=json", "institution": "Brown University"}, {"id": 153053, "fullname": "Yongqiang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/153053?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 87581, "fullname": "Nanning Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87581?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Exploring human visual perception and understanding of the stereoscopic world represents a significant topic in computational neuroscience. Recent studies have provided rich Brain-3D datasets, conducted preliminary explorations into 3D visual reconstruction. However, existing research struggles to capture the differences in dynamic changes of 3D stimulus views, and there remains room for improvement in high-fidelity reconstruction and rendering. 3D Gaussian Splatting (3DGS) has recently achieved significant progress in stereoscopic view synthesis. Inspired by it, we propose BrainGS -- an innovative framework for decoding more realistic 3D objects from the brain. BrainGS incorporates a Fusion Time-Spatial Network to achieve comprehensive encoding of the brain, combined with the Multi-Attribute Controller (MAC), it decouples features using visual, semantic, and color as anchors, effectively learning the feature distribution of Brain-3D and providing initial control for 3DGS. The Multi-View Stabilizer (MVS) overcomes the challenge of capturing multi-view changes of 3D objects, creating more robust viewpoint representations. Comprehensive experiments and discussions on fMRI/EEG show the SOTA performance (2.936 FPD, 0.202 LPIPS) of BrainGS, providing reliable neural interpretations, offering new insights into brain stereovision understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39931", "url": null, "sourceid": 44414, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40017, "uid": "230b28a6218d432fec819aaefbdbed34", "name": "DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples", "authors": [{"id": 175749, "fullname": "Abdullah Al Nomaan Nafi", "url": "http://cvpr.thecvf.com/api/miniconf/users/175749?format=json", "institution": "University of Maine"}, {"id": 178531, "fullname": "Habibur Rahaman", "url": "http://cvpr.thecvf.com/api/miniconf/users/178531?format=json", "institution": "University of Florida"}, {"id": 168082, "fullname": "Zafaryab Haider", "url": "http://cvpr.thecvf.com/api/miniconf/users/168082?format=json", "institution": "University of Maine"}, {"id": 193311, "fullname": "Tanzim Mahfuz", "url": "http://cvpr.thecvf.com/api/miniconf/users/193311?format=json", "institution": "University of Maine"}, {"id": 193312, "fullname": "Fnu Suya", "url": "http://cvpr.thecvf.com/api/miniconf/users/193312?format=json", "institution": "University of Tennessee, Knoxville"}, {"id": 193313, "fullname": "Swarup Bhunia", "url": "http://cvpr.thecvf.com/api/miniconf/users/193313?format=json", "institution": "University of Florida"}, {"id": 183826, "fullname": "Prabuddha Chakraborty", "url": "http://cvpr.thecvf.com/api/miniconf/users/183826?format=json", "institution": "University of Maine"}], "abstract": "Numerous techniques have been proposed for generating adversarial examples under strict $\\ \\ell_p \\$-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from $\\ \\ell_p \\$-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DASH, a differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing $\\ \\ell_p \\$-based attack methods. DASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DASH on adversarially trained robust models across CIFAR-10, CIFAR-100, and ImageNet while considering visual perception metrics (e.g. SSIM, FID, LPIPS) in the perturbation budget (instead of $\\ \\ell_p \\$-norm). Despite relying solely on $\\ \\ell_p \\$-constrained based methods, DASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD, achieving higher attack success rates (e.g., 20.63\\% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements of $\\approx11$, 0.015, and 5.7, respectively). DASH generalizes well to unseen defenses and different white-box/black-box scenarios, making it a practical and strong baseline for evaluating robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40017", "url": null, "sourceid": 34102, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36310, "uid": "8935ea08a91b9a60357e72c3a415b71f", "name": "FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution", "authors": [{"id": 182955, "fullname": "Seungho Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182955?format=json", "institution": "Chung-Ang University, CMLab"}, {"id": 184745, "fullname": "Jeahun Sung", "url": "http://cvpr.thecvf.com/api/miniconf/users/184745?format=json", "institution": "Chung-Ang University"}, {"id": 131019, "fullname": "Jihyong Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/131019?format=json", "institution": "Chung-Ang University"}], "abstract": "Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise \u201clow-first, high-later\u201d hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model\u2019s internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives. The code and project page will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36310", "url": null, "sourceid": 43028, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39879, "uid": "14fbf54632c6eead523a03d1bc002fe4", "name": "PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion", "authors": [{"id": 193041, "fullname": "Yichen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193041?format=json", "institution": "Beihang University"}, {"id": 127767, "fullname": "Hong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127767?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 193042, "fullname": "Haodong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193042?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 153445, "fullname": "Linlin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153445?format=json", "institution": "Communication University of China"}, {"id": 105402, "fullname": "guojun lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/105402?format=json", "institution": "Zhejiang University"}, {"id": 90173, "fullname": "Sheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90173?format=json", "institution": "Beihang University"}, {"id": 87855, "fullname": "Baochang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87855?format=json", "institution": "Beihang University"}], "abstract": "Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a \"part-wise\" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state-of-the-art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39879", "url": null, "sourceid": 36603, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38870, "uid": "5e487b1018c4652d45114ef8477f8f8c", "name": "DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives", "authors": [{"id": 190887, "fullname": "Xiaoxu Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190887?format=json", "institution": "N/A"}, {"id": 190888, "fullname": "Zhongmin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190888?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 190889, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190889?format=json", "institution": "Waymo"}, {"id": 89410, "fullname": "Weikai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89410?format=json", "institution": "Tencent America"}, {"id": 76458, "fullname": "Weixiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76458?format=json", "institution": "Johns Hopkins University"}, {"id": 76495, "fullname": "Lin Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76495?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "We present Compact 3D Reconstruction with Positive and Negative Primitives (DualPrim), a novel approach for reconstructing compact and topologically regular 3D meshes from multi-view images. Unlike traditional methods that rely on implicit representations such as signed distance functions, or explicit formats such as meshes and point clouds, our method models geometry using quadrics-based 3D primitives. Each primitive is defined by a positive-density superquadric that contributes to the shape, and a negative-density superquadric that carves out local volumes, enabling fine-grained geometric control and flexible topology. This dual-primitive representation yields compact, well-regularized, and efficiently parameterized mesh reconstructions. To infer primitive parameters from multi-view images, we design a differentiable rendering pipeline that jointly estimates positive and negative superquadrics under view-consistent supervision. Extensive experiments demonstrate that DualPrim outperforms state-of-the-art methods in reconstruction accuracy while producing more geometrically concise, interpretable, and high-fidelity 3D meshes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38870", "url": null, "sourceid": 36381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39504, "uid": "c22f68de5bdce96fee03e21ca08d898d", "name": "Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection", "authors": [{"id": 192221, "fullname": "PengChuang PengChuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192221?format=json", "institution": "Beijing Jiaotong University"}, {"id": 153565, "fullname": "Renshuai Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153565?format=json", "institution": "Beijing Jiaotong University"}, {"id": 151985, "fullname": "Zhongwei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/151985?format=json", "institution": "Beijing Jiaotong University"}, {"id": 75528, "fullname": "Xianglong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75528?format=json", "institution": "BUAA"}, {"id": 75837, "fullname": "Yunchao Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/75837?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modal, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a dual-view caption corpus consisting of 45,613 dual-view images across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that jointly learns correspondences between cross-view geometry and cross-modal semantics, treating the second-view images as a \"language-like modality\". To enale this, we construct the GSXray dataset, with structured Chain-of-Thought sequences: top , side, conclusion. Comprehensive evaluations on DualXrayBench demonstrate that GSR achieves significant improvements across all X-ray tasks, offering a new sight for real-world X-ray inspection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39504", "url": null, "sourceid": 41470, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37515, "uid": "90a562851e9222030339fcf2960c15e9", "name": "MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping", "authors": [{"id": 179914, "fullname": "yushi Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179914?format=json", "institution": "SenseTime"}, {"id": 187620, "fullname": "Zining Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187620?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 88574, "fullname": "Zhihang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88574?format=json", "institution": "Houmo AI"}, {"id": 107333, "fullname": "Yifu Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/107333?format=json", "institution": "Beihang University"}, {"id": 139422, "fullname": "RUIHAO GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/139422?format=json", "institution": "Beihang University"}, {"id": 86293, "fullname": "Jinyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86293?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 75528, "fullname": "Xianglong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75528?format=json", "institution": "BUAA"}, {"id": 90255, "fullname": "Jun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90255?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision\u2013language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods\u2014originally designed for unimodal large language models (LLMs)\u2014to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88\\% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to **10.67\\%** (97.33\\% vs. 86.66\\%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by **2.16$\\times$** and the decoding time by **1.26$\\times$**.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37515", "url": null, "sourceid": 31468, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36396, "uid": "b223756da60ed5012d1d302bc9f50f6e", "name": "Reconstructing Functional 3D Scenes from Egocentric Interaction Videos", "authors": [{"id": 151059, "fullname": "Alexandros Delitzas", "url": "http://cvpr.thecvf.com/api/miniconf/users/151059?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 127988, "fullname": "Chenyangguang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127988?format=json", "institution": "Tsinghua University"}, {"id": 184943, "fullname": "Alexey Gavryushin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184943?format=json", "institution": "ETHZ - ETH Zurich"}, {"id": 184944, "fullname": "Tommaso Di Mario", "url": "http://cvpr.thecvf.com/api/miniconf/users/184944?format=json", "institution": "ETH Z\u00fcrich"}, {"id": 126295, "fullname": "Boyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/126295?format=json", "institution": "ETH Zurich"}, {"id": 89511, "fullname": "Rishabh Dabral", "url": "http://cvpr.thecvf.com/api/miniconf/users/89511?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 75804, "fullname": "Leonidas Guibas", "url": "http://cvpr.thecvf.com/api/miniconf/users/75804?format=json", "institution": "Stanford University"}, {"id": 75465, "fullname": "Christian Theobalt", "url": "http://cvpr.thecvf.com/api/miniconf/users/75465?format=json", "institution": "MPI Informatik"}, {"id": 73915, "fullname": "Marc Pollefeys", "url": "http://cvpr.thecvf.com/api/miniconf/users/73915?format=json", "institution": "ETH Zurich / Microsoft"}, {"id": 151062, "fullname": "Francis Engelmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/151062?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 74200, "fullname": "Daniel Barath", "url": "http://cvpr.thecvf.com/api/miniconf/users/74200?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "We present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunREC surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5$-$10$\\times$ lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction. All code and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36396", "url": null, "sourceid": 44395, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36905, "uid": "65f5a24d2b690cf3819e992cc20173e6", "name": "Learning 3D Shape Fidelity Metric from Real-world Distortions", "authors": [{"id": 186166, "fullname": "Xuelu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186166?format=json", "institution": "State University of New York at Buffalo"}, {"id": 88006, "fullname": "Tianyu Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88006?format=json", "institution": "State University of New York at Buffalo"}, {"id": 186167, "fullname": "Zixin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186167?format=json", "institution": "State University of New York at Buffalo"}, {"id": 186168, "fullname": "Akshobhya Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186168?format=json", "institution": "State University of New York at Buffalo"}, {"id": 186169, "fullname": "Phani Nuney", "url": "http://cvpr.thecvf.com/api/miniconf/users/186169?format=json", "institution": null}, {"id": 87284, "fullname": "Junsong Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87284?format=json", "institution": "State University of New York at Buffalo"}, {"id": 186170, "fullname": "Chunming Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186170?format=json", "institution": null}], "abstract": "3D generation and reconstruction have become essential in many computer vision applications, where the reconstructed or generated 3D shapes need to appear realistic to human perception. However, traditional metrics like Chamfer Distance to compare two 3D shapes focus primarily on matching accuracy of the shape geometry and fail to capture perceptual fidelity in the shape. While frequency-based metrics attempt to analyze shape details in the spectral domain, they still do not fully encapsulate the complexity of human perception. To address this gap, we propose a human-aligned fidelity metric that leverages local shape connectivity through a local attention mechanism to capture rich, detailed shape information. We also introduce the two-branch Real Shape Fidelity (RSF) dataset, including a main subset and test-only subset. This dataset generates 3D mesh distortions using real-world reconstruction and generation methods and annotated by hundreds of human subjects. Our metric named Local-Connection-based Shape Evaluation (LoCaSE), utilizes a PointNet-based backbone combined with Low-Rank Adaptation (LoRA)-style pretraining and finetuning to reduce model bias, while maintaining translation, rotation, and scale invariance. Experiments demonstrate that our approach achieves superior alignment with human perception compared to previous metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36905", "url": null, "sourceid": 45165, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39030, "uid": "1067186e81115b17ee94a90f1a4c124c", "name": "FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models", "authors": [{"id": 154577, "fullname": "Wang Jiarui", "url": "http://cvpr.thecvf.com/api/miniconf/users/154577?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 154576, "fullname": "Huiyu Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154576?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 158205, "fullname": "Juntong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158205?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 89522, "fullname": "Xiongkuo Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/89522?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "As generative models rapidly evolve, the realism of AI-generated videos has reached new levels, posing significant challenges for detecting the authenticity of videos. Existing deepfake detection techniques generally rely on training datasets with limited generation methods and content diversity, which limits their generalization ability on more realistic content, particularly that produced by the latest generative models. Recently, large multimodal models (LMMs) have demonstrated remarkable zero-shot performance across a variety of vision tasks. Yet, their ability to discern deepfake videos remains largely untested. To this end, we propose **FVBench**, a comprehensive deep$\\underline{f}$ake $\\underline{v}$ideo $\\underline{bench}$mark designed to advance video deepfake detection. It includes: **(i)** extensive content diversity, with over 120K videos covering real, AI-edited, and fully AI-generated categories, **(ii)** comprehensive model coverage, with fake videos generated and edited by 42 of the state-of-the-art video synthesis and editing models, and **(iii)** deepfake video detection benchmark for LMMs, which is a comprehensive benchmark for exploring the deepfake video detection capabilities of LMMs. The FVBench dataset and evaluation code will be publicly available upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39030", "url": null, "sourceid": 35399, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39393, "uid": "012fa44d6752ee07c167165cb8c6f11c", "name": "Trust-calibrated Collaborative Learning for Long-Tailed Visual Recognition", "authors": [{"id": 191985, "fullname": "Hao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191985?format=json", "institution": "Naval university of engineering"}, {"id": 157769, "fullname": "Tingjin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157769?format=json", "institution": "National University of Defense Technology"}], "abstract": "Real-world visual recognition faces the fundamental challenge of long-tailed distributions. While state-of-the-art methods often employ multi-expert models to address different frequency categories, we find that the mutual knowledge distillation used in these models enhances collaboration at the cost of introducing two critical limitations: indiscriminate knowledge transfer leads to bias propagation, where a single expert's error can spread and contaminate others, and error consolidation, where mutual reinforcement of incorrect predictions solidifies erroneous consensus. To overcome these issues, we propose Trust-calibrated Collaborative Learning (TCL). Our framework introduces the trustworthy knowledge orchestration module, which enables reliable distillation and precise collaboration through a knowledge quality gate that blocks erroneous information and a tail-class compensation mechanism that alleviates knowledge scarcity for tail categories. Furthermore, we design a consensus error calibration module that suppresses consensus high-confidence negative classes to correct collective misjudgments and steer optimization in the right direction. Extensive experiments on five long-tailed benchmarks demonstrate that TCL achieves the best performance, raising Top-1 accuracy on CIFAR100-LT to 58.7\\%, a gain of 2.4\\% over previous SOTA methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39393", "url": null, "sourceid": 43299, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38483, "uid": "5f6eb0809f31e88067e51bfd2bb0c50e", "name": "tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction", "authors": [{"id": 70682, "fullname": "Chen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70682?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 128744, "fullname": "Hao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128744?format=json", "institution": "Adobe Systems"}, {"id": 159462, "fullname": "Wang Yifan", "url": "http://cvpr.thecvf.com/api/miniconf/users/159462?format=json", "institution": "Adobe Systems"}, {"id": 89985, "fullname": "Zhiqin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89985?format=json", "institution": "Simon Fraser University"}, {"id": 147983, "fullname": "Yuheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147983?format=json", "institution": "University of California, Irvine"}, {"id": 89590, "fullname": "Kalyan Sunkavalli", "url": "http://cvpr.thecvf.com/api/miniconf/users/89590?format=json", "institution": "Adobe Research"}, {"id": 127654, "fullname": "Sai Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/127654?format=json", "institution": "Adobe Systems"}, {"id": 90047, "fullname": "Lingjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90047?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 153091, "fullname": "Yiwei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153091?format=json", "institution": "Adobe Systems"}], "abstract": "We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model\u2019s capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38483", "url": "https://cwchenwang.github.io/tttLRM", "sourceid": 32323, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38838, "uid": "8b33ab221257b074d1d967042ad1d9d0", "name": "VITAL: Vision-Encoder-centered Pretraining for LMMs in Visual Quality Assessment", "authors": [{"id": 141834, "fullname": "Ziheng Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/141834?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185457, "fullname": "Linhan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185457?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190804, "fullname": "Jinliang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190804?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 89537, "fullname": "Zicheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89537?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 190805, "fullname": "Jiaying Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190805?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 154577, "fullname": "Wang Jiarui", "url": "http://cvpr.thecvf.com/api/miniconf/users/154577?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 152884, "fullname": "Zijian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152884?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 89522, "fullname": "Xiongkuo Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/89522?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Developing a robust visual quality assessment (VQualA) large multimodal model (LMM) requires achieving **versatility**, **powerfulness**, and **transferability**. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a **vision-encoder-centered generative pre-training** pipeline and develop the **VITAL-Series** LMMs.(1) We adopt a machine-executed annotation\u2013scrutiny paradigm, constructing over $4.5M$ vision\u2013language (VL) pairs\u2014the **largest VQualA training dataset to date**. (2) We employ a multi-task training workflow that simultaneously enhances the model\u2019s quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities.  (3) Building upon the vision encoder, we realize the **efficient model zoo extension**: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than $1/1000$ of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the **foundation LMM for VQualA**.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38838", "url": null, "sourceid": 39905, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36785, "uid": "de3eab45d1d6c29f01c4ff6a33e87d19", "name": "Efficient Frame Selection for Long Video Understanding via Reinforcement Learning", "authors": [{"id": 180655, "fullname": "Yaxuan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180655?format=json", "institution": "\u817e\u8baf\u79d1\u6280(\u5317\u4eac)\u6709\u9650\u516c\u53f8"}, {"id": 185865, "fullname": "Hefei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185865?format=json", "institution": "Tencent"}, {"id": 185866, "fullname": "Wenqi Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185866?format=json", "institution": null}, {"id": 185867, "fullname": "Yancheng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/185867?format=json", "institution": "Tencent"}], "abstract": "Recent advances in Multimodal Large Language Models (MLLMs) have led to significant progress in video understanding. Due to limited context windows and computational overhead, most MLLMs adopt uniform frame sampling. This approach is at high risk of missing critical visual information and constrains performance especially for long videos. To address this problem, we propose a lightweight frame selection method to identify keyframes and train it via a two-stage strategy. In the pre-training stage, the frame selector learns to model relevance between individual video frames and queries. In the reinforcement learning (RL) stage, we employ a hierarchical reward that evaluates selection quality at combination and frame levels. Through stochastic exploration of frame combinations, the selector learns to identify and retain frames that improve task performance rather than merely maximizing query relevance, which can be misleading. The selected frames serve as input to downstream MLLMs for video understanding and reasoning. Experimental results demonstrate the proposed selector improves performance of diverse downstream MLLMs across benchmarks spanning medium to long videos.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36785", "url": null, "sourceid": 38785, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38195, "uid": "317e30985b2c3bf93a0fe849ddca9888", "name": "Asking like Socrates: Socrates helps VLMs understand remote sensing images", "authors": [{"id": 172720, "fullname": "Run Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/172720?format=json", "institution": "Central South University"}, {"id": 189275, "fullname": "Ziyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189275?format=json", "institution": "Central South University"}, {"id": 189276, "fullname": "Zhaoyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189276?format=json", "institution": "Central South University"}, {"id": 189277, "fullname": "Linrui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189277?format=json", "institution": "Central South University"}, {"id": 172552, "fullname": "Xinran He", "url": "http://cvpr.thecvf.com/api/miniconf/users/172552?format=json", "institution": "Baidu Inc."}, {"id": 179897, "fullname": "Hongyuan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179897?format=json", "institution": "Central South University"}, {"id": 189278, "fullname": "Bolei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/189278?format=json", "institution": "University of Science and Technology of China"}, {"id": 189279, "fullname": "Yongxing Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189279?format=json", "institution": "Baidu"}, {"id": 189280, "fullname": "Yan Yiming", "url": "http://cvpr.thecvf.com/api/miniconf/users/189280?format=json", "institution": "Zhejiang University"}, {"id": 189281, "fullname": "Chen Yijun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189281?format=json", "institution": "Zhejiang University"}, {"id": 189282, "fullname": "Wang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189282?format=json", "institution": "Central South University"}, {"id": 189283, "fullname": "Haifeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189283?format=json", "institution": "Central South University, China"}], "abstract": "Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision\u2013language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38195", "url": null, "sourceid": 32099, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38920, "uid": "0fc55b68b62a7bdc44bfc347f48d4277", "name": "RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval", "authors": [{"id": 157787, "fullname": "Yijiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157787?format=json", "institution": "University of California, San Diego"}, {"id": 190967, "fullname": "Kunal Kotian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190967?format=json", "institution": "Amazon"}, {"id": 190968, "fullname": "Ali Marjaninejad", "url": "http://cvpr.thecvf.com/api/miniconf/users/190968?format=json", "institution": "Amazon"}, {"id": 190969, "fullname": "Meir Friedenberg", "url": "http://cvpr.thecvf.com/api/miniconf/users/190969?format=json", "institution": "Amazon"}, {"id": 190970, "fullname": "Kaushik Pavani", "url": "http://cvpr.thecvf.com/api/miniconf/users/190970?format=json", "institution": "Amazon"}, {"id": 190971, "fullname": "Sunny Dasgupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/190971?format=json", "institution": "Amazon"}], "abstract": "Current multimodal image retrieval benchmarks focus on relatively simple queries where target images are either described directly or by simple composition with an input image. When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of $1,634$ queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). Each query combines visual and textual inputs that demand robust visual understanding together with logical inference, beyond surface-level matching, to identify correct target images. Evaluation of state-of-the-art models on RMIR reveals significant performance gaps, with the best model achieving only $46.53$\\% recall@$20$ averaged across reasoning categories. Our systematic analysis exposes fundamental limitations in current multimodal retrieval systems and establishes RMIR as a challenging testbed for developing multimodal, reasoning-capable retrieval models.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38920", "url": null, "sourceid": 43784, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37260, "uid": "83fe2aabfbd180be2c8afde7a22b2fc4", "name": "Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation", "authors": [{"id": 154533, "fullname": "Li jiaye", "url": "http://cvpr.thecvf.com/api/miniconf/users/154533?format=json", "institution": "Fudan University"}, {"id": 187029, "fullname": "Baoyou Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187029?format=json", "institution": "Fudan University"}, {"id": 158003, "fullname": "Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158003?format=json", "institution": "Fudan University"}, {"id": 126294, "fullname": "Zilong Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126294?format=json", "institution": "Alibaba Group"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}, {"id": 154534, "fullname": "Siyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154534?format=json", "institution": "Fudan University"}], "abstract": "Transformers rely on explicit positional encoding to model structure in data. WhileRotary Position Embedding (RoPE) excels in 1D domains, its application to im_x0002_age generation reveals significant limitations such as fine-grained spatial relationmodeling, color cues, and object counting. This paper identifies key limitationsof standard multi-dimensional RoPE\u2014rigid frequency allocation, axis-wise inde_x0002_pendence, and uniform head treatment\u2014in capturing the complex structural biasesrequired for fine-grained image generation. We propose HARoPE, a head-wiseadaptive extension that inserts a learnable linear transformation parameterized viasingular value decomposition (SVD) before the rotary mapping. This lightweightmodification enables dynamic frequency reallocation, semantic alignment of rotaryplanes, and head-specific positional receptive fields while rigorously preservingRoPE\u2019s relative-position property. Extensive experiments on class-conditional Ima_x0002_geNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPEconsistently improves performance over strong RoPE baselines and other exten_x0002_sions. The method serves as an effective drop-in replacement, offering a principledand adaptable solution for enhancing positional awareness in transformer-basedimage generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37260", "url": null, "sourceid": 42457, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38995, "uid": "0c2da1c3364eb2e4d2b9d340c246eb96", "name": "GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression", "authors": [{"id": 182058, "fullname": "Ruoke Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182058?format=json", "institution": "Peking University"}, {"id": 191145, "fullname": "Mingjia Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191145?format=json", "institution": "Peking University"}, {"id": 191146, "fullname": "Xinfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191146?format=json", "institution": "University of the Chinese Academy of Sciences; University of the Chinese Academy of Sciences"}, {"id": 191147, "fullname": "Haocheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191147?format=json", "institution": "Peking University"}, {"id": 147139, "fullname": "Qian Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/147139?format=json", "institution": "Peking University"}, {"id": 191148, "fullname": "Zhipin Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191148?format=json", "institution": "Bytedance Inc."}, {"id": 157655, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157655?format=json", "institution": "ByteDance Inc."}, {"id": 130296, "fullname": "Li zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130296?format=json", "institution": "Bytedance Inc."}, {"id": 90140, "fullname": "Siwei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90140?format=json", "institution": "Peking University"}], "abstract": "Human-centric multi-view video has a clear semantic structure: a static background and dynamic human motion. We propose a generative compression framework that explicitly decouples these components. The background is modeled once with 3D Gaussian Splatting, while the human is represented by a personalized Gaussian avatar reconstructed from a sparse set of key views that are transmitted only once and driven by compact per-frame pose parameters from the Skinned Multi-Person Linear (SMPL) model. The encoder sends only three elements: the background, the key views, and the SMPL parameters, enabling high-fidelity multi-viewpoint synthesis at dramatically reduced bitrates. This shifts compression from low-level redundancy removal to semantics-aware generative modeling. Experiments across multiple human-centric datasets demonstrate superior rate\u2013distortion performance, particularly for long and densely captured sequences, and naturally enable semantic editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38995", "url": null, "sourceid": 45035, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39796, "uid": "4e731a273a4f480aadc865db1cd9c448", "name": "KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing", "authors": [{"id": 192876, "fullname": "Siyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192876?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 175916, "fullname": "Feiyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/175916?format=json", "institution": "HuaZhong Univercity of Science and Technology"}, {"id": 192877, "fullname": "Xiaojin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192877?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 129383, "fullname": "Kun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/129383?format=json", "institution": "Huazhong University of Sceince and Technology"}], "abstract": "Despite the significant progress of Multi-modal Large Language Models (MLLMs) across diverse tasks, hallucination, which corresponds to the generation of visually inconsistent objects, attributes, or relations, remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs; However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases.To address this, we propose KVSmooth, a training-free,  plug-and-play method that mitigates hallucination by performing attention\u2013entropy\u2013guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache while dynamically quantifying the sink degree of each token through its attention distribution entropy to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ($\\mathit{CHAIR}_{S}$ from $41.8 \\rightarrow 18.2$) while improving overall performance ($F_1$ score from $77.5 \\rightarrow 79.2$), achieving higher precision and recall simultaneously, whereas prior methods often sacrifice one for the other, thereby validating the effectiveness and generality of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39796", "url": null, "sourceid": 40297, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38158, "uid": "b41f25f2d24fd25be2eb239dfc275266", "name": "ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data", "authors": [{"id": 172090, "fullname": "Yuxing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/172090?format=json", "institution": "Beijing University of Chemical Technology College of Science: Beijing University of Chemical Technol"}, {"id": 189173, "fullname": "Zheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189173?format=json", "institution": "Beijing University of Chemical Technology"}, {"id": 189174, "fullname": "Huanhuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189174?format=json", "institution": "Beijing University of Chemical Technology"}, {"id": 189175, "fullname": "Ji Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189175?format=json", "institution": "Southwest University of Nationalities"}, {"id": 189176, "fullname": "Zeyu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189176?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 189177, "fullname": "Yong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189177?format=json", "institution": "Beijing University of Chemical Technology"}], "abstract": "Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes\u2014a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions.Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38158", "url": null, "sourceid": 41525, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36379, "uid": "abcbe3ee8523b90b416337f0abd94a53", "name": "Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery", "authors": [{"id": 175130, "fullname": "Bingwen Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/175130?format=json", "institution": "Southern University of Science and Technology"}, {"id": 184912, "fullname": "Gan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184912?format=json", "institution": "Southern University of Science and Technology"}, {"id": 184913, "fullname": "Xiaoxi Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184913?format=json", "institution": "Southern University of Science and Technology"}, {"id": 172384, "fullname": "Guangcheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/172384?format=json", "institution": "Southern University of Science and Technology"}, {"id": 184914, "fullname": "Jialu ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/184914?format=json", "institution": "Southern University of Science and Technology"}, {"id": 184915, "fullname": "Yan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184915?format=json", "institution": "Southern University of Science and Technology"}, {"id": 184916, "fullname": "Xiaoqing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184916?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 184917, "fullname": "Jiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184917?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Accurate depth estimation is crucial for 3D reconstruction and precise navigation in ophthalmic fundus surgery. However, acquiring annotated data remains challenging due to the impracticality of depth sensors under surgical microscopes.To overcome this limitation, we introduce RetinalDepth-64K, a novel synthetic dataset comprising 64,000 stereo image pairs across 1,280 diverse scenes, developed through a Real2Sim2Real pipeline that transforms real-world fundus surgery videos into synthetic data and facilitates model deployment in real scenarios. We analyzed key characteristics such as intricate retinal textures from real-world videos to guide the Real-to-Sim phase, enabling realistic data synthesis.To improving dataset fidelity for depth estimation, we created 3D eye models using Blender with ultra-wide-field retinal textures, glass-modeled aqueous humor, and dynamic instrument trajectories, enhanced by post-processing to ensure photorealism.The dataset provides RGB images, depth maps, normal maps, and instrument segmentation masks from binocular view, supporting the training of monocular, binocular, and video-based depth estimation models to enhance robustness. In the Sim-to-Real phase, quantitative and qualitative experiments show that finetuning foundation models with RetinalDepth-64K produces accurate depth predictions for synthetic data. Comparative analysis on results of zeroshot and finetuned models further validates robust generalization to real fundus surgery scenes, offering significant potential to enhance surgical precision and support the training of novice surgeons through reliable depth cues.As the first dataset of its kind for retinal surgery, RetinalDepth-64K offers a vital resource for advancing 3D reconstruction and surgical navigation in ophthalmology.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36379", "url": null, "sourceid": 41675, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37001, "uid": "bf47908c636948e3a12da188a5708334", "name": "SEA: Evaluating Sketch Abstraction Efficiency via Element-level Common-sense Visual Question Answering", "authors": [{"id": 183854, "fullname": "Jiho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/183854?format=json", "institution": "Dongguk University"}, {"id": 186434, "fullname": "Sieun Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186434?format=json", "institution": "Dongguk University"}, {"id": 186435, "fullname": "Jaeyoon Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186435?format=json", "institution": "Dongguk University"}, {"id": 186436, "fullname": "Minho Sohn", "url": "http://cvpr.thecvf.com/api/miniconf/users/186436?format=json", "institution": "Dongguk University"}, {"id": 186437, "fullname": "Yeana Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/186437?format=json", "institution": "Dongguk University"}, {"id": 129456, "fullname": "Jihie Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/129456?format=json", "institution": "Dongguk University"}], "abstract": "A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches.To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under simple visual representations.To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37001", "url": null, "sourceid": 34889, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38873, "uid": "83d5aac46c097426773520489cda201c", "name": "A Polarized Reflection and Material Dataset of Real World Objects", "authors": [{"id": 70484, "fullname": "Jing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70484?format=json", "institution": "USC Institute for Creative Technologies"}, {"id": 190896, "fullname": "Krithika Dharanikota", "url": "http://cvpr.thecvf.com/api/miniconf/users/190896?format=json", "institution": "University of Southern California"}, {"id": 154670, "fullname": "Emily Yue-ting Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/154670?format=json", "institution": "University of Southern California"}, {"id": 128251, "fullname": "Haiwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128251?format=json", "institution": "USC-ICT, Vision and Graphics Lab"}, {"id": 128277, "fullname": "Yajie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128277?format=json", "institution": "University of Southern California"}], "abstract": "Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic datasets with simplified illumination and limited material realism, preventing models from generalizing to real-world images. We introduce a large-scale polarized reflection and material dataset of real-world objects, captured with an 8-camera, 346-light Light Stage equipped with cross/parallel polarization. Our dataset spans 218 everyday objects across five acquisition dimensions\u2014multiview, multi-illumination, polarization, reflectance separation, and material attributes\u2014yielding over 1.2M high-resolution images with diffuse\u2013specular separation and analytically derived diffuse albedo, specular albedo, and surface normals. Using this dataset, we train and evaluate state-of-the-art inverse and forward rendering models on intrinsic decomposition, relighting, and sparse-view 3D reconstruction, demonstrating significant improvements in material separation, illumination fidelity, and geometric consistency. We hope that our work can establish a new foundation for physically grounded material understanding and enable real-world generalization beyond synthetic training regimes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38873", "url": null, "sourceid": 46086, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39792, "uid": "91afcd714efa49dd8bd48d8da385fed9", "name": "EvoGraph-R1: Self-Evolving Multimodal Knowledge Hypergraphs for Agentic Retrieval", "authors": [{"id": 192866, "fullname": "Jiashi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192866?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 192867, "fullname": "Changhong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192867?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 128015, "fullname": "Xiangru Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/128015?format=json", "institution": "University of Hong Kong; Sun Yat-Sen University"}, {"id": 90982, "fullname": "Ruifei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90982?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 192868, "fullname": "Xinyi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192868?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 192869, "fullname": "Jiyao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192869?format=json", "institution": "Fudan University"}, {"id": 192870, "fullname": "Cheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192870?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 192871, "fullname": "Ye Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/192871?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 192872, "fullname": "Shujian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192872?format=json", "institution": "Fudan University"}, {"id": 175244, "fullname": "Junzhi Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/175244?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 191479, "fullname": "Lihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191479?format=json", "institution": "Amazon"}, {"id": 103506, "fullname": "Ziyan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/103506?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 104600, "fullname": "Tianbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/104600?format=json", "institution": "ShangHai Artificial Intelligence Laboratory"}, {"id": 152261, "fullname": "Jin Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/152261?format=json", "institution": "Monash University"}, {"id": 86671, "fullname": "Junjun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/86671?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}], "abstract": "Retrieval-augmented generation (RAG) has emerged as a critical paradigm for grounding Multimodal Large Language Models (MLLMs) in external knowledge. Recent GraphRAG methods introduce structured entity-relation graphs to improve retrieval and reasoning. However, they remain limited by treating knowledge graphs as static data structures built offline and queried in a single pass. This static paradigm misaligns with the interactive, iterative nature of knowledge-intensive reasoning, creating three bottlenecks: (i) text-centric fragmentation that impedes cross-modal reasoning, (ii) frozen structures unable to incorporate new evidence or correct errors, and (iii) rigid single-pass retrieval without adaptive refinement. To overcome these limitations, we introduce EvoGraph-R1, a self-evolving GraphRAG framework that reconceptualizes knowledge graphs as dynamic environments shaped through agent interactions.We formulate retrieval as a Markov Decision Process (MDP) where the agent observes the graph state and executes actions to query (GraphRetrieve), expand (WebSearch), refine (GraphEdit), or terminate (Answer) the reasoning. These actions reshape the hypergraph structure and generate feedback signals that guide subsequent evolution.Through this closed loop, the hypergraph evolves by integrating new evidence, correcting errors, and refining structure to support multi-hop reasoning. Experiments on multimodal VQA and text QA benchmarks demonstrate substantial improvements over existing RAG baselines in accuracy, coverage, and traceability, establishing self-evolving knowledge graphs as a fundamental paradigm across modalities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39792", "url": null, "sourceid": 41477, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37234, "uid": "37db66c660124c61c9516cf488344005", "name": "RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning", "authors": [{"id": 153242, "fullname": "Yuhong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153242?format=json", "institution": "Johns Hopkins University"}, {"id": 182579, "fullname": "Zihan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182579?format=json", "institution": "Tsinghua Shenzhen International Graduate School"}, {"id": 186972, "fullname": "Shengpeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186972?format=json", "institution": "Synapath"}, {"id": 153244, "fullname": "Ling-Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/153244?format=json", "institution": "Tsinghua University"}, {"id": 186973, "fullname": "Kaisheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186973?format=json", "institution": "Shenzhen University"}, {"id": 186974, "fullname": "Runqing Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186974?format=json", "institution": "Synapath AI"}, {"id": 186975, "fullname": "Xiao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186975?format=json", "institution": "Synapath Technology Shenzhen Co., Ltd."}, {"id": 186976, "fullname": "Junjia Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186976?format=json", "institution": "Xianova Robotics"}, {"id": 153248, "fullname": "Zhuoheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153248?format=json", "institution": "University of Hong Kong"}, {"id": 186977, "fullname": "Jingyi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186977?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 181596, "fullname": "Ziyan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181596?format=json", "institution": "Shenzhen University"}, {"id": 186978, "fullname": "Jintian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186978?format=json", "institution": null}, {"id": 186979, "fullname": "Zheyan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186979?format=json", "institution": "Tsinghua University"}, {"id": 186980, "fullname": "Zhifang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186980?format=json", "institution": "Tsinghua University"}, {"id": 86341, "fullname": "Haoqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86341?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "We introduce Robowheel, a data engine that converts human hand\u2013object interaction (HOI) videos into training-ready supervision for cross-morphology robotic learning. From monocular RGB/RGB-D inputs, we perform high-precision HOI reconstruction and enforce physical plausibility via a reinforcement learning (RL) optimizer that refines hand\u2013object relative poses under contact and penetration constraints. The reconstructed, contact-rich trajectories are then retargeted to cross-embodiments, robot arms with simple end-effectors, dexterous hands, and humanoids, yielding executable actions and rollouts. To scale coverage, we build a simulation-augmented framework on Isaac Sim with diverse domain randomization (embodiments, trajectories, object retrieval, background textures, hand motion mirroring), which enriches the distributions of trajectories and observations while preserving spatial relationships and physical plausibility. The entire data pipeline forms an end-to-end pipeline from video \u2192 reconstruction \u2192 retargeting \u2192 augmentation \u2192 data acquisition.We validate the data on mainstream vision\u2013language\u2013action (VLA) and imitation learning architectures, demonstrating that trajectories produced by our pipeline are as stable as those from teleoperation and yield comparable continual performance gains. To our knowledge, this provides the first quantitative evidence that HOI modalities can serve as effective supervision for robotic learning. Compared with teleoperation, Robowheel is lightweight: a single monocular RGB(D) camera is sufficient to extract a universal, embodiment-agnostic motion representation that could be flexibly retargeted across embodiments. We further assemble a large-scale multimodal dataset combining multi-camera captures, monocular videos, and public HOI corpora for training and evaluating embodied models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37234", "url": null, "sourceid": 32344, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39779, "uid": "30b1a01d3f62bf747f856235cc8bfb57", "name": "SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models", "authors": [{"id": 180194, "fullname": "Yifei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180194?format=json", "institution": "University of Central Florida"}, {"id": 192832, "fullname": "Qian Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192832?format=json", "institution": "University of Central Florida"}, {"id": 192833, "fullname": "Mengxin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192833?format=json", "institution": "University of Central Florida"}], "abstract": "The public accessibility of Large Vision\u2013Language Models (LVLMs) raises serious concerns about unauthorized model reuse and intellectual property infringement. Existing ownership verification approaches often rely on semantically abnormal queries or out-of-distribution responses as fingerprints, which are easily recognized and removed by adversaries.We first expose this vulnerability through the Semantic Divergence Attack (SDA), which detects and filters fingerprint checks by measuring semantic divergence between a stolen model and a reference model, showing that existing fingerprints are not semantic-preserving, easy to detect and bypass, and lacking robustness. To address these weaknesses, we propose **SIF** (Semantically In-Distribution Fingerprints), a non-intrusive ownership verification framework requiring no parameter modification. SIF introduces Semantic-Aligned Fingerprint Distillation (SAFD), which distills text-generation watermark signals\u2014originally designed for text ownership protection rather than model protection\u2014into the visual modality, enabling semantically coherent yet fingerprinted responses. Robust-Fingerprint Optimization (RFO) further simulates worst-case representation perturbations, ensuring resilience to perturbations such as fine-tuning and quantization.Extensive experiments on LLaVA-1.5 and Qwen2.5-VL demonstrate that **SIF** achieves superior stealthiness and robustness, providing a practical solution for LVLM copyright protection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39779", "url": null, "sourceid": 31840, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37345, "uid": "ec747e10b5b7a104415a43bcbd060f2c", "name": "LVLM-Aided Alignment of Task-Specific Vision Models", "authors": [{"id": 187211, "fullname": "Alexander Koebler", "url": "http://cvpr.thecvf.com/api/miniconf/users/187211?format=json", "institution": "Johann Wolfgang Goethe Universit\u00e4t Frankfurt am Main"}, {"id": 187212, "fullname": "Lukas Kuhn", "url": "http://cvpr.thecvf.com/api/miniconf/users/187212?format=json", "institution": "DKFZ Heidelberg &amp; Goethe University Frankfurt"}, {"id": 187213, "fullname": "Ingo Thon", "url": "http://cvpr.thecvf.com/api/miniconf/users/187213?format=json", "institution": "Siemens Corporate Research"}, {"id": 187214, "fullname": "Florian Buettner", "url": "http://cvpr.thecvf.com/api/miniconf/users/187214?format=json", "institution": "Deutsches Krebsforschungszentrum"}], "abstract": "In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model\u2019s dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37345", "url": null, "sourceid": 44105, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38299, "uid": "419b27ea74ad9b4e5b0b88a9bf4f0dc1", "name": "Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval", "authors": [{"id": 180261, "fullname": "Jingjing Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180261?format=json", "institution": "University of Science and Technology of China"}, {"id": 128550, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128550?format=json", "institution": "University of Science and Technology of China"}, {"id": 76277, "fullname": "Zheren Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76277?format=json", "institution": "University of Science and Technology of China"}, {"id": 189541, "fullname": "Bo Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189541?format=json", "institution": "University of Science and Technology of China"}, {"id": 76376, "fullname": "Zhendong Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76376?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images using a composed query of a reference image and a textual modification, without relying on triplet-based supervision. As the two inputs describe related but semantically unaligned information, the key challenge lies in interpreting their cross-modal discrepancy to infer the user\u2019s intended semantic modification.Existing ZS-CIR methods mainly adopt a consistency-driven paradigm, training on semantically aligned image\u2013text pairs with alignment or reconstruction objectives. This paradigm enforces cross-modal agreement but overlooks the semantic discrepancies between modalities that naturally arise during inference. To address this issue, we propose DiffComp (Differentiate-then-Compose), a difference-driven self-supervised framework that actively induces and exploits cross-modal discrepancies during training. It stimulates the model to perceive and reconcile semantic differences across visual and textual modalities, thereby improving consistency between training and inference. The framework consists of three components: Contextual Semantic Super-patches that provide localized and coherent visual representations for downstream perception and composition; Phrase-guided Masking that selectively removes text-aligned visual cues to induce controlled cross-modal discrepancies; and Difference-aware Composition that adaptively integrates visual and textual features according to their degree of semantic difference. Extensive experiments on four ZS-CIR benchmarks show that DiffComp achieves state-of-the-art performance and strong generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38299", "url": null, "sourceid": 37601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65716, "file": "/media/PosterPDFs/CVPR%202026/38299.png", "modified": "2026-04-21T03:57:17.232400-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65738, "file": "/media/PosterPDFs/CVPR%202026/38299-thumb.png", "modified": "2026-04-24T07:01:03.691761-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39585, "uid": "d15ff2db80a89807d24869fd9ffb1700", "name": "FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting", "authors": [{"id": 171191, "fullname": "Matteo Ballegeer", "url": "http://cvpr.thecvf.com/api/miniconf/users/171191?format=json", "institution": "Ghent University"}, {"id": 192412, "fullname": "Dries Benoit", "url": "http://cvpr.thecvf.com/api/miniconf/users/192412?format=json", "institution": "Ghent University"}], "abstract": "Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary SO(3) rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39585", "url": null, "sourceid": 32524, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37972, "uid": "6790fe5f90c88ea86e251454dd2b8855", "name": "TESO: Online Tracking of Essential Matrix by Stochastic Optimization", "authors": [{"id": 176820, "fullname": "Jaroslav Moravec", "url": "http://cvpr.thecvf.com/api/miniconf/users/176820?format=json", "institution": "Czech Technical University in Prague, Faculty of Electrical Engineering"}, {"id": 188715, "fullname": "Radim Sara", "url": "http://cvpr.thecvf.com/api/miniconf/users/188715?format=json", "institution": "Czech Technical University in Prague, Faculty of EE"}, {"id": 77490, "fullname": "Akihiro Sugimoto", "url": "http://cvpr.thecvf.com/api/miniconf/users/77490?format=json", "institution": "NII"}], "abstract": "Reliable perception of autonomous systems relies on fusion of data from multiple sensors, which requires maintaining accurate geometric calibration during operation. This work aims to track the drift of the calibration parameters caused by mechanical stress, thermal effects, or minor accidents. We focus on five parameters of the essential matrix and propose TESO, whose core mechanisms are: 1) a robust loss function based on kernel correlation over tentative correspondences instead of robust matching and estimators, 2) an adaptive online stochastic optimization on the essential manifold. Both contribute to reduced CPU and memory requirements. TESO relies on a few hyperparameters and eliminates the need for data-driven training, enabling use in resource-constrained online perception systems. We evaluated TESO based on the geometric precision of the tracked extrinsic parameters, the rectification quality, and the stereo depth consistency with respect to a 3D LiDAR. In the large-scale MAN TruckScenes dataset, TESO tracks drift with 0.12\u00b0 precision in the rotation around Y, which is critical for stereo accuracy, while the other two rotation angles are tracked with five times better precision. Sequences with simulated drift are tracked with similar precision as the no-drift ones, suggesting that the tracker is unbiased. Applied to the KITTI dataset, TESO reported systematic inconsistencies in extrinsic parameters across all stereo pairs, confirming observations made by other authors. We verify that these errors were partly caused by intrinsic decalibration, which manifested in the contradictory performance of two metrics: The epipolar error and the depth estimation accuracy. With corrected calibration parameters, TESO improved its rotation precision around the hardest Y-axis by approximately twentyfold, reaching 0.025\u00b0. In the depth estimation, there was a fiftyfold improvement. Despite its lightweight nature, we show that the combination of SIFT features and the proposed TESO loss function achieves accuracy comparable to published single-frame methods that rely on neural network models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37972", "url": null, "sourceid": 40355, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40320?format=json"], "related_events_ids": [40320]}, {"id": 40320, "uid": "6790fe5f90c88ea86e251454dd2b8855", "name": "TESO: Online Tracking of Essential Matrix by Stochastic Optimization", "authors": [{"id": 176820, "fullname": "Jaroslav Moravec", "url": "http://cvpr.thecvf.com/api/miniconf/users/176820?format=json", "institution": "Czech Technical University in Prague, Faculty of Electrical Engineering"}, {"id": 188715, "fullname": "Radim Sara", "url": "http://cvpr.thecvf.com/api/miniconf/users/188715?format=json", "institution": "Czech Technical University in Prague, Faculty of EE"}, {"id": 77490, "fullname": "Akihiro Sugimoto", "url": "http://cvpr.thecvf.com/api/miniconf/users/77490?format=json", "institution": "NII"}], "abstract": "Reliable perception of autonomous systems relies on fusion of data from multiple sensors, which requires maintaining accurate geometric calibration during operation. This work aims to track the drift of the calibration parameters caused by mechanical stress, thermal effects, or minor accidents. We focus on five parameters of the essential matrix and propose TESO, whose core mechanisms are: 1) a robust loss function based on kernel correlation over tentative correspondences instead of robust matching and estimators, 2) an adaptive online stochastic optimization on the essential manifold. Both contribute to reduced CPU and memory requirements. TESO relies on a few hyperparameters and eliminates the need for data-driven training, enabling use in resource-constrained online perception systems. We evaluated TESO based on the geometric precision of the tracked extrinsic parameters, the rectification quality, and the stereo depth consistency with respect to a 3D LiDAR. In the large-scale MAN TruckScenes dataset, TESO tracks drift with 0.12\u00b0 precision in the rotation around Y, which is critical for stereo accuracy, while the other two rotation angles are tracked with five times better precision. Sequences with simulated drift are tracked with similar precision as the no-drift ones, suggesting that the tracker is unbiased. Applied to the KITTI dataset, TESO reported systematic inconsistencies in extrinsic parameters across all stereo pairs, confirming observations made by other authors. We verify that these errors were partly caused by intrinsic decalibration, which manifested in the contradictory performance of two metrics: The epipolar error and the depth estimation accuracy. With corrected calibration parameters, TESO improved its rotation precision around the hardest Y-axis by approximately twentyfold, reaching 0.025\u00b0. In the depth estimation, there was a fiftyfold improvement. Despite its lightweight nature, we show that the combination of SIFT features and the proposed TESO loss function achieves accuracy comparable to published single-frame methods that rely on neural network models.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40320", "url": null, "sourceid": -40355, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37972?format=json"], "related_events_ids": [37972]}, {"id": 36746, "uid": "6fddc5cb7045037050c550acf1b6d183", "name": "MotionEdit: Benchmarking and Learning Motion-Centric Image Editing", "authors": [{"id": 185774, "fullname": "Yixin Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185774?format=json", "institution": "University of California, Los Angeles"}, {"id": 70165, "fullname": "Lei Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/70165?format=json", "institution": "HKUST &amp; ETH Zurich"}, {"id": 185775, "fullname": "Wenhao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185775?format=json", "institution": "Tencent AI Lab"}, {"id": 89393, "fullname": "Kai-Wei Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89393?format=json", "institution": "University of California, Los Angeles"}, {"id": 185776, "fullname": "Dong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185776?format=json", "institution": "Capital One"}], "abstract": "We introduce **MotionEdit**, a novel dataset for motion-centric image editing\u2014the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility.Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation.To evaluate model performance on the novel task, we introduce **MotionEdit-Bench**, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics.Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models.To address this gap, we propose **MotionNFT** (Motion-guided Negative-aware FineTuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations.Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36746", "url": null, "sourceid": 32387, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37032, "uid": "abf71c3ebd03356b9baa2b0d3b77e64c", "name": "Language-Free Generative Editing from One Visual Example", "authors": [{"id": 186531, "fullname": "Omar Elezabi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186531?format=json", "institution": "University of W\u00fcrzburg"}, {"id": 153047, "fullname": "Eduard Zamfir", "url": "http://cvpr.thecvf.com/api/miniconf/users/153047?format=json", "institution": "Bayerische Julius-Maximilians-Universit\u00e4t W\u00fcrzburg"}, {"id": 128643, "fullname": "Zongwei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128643?format=json", "institution": "University of W\u00fcrzburg Germany"}, {"id": 73509, "fullname": "Radu Timofte", "url": "http://cvpr.thecvf.com/api/miniconf/users/73509?format=json", "institution": "University of W\u00fcrzburg"}], "abstract": "Text-guided diffusion models have advanced image editing by enabling intuitive control through language.However, despite their strong capabilities, we surprisingly find that SOTA methods struggle with simple, everyday transformations such as rain or blur. We attribute this limitation to weak and inconsistent textual supervision during training, which leads to poor alignment between language and vision. Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements. We contend that the capability for diffusion-based editing is not lost but merely hidden from text.The door to cost-efficient visual editing remains open, and the key lies in a vision-centric paradigm that perceives and reasons about visual change as humans do, beyond words.Inspired by this, we introduce Visual Diffusion Conditioning (VDC), a training-free framework that learns conditioning signals directly from visual examples for precise, language-free image editing.Given a paired example\u2014one image with and one without the target effect\u2014VDC derives a visual condition that captures the transformation and steers generation through a novel condition-steering mechanism.An accompanying inversion-correction step mitigates reconstruction errors during DDIM inversion, preserving fine detail and realism.Across diverse tasks, VDC outperforms both training-free and fully fine-tuned text-based editing methods.Code and models will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37032", "url": null, "sourceid": 43922, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37215, "uid": "1d332a2954b1ce91e36ee87a34e286b2", "name": "Emergent Extreme-View Geometry in 3D Foundation Models", "authors": [{"id": 186942, "fullname": "Yiwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186942?format=json", "institution": "Cornell University"}, {"id": 146563, "fullname": "Joseph Tung", "url": "http://cvpr.thecvf.com/api/miniconf/users/146563?format=json", "institution": "New York University"}, {"id": 90264, "fullname": "Ruojin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/90264?format=json", "institution": "Cornell University"}, {"id": 106416, "fullname": "David Fouhey", "url": "http://cvpr.thecvf.com/api/miniconf/users/106416?format=json", "institution": "New York University"}, {"id": 156128, "fullname": "Hadar Averbuch-Elor", "url": "http://cvpr.thecvf.com/api/miniconf/users/156128?format=json", "institution": "Department of Computer Science, Cornell University"}], "abstract": "3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37215", "url": null, "sourceid": 31972, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39198, "uid": "92f75eaab29dd31e16b4b88cff13c6fb", "name": "Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo", "authors": [{"id": 191563, "fullname": "Ninghui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191563?format=json", "institution": "Southeast University"}, {"id": 75726, "fullname": "Fabio Tosi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75726?format=json", "institution": "University of Bologna"}, {"id": 191564, "fullname": "Lihui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191564?format=json", "institution": "Southeast University"}, {"id": 142332, "fullname": "Jiawei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/142332?format=json", "institution": "Beijing Institute of Technology"}, {"id": 152346, "fullname": "Luca Bartolomei", "url": "http://cvpr.thecvf.com/api/miniconf/users/152346?format=json", "institution": "University of Bologna"}, {"id": 181972, "fullname": "Zhiting Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181972?format=json", "institution": "Hohai University"}, {"id": 87171, "fullname": "Matteo Poggi", "url": "http://cvpr.thecvf.com/api/miniconf/users/87171?format=json", "institution": "Universit\u00e0 di Bologna"}, {"id": 87188, "fullname": "Stefano Mattoccia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87188?format=json", "institution": "University of Bologna"}], "abstract": "Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39198", "url": null, "sourceid": 44720, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39327, "uid": "8525d6d5e6d0b3c70790c9d45ec0f7d9", "name": "PAVAS: Physics-Aware Video-to-Audio Synthesis", "authors": [{"id": 77112, "fullname": "Oh Hyun-Bin", "url": "http://cvpr.thecvf.com/api/miniconf/users/77112?format=json", "institution": "POSTECH"}, {"id": 153170, "fullname": "Yuhta Takida", "url": "http://cvpr.thecvf.com/api/miniconf/users/153170?format=json", "institution": "Sony AI"}, {"id": 191860, "fullname": "Toshimitsu Uesaka", "url": "http://cvpr.thecvf.com/api/miniconf/users/191860?format=json", "institution": "Sony AI"}, {"id": 152617, "fullname": "Tae-Hyun Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/152617?format=json", "institution": "KAIST"}, {"id": 153173, "fullname": "Yuki Mitsufuji", "url": "http://cvpr.thecvf.com/api/miniconf/users/153173?format=json", "institution": "Sony AI"}], "abstract": "Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object\u2013object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39327", "url": null, "sourceid": 38566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40364?format=json"], "related_events_ids": [40364]}, {"id": 37635, "uid": "441e62dcb5e64fcae728d47945c83eee", "name": "VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations", "authors": [{"id": 91745, "fullname": "Maitreya Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/91745?format=json", "institution": "Arizona State University"}, {"id": 152530, "fullname": "Jingtao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152530?format=json", "institution": "Sony AI"}, {"id": 128883, "fullname": "Weiming Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128883?format=json", "institution": "Sony Research"}, {"id": 187922, "fullname": "Yezhou Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187922?format=json", "institution": "Amazon; Arizona State University"}, {"id": 75477, "fullname": "Lingjuan Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75477?format=json", "institution": "Sony AI"}], "abstract": "We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32\u2013256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024$\\times$1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen\u2014whose inference FLOPs grow quadratically with resolution ($\\approx$11T FLOPs at 1024$\\times$1024)\u2014VibeToken-Gen maintains a constant 179G FLOPs (63.4$\\times$ efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37635", "url": null, "sourceid": 41722, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39310, "uid": "36c9498bc0750db3eabc4a9936cb2f6d", "name": "SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping", "authors": [{"id": 153559, "fullname": "Allen Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153559?format=json", "institution": "Department of Computer Science             University of Maryland, College Park"}, {"id": 165063, "fullname": "Haiyang Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/165063?format=json", "institution": "University of Maryland, College Park"}, {"id": 153558, "fullname": "Alex Hanson", "url": "http://cvpr.thecvf.com/api/miniconf/users/153558?format=json", "institution": "University of Maryland College Park"}, {"id": 148858, "fullname": "Yonghan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/148858?format=json", "institution": "University of Maryland, College Park"}, {"id": 85114, "fullname": "Tom Goldstein", "url": "http://cvpr.thecvf.com/api/miniconf/users/85114?format=json", "institution": "University of Maryland, College Park"}, {"id": 86740, "fullname": "Matthias Zwicker", "url": "http://cvpr.thecvf.com/api/miniconf/users/86740?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency\u2013fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78$\\times$ on average while maintaining neural-field fidelity and using 10$\\times$ fewer primitives. Adding GroupFlow culminates in 13.71$\\times$ faster rendering and 2.53$\\times$ shorter training, surpassing all baselines in speed while preserving superior image quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39310", "url": null, "sourceid": 30914, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38006, "uid": "ca54ba6dcc222ed22aea916bee8d50ec", "name": "CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric", "authors": [{"id": 188811, "fullname": "Lakshmikar Reddy Polamreddy", "url": "http://cvpr.thecvf.com/api/miniconf/users/188811?format=json", "institution": "Yeshiva University"}, {"id": 188812, "fullname": "Ming Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188812?format=json", "institution": "Yeshiva University"}], "abstract": "Accurate and interpretable medical image segmentation remains a major challenge, as existing deep learning models primarily optimize pixel-level accuracy while overlooking positional reasoning\u2014an essential component for automated report generation and clinical interpretability. We introduce CG-Reasoner, a novel centroid-guided cross-modal framework that jointly performs medical image segmentation and positional reasoning. CG-Reasoner integrates a multimodal large language model (LLM), a newly designed light-weight encoder\u2013decoder architecture, and a Text2Centroid module that predicts lesion centroids from reasoning embeddings\u2014enabling the model to produce both accurate segmentation masks and spatially coherent, clinically meaningful reasoning explanations. Furthermore, we propose PRScore (Positional-Reasoning Score), a robust evaluation metric that jointly measures the spatial and semantic alignment between generated reasoning text and segmentation masks. Experiments on six medical datasets across different imaging modalities demonstrate that CG-Reasoner achieves state-of-the-art performance, offering precise segmentation, spatially coherent reasoning, and clinically interpretable visual-textual explanations within a unified framework. The source code is available at https://github.com/lpmm2025/CG-Reasoner.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38006", "url": null, "sourceid": 38481, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38748, "uid": "8b759a318cd60cb93817f36351384142", "name": "Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors", "authors": [{"id": 181930, "fullname": "Mingxuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181930?format=json", "institution": "Beijing Institute of Technology"}, {"id": 190580, "fullname": "Shuang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190580?format=json", "institution": "Beihang University"}, {"id": 190581, "fullname": "Yutang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190581?format=json", "institution": "Beihang University; NCEPU"}, {"id": 130687, "fullname": "Jing Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130687?format=json", "institution": "Beijing Institute of Technology"}, {"id": 190582, "fullname": "Yirui Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190582?format=json", "institution": "Beijing Institute of Technology"}, {"id": 183835, "fullname": "Jingxuan Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183835?format=json", "institution": "Imperial College London"}, {"id": 190583, "fullname": "Fuzhen Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190583?format=json", "institution": "Institute of Artificial Intelligence, Beihang University"}, {"id": 90097, "fullname": "Shuigen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90097?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Infrared video acquisition inherently suffers from low spatial resolution and limited frame rates due to the physical constraints of thermal imaging sensors. These limitations make infrared video enhancement uniquely challenging, as it requires restoring spatial details and temporal continuity from highly undersampled thermal signals. To address this challenge, we propose `THERIS`, a unified **THER**mal-physics inspired framework for **I**nfrared spatial-temporal video **S**uper-resolution. Grounded in the physical principles of thermal diffusion, `THERIS` leverages heat conduction dynamics that govern the spatiotemporal evolution of infrared pixel intensities. Specifically, the proposed Thermal Diffusion Interpolation Module (TDIM) treats temporal feature sequences as one-dimensional heat fields and performs frequency-domain diffusion to synthesize temporally coherent intermediate frames. Building on this foundation, the Thermo-Aware State Space Module (TSSM) refines spatiotemporal representations through learnable spectral filtering and selective state-space modeling, while maintaining consistency guided by the thermodynamic prior inherited from TDIM. Additionally, a Temperature Field Modeling Loss is introduced to enforce adherence to the heat conduction equation, promoting temporal coherence and spatial stability in the generated results. Extensive experiments demonstrate that `THERIS` achieves state-of-the-art performance while producing visually coherent results. To facilitate further research in the infrared video processing domain, we also introduce **IRVAL**, a high-resolution dataset comprising 108,512 video frames at 512$\\times$512 resolution.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38748", "url": null, "sourceid": 37835, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40182, "uid": "9d1e3e755ef2d97e58d14db50656b596", "name": "Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis", "authors": [{"id": 193735, "fullname": "Yidan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193735?format=json", "institution": "Hebei University"}, {"id": 180856, "fullname": "Zongheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180856?format=json", "institution": "Hebei University"}, {"id": 193736, "fullname": "Hong-Jie Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/193736?format=json", "institution": "Hebei University"}, {"id": 193737, "fullname": "Chun-Guo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193737?format=json", "institution": "Hebei University"}, {"id": 178400, "fullname": "Xiaoxiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178400?format=json", "institution": "Hebei University"}], "abstract": "Multimodal sentiment analysis (MSA) aims to identify human emotions through multimodal data. Despite considerable advances in MSA, we find that emotional class centers often overlap when integrating data from different modalities into the same representation space. In this paper, we propose a novel $\\textbf{M}$ulti-$\\textbf{M}$etric $\\textbf{R}$epresentation l$\\textbf{e}$arning $\\textbf{s}$trategy based on clus$\\textbf{t}$ering (MMRest) to alleviate this issue through flexible multi-metric representation learning, enabling the model to learn fine-grained sentiments. Specifically, we first design a module termed Multi-metric Multimodal learning on Clusters (MMC), which minimizes distances within similar sentiment pairs while maximizing dissimilar ones, aiming to learn a global metric and local metrics in each cluster from multimodal data. Afterwards, we develop a Projection and Decision-Level Fusion (PDLF) module, including two parts. One part utilizes the optimal global and local metrics to obtain a projection value. The other part combines the projection value with an intermediate score which is obtained through the fusion of unimodal and multimodal representations to obtain the final sentiment prediction score. Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method is significantly superior to state-of-the-art methods on various evaluation indicators and parameter count, by effectively learning fine-grained emotional boundaries. The code will be made open-source if the paper is accepted.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40182", "url": null, "sourceid": 42275, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40364, "uid": "8525d6d5e6d0b3c70790c9d45ec0f7d9", "name": "PAVAS: Physics-Aware Video-to-Audio Synthesis", "authors": [{"id": 77112, "fullname": "Oh Hyun-Bin", "url": "http://cvpr.thecvf.com/api/miniconf/users/77112?format=json", "institution": "POSTECH"}, {"id": 153170, "fullname": "Yuhta Takida", "url": "http://cvpr.thecvf.com/api/miniconf/users/153170?format=json", "institution": "Sony AI"}, {"id": 191860, "fullname": "Toshimitsu Uesaka", "url": "http://cvpr.thecvf.com/api/miniconf/users/191860?format=json", "institution": "Sony AI"}, {"id": 152617, "fullname": "Tae-Hyun Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/152617?format=json", "institution": "KAIST"}, {"id": 153173, "fullname": "Yuki Mitsufuji", "url": "http://cvpr.thecvf.com/api/miniconf/users/153173?format=json", "institution": "Sony AI"}], "abstract": "Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object\u2013object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40364", "url": null, "sourceid": -38566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39327?format=json"], "related_events_ids": [39327]}, {"id": 36621, "uid": "1359d70e095ad4d981176e31e3052170", "name": "ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning", "authors": [{"id": 185488, "fullname": "Shengyuan Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185488?format=json", "institution": "Fudan University"}, {"id": 185489, "fullname": "Xinyu Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185489?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 185490, "fullname": "Ziyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185490?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 131138, "fullname": "Yuhang Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131138?format=json", "institution": "Nanyang Technological University"}, {"id": 152680, "fullname": "Yuhang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152680?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 154737, "fullname": "Xiangyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154737?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 153062, "fullname": "Haodong Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153062?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 90594, "fullname": "Xiaoyi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/90594?format=json", "institution": "Microsoft"}, {"id": 185491, "fullname": "Jianze Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185491?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 155031, "fullname": "Bin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155031?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 77217, "fullname": "Jiaqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77217?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks.We present **ARM-Thinker**, an **A**gentic multimodal **R**eward **M**odel that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring.This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models.We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.To evaluate agentic reward modeling, we introduce **ARMBench-VL**, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification).ARM-Thinker achieves +16.2\\% average improvement on reward modeling benchmarks, +9.6\\% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks.Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36621", "url": null, "sourceid": 38036, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36600, "uid": "f630930295f2102fb56edc9f88de45fb", "name": "Den-TP: Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction", "authors": [{"id": 180123, "fullname": "Ruining Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180123?format=json", "institution": "Northeastern University"}, {"id": 90021, "fullname": "Yi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90021?format=json", "institution": "Robert Bosch LLC"}, {"id": 86434, "fullname": "Yun Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86434?format=json", "institution": "Northeastern University"}, {"id": 185440, "fullname": "Lili Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/185440?format=json", "institution": "Northeastern University"}], "abstract": "Trajectory prediction in autonomous driving has traditionally been studied from a model-centric perspective. However, existing datasets exhibit a strong long-tail distribution in scenario density, the number of agents per scenario, where common low-density cases dominate and safety-critical high-density cases are severely underrepresented. This imbalance limits model robustness and hides failure modes when standard evaluations average errors across all scenarios. We revisit trajectory prediction from a data-centric angle and present Den-TP, a framework for density-aware dataset curation and evaluation. Den-TP first partitions data into density-conditioned regions using agent count as a lightweight, dataset-agnostic proxy for interaction complexity. It then applies gradient-based utilities with a submodular selection objective to choose representative samples within each region while explicitly rebalancing across densities. The resulting subset reduces dataset size by 50\\% yet preserves overall performance and significantly improves robustness in high-density scenarios. We further introduce density-conditioned evaluation protocols that reveal long-tail failure modes overlooked by conventional metrics. Experiments on Argoverse 1 and 2 with state-of-the-art models show that robust trajectory prediction hinges not only on data scale, but also on balancing scenario density.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36600", "url": null, "sourceid": 32856, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37872, "uid": "3e9ac32bbcbe9c84d3c740a92783f692", "name": "MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models", "authors": [{"id": 188451, "fullname": "Yifan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188451?format=json", "institution": "Beihang University"}, {"id": 86271, "fullname": "Chao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86271?format=json", "institution": "Beijing Institute of Technology"}, {"id": 188452, "fullname": "Ruifei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188452?format=json", "institution": "Beijing Digital Native Digital City Research Center"}, {"id": 188453, "fullname": "Fei Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188453?format=json", "institution": "Beijing Digital Native Digital City Research Center"}, {"id": 156338, "fullname": "Zhifei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156338?format=json", "institution": "Peking University"}, {"id": 188454, "fullname": "Jiaxing Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188454?format=json", "institution": "Beihang University"}, {"id": 188455, "fullname": "Zhipeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188455?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37872", "url": null, "sourceid": 40306, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40315, "uid": "35ae540ac24d774598bdf2bcfb3e5421", "name": "Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models", "authors": [{"id": 188411, "fullname": "Tao Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188411?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188412, "fullname": "Huili Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188412?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188413, "fullname": "Yuanhong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188413?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188414, "fullname": "Wendan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188414?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188415, "fullname": "Lianchao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188415?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 188416, "fullname": "Jinrui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188416?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188417, "fullname": "Zichen Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188417?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 91196, "fullname": "Shangguang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91196?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188418, "fullname": "Yongfeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188418?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data.Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training.Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status.However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data).Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality.In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models.We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40315", "url": null, "sourceid": -40135, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37854?format=json"], "related_events_ids": [37854]}, {"id": 38104, "uid": "ae0ecbe3e82ed7eb0c1066059d9513df", "name": "Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank", "authors": [{"id": 189068, "fullname": "Yang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189068?format=json", "institution": "Beijing Jiaotong University"}, {"id": 184672, "fullname": "Zhixiang Chi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184672?format=json", "institution": "University of Toronto"}, {"id": 189069, "fullname": "Xudong Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189069?format=json", "institution": "Beijing Jiaotong University"}, {"id": 134864, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/134864?format=json", "institution": "Concordia University"}, {"id": 155710, "fullname": "Songhe Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155710?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions with learned primitives (attribute and object) knowledge from seen compositions. While previous approaches gain their notable performance through the powerful cross-modal alignment of CLIP, they often overlook the modality gap, an inherent constraint stemming from information-imbalanced training data. In this work, we propose SAM, a novel $\\underline{\\text{S}}$parse $\\underline{\\text{A}}$lignment and Unimoal $\\underline{\\text{M}}$emory Bank to effectively bridging modality gap for CZSL. Specifically, we conduct $\\textbf{\\textit{sparse}}$ $\\textbf{\\textit{alignment}}$ that links textual representations directly to their semantically pertinent visual patches. This direct linking serves to prune redundant visual data and counter the information imbalance in image-text pairs. Subsequently, with the sparsely aligned visual information as its guidance, the $\\textbf{\\textit{visual}}$ $\\textbf{\\textit{adaptive}}$ $\\textbf{\\textit{condensation}}$ module adaptively fuses these critical cues into a unified representation. Finally, we introduce a $\\textbf{\\textit{dynamically}}$ $\\textbf{\\textit{updated}}$  $\\textbf{\\textit{memory}}$ $\\textbf{\\textit{ bank}}$ that stores samples from both seen and unseen compositions. This bank serves a dual purpose: it bypasses the modality gap through visual-only classification and concurrently strengthens generalization to unseen compositions. Experiments on three benchmarks demonstrate that our method gains significant improvements over CLIP-based methods under closed-world and open-world settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38104", "url": null, "sourceid": 45060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39635, "uid": "20affd577bdf871ee8acfc834622791f", "name": "VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction", "authors": [{"id": 90152, "fullname": "Zhiwen Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90152?format=json", "institution": "University of Texas, Austin"}, {"id": 186074, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186074?format=json", "institution": "Xiamen University"}, {"id": 192534, "fullname": "Renjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192534?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 192535, "fullname": "Junge Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192535?format=json", "institution": "University of California, Riverside"}, {"id": 192536, "fullname": "Runjin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192536?format=json", "institution": "Anthropic; University of Texas at Austin"}, {"id": 165078, "fullname": "Hezhen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165078?format=json", "institution": "UT Austin"}, {"id": 152584, "fullname": "Kevin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152584?format=json", "institution": "University of Texas at Austin"}, {"id": 90134, "fullname": "Peihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90134?format=json", "institution": "University of Texas, Austin"}, {"id": 192537, "fullname": "Huaizhi Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192537?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 85881, "fullname": "Shijie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85881?format=json", "institution": "Google"}, {"id": 154362, "fullname": "Dilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154362?format=json", "institution": "Meta"}, {"id": 151654, "fullname": "Zhicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/151654?format=json", "institution": null}, {"id": 156979, "fullname": "Hongyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156979?format=json", "institution": "Meta                  Reality Labs"}, {"id": 192538, "fullname": "Justin Theiss", "url": "http://cvpr.thecvf.com/api/miniconf/users/192538?format=json", "institution": "Meta Reality Labs"}, {"id": 128595, "fullname": "Tianlong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128595?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 156898, "fullname": "Jiachen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156898?format=json", "institution": "University of California, Riverside"}, {"id": 155027, "fullname": "Zhengzhong Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155027?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 75636, "fullname": "Zhangyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75636?format=json", "institution": "University of Texas at Austin"}, {"id": 73496, "fullname": "Rakesh Ranjan", "url": "http://cvpr.thecvf.com/api/miniconf/users/73496?format=json", "institution": "Meta"}], "abstract": "The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has sparked interest in extending these models to 3D scenes, with the goal of human-like visual-spatial intelligence. However, achieving deep spatial understanding comparable to human capabilities remains challenging for both model design and data acquisition. Existing methods often rely on external depth sensors for geometry capture or off-the-shelf algorithms for pre-constructing 3D maps, which limits their scalability.In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. Specifically, VLM-3R processes monocular video frames with a geometry encoder that derives implicit 3D tokens representing scene context (spatial tokens) and camera motion (view tokens). In parallel, we build a scalable data creation pipeline with over 200K 3D reconstructive instruction-tuning question-answer pairs. To evaluate temporal reasoning, we further introduce the Vision-Spatial-Temporal Intelligence benchmark (VSTI-Bench), which contains over 138.6K question-answer pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments show that VLM-3R supports robust visual-spatial reasoning and improves the understanding of temporal 3D context changes, enabling monocular 3D spatial assistance and embodied reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39635", "url": null, "sourceid": 32588, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37037, "uid": "d805b3977c018e4f8d30c5e6b0e59b42", "name": "Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection", "authors": [{"id": 186538, "fullname": "Yuru Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186538?format=json", "institution": "Northeast Normal University"}, {"id": 186539, "fullname": "Yue Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186539?format=json", "institution": "Northeast Normal University"}], "abstract": "Facial Action Unit (AU) detection suffers from limited annotated data, severe class imbalance, and label noise, which often result in overfitting and degraded performance. We propose a novel framework that synergizes Uncertainty-aware Transformers with Causal Intervention to address these challenges. By modeling attention weights as Gaussian distributions, our probabilistic Transformer captures inter-AU dependencies and epistemic uncertainty. An uncertainty-guided loss weighting strategy further mitigates data imbalance by adaptively emphasizing reliable predictions. Moreover, a causal intervention module is introduced to eliminate spurious correlations caused by confounders, ensuring that the learned AU relationships reflect true causality. Extensive experiments on BP4D and DISFA demonstrate that our framework achieves state-of-the-art performance with superior robustness and generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37037", "url": null, "sourceid": 40972, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38680, "uid": "acc44efd90f4fa281caf23e53e7227e6", "name": "Sky2Ground: A Benchmark for Site Modeling under Varying Altitude", "authors": [{"id": 190450, "fullname": "Zengyan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190450?format=json", "institution": "University of Central Florida"}, {"id": 190451, "fullname": "Sirshapan Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/190451?format=json", "institution": "University of Central Florida"}, {"id": 141986, "fullname": "rajat modi", "url": "http://cvpr.thecvf.com/api/miniconf/users/141986?format=json", "institution": "Self"}, {"id": 190452, "fullname": "Hui Xian Grace Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190452?format=json", "institution": "University of Central Florida"}, {"id": 135103, "fullname": "Yogesh Rawat", "url": "http://cvpr.thecvf.com/api/miniconf/users/135103?format=json", "institution": "University of Central Florida"}], "abstract": "In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that simply fine-tuning such models reduces performance, and can't be solved simply by bruteforce-scaling. We find the culprit to be satellite images, which inject too much noise during the learning process. Therefore, we propose SkyNet to enable learning using satellite-images. SkyNet is a two-stream neural-net, with one stream explicitly processing satellite, and another processing all modalities together.We propose a restricted-attention mechanism, termed as `Masked-Satellite-Attention' which prevents ground/aerial images from interacting with satellite images. Further, our SkyNet is optimized with strategies inspired from curriculum-learning: sampling cameras which are far-away from each other during training. Extensive experiments on our Sky2Earth dataset reveal that SkyNet outperforms existing methods by $23\\%$ in terms of absolute performance. Our dataset, and code shall be made publicly available on huggingface.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38680", "url": "https://sky2ground2026.github.io/sky2ground/", "sourceid": 40594, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38463, "uid": "7cc2794f14963ceaedb29d19875c950c", "name": "UniDef: Universal Defense Against Unauthorized Image Manipulation", "authors": [{"id": 152222, "fullname": "Mingwen Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152222?format=json", "institution": "China University of Petroleum"}, {"id": 156225, "fullname": "Lingzhuang Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156225?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 188849, "fullname": "Xiang Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/188849?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 189906, "fullname": "Mengyao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189906?format=json", "institution": "China University of Petroleum"}, {"id": 188848, "fullname": "Xinyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188848?format=json", "institution": "China University of Petroleum"}, {"id": 188847, "fullname": "Qiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188847?format=json", "institution": "China University of Petroleum(East China)"}, {"id": 188850, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188850?format=json", "institution": "China University of Petroleum"}, {"id": 72006, "fullname": "Yuanjian Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/72006?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 75721, "fullname": "Chao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/75721?format=json", "institution": "SIAT"}], "abstract": "Image protection against unauthorized diffusion-based editing has achieved encouraging progress. However, existing methods face two critical limitations: (1) They only disturb the denoising direction at local step, resulting in generated images still retaining original or edited semantics. (2) Their optimization rely heavily on model-specific gradient, limiting transferable protection across different models and tasks. To address these challenges, we propose a Universal Defense (UniDef) framework for protection against unauthorized image manipulation. Specifically, we first discover that different variants of diffusion models tend to pursue a consistent distribution objective during complete denoising process. Based on this discovery, we design Consistent Distribution Deviation strategy to perturb the diffusion direction at the global denoising, thereby disrupting the overall image semantics. Furthermore, to mitigate model dependency, we devise a Finite Difference-based Jacobian Estimation module to approximate the global gradient in a model-agnostic manner, thus ensuring more transferable protection. Benefiting from the above designs, our method yields generated images no longer preserve the image semantic while possessing excellent generalization. Extensive experiments demonstrate that our UniDef not only outperforms existing methods, but also exhibits universal protection across diverse models and tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38463", "url": null, "sourceid": 41532, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38024, "uid": "78cfc36b921a50fba024eca72d6a458e", "name": "Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View", "authors": [{"id": 152222, "fullname": "Mingwen Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152222?format=json", "institution": "China University of Petroleum"}, {"id": 188847, "fullname": "Qiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188847?format=json", "institution": "China University of Petroleum(East China)"}, {"id": 188848, "fullname": "Xinyuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188848?format=json", "institution": "China University of Petroleum"}, {"id": 188849, "fullname": "Xiang Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/188849?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 156225, "fullname": "Lingzhuang Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/156225?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 188850, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188850?format=json", "institution": "China University of Petroleum"}, {"id": 188851, "fullname": "Qinglin Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188851?format=json", "institution": "China University of Petroleum (East China)"}, {"id": 188852, "fullname": "Ling Jian", "url": "http://cvpr.thecvf.com/api/miniconf/users/188852?format=json", "institution": "China University of Petroleum"}], "abstract": "Pose-agnostic anomaly detection (PAD) achieves strong performance in localizing anomalies from arbitrary viewpoints when trained on densely sampled normal data. However, under sparse-view conditions, existing methods face two key challenges: (1) sparse observations lead to overfitting and geometric detail loss in 3D reconstruction; (2) limited visual cues lead to inaccurate pose estimation, compromising the reliability of subsequent anomaly localization. To address these challenges, we propose Wave-Pose3D, a wavelet-driven 3D anomaly detection framework tailored for PAD under sparse-view conditions. First, we design a structure-aware and wavelet-optimized Gaussian modeling strategy that dynamically filters unreliable regions via structural priors to mitigate overfitting and leverages high-frequency supervision to restore fine-grained geometric details. Second, to improve pose estimation under sparse views, we develop a wavelet-based pose estimator that integrates low-frequency structural cues and high-frequency details to enhance both initialization and refinement accuracy. Finally, we introduce a wavelet difference-aware anomaly detector that computes frequency-domain anomaly scores, improving localization robustness against pose and geometric variations. By integrating these strategies, Wave-Pose3D achieves robust and accurate anomaly localization under sparse views. Extensive experiments validate that the proposed approach achieves state-of-the-art performance under 10\\% and 20\\% sparse-view configurations.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38024", "url": null, "sourceid": 30816, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39764, "uid": "c4ded2b85cc5be82fa1d2464eba9a7d3", "name": "Dual-Estimator: Decoupling Global and Local Semantic Shift for Drift Compensation in Class-Incremental Learning", "authors": [{"id": 192808, "fullname": "Fankang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192808?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 192809, "fullname": "Lu Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192809?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 128187, "fullname": "Yanpeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/128187?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 156039, "fullname": "Shiyu Xuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156039?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 181619, "fullname": "Zechao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181619?format=json", "institution": "Nanjing University of Science and Techonolgy"}], "abstract": "Continual Learning (CL) provides an effective paradigm for acquiring new knowledge, and the principle of learning without retaining past samples has led to exemplar-free CL that better matches practical conditions. However, a key challenge is the semantic shift, which requires reliable activation of past class representations to align with the current feature space. While drift compensation acts as the activator, it commonly assumes uniform semantic distributions and shifts, which is unrealistic for random data streams. For this, we propose the Dual-Estimator (Dual-E) to decouple global and local semantic shifts, addressing both issues of non-uniformity. Specifically, to address intra-task non-uniform semantic distributions that limit effective compensation for low-frequency semantics, Dual-E incorporates a mixture-of-experts estimator comprising multiple networks that model semantic shifts across diverse local representation spaces. For inter-task non-uniformity in semantic shifts, where uniform full-scale compensation potentially overlooks the varying degrees of semantic change across classes, Dual-E employs a low-rank estimator with an embedded low-rank network that prioritizes global semantic trends for classes exhibiting larger shifts. Dual-E leverages analytical solutions to update within a few epochs, enabling efficient plug-in integration with existing exemplar-free methods. Extensive experiments on diverse datasets demonstrate the advantages of Dual-E over state-of-the-art approaches. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39764", "url": null, "sourceid": 45678, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36874, "uid": "c16ff7580f9bf8085fa25dbafbc8c83a", "name": "SpatialStack: Layered Geometry-Semantic Fusion for 3D VLM Spatial Reasoning", "authors": [{"id": 186074, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186074?format=json", "institution": "Xiamen University"}, {"id": 85881, "fullname": "Shijie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85881?format=json", "institution": "Google"}, {"id": 174267, "fullname": "Bangya LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/174267?format=json", "institution": "University of Wisconsin-Madison"}, {"id": 75637, "fullname": "Achuta Kadambi", "url": "http://cvpr.thecvf.com/api/miniconf/users/75637?format=json", "institution": "UCLA"}, {"id": 90152, "fullname": "Zhiwen Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90152?format=json", "institution": "University of Texas, Austin"}], "abstract": "Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physcial AI systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36874", "url": null, "sourceid": 42240, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39063, "uid": "22c64ea90b762e830ec7019dcfe43fd2", "name": "MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis", "authors": [{"id": 182804, "fullname": "Binghui Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182804?format=json", "institution": "Southeast University"}, {"id": 191281, "fullname": "Lin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191281?format=json", "institution": "Southeast University"}, {"id": 191282, "fullname": "Haoxuan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191282?format=json", "institution": "Southeast University"}, {"id": 191283, "fullname": "Jianan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191283?format=json", "institution": "Southeast University"}, {"id": 191284, "fullname": "ZhiPeng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191284?format=json", "institution": "Southeast University"}, {"id": 191285, "fullname": "Zekai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191285?format=json", "institution": "Southeast University"}, {"id": 90020, "fullname": "Yangang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90020?format=json", "institution": "Southeast University"}], "abstract": "Dexterous grasp generation is a predominant task that enables robots to perform human-level manipulation. However, a dexterous hand always maintains high-dimensional DoF and actuation space, making existing approaches that rely on holistic latent representations difficult to produce high-quality and semantically aligned grasps. In this paper, we propose MaskDexGrasp to address these challenges. We first present a part-aware grasp tokenizer that decomposes dexterous grasps into discrete tokens, facilitating compositional modeling of anatomical dependencies. Building upon this representation, a bidirectional masked grasp transformer is then developed to predict grasp tokens conditioned on object geometry and task description, ensuring coherent grasp generation while allowing fine-grained part-level editing. To facilitate evaluation, we construct a dexterous grasp dataset that comprises 64K grasping instances and 256K richly annotated descriptions covering 11 tasks. Comprehensive experiments demonstrate that our method achieves the state-of-the-art performance. Our code and dataset will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39063", "url": null, "sourceid": 35228, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65717, "file": "/media/PosterPDFs/CVPR%202026/39063-thumb.png", "modified": "2026-04-21T05:53:25.123471-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65718, "file": "/media/PosterPDFs/CVPR%202026/39063.png", "modified": "2026-04-21T05:53:52.077360-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39318, "uid": "30c3d18de32265d9a0510345fd6dac12", "name": "Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting", "authors": [{"id": 191836, "fullname": "Huaqi Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191836?format=json", "institution": "Southern University of Science and Technology"}, {"id": 191837, "fullname": "Bingxi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191837?format=json", "institution": "Southern University of Science and Technology; Pengcheng Laboratory"}, {"id": 172384, "fullname": "Guangcheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/172384?format=json", "institution": "Southern University of Science and Technology"}, {"id": 153787, "fullname": "Fulin Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153787?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 191838, "fullname": "Li He", "url": "http://cvpr.thecvf.com/api/miniconf/users/191838?format=json", "institution": "Southern University of Science and Technology"}, {"id": 73725, "fullname": "Hong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73725?format=json", "institution": "Southern University of Science and Technology"}], "abstract": "Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera\u2019s pose when it revisits a previously known scene. While point-based hierarchical localization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on both indoor and outdoor datasets show that SplatHLoc significantly enhances the robustness of visual relocalization, setting a new state-of-the-art. The code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39318", "url": null, "sourceid": 34087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36260, "uid": "380300f2977be97431f05aacf79c59d8", "name": "Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From  an Information Bottleneck Perspective", "authors": [{"id": 180624, "fullname": "Kaifang Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/180624?format=json", "institution": "Northeastern University"}, {"id": 72418, "fullname": "Lianbo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/72418?format=json", "institution": "Northeastern University"}, {"id": 184601, "fullname": "Jiaqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184601?format=json", "institution": "Independent researcher"}, {"id": 184602, "fullname": "liming liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184602?format=json", "institution": "Northeastern University"}, {"id": 184603, "fullname": "Guoyang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/184603?format=json", "institution": "CATL"}], "abstract": "The quest for incremental unified multimodal anomaly detection seeks to empower a single model with the ability to systematically detect anomalies across all categories and support incremental learning to accommodate emerging objects/categories. Central to this pursuit is resolving the catastrophic forgetting dilemma, which involves acquiring new knowledge while preserving prior learned knowledge. Despite some efforts to address this dilemma, a key oversight persists: ignoring the potential impact of spurious and redundant features on catastrophic forgetting. In this paper, we delve into the negative effect of spurious and redundant features on this dilemma in incremental unified frameworks, and reveal that under similar conditions, the multimodal framework developed by naive aggregation of unimodal architectures is more prone to forgetting. To address this issue, we introduce a novel denoising framework called IB-IUMAD, which exploits the complementary benefits of the Mamba decoder and information bottleneck fusion module:  the former dedicated to disentangle inter-object feature coupling, preventing spurious feature interference between objects; the latter serves to filter out redundant features from the fused features, thus explicitly preserving discriminative information. A series of theoretical analyses and experiments on MVTec 3D-AD and Eyecandies datasets demonstrates the effectiveness and competitive performance of IB-IUMAD. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36260", "url": null, "sourceid": 39066, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40004, "uid": "c2cb1fc76f1bbf73ce680c2b78aa328f", "name": "SeeLe: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices", "authors": [{"id": 182098, "fullname": "He Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182098?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 174817, "fullname": "Xiaotong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174817?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 193278, "fullname": "Zihan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193278?format=json", "institution": null}, {"id": 193279, "fullname": "Weikai Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193279?format=json", "institution": "University of Rochester"}, {"id": 71043, "fullname": "Xiaohong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71043?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 158262, "fullname": "Zhezhi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/158262?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188735, "fullname": "Jingwen Leng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188735?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 88218, "fullname": "Minyi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/88218?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 183787, "fullname": "Yu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183787?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "3D Gaussian Splatting (3DGS) has become a crucial rendering technique for many real-time applications. However, the limited hardware resources on today's mobile platforms hinder these applications, as they struggle to achieve real-time performance. In this paper, we propose Seele, a general framework designed to accelerate the 3DGS pipeline for resource-constrained mobile devices.Specifically, we propose two GPU-oriented techniques: hybrid preprocessing and contribution-aware rasterization.Hybrid preprocessing alleviates the GPU compute and memory pressure by reducing the number of irrelevant Gaussians during rendering.The key is to combine our view-dependent scene representation with online filtering. Meanwhile, contribution-aware rasterization improves the GPU utilization at the rasterization stage by prioritizing Gaussians with high contributions while reducing computations for those with low contributions.Both techniques can be seamlessly integrated into existing 3DGS pipelines with minimal fine-tuning.Collectively, our framework achieves up to 6.3$\\times$ speedup and 39.1\\% model reduction while achieving superior rendering quality compared to existing methods.Our codes will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40004", "url": null, "sourceid": 31329, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39846, "uid": "cf8890bb0b8c1a05eb02afe90d7322ac", "name": "Fusion of Depth and Semantic for Probabilistic Floorplan Localization", "authors": [{"id": 179885, "fullname": "Kecheng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/179885?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 192964, "fullname": "Mao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192964?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 174773, "fullname": "Xiangkai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174773?format=json", "institution": "Institute of Automation\uff0cChinese Academy of Sciences"}, {"id": 127662, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127662?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Floorplan localization aims to estimate the camera pose of a query image with respect to a 2D floorplan, providing a lightweight and long-term stable alternative to localization based on 3D maps or large image databases for indoor robotics and AR. Recent methods frame the problem as ray-based matching, representing the image as a set of rays annotated with depth or semantic labels and aligning them with the floorplan. However, they still face challenges in addressing the complexity of indoor environments, which can be decomposed into environmental, geometric, and semantic ambiguities.To address these ambiguities, we propose a floorplan-aware probabilistic fusion framework that models both depth and semantic information within a unified architecture. Our framework also combines a distribution-based ray confidence estimator, which down-weights uncertain geometric hypotheses, with a probabilistic semantic matching scheme based on Jensen\u2013Shannon divergence (JSD), which preserves and leverages informative semantic ambiguity instead of collapsing it into hard labels. Experiments on challenging benchmarks demonstrate that our approach significantly outperforms prior methods in both robustness and accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39846", "url": null, "sourceid": 38224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39791, "uid": "5de89937c146d16084648eea9295aa35", "name": "LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing", "authors": [{"id": 180249, "fullname": "Yuanming Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180249?format=json", "institution": "McMaster university"}, {"id": 192864, "fullname": "Chengqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192864?format=json", "institution": "McMaster University"}, {"id": 192865, "fullname": "Wenbo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192865?format=json", "institution": "McMaster University"}], "abstract": "Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39791", "url": null, "sourceid": 38060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40383?format=json"], "related_events_ids": [40383]}, {"id": 40383, "uid": "5de89937c146d16084648eea9295aa35", "name": "LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing", "authors": [{"id": 180249, "fullname": "Yuanming Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180249?format=json", "institution": "McMaster university"}, {"id": 192864, "fullname": "Chengqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192864?format=json", "institution": "McMaster University"}, {"id": 192865, "fullname": "Wenbo He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192865?format=json", "institution": "McMaster University"}], "abstract": "Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40383", "url": null, "sourceid": -38060, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39791?format=json"], "related_events_ids": [39791]}, {"id": 38752, "uid": "7f945d34e708a2a6a192697de248fd77", "name": "Volumetric Functional Maps", "authors": [{"id": 190586, "fullname": "Filippo Maggioli", "url": "http://cvpr.thecvf.com/api/miniconf/users/190586?format=json", "institution": "Pegaso University"}, {"id": 152433, "fullname": "Simone Melzi", "url": "http://cvpr.thecvf.com/api/miniconf/users/152433?format=json", "institution": "University of Milan - Bicocca"}, {"id": 190587, "fullname": "Marco Livesu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190587?format=json", "institution": "CNR IMATI"}], "abstract": "The computation of volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the functional maps framework from the surface to the volumetric setting. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for high-quality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well-established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer, and solid texturing. Last but not least, we show that considering the volumetric spectrum greatly improves the accuracy for classical shape matching tasks among surfaces, consistently outperforming existing surface-only spectral methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38752", "url": null, "sourceid": 32350, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39568, "uid": "ed1d02c1bbc76fdf5fefe5d6d14e9821", "name": "GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation", "authors": [{"id": 88627, "fullname": "Jiahao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88627?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 132274, "fullname": "Zihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132274?format=json", "institution": "National University of Singapore"}, {"id": 69204, "fullname": "Xiangyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/69204?format=json", "institution": "Institue of Computing Technology, Chinese Academy of Sciences"}, {"id": 186273, "fullname": "Xing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186273?format=json", "institution": "Ant Group"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 91631, "fullname": "Yinghao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91631?format=json", "institution": "Stanford University"}, {"id": 85108, "fullname": "Shuqiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85108?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) \u2014a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM)\u2013based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues\u2014explicit depth-based projection and implicit learned priors\u2014yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39568", "url": null, "sourceid": 33108, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36831, "uid": "676a50b58679afc0e573776d7cbb3835", "name": "From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition", "authors": [{"id": 104666, "fullname": "Jingxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/104666?format=json", "institution": "University of Maryland College Park"}, {"id": 185977, "fullname": "Yixiao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185977?format=json", "institution": "Amazon"}, {"id": 185978, "fullname": "Xiaoye qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/185978?format=json", "institution": null}, {"id": 126703, "fullname": "Zongxia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126703?format=json", "institution": "University of Maryland, College Park"}, {"id": 136019, "fullname": "Cornelia Fermuller", "url": "http://cvpr.thecvf.com/api/miniconf/users/136019?format=json", "institution": "University of Maryland, College Park"}, {"id": 185979, "fullname": "Caren Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185979?format=json", "institution": "Amazon"}, {"id": 85403, "fullname": "Yiannis Aloimonos", "url": "http://cvpr.thecvf.com/api/miniconf/users/85403?format=json", "institution": "University of Maryland, College Park"}], "abstract": "Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modality context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36831", "url": null, "sourceid": 44351, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36604, "uid": "f72838a46139d9e8c1d71c9204376a88", "name": "Interpretable Prompts made Edit-Friendly: Token-to-Token Similarity Reduction in dLLMs for Edit-Friendly Hard Prompt Inversion", "authors": [{"id": 185447, "fullname": "Naresh Kumar Devulapally", "url": "http://cvpr.thecvf.com/api/miniconf/users/185447?format=json", "institution": "State University of New York at Buffalo"}, {"id": 92390, "fullname": "Shruti Agarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/92390?format=json", "institution": "Adobe"}, {"id": 76133, "fullname": "Vishal Asnani", "url": "http://cvpr.thecvf.com/api/miniconf/users/76133?format=json", "institution": "Michigan State University"}, {"id": 138364, "fullname": "Vishnu Lokhande", "url": "http://cvpr.thecvf.com/api/miniconf/users/138364?format=json", "institution": "University at Buffalo, State University of New York"}], "abstract": "Crafting prompts via Prompt Engineering that steer a model\u2019s internal representations toward specific and pre-defined outcomes can be time-consuming, often requiring multiple iterations. Hard Prompt Inversion offers a complementary workflow: start from a reference image and generate a prompt that conditions a text-to-image (T2I) model to reconstruct the reference image. Existing inversion methods either yield incoherent text, or produce prompts that are overly sensitive to downstream token edits. We propose a dLLM-based prompt inversion framework that yield prompts that are (i) more interpretable to humans, (ii) better aligned with the reference image, and (iii) designed for downstream token swap and token append operations (aka edit-friendly prompts). The method is plug-and-play, requiring no finetuning of either the T2I model or the dLLM. Experiments across three datasets show a $\\sim10\\times$ reduction in inversion time relative to existing prompt-inversion baselines, higher interpretability scores, and significantly higher prompt editability, as measured by TIFA, GPT-V preference scoring, and controlled user studies, all while preserving high-fidelity image generation. By coupling diffusion-time sampling with token-similarity control inside a dLLM decoder, our approach extends prompt inversion beyond reconstruction to downstream token-editing tasks, enabling faster, more transferable prompts that generalize across multiple T2I models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36604", "url": null, "sourceid": 39332, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37146, "uid": "7155eaa13b3893b816315471556dae55", "name": "Rethinking Dataset Distillation: Hard Truths about Soft Labels", "authors": [{"id": 186771, "fullname": "Priyam Dey", "url": "http://cvpr.thecvf.com/api/miniconf/users/186771?format=json", "institution": "Indian Institute of Science, Indian institute of science, Bangalore"}, {"id": 186772, "fullname": "Aditya Sahdev", "url": "http://cvpr.thecvf.com/api/miniconf/users/186772?format=json", "institution": "Indian Institute of Science"}, {"id": 132728, "fullname": "Sunny Bhati", "url": "http://cvpr.thecvf.com/api/miniconf/users/132728?format=json", "institution": "University of Maryland Baltimore County"}, {"id": 87943, "fullname": "Konda Reddy Mopuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/87943?format=json", "institution": "Indian Institute of Technology Hyderabad"}, {"id": 76920, "fullname": "R. Venkatesh Babu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76920?format=json", "institution": "Indian Institute of Science"}], "abstract": "Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence \\cite{qin2024a} finds that simple random image baselines perform on-par with state-of-the-art DD methods like SRe2L \\cite{yin2024squeezerecoverrelabeldataset} due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hard-label (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches near-optimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED \\cite{sun2024diversityrealismdistilleddataset} reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance data-efficient learning for both coresets and DD.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37146", "url": null, "sourceid": 40982, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40285?format=json"], "related_events_ids": [40285]}, {"id": 39355, "uid": "12db1dbf2200bc9c5f1d04c69ae04ff0", "name": "LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception", "authors": [{"id": 180472, "fullname": "Simon de Moreau", "url": "http://cvpr.thecvf.com/api/miniconf/users/180472?format=json", "institution": "Axanea"}, {"id": 87148, "fullname": "Andrei Bursuc", "url": "http://cvpr.thecvf.com/api/miniconf/users/87148?format=json", "institution": "valeo.ai"}, {"id": 191912, "fullname": "Hafid EL IDRISSI", "url": "http://cvpr.thecvf.com/api/miniconf/users/191912?format=json", "institution": "Valeo"}, {"id": 191913, "fullname": "Fabien Moutarde", "url": "http://cvpr.thecvf.com/api/miniconf/users/191913?format=json", "institution": "MinesParis PSL"}], "abstract": "Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting.We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high\u2011definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero\u2011shot in real\u2011world closed\u2011loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low\u2011beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain\u2011generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost\u2011effective solution to robust nighttime perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39355", "url": null, "sourceid": 41119, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39288, "uid": "d1fcdf15cb97c47d0ed1e1e10773ae36", "name": "AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection", "authors": [{"id": 191774, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191774?format=json", "institution": "East China Jiaotong University"}, {"id": 191775, "fullname": "Hui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191775?format=json", "institution": "East China Jiao Tong University"}, {"id": 191776, "fullname": "Man Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191776?format=json", "institution": "Shanghai University of Electric Power"}, {"id": 191777, "fullname": "Zexuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191777?format=json", "institution": "Shanghai University of Electric Power"}, {"id": 184076, "fullname": "Zizhu Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184076?format=json", "institution": "Shanghai University of Electric Power"}], "abstract": "The YOLO (You Only Look Once) series has been a cornerstone in real-time object detection, renowned for its efficient convolutional design and rapid inference. However, its reliance on convolutional operations inherently limits its ability to capture long-range dependencies and rich contextual information, leading to suboptimal performance in complex scenes. Recently, SSM (State Space Models) have emerged as an efficient alternative to attention mechanisms, offering global representation with linear time complexity. In this paper, we propose AKCMamba-YOLO, a novel object detector that incorporates SSM into the YOLO architecture. We introduce 3CAKCMamba and 4CAKCMamba modules to a novel object detection framework, enabling enhanced channel interaction and cross-layer semantic fusion. This design improves multi-scale feature modeling while maintaining computational efficiency. To support safety-critical applications, we provide railway pedestrian Detection datasets with 2,975 annotated images under complex scenarios. Experiments on COCO2017, power tower foreign object detection datasets, and our custom dataset show that AKCMamba-YOLO achieves superior accuracy and speed compared to state-of-the-art baselines, making it well-suited for real-time and resource-constrained environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39288", "url": null, "sourceid": 39097, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36519, "uid": "4b97422dd3861c217c68285a92cd89bc", "name": "Representing 3D Faces with Learnable B-Spline Volumes", "authors": [{"id": 185259, "fullname": "Prashanth Chandran", "url": "http://cvpr.thecvf.com/api/miniconf/users/185259?format=json", "institution": "Google"}, {"id": 95402, "fullname": "Daoye Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/95402?format=json", "institution": "Google"}, {"id": 88338, "fullname": "Timo Bolkart", "url": "http://cvpr.thecvf.com/api/miniconf/users/88338?format=json", "institution": "Google"}], "abstract": "We present CUBE (Control-based Unified B-Splinie Encoding), a new geometric representation for digital humans that combines B-Spline volumes with learned features, and demonstrate its use as decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-Spline representations that use 3D control points, CUBE is parametrized by a lattice (e.g., $8 \\times 8 \\times 8$) of high-dimensional control features, increasing the models' expressivity. These control features define a continuous mapping from a 3D parametric domain to 3D Euclidean space through an intermediate feature space, which is evaluated in two stages. First, high-dimensional control features are locally blended using the B-Spline bases, yielding a high-dimensional feature vector, where the first three values are the 3D coordinates of a coarse base mesh. This feature vector is input to a small MLP to predict a residual from the base shape, resulting in refined 3D point coordinates. To reconstruct 3D surfaces in dense semantic correspondence, we query CUBE at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support of traditional B-spline representations, enabling us to locally edit the surface by updating individual control features. We demonstrate the strengths of this representation by training two transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent geometric and multi-view baselines.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36519", "url": null, "sourceid": 43885, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40285, "uid": "7155eaa13b3893b816315471556dae55", "name": "Rethinking Dataset Distillation: Hard Truths about Soft Labels", "authors": [{"id": 186771, "fullname": "Priyam Dey", "url": "http://cvpr.thecvf.com/api/miniconf/users/186771?format=json", "institution": "Indian Institute of Science, Indian institute of science, Bangalore"}, {"id": 186772, "fullname": "Aditya Sahdev", "url": "http://cvpr.thecvf.com/api/miniconf/users/186772?format=json", "institution": "Indian Institute of Science"}, {"id": 132728, "fullname": "Sunny Bhati", "url": "http://cvpr.thecvf.com/api/miniconf/users/132728?format=json", "institution": "University of Maryland Baltimore County"}, {"id": 87943, "fullname": "Konda Reddy Mopuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/87943?format=json", "institution": "Indian Institute of Technology Hyderabad"}, {"id": 76920, "fullname": "R. Venkatesh Babu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76920?format=json", "institution": "Indian Institute of Science"}], "abstract": "Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence \\cite{qin2024a} finds that simple random image baselines perform on-par with state-of-the-art DD methods like SRe2L \\cite{yin2024squeezerecoverrelabeldataset} due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hard-label (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches near-optimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED \\cite{sun2024diversityrealismdistilleddataset} reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance data-efficient learning for both coresets and DD.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40285", "url": null, "sourceid": -40982, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37146?format=json"], "related_events_ids": [37146]}, {"id": 39586, "uid": "4ecb8876b622f561d9d13161071f518c", "name": "Hist2Style: Histogram-Guided Stylization with Bilateral Grids", "authors": [{"id": 192413, "fullname": "Dekel Galor", "url": "http://cvpr.thecvf.com/api/miniconf/users/192413?format=json", "institution": "UC Berkeley"}, {"id": 153005, "fullname": "Adam Pikielny", "url": "http://cvpr.thecvf.com/api/miniconf/users/153005?format=json", "institution": "Adobe Systems"}, {"id": 129904, "fullname": "Zhoutong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129904?format=json", "institution": "Adobe Systems"}, {"id": 86862, "fullname": "Ke Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86862?format=json", "institution": "Electrical Engineering and Computer Sciences, University of California, Berkeley"}, {"id": 164935, "fullname": "Laura Waller", "url": "http://cvpr.thecvf.com/api/miniconf/users/164935?format=json", "institution": "University of California, Berkeley"}, {"id": 192414, "fullname": "Jiawen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192414?format=json", "institution": "Adobe Inc."}, {"id": 192415, "fullname": "Ilya Chugunov", "url": "http://cvpr.thecvf.com/api/miniconf/users/192415?format=json", "institution": "Adobe Systems"}], "abstract": "Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model is trained to reproduce the spatially varying color edits available in larger image editing models. This training paradigm involves generating a large supervised corpus with language and vision-language models and distilling a high-capacity editor into a lightweight model. The model conditions on a histogram-based embedding of the style target, which provides an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39586", "url": null, "sourceid": 31755, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38269, "uid": "a8e7bf81e2bc2a1832617ebaa73df373", "name": "Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization", "authors": [{"id": 133684, "fullname": "Kai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133684?format=json", "institution": "University of Toronto"}, {"id": 184255, "fullname": "Tao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184255?format=json", "institution": "Fudan University"}, {"id": 189464, "fullname": "jiayi lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/189464?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189465, "fullname": "Jing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189465?format=json", "institution": "Southeast University; CECloud Computing Technology Co., Ltd."}, {"id": 189466, "fullname": "Jinman Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189466?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 136675, "fullname": "Weiguo Pian", "url": "http://cvpr.thecvf.com/api/miniconf/users/136675?format=json", "institution": "The University of Texas at Dallas"}, {"id": 86119, "fullname": "Yuan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86119?format=json", "institution": "Fudan University"}, {"id": 87904, "fullname": "Yapeng Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/87904?format=json", "institution": "University of Texas at Dallas"}, {"id": 90197, "fullname": "Peng Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90197?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 76614, "fullname": "Bin Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76614?format=json", "institution": "Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences"}, {"id": 77291, "fullname": "Yihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77291?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 189467, "fullname": "Dimitrios Hatzinakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/189467?format=json", "institution": "University of Toronto"}, {"id": 186111, "fullname": "Yuewen Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186111?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}], "abstract": "Generating high-fidelity audio that is both semantically meaningful and temporally synchronized with silent videos remains a challenging problem in video-to-audio generation. Existing approaches often fail to capture fine-grained temporal correspondence between visual events and audio dynamics, leading to unrealistic or desynchronized outputs. To address these limitations, we propose VisioSonic, a Video-Aligned Sound generation framework that unifies flow-matching diffusion and preference-guided alignment. VisioSonic introduces a multimodal conditioning module that jointly leverages video frames and textual cues to provide semantic and frame-level temporal guidance. A co-attention diffusion transformer efficiently fuses visual and audio representations, enabling content-aware sound synthesis with minimal computation costs. To further enhance alignment beyond supervised training, we introduce Semantic-Temporal Alignment Ranked Direct Preference Optimization (STAR-DPO), a novel preference-learning paradigm that automatically generates audio candidates,ranks them based on both semantic and temporal alignment, and subsequently fine-tunes the diffusion model using the derived preference pairs. Extensive experiments on various benchmarks demonstrate that VisioSonic achieves state-of-the-art audio-video synchronization and audio fidelity while using the fewest trainable parameters among competing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38269", "url": null, "sourceid": 41517, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36673, "uid": "f8cd71b7b469f15bd03947f8493a7259", "name": "Prototype-Guided Concept Erasure in Diffusion Models", "authors": [{"id": 181235, "fullname": "Yuze Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181235?format=json", "institution": "Fudan University"}, {"id": 133284, "fullname": "Jiahao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133284?format=json", "institution": "National University of Singapore"}, {"id": 185616, "fullname": "Hongxiang Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185616?format=json", "institution": "Fudan University"}, {"id": 185617, "fullname": "Yichao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185617?format=json", "institution": "Fudan University"}, {"id": 185618, "fullname": "Hong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185618?format=json", "institution": "Fudan University"}], "abstract": "Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content.    Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk).    However, their performance degrades on broad concepts such as ''sexual'' or ''violent'', whose wide scope and multi-faceted nature make them difficult to erase reliably.To overcome this limitation, we exploit the model's intrinsic embedding geometry to identify latent embeddings that encode a given concept. By clustering these embeddings, we derive a set of \\textbf{concept prototypes} that summarize the model's internal representations of the concept, and employ them as negative conditioning signals during inference to achieve precise and reliable erasure.  Extensive experiments across multiple benchmarks show that our approach achieves substantially more reliable removal of broad concepts while preserving overall image quality, marking a step towards safer and more controllable image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36673", "url": null, "sourceid": 32896, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39917, "uid": "f7d39620fd98ccb30f1c7741fe158210", "name": "AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video", "authors": [{"id": 193110, "fullname": "Yogesh Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/193110?format=json", "institution": "Arizona State University | Intern (Adobe Research)"}, {"id": 153065, "fullname": "Pooyan Fazli", "url": "http://cvpr.thecvf.com/api/miniconf/users/153065?format=json", "institution": "Arizona State University"}], "abstract": "Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps.We introduce $\\textbf{AVATAR}$ ($\\textbf{A}$udio-$\\textbf{V}$ideo $\\textbf{A}$gen$\\textbf{t}$ for $\\textbf{A}$lignment and $\\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning.$\\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\\textbf{+5.4}$ on MMVU, $\\textbf{+4.9}$ on OmniBench, and $\\textbf{+4.5}$ on Video-Holmes, while demonstrating $\\textbf{5$\\times$ sample efficiency}$, requiring $80$% fewer generated completions to reach target performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39917", "url": "https://people-robots.github.io/AVATAR/", "sourceid": 34630, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36479, "uid": "488be271582755444ab15d06f5d49c12", "name": "Region-Aware Instance Consistency Learning for Micro-Expression Recognition", "authors": [{"id": 181251, "fullname": "Yaomin Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181251?format=json", "institution": "South China University of Technology"}, {"id": 155062, "fullname": "C.L.Philip Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155062?format=json", "institution": "South China University of Technology"}, {"id": 185149, "fullname": "Shiting Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185149?format=json", "institution": "South China University of Technology"}, {"id": 185150, "fullname": "Haiqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185150?format=json", "institution": "South China University of Technology"}, {"id": 130660, "fullname": "Tong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130660?format=json", "institution": "South China University of Technology"}], "abstract": "Micro-expression Recognition (MER) is challenging due to the subtle motion. Existing methods heavily rely on the onset/apex pair to capture the most discriminative motion clues. This paradigm struggles with labor-intensive apex annotation and effective utilization of data. In this paper, we propose a novel paradigm for MER that eliminates the need for expensive apex annotations while effectively capturing subtle motion dynamics. Our key insight is that frames within the sequence exhibit spatially consistent and intensity varied motion cues relative to the onset frame. Motivated by this, our method treats each sequence as a set of multiple onset/near-median motion instances. To fully exploit weaker motion information conveyed by these varied instances, our framework introduces an Instance Region Consistency (IRC) module that enforces visual attention consistency on similar facial activation regions across different instances within the same set. Furthermore, we present a Multi-Region Discovery (MRD) module with self-supervised learning to expand attention on more subtle activation regions which are typically neglected. Extensive experiments on four public micro-expression datasets demonstrate that our proposed approach surpasses state-of-the-art methods without using any apex frame annotations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36479", "url": null, "sourceid": 45763, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39524, "uid": "0d8b45d95dc4bd9712dc67f87db79a8e", "name": "TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding", "authors": [{"id": 188095, "fullname": "Boshen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188095?format=json", "institution": "Renmin University of China"}, {"id": 192260, "fullname": "Zihan Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192260?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 152287, "fullname": "Jiaze Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152287?format=json", "institution": "Zhejiang University"}, {"id": 188097, "fullname": "Jianzhong Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/188097?format=json", "institution": "Xiaomi Corporation"}, {"id": 188098, "fullname": "Zhenbo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188098?format=json", "institution": "Xiaomi Corporation"}, {"id": 188099, "fullname": "Jian Luan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188099?format=json", "institution": "Xiaomi Corporation"}, {"id": 76490, "fullname": "Qin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/76490?format=json", "institution": "Renmin University of China"}], "abstract": "We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding.Processing hour-long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from visual tokens to textual tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending input length.  We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability.  This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures. All code and model weights will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39524", "url": null, "sourceid": 42839, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39991, "uid": "b9e702b9182246d9ad08bef0c2229f6b", "name": "Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks", "authors": [{"id": 166128, "fullname": "morui zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166128?format=json", "institution": "University of North Texas"}, {"id": 166047, "fullname": "Yongqi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166047?format=json", "institution": "University of North Texas"}, {"id": 193255, "fullname": "Song Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193255?format=json", "institution": "University of North Texas, Denton"}, {"id": 138683, "fullname": "Qing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/138683?format=json", "institution": "University of North Texas, Denton"}], "abstract": "Autonomous trucking poses unique challenges due to articulated tractor\u2013trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate evaluation, we introduce STT4AT, a CARLA-based benchmark simulating semi-trailer trucks with synchronized multi-sensor suites and time-varying inter-rig geometry across diverse environments. Experiments demonstrate that dCAP achieves stable, accurate perception while addressing the limitations of static calibration in autonomous trucking. The dataset, development kit, and source code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39991", "url": null, "sourceid": 31721, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36610, "uid": "0490cd3a67ba409154e4126ecb3329e8", "name": "Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control", "authors": [{"id": 182531, "fullname": "Chenxi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/182531?format=json", "institution": "Westlake University"}, {"id": 185461, "fullname": "Yanming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185461?format=json", "institution": "Westlake University"}, {"id": 185462, "fullname": "Tong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185462?format=json", "institution": "Zhejiang University &amp; Westlake University"}, {"id": 86162, "fullname": "Ruibo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86162?format=json", "institution": "Nanyang Technological University"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping strategies, are insufficient, often introducing visual artifacts, failing to generalize, or incurring high computational costs. We introduce a novel, training-free framework that operates purely at inference time to resolve these issues. Our method is comprised of three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply guidance, thereby decoupling motion from appearance and preserving visual fidelity. Third, a dual-path guidance strategy adaptively corrects for drift by comparing the guided generation against an unguided, reference denoising path, effectively neutralizing artifacts caused by misaligned structural inputs. These components work in concert to inject precise, trajectory-aligned control without any model retraining, achieving both accurate motion guidance and photorealistic synthesis. Our plug-and-play, model-agnostic solution demonstrates broad applicability for 3D/4D tasks. Extensive experiments confirm state-of-the-art performance in trajectory adherence and perceptual quality, outperforming both training-dependent and other inference-only methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36610", "url": null, "sourceid": 41635, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36731, "uid": "77a86db24f4897976c3b8879b4f3d47a", "name": "Few-Step Diffusion Sampling Through Instance-Aware Discretizations", "authors": [{"id": 185742, "fullname": "Liangyu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185742?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185743, "fullname": "Ruoyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185743?format=json", "institution": "Westlake University"}, {"id": 185462, "fullname": "Tong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185462?format=json", "institution": "Zhejiang University &amp; Westlake University"}, {"id": 185744, "fullname": "Dingwen Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185744?format=json", "institution": "Westlake University"}, {"id": 147417, "fullname": "Mingkun Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/147417?format=json", "institution": "Westlake University"}, {"id": 159049, "fullname": "Beier Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159049?format=json", "institution": "Nanyang Technological University"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance.Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36731", "url": null, "sourceid": 44404, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39094, "uid": "e3dabf77dd2fc079db2f54e480c00595", "name": "TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval", "authors": [{"id": 180799, "fullname": "Chengyu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180799?format=json", "institution": "The University of British Columbia"}, {"id": 191346, "fullname": "Hanzhang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191346?format=json", "institution": "University of British Columbia"}, {"id": 191347, "fullname": "Jie Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/191347?format=json", "institution": "Ocean University of China"}, {"id": 180800, "fullname": "Shan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/180800?format=json", "institution": "The University of British Columbia"}], "abstract": "In remote sensing (RS) cross-modal retrieval, most existing methods employ contrastive learning as their primary optimization objective, aligning anchors with positive counterparts and distinguishing them from negative samples. To improve negative sampling, these approaches typically set thresholds on cross-modal similarity scores, designating negatives that exceed the threshold as false negative samples (FNS). However, dependence on a single cross-modal similarity threshold is fragile because it fails to account for the cross-modal semantic overlaps and gaps. To address these challenges, we introduce TriSim, a novel image-text retrieval framework that constructs a tri-dimensional negative similarity space $<$img-img, img-txt, txt-txt$>$ to mitigate the influence of FNS issue. Specifically, considering that FNS appear as anomalies in this space, Extreme Value Theory (EVT) is applied to model the statistical behavior of the tail distribution for FNS selection. Two complementary tail selection strategies are developed: one identifies samples distant from the dense ellipsoidal center, and the other targets upper-right high-similarity extremes. The selected tail samples are regarded as FNS and modeled using a generalized Pareto distribution, with probabilistic weights assigned in the triplet loss. To further refine the selected FNS, intra-modal saliency differences are computed to generate masks that guide the learning of a gain matrix, which amplifies highly discriminative regions and suppresses ambiguous ones. Extensive experiments on two benchmarks demonstrate the superiority of the proposed TriSim framework in mitigating the influence of false negatives in RS image-text retrieval.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39094", "url": null, "sourceid": 43118, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38212, "uid": "b4994d8000b83caf4875f9cf28664194", "name": "AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs", "authors": [{"id": 175317, "fullname": "Shuhan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/175317?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 157538, "fullname": "Pei Pei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157538?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 157540, "fullname": "Xuannan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157540?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 189333, "fullname": "Dongsen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189333?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 177266, "fullname": "Xinyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/177266?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 189334, "fullname": "Zekun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189334?format=json", "institution": "University of California, Santa Barbara"}], "abstract": "The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38212", "url": null, "sourceid": 44639, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40391, "uid": "8d0847366a208968be344b7c3e595291", "name": "ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation", "authors": [{"id": 88038, "fullname": "Huan Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/88038?format=json", "institution": "University of Science and Technology of China"}, {"id": 144463, "fullname": "Yihan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/144463?format=json", "institution": "University of Science and Technology of China"}, {"id": 183623, "fullname": "Chuxin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183623?format=json", "institution": "University of Science and Technology of China"}, {"id": 178536, "fullname": "Nailong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178536?format=json", "institution": "Beijing Institute of Control Engineering"}, {"id": 88062, "fullname": "Wenfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88062?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency.To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations.Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors. Our method pioneers a new direction for future research by effectively and efficiently integrating shape completion into category-level object pose estimation. Code will be open.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40391", "url": null, "sourceid": -40534, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40025?format=json"], "related_events_ids": [40025]}, {"id": 39325, "uid": "65d03e7a8e7e35f398ece831361d7c58", "name": "Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs", "authors": [{"id": 180678, "fullname": "Yurun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180678?format=json", "institution": "Zhejiang University"}, {"id": 191851, "fullname": "Xueyu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191851?format=json", "institution": null}, {"id": 191852, "fullname": "Yuhan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191852?format=json", "institution": "Xiamen University"}, {"id": 191853, "fullname": "Ziqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191853?format=json", "institution": "Zhejiang University"}, {"id": 191854, "fullname": "Zeyi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191854?format=json", "institution": "Ohio State University, Columbus"}, {"id": 191855, "fullname": "Lin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191855?format=json", "institution": null}, {"id": 191856, "fullname": "Feng Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191856?format=json", "institution": null}, {"id": 191857, "fullname": "Yuxi qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191857?format=json", "institution": "mybank"}, {"id": 191858, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191858?format=json", "institution": null}, {"id": 191859, "fullname": "Keting Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191859?format=json", "institution": "Zhejiang University"}, {"id": 84791, "fullname": "Shengyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84791?format=json", "institution": "Zhejiang University"}], "abstract": "As multimodal LLM-driven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in increasingly complex and diverse tasks. Existing studies have attempted to generate agent tasks using LLMs, but due to the inherent hallucinations of LLMs and the lack of internal data relationship modeling, these tasks often exhibit semantic inconsistencies and solvability issues. To address these challenges, we introduce Graph2Eval, a knowledge-graph-driven framework for automated, scalable, and semantically grounded agent task generation. At its core, Graph2Eval leverages a knowledge graph built from heterogeneous external data sources as a structured task space, generating multimodal agent tasks through subgraph sampling and task construction guided by task templates and meta-path strategies. To further ensure task reliability, a multi-stage filtering pipeline based on node reachability analysis, LLM scoring, and similarity analysis ensures the diversity and solvability of the generated tasks. By unifying both RAG Agent and Web Agent scenarios, Graph2Eval enables efficient generation of multimodal document understanding tasks and multi-step web interaction tasks. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document understanding and web interaction scenarios. Extensive experiments show that, on average, Graph2Eval improves task semantic consistency by 20% and solvability by 17% over baselines, while Graph2Eval-Bench effectively distinguishes agent performance, offering a new perspective on automated agent evaluation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39325", "url": null, "sourceid": 41265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38952, "uid": "6a6b9ab46b610b6bf661a9c766f195eb", "name": "SIR: Structured Image Representations for Explainable Robot Learning", "authors": [{"id": 181604, "fullname": "Paul Mattes", "url": "http://cvpr.thecvf.com/api/miniconf/users/181604?format=json", "institution": "Karlsruher Institute of Technology"}, {"id": 191045, "fullname": "Jan Schwab", "url": "http://cvpr.thecvf.com/api/miniconf/users/191045?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 191046, "fullname": "Jens Bosch", "url": "http://cvpr.thecvf.com/api/miniconf/users/191046?format=json", "institution": ""}, {"id": 183270, "fullname": "Maximilian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183270?format=json", "institution": "Karlsruhe Institut of Technologie"}, {"id": 191047, "fullname": "Nils Blank", "url": "http://cvpr.thecvf.com/api/miniconf/users/191047?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 191048, "fullname": "Minh-Trung Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191048?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 191049, "fullname": "Moritz Haberland", "url": "http://cvpr.thecvf.com/api/miniconf/users/191049?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 191050, "fullname": "Rudolf Lioutikov", "url": "http://cvpr.thecvf.com/api/miniconf/users/191050?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}], "abstract": "Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions.Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret.To address this, we introduce Structured Image Representation, a method that leverages Scene Graphs as an intermediate representation for robot policy learning.Our approach first constructs a fully connected graph, using 2D or 3D image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a minimal, task-relevant sub-graph that is passed to the action generation model.This process makes our model intrinsically explainable.Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate.We also demonstrate that our graph-based representations are significantly more robust to distractor objects, showing almost no performance degradation, as opposed to image representations.Most importantly, we show that the learned sparse graphs are a powerful tool for introspection.By analysing when the model's sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38952", "url": null, "sourceid": 34832, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37938, "uid": "3958f53d13f000006c688ac027e34d2a", "name": "Mobile-VTON: High-Fidelity On-Device Virtual Try-On", "authors": [{"id": 180340, "fullname": "Zhenchen Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180340?format=json", "institution": "University of Sydney"}, {"id": 90400, "fullname": "Ce Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90400?format=json", "institution": "Computer and Information Engineering, School of Science and Technology, The Chinese University of Hong Kong (Shenzhen)"}, {"id": 188091, "fullname": "Runqi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188091?format=json", "institution": "University of Sydney"}, {"id": 188635, "fullname": "Jiaxin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188635?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 188636, "fullname": "Tianxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188636?format=json", "institution": "University of Melbourne"}, {"id": 85824, "fullname": "Yanwu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85824?format=json", "institution": "Boston University, Boston University"}, {"id": 84796, "fullname": "Tongliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/84796?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 85803, "fullname": "Mingming Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/85803?format=json", "institution": "University of Melbourne"}], "abstract": "Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present \\textsc{Mobile-VTON}, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. \\textsc{Mobile-VTON} introduces a modular TeacherNet--GarmentNet--TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, \\textsc{Mobile-VTON} achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at $1024{\\times}768$ show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37938", "url": null, "sourceid": 40577, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38099, "uid": "f32fe8b226a4303632d2c749bb9304cc", "name": "LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection", "authors": [{"id": 189054, "fullname": "Zhichao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189054?format=json", "institution": "Xidian University"}, {"id": 189055, "fullname": "Jiasheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189055?format=json", "institution": "Xidian University"}, {"id": 189056, "fullname": "Jiyun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189056?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 189057, "fullname": "Jiangtao Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/189057?format=json", "institution": "Xi&#x27;an University of Posts and Telecommunications; Xidian University"}, {"id": 151565, "fullname": "Xiaotian Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151565?format=json", "institution": "Xidian University"}], "abstract": "Visual anomaly detection is vital for quality control applications by identifying deviations from normal patterns.Previous structural or logical anomaly detection methods mainly focus on pixel-level deviations like texture defects and reconstruction errors, ignoring the object-level structural and contextual inconsistencies.These overlooked layout anomalies remain critical yet underexplored, e.g., factually defective hallucinations appeared in generative text-to-image models.Based on the above observation, in this paper, we introduce scene layout anomaly detection, a new task that predicts an object-level anomaly map from the input image to reveal the semantic plausibility and geometric consistency of each object in the scene.Specifically, we propose \\textit{LayoutAD}, an unsupervised learning framework that constructs semantic and geometric graphs to jointly reason over semantic-geometric misalignment among objects.Under this formulation, we are able to detect diverse layout deviations, including object attribute implausibilities and relationship mismatches.Extensive experiments show that \\textit{LayoutAD} outperforms baselines qualitatively and quantitatively across scenarios, benefiting scene understanding and generation applications, including self-corrected image generation and video anomaly detection.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38099", "url": null, "sourceid": 33630, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36457, "uid": "44a1a9f5907b7f74666d766251addcdb", "name": "MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models", "authors": [{"id": 185098, "fullname": "Chunpu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185098?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 126829, "fullname": "Zhixuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126829?format=json", "institution": "The University of Hong Kong"}, {"id": 185099, "fullname": "Tianshuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185099?format=json", "institution": "University of Hong Kong"}, {"id": 185100, "fullname": "Chi-Min Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185100?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 185101, "fullname": "Yang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185101?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 185102, "fullname": "Jessie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185102?format=json", "institution": null}, {"id": 86696, "fullname": "Xiaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86696?format=json", "institution": "Shanghai Jiao Tong University, China"}, {"id": 185103, "fullname": "Yao Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185103?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Recent works on vision-language-action (VLA) models have made great progress in exploring action tokenizers that convert continuous control signals into discrete tokens to align with LLM/VLM training paradigms.These approaches typically train a single tokenizer over entire manipulation trajectories, which often comprise multiple distinct skills and thus pose a challenging optimization trade-off.To address this issue, we introduce MoEActok, a novel action tokenizer that employs a mixture-of-experts (MoE) quantizer to produce skill-aware discrete representations for VLA models. MoEActok utilizes a clustering-driven MoE VQ-VAE mechanism in which each expert specializes in a particular skill.The key components are: (a) an action-skill decoupling strategy that uses k-means clustering to group action chunks, aligning clusters having similar skills; (b) a skill-aware training paradigm that augments VLA models with skill-conditioned context, improving skill grounding; and (c) an adapter that projects shared encoder representations into skill-specific latent spaces for specialized quantization, and subsequently harmonize the heterogeneous quantized representations back into a unified space for coherent reconstruction by the shared decoder.We evaluate MoEActok-based VLA models against multiple prior action tokenizer baselines in the RoboTwin and Simpler-Env simulators, and further assess zero-shot transfer on three real-world tasks. Across both simulated and real-world settings, MoEActok-based VLA substantially outperforms existing discrete tokenization methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36457", "url": null, "sourceid": 46250, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40177, "uid": "23cf654259bfa488296e1e07d38644bb", "name": "Skyreels-Text: Fine-grained Font-Controllable Text Editing for Poster Design", "authors": [{"id": 183263, "fullname": "Yunjie Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183263?format=json", "institution": "Beijing Skywork Intelligence Technology Co., Ltd."}, {"id": 193723, "fullname": "Jingchen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193723?format=json", "institution": "Skywork AI"}, {"id": 193724, "fullname": "Junchen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193724?format=json", "institution": "Skywork AI"}, {"id": 193725, "fullname": "Chunze Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193725?format=json", "institution": "Kunlun Skywork"}, {"id": 193726, "fullname": "Guibin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193726?format=json", "institution": "inspirai"}], "abstract": "Artistic design such as poster design often demands rapid yet precise modification of textual content while preserving visual harmony and typographic intent, especially across diverse font styles. Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present Skyreels-Text, a novel font-controllable framework for precise poster text editing. Our method supports simultaneous editing of multiple text regions with distinct typographic styles: users simply provide cropped glyph patches from reference images, and our model synthesizes the desired content in a visually matching font, without requiring font labels or fine-tuning. Extensive experiments on multiple datasets, including handwrittent text benchmarks, Skyreels-Text achieves state-of-the-art performance in both text fidelity and visual realism, offering unprecedented control over font families, and stylistic nuances. This work bridges the gap between general-purpose image editing and professional-grade typographic design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40177", "url": null, "sourceid": 37359, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36554, "uid": "9eca6659a8b78c13386f85bb3fbb70a2", "name": "HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction", "authors": [{"id": 185327, "fullname": "Xi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185327?format=json", "institution": "Amazon, Clemson University"}, {"id": 185328, "fullname": "Weiwei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185328?format=json", "institution": "Amazon"}, {"id": 98258, "fullname": "Joe Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/98258?format=json", "institution": "Amazon"}, {"id": 139949, "fullname": "Christopher Broaddus", "url": "http://cvpr.thecvf.com/api/miniconf/users/139949?format=json", "institution": "Amazon"}, {"id": 71707, "fullname": "Siyu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71707?format=json", "institution": "Harvard University"}, {"id": 185329, "fullname": "Laurent Guigues", "url": "http://cvpr.thecvf.com/api/miniconf/users/185329?format=json", "institution": "Amazon"}], "abstract": "Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content-- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36554", "url": null, "sourceid": 46461, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36468, "uid": "a94eef54dd59424283d4c39b3f2248c7", "name": "Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening", "authors": [{"id": 159380, "fullname": "jiahan huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/159380?format=json", "institution": "Southeast University"}, {"id": 159018, "fullname": "Ran Ran", "url": "http://cvpr.thecvf.com/api/miniconf/users/159018?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 128548, "fullname": "Junming Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128548?format=json", "institution": "Southeast University"}, {"id": 185119, "fullname": "Zihao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185119?format=json", "institution": "Southeast University"}, {"id": 69585, "fullname": "Xiaofeng Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/69585?format=json", "institution": "Southeast University"}, {"id": 185120, "fullname": "Junling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185120?format=json", "institution": "Southeast University"}, {"id": 130297, "fullname": "Liang-Jian Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130297?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Pan-sharpening, a fundamental image preprocessing technique in remote sensing, aims to generate spatially and spectrally enriched multispectral imagery by integrating complementary information from texture-rich panchromatic (PAN) images and paired low-resolution multispectral (LRMS) counterparts. Although recent generative diffusion models have achieved impressive fusion quality, these performance gains often come with substantial computational costs, rendering them impractical for resource-constrained scenarios common in remote sensing applications. This work introduces a function-space diffusion model built upon a neural operator architecture that achieves compelling performance with promising efficiency. Specifically, our framework replaces the standard attention-based denoising backbone with a Galerkin-type neural operator, significantly reducing computational complexity while maintaining excellent representational capacity. Furthermore, by explicitly integrating pixel-wise spatial-spectral consistency residuals into each reverse diffusion step, our method establishes a fine-grained, closed-loop guidance mechanism that dynamically calibrates spatial details and spectral fidelity throughout the generation process. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36468", "url": null, "sourceid": 40467, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39104, "uid": "ecf0f13c45ae9b1787dd211283ee4cb0", "name": "PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting", "authors": [{"id": 168410, "fullname": "Danyal Maqbool", "url": "http://cvpr.thecvf.com/api/miniconf/users/168410?format=json", "institution": "University of Wisconsin-Madison"}, {"id": 191362, "fullname": "Changhee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/191362?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191363, "fullname": "Zachary Huemann", "url": "http://cvpr.thecvf.com/api/miniconf/users/191363?format=json", "institution": "SmarterDx"}, {"id": 191364, "fullname": "Samuel Church", "url": "http://cvpr.thecvf.com/api/miniconf/users/191364?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191365, "fullname": "Matthew Larson", "url": "http://cvpr.thecvf.com/api/miniconf/users/191365?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191366, "fullname": "Scott Perlman", "url": "http://cvpr.thecvf.com/api/miniconf/users/191366?format=json", "institution": null}, {"id": 191367, "fullname": "Tomas Romero", "url": "http://cvpr.thecvf.com/api/miniconf/users/191367?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191368, "fullname": "Joshua Warner", "url": "http://cvpr.thecvf.com/api/miniconf/users/191368?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191369, "fullname": "Meghan Lubner", "url": "http://cvpr.thecvf.com/api/miniconf/users/191369?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191370, "fullname": "Xin Tie", "url": "http://cvpr.thecvf.com/api/miniconf/users/191370?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 191371, "fullname": "Jameson Merkow", "url": "http://cvpr.thecvf.com/api/miniconf/users/191371?format=json", "institution": "Microsoft"}, {"id": 131545, "fullname": "Junjie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131545?format=json", "institution": "University of Wisconsin, Madison"}, {"id": 191372, "fullname": "Steve Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/191372?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 191373, "fullname": "Tyler Bradshaw", "url": "http://cvpr.thecvf.com/api/miniconf/users/191373?format=json", "institution": "University of Wisconsin - Madison"}], "abstract": "Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1\\% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians---the first of its kind for automated PET reporting---confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39104", "url": null, "sourceid": 45986, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38591, "uid": "c2f8e1260683ddd569afefcb6558a1a6", "name": "AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors", "authors": [{"id": 175640, "fullname": "Xiaoxue Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175640?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 86372, "fullname": "Xiaoxu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86372?format=json", "institution": "National University of Singapore"}, {"id": 190227, "fullname": "Yixuan Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190227?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 175643, "fullname": "Tiao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/175643?format=json", "institution": "Huawei Technologies Co., Ltd."}, {"id": 180974, "fullname": "Kaihua Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180974?format=json", "institution": "Tongji University"}, {"id": 86358, "fullname": "Michael Bi Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86358?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 190228, "fullname": "Zhan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190228?format=json", "institution": null}, {"id": 130142, "fullname": "Dave Zhenyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130142?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}], "abstract": "Scene-level 3D reconstruction has attracted increasing attention, and feed-forward 3D Gaussian Splatting (3DGS) has emerged as a promising paradigm for novel view synthesis. However, most existing methods adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, making the number of Gaussians tightly coupled with the input images. This leads to several limitations: (i) reconstruction quality is sensitive to the quantity and viewpoint coverage of input images, often causing Gaussians to accumulate more densely in regions with frequent viewpoints; (ii) alignment errors become more pronounced under sparse-view conditions; and (iii) the lack of explicit geometric consistency can degrade depth estimation and downstream 3D tasks. In this paper, we propose AnchorSplat, a novel multi-view feed-forward 3DGS framework for scene-level reconstruction that departs from pixel-aligned prediction and instead represents the scene directly in 3D space. AnchorSplat introduces anchor-aligned Gaussians guided by geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware representation that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. The framework is trained in two stages: a Gaussian decoder first predicts anchor-aligned Gaussians, and a subsequent Gaussian refiner further improves their quality and view consistency. Experiments on the ScanNet benchmark demonstrate that AnchorSplat achieves state-of-the-art performance, producing more view-consistent and plausible 3D Gaussian reconstructions. Code, videos, and pretrained models will be released on the project page.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38591", "url": null, "sourceid": 32520, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38796, "uid": "baead3b2bab81c64c1ea6a6984646567", "name": "Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields", "authors": [{"id": 153178, "fullname": "Ankit Dhiman", "url": "http://cvpr.thecvf.com/api/miniconf/users/153178?format=json", "institution": "Indian Institute of Science, Bangalore"}, {"id": 155693, "fullname": "Tao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155693?format=json", "institution": "Brown University"}, {"id": 190688, "fullname": "Srinath Ravi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190688?format=json", "institution": "Carnegie Mellon University"}, {"id": 190689, "fullname": "Emre Arslan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190689?format=json", "institution": "Brown University"}, {"id": 127302, "fullname": "Angela Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/127302?format=json", "institution": "Brown University"}, {"id": 70327, "fullname": "Yuanbo Xiangli", "url": "http://cvpr.thecvf.com/api/miniconf/users/70327?format=json", "institution": "Cornell University"}, {"id": 76920, "fullname": "R. Venkatesh Babu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76920?format=json", "institution": "Indian Institute of Science"}, {"id": 76158, "fullname": "Srinath Sridhar", "url": "http://cvpr.thecvf.com/api/miniconf/users/76158?format=json", "institution": "Brown University"}], "abstract": "Novel-view synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally, to improve densification efficiency and prevent gradient vanishing, we incorporate both positional and appearance error to enhance densification effectiveness. With these improvements, we achieve fast 4K-resolution fitting while maintaining, or even improving, novel view rendering quality. Extensive experiments demonstrate that our method achieves significantly faster optimization than existing approaches while preserving high rendering fidelity.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38796", "url": null, "sourceid": 43025, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39660, "uid": "72d2dc616bbb33717dff9126539887cf", "name": "Specificity-aware reinforcement learning for fine-grained open-world classification", "authors": [{"id": 192587, "fullname": "Samuele Angheben", "url": "http://cvpr.thecvf.com/api/miniconf/users/192587?format=json", "institution": null}, {"id": 192588, "fullname": "Davide Berasi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192588?format=json", "institution": "University of Trento"}, {"id": 192589, "fullname": "Alessandro Conti", "url": "http://cvpr.thecvf.com/api/miniconf/users/192589?format=json", "institution": "Tavus"}, {"id": 75841, "fullname": "Elisa Ricci", "url": "http://cvpr.thecvf.com/api/miniconf/users/75841?format=json", "institution": "University of Trento"}, {"id": 106509, "fullname": "Yiming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/106509?format=json", "institution": "Fondazione Bruno Kessler"}], "abstract": "Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. We will release both code and model.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39660", "url": null, "sourceid": 34597, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37074, "uid": "5d64ea6bc3bf7929860190fe5159b5b3", "name": "Native and Compact Structured Latents for 3D Generation", "authors": [{"id": 151573, "fullname": "Jianfeng XIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/151573?format=json", "institution": "Tsinghua University"}, {"id": 89003, "fullname": "Xiaoxue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89003?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 153304, "fullname": "Sicheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153304?format=json", "institution": "Microsoft"}, {"id": 143492, "fullname": "Ruicheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143492?format=json", "institution": "University of Science and Technology of China"}, {"id": 153303, "fullname": "Zelong Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/153303?format=json", "institution": "University of Science and Technology of China"}, {"id": 89100, "fullname": "Yu Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89100?format=json", "institution": "Xiaobing.ai"}, {"id": 186603, "fullname": "Hongyuan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186603?format=json", "institution": "Microsoft"}, {"id": 186604, "fullname": "Yue Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186604?format=json", "institution": "Microsoft"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88169, "fullname": "Nicholas Jing Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88169?format=json", "institution": "Microsoft"}, {"id": 129072, "fullname": "Jiaolong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129072?format=json", "institution": "Microsoft Research"}], "abstract": "Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37074", "url": null, "sourceid": 41354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40283?format=json"], "related_events_ids": [40283]}, {"id": 38808, "uid": "39413c8e064172834b2cca658a731119", "name": "TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking", "authors": [{"id": 182859, "fullname": "Abdullah All Tanvir", "url": "http://cvpr.thecvf.com/api/miniconf/users/182859?format=json", "institution": "University of Nebraska at Omaha"}, {"id": 190725, "fullname": "Agnibh Dasgupta", "url": "http://cvpr.thecvf.com/api/miniconf/users/190725?format=json", "institution": "University of Nebraska, Omaha"}, {"id": 182256, "fullname": "Xin Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/182256?format=json", "institution": "University of Nebraska Omaha"}], "abstract": "Camera recapture introduces complex optical degradations, such as perspective warping, illumination shifts, and Moir\u00e9 interference, that remain challenging for deep watermarking systems. We present TIACam, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking. The method integrates three key innovations: (1) a $\\textit{learnable auto-augmentor}$ that discovers camera-like distortions through differentiable geometric, photometric, and Moir\u00e9 operators; (2) a $\\textit{text-anchored invariant feature learner}$ that enforces semantic consistency via cross-modal adversarial alignment between image and text; and (3) a $\\textit{zero-watermarking head}$ that binds binary messages in the invariant feature space without modifying image pixels. This unified formulation jointly optimizes invariance, semantic alignment, and watermark recoverability. Extensive experiments on both synthetic and real-world camera captures demonstrate that TIACam achieves state-of-the-art feature stability and watermark extraction accuracy, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38808", "url": null, "sourceid": 38754, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40283, "uid": "5d64ea6bc3bf7929860190fe5159b5b3", "name": "Native and Compact Structured Latents for 3D Generation", "authors": [{"id": 151573, "fullname": "Jianfeng XIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/151573?format=json", "institution": "Tsinghua University"}, {"id": 89003, "fullname": "Xiaoxue Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89003?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 153304, "fullname": "Sicheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153304?format=json", "institution": "Microsoft"}, {"id": 143492, "fullname": "Ruicheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143492?format=json", "institution": "University of Science and Technology of China"}, {"id": 153303, "fullname": "Zelong Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/153303?format=json", "institution": "University of Science and Technology of China"}, {"id": 89100, "fullname": "Yu Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89100?format=json", "institution": "Xiaobing.ai"}, {"id": 186603, "fullname": "Hongyuan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186603?format=json", "institution": "Microsoft"}, {"id": 186604, "fullname": "Yue Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186604?format=json", "institution": "Microsoft"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88169, "fullname": "Nicholas Jing Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88169?format=json", "institution": "Microsoft"}, {"id": 129072, "fullname": "Jiaolong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129072?format=json", "institution": "Microsoft Research"}], "abstract": "Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40283", "url": null, "sourceid": -41354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37074?format=json"], "related_events_ids": [37074]}, {"id": 39154, "uid": "5546524ac938e99381090b48365c0740", "name": "Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization", "authors": [{"id": 175436, "fullname": "Xuefei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175436?format=json", "institution": "Caltech"}, {"id": 166960, "fullname": "Kai A. Horstmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/166960?format=json", "institution": "Cornell University"}, {"id": 168006, "fullname": "Ethan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/168006?format=json", "institution": "Cornell University"}, {"id": 191465, "fullname": "Jonathan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191465?format=json", "institution": "Cornell University"}, {"id": 191466, "fullname": "Alexander Farhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191466?format=json", "institution": "California Institute of Technology"}, {"id": 191467, "fullname": "Sophia Stiles", "url": "http://cvpr.thecvf.com/api/miniconf/users/191467?format=json", "institution": "California Institute of Technology"}, {"id": 156410, "fullname": "Atharva Sehgal", "url": "http://cvpr.thecvf.com/api/miniconf/users/156410?format=json", "institution": "UT Austin Caltech"}, {"id": 191468, "fullname": "Jonathan Light", "url": "http://cvpr.thecvf.com/api/miniconf/users/191468?format=json", "institution": "Rensselaer Polytechnic Institute"}, {"id": 191469, "fullname": "David Valen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191469?format=json", "institution": "California Institute of Technology"}, {"id": 76018, "fullname": "Yisong Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/76018?format=json", "institution": "Caltech"}, {"id": 156412, "fullname": "Jennifer J. Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/156412?format=json", "institution": "Cornell University"}], "abstract": "Adapting production-level computer vision tools to bespoke scientific datasets is a critical ''last mile'' bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39154", "url": null, "sourceid": 41801, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38892, "uid": "26203c6ab63ff66fef12bbc3b8dad1c0", "name": "Time Blindness: Why Video-Language Models Can\u2019t See What Humans Can?", "authors": [{"id": 180034, "fullname": "Ujjwal Upadhyay", "url": "http://cvpr.thecvf.com/api/miniconf/users/180034?format=json", "institution": "DocPanel"}, {"id": 190924, "fullname": "Mukul Ranjan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190924?format=json", "institution": "MBZUAI"}, {"id": 135097, "fullname": "Zhiqiang Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/135097?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 85585, "fullname": "Mohamed Elhoseiny", "url": "http://cvpr.thecvf.com/api/miniconf/users/85585?format=json", "institution": "KAUST"}], "abstract": "Recent advances in vision\u2013language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98\\% accuracy, state-of-the-art VLMs achieve 0\\% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Sample dataset and code are available in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38892", "url": null, "sourceid": 36496, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40143, "uid": "c366211d15a83c5e4d72f04ab3b1cf9d", "name": "BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird\u2019s-Eye View Images", "authors": [{"id": 173825, "fullname": "David Skuddis", "url": "http://cvpr.thecvf.com/api/miniconf/users/173825?format=json", "institution": "University of Stuttgart"}, {"id": 193613, "fullname": "Vincent Ress", "url": "http://cvpr.thecvf.com/api/miniconf/users/193613?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 193614, "fullname": "Wei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193614?format=json", "institution": "Universit\u00e4t Stuttgart"}, {"id": 193615, "fullname": "Vincent Ofosu Nyako", "url": "http://cvpr.thecvf.com/api/miniconf/users/193615?format=json", "institution": "Institute for Photogrammetry and Geoinformatics, University of Stuttgart"}, {"id": 193616, "fullname": "Norbert Haala", "url": "http://cvpr.thecvf.com/api/miniconf/users/193616?format=json", "institution": "Universit\u00e4t Stuttgart"}], "abstract": "We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our new self-supervised approach leverages bird\u2019s-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns a learnable set of global landmark coordinates with per-frame heatmaps, yielding consistent detection and reliable occurrence across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and outperforms state-of-the-art methods. Code and trained models will be released after publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40143", "url": null, "sourceid": 31948, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39597, "uid": "f832012aa9c2b51641e64e901024047c", "name": "Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation", "authors": [{"id": 192443, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192443?format=json", "institution": "Xiamen University"}, {"id": 90817, "fullname": "Yang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90817?format=json", "institution": "Xiamen University"}, {"id": 157848, "fullname": "Yachao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157848?format=json", "institution": "Xiamen University"}, {"id": 192444, "fullname": "FangyongWang FangyongWang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192444?format=json", "institution": null}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 87721, "fullname": "Yanyun Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87721?format=json", "institution": "Xiamen University"}], "abstract": "Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39597", "url": null, "sourceid": 37683, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38241, "uid": "da0c03acd1f39210011e050871fa3570", "name": "SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL", "authors": [{"id": 189406, "fullname": "Siyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189406?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 184979, "fullname": "Mikaela Angelina Uy", "url": "http://cvpr.thecvf.com/api/miniconf/users/184979?format=json", "institution": "NVIDIA"}, {"id": 97322, "fullname": "Chan Hee Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/97322?format=json", "institution": "The Ohio State University"}, {"id": 189407, "fullname": "Faisal Ladhak", "url": "http://cvpr.thecvf.com/api/miniconf/users/189407?format=json", "institution": "NVIDIA"}, {"id": 137420, "fullname": "Adithya Murali", "url": "http://cvpr.thecvf.com/api/miniconf/users/137420?format=json", "institution": "NVIDIA"}, {"id": 106956, "fullname": "Qing Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106956?format=json", "institution": "University of Michigan"}, {"id": 159437, "fullname": "Stan Birchfield", "url": "http://cvpr.thecvf.com/api/miniconf/users/159437?format=json", "institution": "NVIDIA"}, {"id": 87370, "fullname": "Valts Blukis", "url": "http://cvpr.thecvf.com/api/miniconf/users/87370?format=json", "institution": "NVIDIA"}, {"id": 87339, "fullname": "Jonathan Tremblay", "url": "http://cvpr.thecvf.com/api/miniconf/users/87339?format=json", "institution": "NVIDIA"}], "abstract": "Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, Internal Benchmark) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38241", "url": null, "sourceid": 41901, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38331, "uid": "600e75a6bc206bbb4e8f7858285ce7b7", "name": "Fresco: Frequency\u2013Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling", "authors": [{"id": 180707, "fullname": "shikun zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180707?format=json", "institution": "Monash University"}, {"id": 189622, "fullname": "Yong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189622?format=json", "institution": "Chongqing University"}, {"id": 87313, "fullname": "Yiqun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87313?format=json", "institution": "Chongqing University"}, {"id": 75819, "fullname": "Qiuhong Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/75819?format=json", "institution": "Monash University"}, {"id": 152755, "fullname": "Cunjian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152755?format=json", "institution": "Monash University"}], "abstract": "We propose Fresco, a unified optimization paradigm designed to mitigate early over-sharpening, and cross-view drifting in head avatar reconstruction. Fresco combines a Laplacian-pyramid-based frequency curriculum with UV-space consistency regularization to progressively enhance reconstruction quality. The optimization begins by stabilizing low-frequency appearance in the image domain, which suppresses spurious details and promotes reliable convergence. As learning proceeds, consistency across different viewpoints is reinforced through pixel-level alignment on shared UV texture coordinates. Finally, high-frequency components are refined under explicit frequency-band constraints, and seam boundary regularization is applied to preserve local continuity. By optimizing in a frequency- and UV-aligned space, Fresco achieves robust convergence without pseudo high-frequency artifacts and yields consistent, high-fidelity results across views. Experiments on the NeRSemble dataset validate the effectiveness of our design. Our method outperforms previous state-of the-art methods while avoiding additional training overhead through frequency scheduling and UV-bake caching.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38331", "url": null, "sourceid": 41938, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37812, "uid": "8bb7e9b359214147c19df99f9271ade9", "name": "ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation", "authors": [{"id": 188329, "fullname": "Ayush Roy", "url": "http://cvpr.thecvf.com/api/miniconf/users/188329?format=json", "institution": "State University of New York at Buffalo"}, {"id": 188330, "fullname": "Wei-Yang Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/188330?format=json", "institution": null}, {"id": 188331, "fullname": "Rudrasis Chakraborty", "url": "http://cvpr.thecvf.com/api/miniconf/users/188331?format=json", "institution": "Lawrence Livermore National Labs"}, {"id": 138364, "fullname": "Vishnu Lokhande", "url": "http://cvpr.thecvf.com/api/miniconf/users/138364?format=json", "institution": "University at Buffalo, State University of New York"}], "abstract": "In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features\u2014yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, $\\ell_2$ distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37812", "url": null, "sourceid": 35018, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37786, "uid": "ede8a40c37d1c1b12c8225bc8c672660", "name": "RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation", "authors": [{"id": 188265, "fullname": "Anuvab Sen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188265?format=json", "institution": "Georgia Institute of Technology"}, {"id": 188266, "fullname": "Mir Sayeed Mohammad", "url": "http://cvpr.thecvf.com/api/miniconf/users/188266?format=json", "institution": "Georgia Institute of Technology"}, {"id": 188267, "fullname": "Saibal Mukhopadhyay", "url": "http://cvpr.thecvf.com/api/miniconf/users/188267?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "We introduce RAVEN, a deep learning  architecture for processing frequency-modulated continuous-wave (FMCW) radar data that is designed for high computational efficiency. RAVEN reduces computation by using a learnable antenna mixer module on independent receiver state space encoders (SSM) to compress the virtual MIMO array into a compact set of learned features and by performing per-chirp inference with a calibrated early-exit rule, so the model reaches a decision using only a subset of chirps in a radar frame. These design choices yield up to 170\u00d7 lower computation and 4\u00d7 lower end-to-end latency than conventional frame-based radar backbones, while achieving state-of-the-art detection and BEV free-space segmentation performance on automotive radar datasets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37786", "url": null, "sourceid": 37989, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39857, "uid": "97b9c4da938b99eb6bd4b7663661faf7", "name": "CREward: A Type-Specific Creativity Reward Model", "authors": [{"id": 192994, "fullname": "Jiyeon Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/192994?format=json", "institution": "Simon Fraser University"}, {"id": 98634, "fullname": "Ali Mahdavi Amiri", "url": "http://cvpr.thecvf.com/api/miniconf/users/98634?format=json", "institution": "Simon Fraser University"}, {"id": 88076, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88076?format=json", "institution": "Simon Fraser University"}, {"id": 192995, "fullname": "Haedong Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192995?format=json", "institution": "Sogang University"}], "abstract": "Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and under-whelming. In this work, we learn the first type-specific creativity reward model, coined CREward, which spans three creativity \u201caxes,\u201d geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39857", "url": null, "sourceid": 38943, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40241, "uid": "cc46f33ee83eb94cea644695f8717c0e", "name": "Functional Mean Flow in Hilbert Space", "authors": [{"id": 193857, "fullname": "Zhiqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193857?format=json", "institution": "Georgia Institute of Technology"}, {"id": 193858, "fullname": "Yuchen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193858?format=json", "institution": "Georgia Institute of Technology"}, {"id": 193859, "fullname": "Greg Turk", "url": "http://cvpr.thecvf.com/api/miniconf/users/193859?format=json", "institution": "Georgia Institute of Technology"}, {"id": 159475, "fullname": "Bo Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159475?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "We present Functional Mean Flow (FMF) as a one-step generative model defined in infinite-dimensional Hilbert space. FMF extends the one-step Mean Flow framework to functional domains by providing a theoretical formulation for Functional Flow Matching and a practical implementation for efficient training and sampling. We also introduce an $x_1$-prediction variant that improves stability over the original $u$-prediction form. The resulting framework is a practical one-step Flow Matching method applicable to a wide range of functional data generation tasks such as time series, images, PDEs, and 3D geometry.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40241", "url": null, "sourceid": 33310, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40395, "uid": "18197405b180a434e40551201cd25c54", "name": "Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding", "authors": [{"id": 99349, "fullname": "Yue Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/99349?format=json", "institution": "University of Amsterdam"}, {"id": 126228, "fullname": "Qi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/126228?format=json", "institution": "ETH Zurich, INSAIT Sofia"}, {"id": 193605, "fullname": "Runyi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193605?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology (INSAIT)"}, {"id": 193606, "fullname": "Mengjiao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193606?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 186317, "fullname": "Bin Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/186317?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 193607, "fullname": "Nikola Popovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/193607?format=json", "institution": "Institute for Computer Science, Artificial Intelligence and Technology"}, {"id": 75973, "fullname": "Nicu Sebe", "url": "http://cvpr.thecvf.com/api/miniconf/users/75973?format=json", "institution": "University of Trento"}, {"id": 153394, "fullname": "Theo Gevers", "url": "http://cvpr.thecvf.com/api/miniconf/users/153394?format=json", "institution": "University of Amsterdam, University of Amsterdam"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 156198, "fullname": "Danda Paudel", "url": "http://cvpr.thecvf.com/api/miniconf/users/156198?format=json", "institution": "INSAIT, Sofia University"}, {"id": 88372, "fullname": "Martin R. Oswald", "url": "http://cvpr.thecvf.com/api/miniconf/users/88372?format=json", "institution": "University of Amsterdam"}], "abstract": "While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models.  Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians\u2019 centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using $39.9\\times$ fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40395", "url": null, "sourceid": -40319, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40136?format=json"], "related_events_ids": [40136]}, {"id": 38733, "uid": "e7d394a3b9082264ac8c53b6a885df1c", "name": "VecGlypher: Unified Vector Glyph Generation with Language Models", "authors": [{"id": 154250, "fullname": "Xiaoke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154250?format=json", "institution": "UC Santa Cruz"}, {"id": 190542, "fullname": "Bhavul Gauri", "url": "http://cvpr.thecvf.com/api/miniconf/users/190542?format=json", "institution": "Meta"}, {"id": 157111, "fullname": "Kam Woh Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/157111?format=json", "institution": "Meta"}, {"id": 190543, "fullname": "Tony Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190543?format=json", "institution": "Facebook"}, {"id": 138974, "fullname": "Mengmeng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/138974?format=json", "institution": "Meta"}, {"id": 126968, "fullname": "Zhiheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126968?format=json", "institution": "University of Science and Technology of China"}, {"id": 130315, "fullname": "Weiming Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/130315?format=json", "institution": "University of Waterloo"}, {"id": 113866, "fullname": "Zhaochong An", "url": "http://cvpr.thecvf.com/api/miniconf/users/113866?format=json", "institution": "University of Copenhagen"}, {"id": 103410, "fullname": "Zijian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/103410?format=json", "institution": "King&#x27;s College London"}, {"id": 155930, "fullname": "Haonan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155930?format=json", "institution": "Nanyang Technological University"}, {"id": 75508, "fullname": "Yuyin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75508?format=json", "institution": "UC Santa Cruz"}, {"id": 127697, "fullname": "Sen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/127697?format=json", "institution": "Meta AI"}, {"id": 169645, "fullname": "Ziheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169645?format=json", "institution": "Facebook"}, {"id": 85886, "fullname": "Tao Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85886?format=json", "institution": "University of Surrey"}, {"id": 157110, "fullname": "Xiao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/157110?format=json", "institution": "Meta AI"}], "abstract": "Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38733", "url": null, "sourceid": 36018, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39382, "uid": "f8936fbd065f5b9fdb3149eaba9665df", "name": "Dark3R: Learning Structure from Motion in the Dark", "authors": [{"id": 160560, "fullname": "Andrew Y Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/160560?format=json", "institution": "University of Toronto"}, {"id": 160630, "fullname": "Anagh Malik", "url": "http://cvpr.thecvf.com/api/miniconf/users/160630?format=json", "institution": "University of Toronto"}, {"id": 77551, "fullname": "SaiKiran Tedla", "url": "http://cvpr.thecvf.com/api/miniconf/users/77551?format=json", "institution": "York University"}, {"id": 191961, "fullname": "Yutong Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191961?format=json", "institution": "Sony"}, {"id": 191962, "fullname": "Yiqian Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191962?format=json", "institution": "PRE"}, {"id": 191963, "fullname": "Zach Salehe", "url": "http://cvpr.thecvf.com/api/miniconf/users/191963?format=json", "institution": "Harvard University"}, {"id": 191964, "fullname": "Benjamin Attal", "url": "http://cvpr.thecvf.com/api/miniconf/users/191964?format=json", "institution": "University of Toronto"}, {"id": 156815, "fullname": "Sotiris Nousias", "url": "http://cvpr.thecvf.com/api/miniconf/users/156815?format=json", "institution": "University of Toronto"}, {"id": 93592, "fullname": "Kiriakos Kutulakos", "url": "http://cvpr.thecvf.com/api/miniconf/users/93592?format=json", "institution": "University of Toronto"}, {"id": 77223, "fullname": "David B. Lindell", "url": "http://cvpr.thecvf.com/api/miniconf/users/77223?format=json", "institution": "University of Toronto"}], "abstract": "We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB\u2014a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher\u2013student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson\u2013Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39382", "url": null, "sourceid": 30873, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37290, "uid": "0183e4f6ecf3efd66438a27cb4ec2d68", "name": "PhyCritic: Multimodal Critic Models for Physical AI", "authors": [{"id": 153748, "fullname": "Tianyi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153748?format=json", "institution": "University of Maryland, College Park"}, {"id": 157791, "fullname": "Shihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157791?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 153845, "fullname": "Guilin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153845?format=json", "institution": "NVIDIA"}, {"id": 187091, "fullname": "Yi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187091?format=json", "institution": "NVIDIA"}, {"id": 91542, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91542?format=json", "institution": "The University of Tokyo"}, {"id": 84846, "fullname": "Heng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84846?format=json", "institution": "University of Pittsburgh"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 91930, "fullname": "Zhiding Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91930?format=json", "institution": "NVIDIA"}], "abstract": "With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37290", "url": null, "sourceid": 32310, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37655, "uid": "8f7389a164b5eb59f92579ca0bbe6da5", "name": "RewardFlow: Generate Images by Optimizing What You Reward", "authors": [{"id": 180408, "fullname": "Onkar Susladkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/180408?format=json", "institution": "UIUC"}, {"id": 187954, "fullname": "Dong-Hwan Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187954?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187955, "fullname": "Tushar Prakash", "url": "http://cvpr.thecvf.com/api/miniconf/users/187955?format=json", "institution": "Sony Research India"}, {"id": 152648, "fullname": "Adheesh Juvekar", "url": "http://cvpr.thecvf.com/api/miniconf/users/152648?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187956, "fullname": "Vedant Shah", "url": "http://cvpr.thecvf.com/api/miniconf/users/187956?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187957, "fullname": "Ayush Barik", "url": "http://cvpr.thecvf.com/api/miniconf/users/187957?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187958, "fullname": "Nabeel Bashir", "url": "http://cvpr.thecvf.com/api/miniconf/users/187958?format=json", "institution": null}, {"id": 152650, "fullname": "Muntasir Wahed", "url": "http://cvpr.thecvf.com/api/miniconf/users/152650?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187959, "fullname": "Ritish Shrirao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187959?format=json", "institution": "International Institute of Information Technology, Bangalore"}, {"id": 152651, "fullname": "Ismini Lourentzou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152651?format=json", "institution": "University of Illinois Urbana - Champaign"}], "abstract": "RewardFlow is a zero-shot, training-free framework for text-guided image editing and generation based on reward-guided Langevin dynamics. We steer pretrained diffusion and flow-matching models at inference time using a diverse set of differentiable rewards, and control their influence with a prompt-aware adaptive policy that parses the text instruction, infers edit intent, and dynamically adjusts update steps. Our design includes a differentiable VQA-based reward for fine-grained semantic supervision and a SAM-guided reward for precise, localized edits with minimal leakage. Across standard image editing and compositional generation benchmarks, RewardFlow achieves state-of-the-art zero-shot edit fidelity and compositional alignment. We will release the code and an open-source demo upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37655", "url": "https://plan-lab.github.io/projects/rewardflow/", "sourceid": 35607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38023, "uid": "f4556722628af5df7e753c1256c54bf3", "name": "VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding", "authors": [{"id": 157791, "fullname": "Shihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157791?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 126529, "fullname": "Guo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126529?format=json", "institution": "Nanjing University"}, {"id": 188846, "fullname": "De-An Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188846?format=json", "institution": "NVIDIA"}, {"id": 132597, "fullname": "Zhiqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132597?format=json", "institution": "NVIDIA"}, {"id": 77015, "fullname": "Minghan LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/77015?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 153845, "fullname": "Guilin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153845?format=json", "institution": "NVIDIA"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 73959, "fullname": "Jose M. Alvarez", "url": "http://cvpr.thecvf.com/api/miniconf/users/73959?format=json", "institution": "NVIDIA"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 91930, "fullname": "Zhiding Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91930?format=json", "institution": "NVIDIA"}], "abstract": "While Video Large Language Models (Video-LLMs) have shown significant potential in multimodal understanding and reasoning tasks, how to efficiently select the most informative frames from videos remains a critical challenge. Existing methods attempt to optimize frame sampling by reducing inter-frame redundancy or employing unsupervised event localization. However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. Specifically, we design the VidThinker pipeline, which automates annotation by generating instruction-conditioned captions, retrieving relevant video segments, and selecting key frames to enable efficient supervision. Using VidThinker, we build the VideoITG-40K dataset with 40K videos and 500K temporal grounding annotations. Our plug-and-play VideoITG model leverages Video-LLMs\u2019 visual-language alignment and reasoning for discriminative frame selection. VideoITG consistently boosts the performance on multiple multimodal video understanding benchmarks, demonstrating its effectiveness and potential.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38023", "url": null, "sourceid": 30740, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39004, "uid": "9d6e19ad9c92e6d9e0579b33d7970c5f", "name": "Scaling Parallel Sequence Models to Vision Foundation Models", "authors": [{"id": 72288, "fullname": "Yitong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/72288?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 177774, "fullname": "Collin McCarthy", "url": "http://cvpr.thecvf.com/api/miniconf/users/177774?format=json", "institution": "NVIDIA"}, {"id": 84689, "fullname": "Hongjun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84689?format=json", "institution": "University of Hong Kong"}, {"id": 191164, "fullname": "Hanrong Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/191164?format=json", "institution": "NVIDIA"}, {"id": 75510, "fullname": "Qi Dou", "url": "http://cvpr.thecvf.com/api/miniconf/users/75510?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87471, "fullname": "Tianfan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/87471?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 75943, "fullname": "Jinwei Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75943?format=json", "institution": "NVIDIA"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 73956, "fullname": "Danny Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/73956?format=json", "institution": "NVIDIA"}, {"id": 73958, "fullname": "Pavlo Molchanov", "url": "http://cvpr.thecvf.com/api/miniconf/users/73958?format=json", "institution": "NVIDIA"}, {"id": 76011, "fullname": "Sifei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76011?format=json", "institution": "NVIDIA"}], "abstract": "Scaling vision foundation models is constrained by the quadratic complexity of self-attention. Although subquadratic attention alternatives like linear attention variants and state-space models successfully reduce the model complexity, they typically serialize images into 1D token sequences, compromising spatial coherence and efficiency. Generalized Spatial Propagation Networks (GSPN) offer a linear-time alternative that propagates context directly on the 2D grid via line-scan propagation and removes positional embeddings, yet the original design hits GPU-scaling limits: growing batch/channels saturate SM concurrency, serializing scans, and spiking latency. We introduce Compact GSPN (C-GSPN), a ViT block that compresses the propagation space to preserve accuracy while cutting propagation latency by nearly 10\u00d7. We further improve efficiency with lightweight projections and fused CUDA kernels. To enable large-scale pretraining, we adopta two-stage cross-operator distillation strategy that combines layer-wise supervision with end-to-end alignment. In a representative 1K configuration (batch 32, C=1152), C-GSPN achieves up to 2\u00d7 speedup, maintains competitive zero-shot accuracy, and improves segmentation by +2.1%. Extensive experiments and ablations show that the proposed compression and two-stage distillation are criticalfor strong transfer while substantially reducing compute, enabling the first extension of a subquadratic operator to foundation-scale (CLIP-style) vision pretraining.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39004", "url": null, "sourceid": 45510, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39333, "uid": "fd1e1971f70705af958fce3debc479dd", "name": "NitroGen: An Open Foundation Model for Generalist Gaming Agents", "authors": [{"id": 191868, "fullname": "Lo\u00efc Magne", "url": "http://cvpr.thecvf.com/api/miniconf/users/191868?format=json", "institution": "NVIDIA"}, {"id": 191869, "fullname": "Anas Awadalla", "url": "http://cvpr.thecvf.com/api/miniconf/users/191869?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 191870, "fullname": "Guanzhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191870?format=json", "institution": "California Institute of Technology"}, {"id": 86771, "fullname": "Yinzhen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86771?format=json", "institution": "University of California, San Diego"}, {"id": 191871, "fullname": "Joshua Belofsky", "url": "http://cvpr.thecvf.com/api/miniconf/users/191871?format=json", "institution": "General Trajectory"}, {"id": 191872, "fullname": "Fengyuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191872?format=json", "institution": "NVIDIA"}, {"id": 191873, "fullname": "Joohwan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/191873?format=json", "institution": "NVIDIA"}, {"id": 85320, "fullname": "Ludwig Schmidt", "url": "http://cvpr.thecvf.com/api/miniconf/users/85320?format=json", "institution": "University of Washington"}, {"id": 86982, "fullname": "Georgia Gkioxari", "url": "http://cvpr.thecvf.com/api/miniconf/users/86982?format=json", "institution": "California Institute of Technology"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 76018, "fullname": "Yisong Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/76018?format=json", "institution": "Caltech"}, {"id": 187497, "fullname": "Yejin Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187497?format=json", "institution": "Stanford University"}, {"id": 75460, "fullname": "Yuke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75460?format=json", "institution": "University of Texas - Austin"}, {"id": 169493, "fullname": "Linxi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/169493?format=json", "institution": "NVIDIA"}], "abstract": "We introduce NitroGen, a video-action foundation model for generalist gaming agents, trained on 40,000 hours of gameplay videos across more than 1000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in success rates over models trained from scratch. We release the dataset, benchmark, and model weights to advance research on generalist embodied agents.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39333", "url": null, "sourceid": 43643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40365?format=json"], "related_events_ids": [40365]}, {"id": 39950, "uid": "868329656ab8bac9417b7536c7a65916", "name": "Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling", "authors": [{"id": 183577, "fullname": "Valter Piedade", "url": "http://cvpr.thecvf.com/api/miniconf/users/183577?format=json", "institution": "Instituto Superior T\u00e9cnico"}, {"id": 182269, "fullname": "Lalit Manam", "url": "http://cvpr.thecvf.com/api/miniconf/users/182269?format=json", "institution": "Mitsubishi Electric Research Labs (MERL)"}, {"id": 193179, "fullname": "Masashi Yamazaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/193179?format=json", "institution": "Mitsubishi Electric Corporation"}, {"id": 97201, "fullname": "Pedro Miraldo", "url": "http://cvpr.thecvf.com/api/miniconf/users/97201?format=json", "institution": "Mitsubishi Electric Research Laboratories"}], "abstract": "Visual SLAM is one of the most fundamental problems in computer vision, with direct applications to real-time localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging\u2014particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline implemented from scratch in C++ that explicitly leverages the spatio-temporal structure of the scene for improved localization, and is designed to be modular so that off-the-shelf components can be easily integrated. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. To complement this, we incorporate a spatial representation based on a 3D cell-based scene model, enabling efficient retrieval of relevant 3D points from previously reconstructed regions. Leveraging recent feed-forward geometry estimators, our hybrid design combines sparse keypoint-based localization with a dense anchor-point\u2013driven spatial representation. This integration allows us to achieve real-time performance (exceeding 80 FPS) and a substantial efficiency improvement compared to existing uncalibrated monocular SLAM pipelines, while maintaining or improving localization accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39950", "url": null, "sourceid": 34806, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39359, "uid": "4a86eb6a5ec9c86d6c4b8e1cfa765c62", "name": "FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction", "authors": [{"id": 181345, "fullname": "Jiaqi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181345?format=json", "institution": "Wuhan University"}, {"id": 144990, "fullname": "Zihan Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144990?format=json", "institution": "Wuhan University"}, {"id": 190990, "fullname": "Guancheng Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190990?format=json", "institution": "Wuhan University"}, {"id": 184315, "fullname": "Wenke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184315?format=json", "institution": "Nanyang Technological University"}, {"id": 87442, "fullname": "He Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87442?format=json", "institution": "Wuhan University"}, {"id": 76422, "fullname": "Mang Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/76422?format=json", "institution": "Wuhan University"}], "abstract": "Federated Graph Learning (FGL) has emerged as a principled framework for decentralized training of Graph Neural Networks (GNNs) while preserving data privacy. In subgraph-FL scenarios, however, structural noise arising from data collection and storage can damage the GNN message-passing scheme of clients, leading to conflicts in collaboration. Existing approaches exhibit two critical limitations: 1) Globally, they fail to identify corrupted clients, causing destructive knowledge inconsistencies. 2) Locally, the global GNN performs poorly on these clients due to structural noise, limiting their ability to benefit from federated collaboration. To address these challenges, we propose $\\textbf{FedSDR}$,  a spectra-based FGL framework against high-structural-noise scenarios. Specifically, Structural Noise-Aware Aggregation (SNAA) introduces a noise evaluation metric to detect corrupted clients and reduce their contributions, thereby mitigating the impact of noise on the global GNN. Furthermore, Robust Local Structure Reconstruction (RLSR) leverages the knowledge from the healthy global model to repair locally corrupted graph structures. Extensive experiments demonstrate that FedSDR outperforms state-of-the-art methods across various scenarios under structural noise.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39359", "url": null, "sourceid": 42950, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39421, "uid": "56f63261671fdfe71cd4f4e0f16f90c4", "name": "Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data", "authors": [{"id": 91292, "fullname": "Yongshan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91292?format=json", "institution": "China University of Geosciences Wuhan"}, {"id": 192044, "fullname": "Xiaohuan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192044?format=json", "institution": "China University of Geosciences Wuhan"}, {"id": 187250, "fullname": "Lefei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187250?format=json", "institution": "Wuhan University"}, {"id": 192045, "fullname": "Zhihua Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192045?format=json", "institution": null}], "abstract": "Multi-view clustering for remote sensing data has received increasing attention by leveraging diverse data representations to enhance Earth observation. Existing methods are primarily developed under the assumption that each pixel is fully observed across all views. No prior work has investigated the more practical yet challenging scenario where some views suffer from partially missing data. To bridge this gap, this paper presents the first study on clustering incomplete remote sensing data, termed orthogonal spatial-aware multi-view anchor graph clustering (OSMAGC). Specifically, spatial-aware anchors and multi-scale anchor graphs are initially constructed by exploiting the superpixel-based texture characteristics of each view. Based on these, multi-scale anchor graph learning is performed through view weighting and matrix factorization on incomplete data. Structure-aligned consensus feature learning is achieved by aligning the multi-scale graph structures within a shared latent space. To ensure spatial continuity and smoothness, orthogonal spatial-aware regularization is imposed in both horizontal and vertical directions. These three modules are jointly optimized through a well-designed optimization algorithm in a mutually reinforcing manner. Extensive experiments on four benchmark datasets validate the effectiveness and efficiency of our proposed method over the state-of-the-art competitors.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39421", "url": null, "sourceid": 43823, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40157, "uid": "86311dbe35f1b6c5166365165602f54d", "name": "FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery", "authors": [{"id": 175822, "fullname": "Xiaokun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175822?format=json", "institution": "Fudan University"}, {"id": 193656, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193656?format=json", "institution": "Fudan University"}, {"id": 193657, "fullname": "Ziqi Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/193657?format=json", "institution": "Fudan University"}, {"id": 193658, "fullname": "Baiyun Baiyun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193658?format=json", "institution": "Fudan Univercity"}, {"id": 193659, "fullname": "Xiaorong Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193659?format=json", "institution": "Fudan University"}, {"id": 193660, "fullname": "Qingchen Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193660?format=json", "institution": "Fudan University"}, {"id": 100893, "fullname": "Ry Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/100893?format=json", "institution": "Fudan University"}, {"id": 193661, "fullname": "Xinpeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193661?format=json", "institution": "FuDan university"}, {"id": 193662, "fullname": "Haipeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193662?format=json", "institution": "Fudan University"}], "abstract": "Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 12%.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40157", "url": null, "sourceid": 33268, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38169, "uid": "2b49fbedae1b24d9a8953ab9263fb782", "name": "Boosting Visual Reprogramming for Vision-Language Models with Dual Granularity Alignment", "authors": [{"id": 189198, "fullname": "Jiayang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189198?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189199, "fullname": "Xinyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189199?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 189200, "fullname": "Ke Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/189200?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 90282, "fullname": "Weili Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90282?format=json", "institution": "Monash University"}], "abstract": "Model reprogramming adapts pretrained models to downstream tasks by modifying their input and output spaces. Visual reprogramming (VR), a prominent instance, introduces learnable input transformations (e.g., visual prompts) to repurpose vision-language models like CLIP for downstream visual tasks. Existing VR methods primarily focus on single-level alignment between prompted images and text descriptions, overlooking inherent structural information in data that facilitates alignment: semantic granularity (label hierarchies) and visual granularity (multi-scale representations). To address this gap, we propose Dual Granularity Alignment (DGA): First, for visual granularity, we generate multi-scale images and propose Uncertainty-calibrated Prediction Fusion (UPF) to capture hierarchical spatial information within images. Second, for semantic granularity, we propose Prototype-guided Label Hierarchization (PLF) to construct category hierarchies from visual semantic similarities and propose Hierarchical Knowledge Propagation (HKP) to achieve top-down superclass-to-subclass guidance for coherent multi-level visual prompts alignment. Our DGA collaboratively integrate both granularities to enhance alignment effectiveness. Experiments across 12 downstream datasets demonstrate DGA's superiority over baselines on both ViT-based and ResNet-based CLIP architectures. Specifically, DGA achieves a 4.5% improvement over the previous state-of-the-art method on ViT-16-based CLIP. By explicitly modeling structural granularities, DGA establishes a new paradigm for visual reprogramming.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38169", "url": null, "sourceid": 41897, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37338, "uid": "07d9152e686ddb50c5330f7f9c1c58bc", "name": "PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis", "authors": [{"id": 174266, "fullname": "chunji lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/174266?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187194, "fullname": "Zequn Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187194?format=json", "institution": "Li Auto Inc."}, {"id": 153284, "fullname": "Donglin Di", "url": "http://cvpr.thecvf.com/api/miniconf/users/153284?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187195, "fullname": "Weinan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187195?format=json", "institution": "Harbin Institute of Technology"}, {"id": 187196, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187196?format=json", "institution": "Li Auto Inc."}, {"id": 187197, "fullname": "Chen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187197?format=json", "institution": "Li Auto Inc."}, {"id": 76644, "fullname": "Yinjie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76644?format=json", "institution": "Sichuan University"}, {"id": 105376, "fullname": "Changsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/105376?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Despite advances in physics-based 3D motion synthesis, current methods face key limitations: reliance on pre-reconstructed 3D Gaussian Splatting (3DGS) built from dense multi-view images with time-consuming per-scene optimization; physics integration via either inflexible, hand-specified attributes or unstable, optimization-heavy guidance from video models using Score Distillation Sampling (SDS); and na\u00efve concatenation of prebuilt 3DGS with physics modules, which ignores physical information embedded in appearance and yields suboptimal performance. To address these issues, we propose PhysGM, a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling immediate simulation and high-fidelity 4D rendering. Unlike slow appearance-agnostic optimization methods, we first pre-train a physics-aware reconstruction model that directly infers both Gaussian and physical parameters.  We further refine the model with Direct Preference Optimization (DPO), aligning simulations with the physically plausible reference videos and avoiding the high-cost SDS optimization. To address the absence of a supporting dataset for this task, we propose PhysAssets, a dataset of 50K+ 3D assets annotated with physical properties and corresponding reference videos. Experiments show that PhysGM produces high-fidelity 4D simulations from a single image in one minute, achieving a significant speedup over prior work while delivering realistic renderings.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37338", "url": null, "sourceid": 43364, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36478, "uid": "da2412d2fc99067e7fa748d3b747c15a", "name": "When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection", "authors": [{"id": 180753, "fullname": "Qiang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180753?format=json", "institution": "Qingdao University of Science and Technology"}, {"id": 185146, "fullname": "Xiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185146?format=json", "institution": "Qingdao University of Science and Technology"}, {"id": 185147, "fullname": "Zongyuan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/185147?format=json", "institution": "Qingdao University of Science and Technology"}, {"id": 185148, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185148?format=json", "institution": "Qingdao University of Science and Technology"}], "abstract": "Video object detection has gained notable progress with the advent of transformers. While transformers excel at modeling long-range contextual dependencies, the quadratic complexity limits their efficiency in long-sequence processing. In contrast, Mamba offers greater efficiency in modeling long sequences but tends to exhibit relatively limited contextual learning capability compared with transformers, and its application to video object detection remains unexplored. To harness the complementary strengths of transformers and Mamba, we propose a hybrid Transformer-Mamba network for video object detection (TMambaDet), a pioneering framework in this domain that combines the long-range modeling power of transformers with the efficient long-sequence processing capability of Mamba. Our TMambaDet is characterized by three core components: 1) a spatial adaptive deformable transformer encoder to effectively model the long-range dependencies within each frame, enabling intra-frame feature aggregation that substantially improves the spatial feature representations of objects; 2) a temporal cascaded bidirectional Mamba encoder to efficiently capture the long-range dependencies across frames in video sequences with linear complexity, enabling inter-frame feature aggregation that effectively enhances the temporal feature representations of objects; 3) a Mamba entangled transformer decoder to fully explore the interactions between object queries and spatial-temporal features, enabling fine-grained query-feature alignment that effectively enriches the instance-level representations of object queries. We conduct experiments on the ImageNet VID and EPIC-KITCHENS-55 datasets, showing that TMambaDet achieves state-of-the-art results. Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36478", "url": null, "sourceid": 39027, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36595, "uid": "9eb87b710e10e833e9259aab3276c2a7", "name": "APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation", "authors": [{"id": 185427, "fullname": "Daoxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185427?format=json", "institution": "Harbin Institute of Technology, ShenZhen"}, {"id": 185428, "fullname": "Ping Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185428?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 84803, "fullname": "Xiaobo Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/84803?format=json", "institution": "The University of Sydney"}, {"id": 156647, "fullname": "Xiu Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/156647?format=json", "institution": "Central South University"}, {"id": 185429, "fullname": "Ruichen Zhen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185429?format=json", "institution": null}, {"id": 185430, "fullname": "Jianqiang Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185430?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 185431, "fullname": "Shuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185431?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "The Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce **APEX** (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM's inference latency and boosting the agent's proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2\\% SR and +2.8\\% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36595", "url": null, "sourceid": 46262, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37429, "uid": "746ac26956df2d6be2b2c66c26b62fda", "name": "HiconAgent: History Context-aware Policy Optimization for GUI Agents", "authors": [{"id": 183552, "fullname": "Xurui Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/183552?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 132952, "fullname": "Gongwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132952?format=json", "institution": "Harbin Institute of Technology"}, {"id": 156167, "fullname": "Yuquan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/156167?format=json", "institution": "Harbin Institute of Technology"}, {"id": 156166, "fullname": "Zaijing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/156166?format=json", "institution": "Harbin Institute of Technology"}, {"id": 184580, "fullname": "Kaiwen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184580?format=json", "institution": "Knowin AI"}, {"id": 186926, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186926?format=json", "institution": "Yale University"}, {"id": 185431, "fullname": "Shuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185431?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 89574, "fullname": "Zhuotao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89574?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 89896, "fullname": "Rui Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89896?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Graphical User Interface (GUI) agents require effective utilization of historical context to perform sequential navigation tasks. While incorporating past actions and observations can significantly improve decision-making, naively using full history leads to excessive computational overhead and potential distraction from irrelevant information. In this work, we introduce ****HiconAgent****, a GUI agent trained with ****History Context-aware Policy Optimization (HCPO)**** for effective and efficient utilization of historical information. HCPO explicitly optimizes history usage in both sampling and policy updates by integrating two complementary components: ****(1) Dynamic Context Sampling (DCS)**** presents the agent with variable-length histories during sampling, enabling adaptive use of the most relevant historical context to improve sequential decision quality; ****(2) Anchor-guided History Compression (AHC)**** refines the policy update phase via a dual-branch optimization strategy, where the compressed branch drops history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage, achieving efficiency with minimal performance degradation. Extensive experiments on mainstream GUI navigation benchmarks demonstrate the strong performance of our model. Despite its smaller size, HiconAgent-3B outperforms GUI-R1-7B by ****+8.46****% grounding and ****+11.32****% step successful rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW, with up to ****2.47\u00d7 computational speedup**** and ****60% FLOPs reduction****.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37429", "url": null, "sourceid": 45283, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38853, "uid": "3abb61db6353550532091da125a41c32", "name": "DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux", "authors": [{"id": 149199, "fullname": "Xinkui Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/149199?format=json", "institution": "Zhejiang University"}, {"id": 190836, "fullname": "Yifan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190836?format=json", "institution": "Zhejiang University"}, {"id": 190837, "fullname": "Sai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190837?format=json", "institution": "Zhejiang University"}, {"id": 190838, "fullname": "Naibo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190838?format=json", "institution": null}, {"id": 190839, "fullname": "Guanjie Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190839?format=json", "institution": "Zhejiang University"}, {"id": 190840, "fullname": "Yueshen Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190840?format=json", "institution": "Xidian University"}, {"id": 190841, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190841?format=json", "institution": "Zhejiang University"}, {"id": 155454, "fullname": "Shuiguang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/155454?format=json", "institution": "Zhejiang University"}, {"id": 190842, "fullname": "Jianwei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190842?format=json", "institution": "Zhejiang University"}], "abstract": "Embodied Multi-Agent Systems have proven highly effective in addressing complex tasks through coordinated collaboration among heterogeneous agents. However, real-world environments and task specifications are inherently dynamic, exhibiting frequent changes, uncertainty, and variability. Despite these characteristics, most existing frameworks employ static architectures with fixed agent capabilities and rigid task allocation strategies, which substantially constrain their adaptability to evolving conditions. This inflexibility presents significant challenges to maintaining robust and efficient multi-agent cooperation in dynamic and unpredictable settings.To address these limitations, we propose DRAMA, short for Dynamic Orchestration for Resilient Multi-Agent Ecosystems, tailored for rapidly changing environments. DRAMA adopts a multilayer architecture that incorporates three principal mechanisms: adaptive scheduling through an affinity-driven mechanism, fault-tolerant continuity via hierarchical trust-chain task takeover, and collective spatial intelligence that consolidates distributed observations for predictive reasoning. Together, these components enable event-triggered rescheduling and decentralized fault recovery, ensuring uninterrupted task execution amid agent arrivals, dropouts, or recoveries. Extensive experiments in the embodied VirtualHome-Social environment demonstrate that DRAMA achieves a 7% improvement in runtime efficiency and a 10% increase in throughput compared with state-of-the-art baselines, while maintaining superior stability and robustness under dynamic agent populations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38853", "url": null, "sourceid": 44353, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39208, "uid": "9ee01a4fa4d78d75be794baa1ca45906", "name": "Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation", "authors": [{"id": 191590, "fullname": "Yunlong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191590?format=json", "institution": "Central South University"}, {"id": 191591, "fullname": "Xiaoheng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191591?format=json", "institution": "Central South University"}, {"id": 187446, "fullname": "Yichao Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187446?format=json", "institution": "Central South University"}, {"id": 191592, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191592?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 91359, "fullname": "Xiangjian He", "url": "http://cvpr.thecvf.com/api/miniconf/users/91359?format=json", "institution": "The University of Nottingham Ningbo China"}, {"id": 187391, "fullname": "Shan You", "url": "http://cvpr.thecvf.com/api/miniconf/users/187391?format=json", "institution": "SenseTime Research"}, {"id": 185431, "fullname": "Shuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185431?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 152449, "fullname": "Lei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152449?format=json", "institution": "University of New South Wales"}, {"id": 149694, "fullname": "Fei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149694?format=json", "institution": "University of Science and Technology of China"}, {"id": 156647, "fullname": "Xiu Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/156647?format=json", "institution": "Central South University"}], "abstract": "Robotic manipulation in complex 3D environments requires unifying spatial reasoning with intuitive visual perception, which is a capability that current Vision-Language-Action paradigms address separately. While 3D VLAs excel in geometric and physical reasoning, they lack intuitive, image-level understanding and dense visual semantics; conversely, 2D VLAs (even with depth image) provide rich visual intuition and semantic continuity but miss explicit spatial global grounding. We introduce DiffRender-VLA, a differentiable rendering\u2013based framework that bridges 3D and 2D Vision-Language-Action models through gradient-consistent visual mediation. It generates differentiable images by localizing the next end-effector target with a world-aligned cube marker, differentiably structuring surrounding geometry whose color encodes spatial relations to the marker, and rendering adaptive viewpoints optimized to reveal the target\u2013environment spatial relationships. These differentiable images serve as visual bridges, embedding spatial semantics while allowing gradients from 2D VLAs to backpropagate into 3D representations, thereby coupling geometric reasoning with visual perception. This closed differentiable loop unifies reasoning and perception, substantially improving performance under occlusion, clutter, and complex spatial manipulation tasks, achieving average improvements of +12.1% over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39208", "url": null, "sourceid": 31949, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36204, "uid": "c1eb1949c87d3bb139951cb0ae9ddeee", "name": "ContourVertex: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation", "authors": [{"id": 184442, "fullname": "Jinming Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184442?format=json", "institution": "Xidian University"}, {"id": 71561, "fullname": "Lingling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/71561?format=json", "institution": "Xidian University"}, {"id": 90795, "fullname": "Licheng Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90795?format=json", "institution": "Xidian University"}, {"id": 184443, "fullname": "Xiaoqiang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184443?format=json", "institution": "Xidian University"}, {"id": 184444, "fullname": "Long Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184444?format=json", "institution": "Xi&#x27;an University"}, {"id": 90815, "fullname": "Xu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90815?format=json", "institution": "Xidian University"}, {"id": 131849, "fullname": "Wenping Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/131849?format=json", "institution": "Xidian University"}, {"id": 184445, "fullname": "Weibin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184445?format=json", "institution": "Xidian University"}], "abstract": "Referring expression comprehension and segmentation (RECS) task plays a vital role in remote sensing due to its high efficiency in multi-tasking. However, RECS has reached a performance bottleneck rooted in representational insufficiency, primarily due to cross-task representational fragmentation in multi-task interpretation. In this paper, we propose RECS4R, a unified multi-task framework to upgrade RECS performance. At representation level, we introduce language-guided unified contour decoding paradigm (LCUDP) that takes language-conditioned contour as the intermediate carrier to decode REC and RIS synchronously, structurally preserving geometric and semantic consistency and enabling lightweight, efficient decoding. At refinement level, we introduce residual coarse-to-fine encoding (RCE), shifting fine stage from learning-from-scratch to error correction. At reaggregation level, we design channel isolated multi-scale fusion (CIMF) to achieve lossless feature fusion. At regularization level, we employ gradient consistency loss (GCL) to enhance LCUDP and improve boundary boundary adherence. Moreover, we validate RECS4R on remote-sensing and natural datasets, including RefDIOR, RRSIS-D, OPT-RSVG, RefCOCO, RefCOCO+, and RefCOCOg, and verify the image encoder under CNN, Transformer, and Mamba backbones, achieving advanced performance. The code will be coming soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36204", "url": null, "sourceid": 31047, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39680, "uid": "87159df0dcd6d3b95e9f8219091c28d2", "name": "Time-Specialized Event-Image Alignment for Blur-to-Video Decomposition", "authors": [{"id": 185458, "fullname": "Zhijing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185458?format=json", "institution": "University of Science and Technology of China"}, {"id": 133327, "fullname": "Senyan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133327?format=json", "institution": "University of Science and Technology of China"}, {"id": 185460, "fullname": "Ruixuan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185460?format=json", "institution": "University of Science and Technology of China"}, {"id": 185459, "fullname": "Kean Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185459?format=json", "institution": "University of Science and Technology of China"}, {"id": 192635, "fullname": "Runze Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/192635?format=json", "institution": "University of Science and Technology of China"}, {"id": 76679, "fullname": "Xueyang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76679?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Motion blur is a common degradation in dynamic imaging. Recent studies have moved beyond restoring a single sharp image from a blurred input and instead target blur decomposition: recovering a temporally continuous sharp video sequence from one motion-blurred image. Event cameras, with their microsecond temporal resolution, can effectively alleviate motion ambiguity.However, existing event-based methods often fail to explicitly model time-aligned event\u2013image features. How to accurately exploit event data to reconstruct frames at different time instants remains largely underexplored. In this paper, we propose TSANet, an event-based blur-to-video decomposition method that time-specializes both event features and image features for alignment. Specifically, we introduce a Relative Time-Encoded Attention module that steers event features toward motion information relevant to a given target time, and a Timesurface Dynamic Warping Module that warps image features into the spatial configuration corresponding to that time. With time-specialized motion features and image features that are explicitly aligned at arbitrary query times, our framework can decompose a single blurred image into a high\u2013frame-rate sharp video sequence. In addition, we collect a new dataset containing real events and high-quality color video, and synthesize blurred inputs by averaging sharp frames to evaluate our method. Experiments on multiple datasets with both synthetic and real events demonstrate that our approach consistently outperforms previous state-of-the-art methods on the blur decomposition task.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39680", "url": null, "sourceid": 42982, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36609, "uid": "f9195e75b0784070189f10f3aa96e026", "name": "Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset", "authors": [{"id": 133327, "fullname": "Senyan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133327?format=json", "institution": "University of Science and Technology of China"}, {"id": 185458, "fullname": "Zhijing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185458?format=json", "institution": "University of Science and Technology of China"}, {"id": 185459, "fullname": "Kean Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185459?format=json", "institution": "University of Science and Technology of China"}, {"id": 134245, "fullname": "Xin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/134245?format=json", "institution": "University of Science and Technology of China"}, {"id": 185460, "fullname": "Ruixuan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185460?format=json", "institution": "University of Science and Technology of China"}, {"id": 76679, "fullname": "Xueyang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76679?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations.Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset.Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36609", "url": null, "sourceid": 42433, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38188, "uid": "2a44fc01550de95485464916cfcfdbf1", "name": "EventGait: Towards Robust Gait Recognition with Event Streams", "authors": [{"id": 133327, "fullname": "Senyan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133327?format=json", "institution": "University of Science and Technology of China"}, {"id": 189246, "fullname": "Shuai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189246?format=json", "institution": "University of Science and Technology of China"}, {"id": 152066, "fullname": "Chuanfu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152066?format=json", "institution": "SIAS, UESTC"}, {"id": 185459, "fullname": "Kean Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185459?format=json", "institution": "University of Science and Technology of China"}, {"id": 185458, "fullname": "Zhijing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185458?format=json", "institution": "University of Science and Technology of China"}, {"id": 88083, "fullname": "Chengzhi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88083?format=json", "institution": "University of Science and Technology of China"}, {"id": 76679, "fullname": "Xueyang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76679?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Gait recognition enables non-intrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity in conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition.Therefore, we propose \\textbf{EventGait}, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structural Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38188", "url": null, "sourceid": 32286, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39762, "uid": "2fb795a41461f9664f41efb0c07e9461", "name": "Personalized Federated Training of Diffusion Models with Privacy Guarantees", "authors": [{"id": 192801, "fullname": "Kumar Kshitij Patel", "url": "http://cvpr.thecvf.com/api/miniconf/users/192801?format=json", "institution": "Yale University"}, {"id": 192802, "fullname": "Bingqing Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192802?format=json", "institution": "the University of Hong Kong"}, {"id": 180743, "fullname": "A F M Mahfuzul Kabir", "url": "http://cvpr.thecvf.com/api/miniconf/users/180743?format=json", "institution": "New Jersey Institute of Technology"}, {"id": 192803, "fullname": "Weitong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192803?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 157174, "fullname": "Difan Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/157174?format=json", "institution": "University of Hong Kong"}, {"id": 192804, "fullname": "Lingxiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192804?format=json", "institution": "New Jersey Institute of Technology"}], "abstract": "We propose a federated framework for training diffusion models on decentralized and private datasets. The method learns a shared generative model together with personalized client models, which allows clients to benefit from cross-client structure while ensuring that the shared model cannot reproduce any client\u2019s data on its own. We provide formal differential privacy guarantees for each client and establish utility bounds for conditional generation under a Gaussian mixture model, showing that collaboration improves sample quality relative to private non-collaborative training. Experiments on CIFAR-10, Colorized MNIST, and CelebA support these results: the method generates high-fidelity samples, improves performance on minority and underrepresented classes, and maintains strong protection against membership inference, memorization, and reconstruction attacks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39762", "url": null, "sourceid": 32605, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37969, "uid": "b2420b697849a300ece7982b58557094", "name": "Illumination-Consistent Human-Scene Reconstruction from Monocular Video", "authors": [{"id": 182051, "fullname": "Rongbin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182051?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 103958, "fullname": "Wensheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/103958?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188708, "fullname": "Lingzhe Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188708?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188709, "fullname": "Dongwang Dongwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188709?format=json", "institution": null}, {"id": 89068, "fullname": "Chengying Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89068?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Reconstructing 3D humans and scenes from monocular videos is a challenging task, particularly due to human motion, varying illumination, and dynamic scene shadows. While recent works have explored scene disentanglement by jointly modeling humans and their surrounding scenes, they often overlook illumination and shadow effects\u2014resulting in inconsistent human appearance and degraded scene realism. To address this gap, we propose a photometrically consistent integration of human and scene reconstruction based on 3D Gaussian Splatting, with a key focus on modeling spatially-varying illumination and shadows. Central to our method is a learnable light volume that provides localized lighting cues to human Gaussians, enabling more realistic and consistent appearance synthesis. To further ensure accurate human geometry and alignment, we adopt a two-stage reconstruction strategy: we first optimize a human mesh and then anchor Gaussians to the refined surface. In addition, we introduce an implicit shadow estimation module that disentangles cast shadows from the scene, thus supporting plausible human shadow synthesis. Our framework also facilitates human relighting and compositing into novel scenes with contextually appropriate lighting. Quantitative and qualitative results demonstrate that our method achieves state-of-the-art performance, producing consistent appearances, realistic illumination, and enhanced overall scene realism.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37969", "url": null, "sourceid": 36346, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39425, "uid": "e281f682fab884aafadb53f9711eaffb", "name": "IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors", "authors": [{"id": 174899, "fullname": "Qingan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174899?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 103958, "fullname": "Wensheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/103958?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 89068, "fullname": "Chengying Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89068?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Applying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under high-illuminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a  Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a  Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the view-dependent \u201cshiny\u201d appearance of reflective surfaces in real time, surpassing the limits of prior 3DGS-based inverse rendering methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39425", "url": null, "sourceid": 39384, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38011, "uid": "11ae9d2fb5ee5579a3a0db7674dff168", "name": "MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration", "authors": [{"id": 131840, "fullname": "Yue Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131840?format=json", "institution": "Xidian University"}, {"id": 188826, "fullname": "Feng Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188826?format=json", "institution": "Xidian University"}, {"id": 131839, "fullname": "Yongzhe Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131839?format=json", "institution": "Xidian University"}, {"id": 188827, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188827?format=json", "institution": "Xidian University; Xidian University"}, {"id": 188828, "fullname": "Kaiyuan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188828?format=json", "institution": ""}, {"id": 131847, "fullname": "Maoguo Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131847?format=json", "institution": "Xidian University"}, {"id": 85385, "fullname": "Qiguang Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85385?format=json", "institution": "Xidian University"}, {"id": 131849, "fullname": "Wenping Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/131849?format=json", "institution": "Xidian University"}], "abstract": "Outlier rejection for correspondence-based point cloud registration confronts two fundamental challenges in real-world scenarios. First, low-overlap regions yield sparse and fragmented inlier distributions that are difficult to discover using conventional one-step global search strategies. Second, large-scale scenes present dense correspondence inputs that impose stringent requirements on the accuracy-efficiency trade-off of search algorithms. To this end, we propose a hierarchical multi-hop graph search framework that progressively refines correspondences to address these challenges. Our method constructs a compatibility graph with transformation-invariant embeddings to predict correspondence confidence, establishing the foundation for cluster-balanced seed sampling that ensures comprehensive coverage across fragmented regions. These strategically selected seeds subsequently drive hierarchical multi-hop expansion, progressively discovering inliers through multi-resolution graph layers while circumventing the high complexity of exhaustive global search. Finally, distribution-aware ranking jointly evaluates geometric consistency and spatial coverage to select well-distributed transformations from multiple hypotheses. Experiments on 3DMatch, 3DLoMatch, and KITTI demonstrate that our method significantly outperforms state-of-the-art methods in both low-overlap and large-scale scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38011", "url": null, "sourceid": 42392, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36660, "uid": "b01b2f6715785729f0a278f4674a9733", "name": "InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene", "authors": [{"id": 185583, "fullname": "Chaoyue Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/185583?format=json", "institution": "Australian National University"}, {"id": 154413, "fullname": "Wei Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154413?format=json", "institution": "Meta"}, {"id": 87488, "fullname": "Miaomiao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87488?format=json", "institution": "Australian National University"}], "abstract": "This paper tackles the problem of physics-aware human motion synthesis in a dynamic scene. Unlike existing works which mainly tend to generate physically unrealistic motions due to limited contact modeling, typically restricted to hands, in this paper, we introduce a physics-aware human motion generation framework that explicitly models the full spectrum of human-related forces, including human-object, human-scene, and internal body dynamics. Our method imposes soft physical constraints to maintain force and torque balance, ensuring physically grounded motion synthesis. We further propose a novel continuous distance-based force model that generalizes contact modeling to arbitrary surfaces, capturing interactions not only with static environments but also with dynamic, moving objects. Extensive experiments show that our approach significantly improves physical plausibility and generalizes well to complex scenes, setting a new benchmark for physically consistent human motion generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36660", "url": null, "sourceid": 33870, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36399, "uid": "2bb054c14409adcb28cb8d922e10a383", "name": "ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval", "authors": [{"id": 162442, "fullname": "tianyu yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162442?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 184950, "fullname": "ChenWei He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184950?format=json", "institution": "Southeast University"}, {"id": 145196, "fullname": "xiangzhao hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/145196?format=json", "institution": "CASIA"}, {"id": 172799, "fullname": "Tianyue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172799?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 184951, "fullname": "Jiarui Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184951?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 85427, "fullname": "Haiyun Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85427?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 129356, "fullname": "Leigang Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129356?format=json", "institution": "National University of Singapore"}, {"id": 85436, "fullname": "Jinqiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85436?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 88927, "fullname": "Tat-seng Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/88927?format=json", "institution": "National University of Singapore"}], "abstract": "Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision\u2013Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to **Capability Degradation**\u2014the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose **ReCALL** (Recalibrating Capability Degradation), a model-agnostic framework that follows a *diagnose\u2013generate\u2013refine* pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual\u2013semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36399", "url": null, "sourceid": 34550, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36429, "uid": "8da484a39300fd0463c0d9cf5cb13032", "name": "AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions", "authors": [{"id": 181951, "fullname": "Zonghao Ying", "url": "http://cvpr.thecvf.com/api/miniconf/users/181951?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 185022, "fullname": "Le Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185022?format=json", "institution": "Beihang University"}, {"id": 185023, "fullname": "Yisong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185023?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 75521, "fullname": "Jiakai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75521?format=json", "institution": "Zhongguancun Laboratory"}, {"id": 86327, "fullname": "Yuqing Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86327?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 86293, "fullname": "Jinyang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86293?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 179641, "fullname": "Zhenfei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/179641?format=json", "institution": "University of Oxford; University of Sydney"}, {"id": 185024, "fullname": "Mingchuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185024?format=json", "institution": null}, {"id": 73541, "fullname": "Aishan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73541?format=json", "institution": "Beihang University"}, {"id": 75528, "fullname": "Xianglong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75528?format=json", "institution": "BUAA"}], "abstract": "The integration of vision-language models (VLMs) is driving a new generation of embodied agents capable of operating in human-centered environments. However, as deployment expands, these systems face growing safety risks, particularly when executing hazardous instructions. Current safety evaluation benchmarks remain limited: they cover only narrow scopes of hazards and focus primarily on final outcomes, neglecting the agent's full perception-planning-execution process and thereby obscuring critical failure modes. Therefore, we present SAFE, a benchmark for systematically assessing the safety of embodied VLM agents on hazardous instructions. SAFE comprises three components: SAFE-THOR, an extensible adversarial simulation sandbox with a universal adapter that maps high-level VLM outputs to low-level embodied controls, supporting diverse agent workflow integration; SAFE-VERSE, a risk-aware task suite inspired by Asimov's Three Laws of Robotics, comprising 45 adversarial scenarios, 1,350 hazardous tasks, and 9,900 instructions that span risks to humans, environments, and agents; and SAFE-DIAGNOSE, a multi-level and fine-grained evaluation protocol measuring agent performance across perception, planning, and execution. Applying SAFE to nine state-of-the-art VLMs and two embodied agent workflows, we uncover systematic failures in translating hazard recognition into safe planning and execution. Our findings reveal fundamental limitations in current safety alignment and demonstrate the necessity of a comprehensive, multi-stage evaluation for developing safer embodied intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36429", "url": null, "sourceid": 44634, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39225, "uid": "bbd36655e64de9758eb68ccc741176f4", "name": "UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking", "authors": [{"id": 156179, "fullname": "Xuangeng Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156179?format=json", "institution": "The University of Tokyo"}, {"id": 107613, "fullname": "Ruicong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107613?format=json", "institution": "The University of Tokyo"}, {"id": 187251, "fullname": "Yifei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187251?format=json", "institution": "The University of Tokyo"}, {"id": 191636, "fullname": "Yun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191636?format=json", "institution": "NII"}, {"id": 159466, "fullname": "YICHEN PENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/159466?format=json", "institution": "Japan Advanced Institute of Science and Technology, Tokyo Institute of Technology"}, {"id": 186325, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186325?format=json", "institution": "Shanda Group Corp."}], "abstract": "Generating lifelike conversational avatars requires modeling not just isolated speakers, but the dynamic, reciprocal interaction of speaking and listening.However, modeling the listener is exceptionally challenging: direct audio-driven training fails, producing stiff, static listening motions. This failure stems from a fundamental imbalance: the speaker's motion is strongly driven by speech audio, while the listener's motion primarily follows an internal motion prior and is only loosely guided by external speech. This challenge has led most methods to focus on speak-only generation. The only prior attempt at joint generation relies on extra speaker's motion to produce the listener.This design is not end-to-end, thereby hindering the real-time applicability.To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio.Our method introduces a novel two-stage training paradigm.Stage 1 first learns the internal motion prior by training an audio-free autoregressive generator, capturing the spontaneous dynamics of natural facial motion. Stage 2 then introduces the dual-track audio, fine-tuning the generator to modulate the learned motion prior based on external speech cues.Extensive evaluations show UniLS achieves state-of-the-art speaking accuracy. More importantly, it delivers up to 44.1% improvement in listening metrics, generating significantly more diverse and natural listening expressions. This effectively mitigates the stiffness problem and provides a practical, high-fidelity audio-driven solution for interactive digital humans.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39225", "url": null, "sourceid": 33761, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38990, "uid": "01949fe85ea9c64f1a9ee2dee805ae50", "name": "DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving", "authors": [{"id": 181898, "fullname": "Zhenjie Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181898?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191134, "fullname": "Yilin Chai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191134?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 75577, "fullname": "Xiaosong Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/75577?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191135, "fullname": "Qifeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191135?format=json", "institution": "Institute for AI Industry Research (AIR), Tsinghua University; Huazhong University of Science and Technology; Tsinghua University; Zhongguancun Academy; University of Science and Technology of China"}, {"id": 191136, "fullname": "Yuqian Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191136?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191137, "fullname": "Xuekai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191137?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 154167, "fullname": "Haisheng Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/154167?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 88195, "fullname": "Junchi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88195?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensor data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. The recent success of the Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that expert specialization enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. First, we introduce Drive-\u03c00, a Vision-Language-Action (VLA) baseline adapted from Embodied AI for autonomous driving, which serves as the foundation model for DriveMoE. Building on this, we strengthen perception through a carefully designed Vision MoE, where a router adaptively selects context-relevant camera views. This mechanism is inspired by human driving cognition, in which attention is directed to key visual cues rather than to all sensory inputs simultaneously. Beyond perception, we introduce an Action MoE that augments the framework by training a router to activate specialized expert modules tailored to distinct driving behaviors. Within the Action MoE, we implement two distinct styles(Token-level Router and Trajectory-level Router) and extensively explore their applicability in autonomous driving. In Bench2Drive closed-loop evaluations, DriveMoE demonstrates robust performance across diverse driving scenarios, alleviates the mode-averaging effect that limits existing models, and achieves state-of-the-art results with significant improvements over Drive-\u03c00. We will release our code and models of DriveMoE and Drive-\u03c00.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38990", "url": null, "sourceid": 34661, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39384, "uid": "36179605b136215afcba7b1344c136a8", "name": "DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization", "authors": [{"id": 156636, "fullname": "Zhengxian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156636?format=json", "institution": "Department of Automation, Tsinghua University"}, {"id": 182444, "fullname": "Fei Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/182444?format=json", "institution": "Tsinghua University"}, {"id": 191965, "fullname": "Xutao Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/191965?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 191966, "fullname": "Rui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191966?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191967, "fullname": "Taicheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191967?format=json", "institution": "JD.com"}, {"id": 191968, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191968?format=json", "institution": "JD.com"}, {"id": 191969, "fullname": "Mengqi Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/191969?format=json", "institution": null}, {"id": 86367, "fullname": "Tao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86367?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye\u2019s large FOV advantage; 2) Undistortion\u2019s stretch\u2010and\u2010interpolate resampling spreads each pixel\u2019s value over a larger area, diluting detail density\u2014 causes 3DGS overfitting these low\u2010frequency zones, producing blur and floating artifacts.In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap\u2013driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views\u2014a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39384", "url": "https://yzxqh.github.io/DirectFisheye-GS/", "sourceid": 35062, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36646, "uid": "176d8075d8edfcba778b54b6749fe43c", "name": "Hyperbolic Gramian Volumes for Multimodal Alignment", "authors": [{"id": 185549, "fullname": "Saiyang Na", "url": "http://cvpr.thecvf.com/api/miniconf/users/185549?format=json", "institution": "University of Texas at Arlington"}, {"id": 185550, "fullname": "Feng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185550?format=json", "institution": "University of Texas at Arlington"}, {"id": 185551, "fullname": "Qifeng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185551?format=json", "institution": "University of Texas at Arlington"}, {"id": 185552, "fullname": "Wenliang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185552?format=json", "institution": "University of Texas at Arlington"}, {"id": 185553, "fullname": "Thao Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185553?format=json", "institution": "University of Texas at Arlington"}, {"id": 185554, "fullname": "Yuzhi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185554?format=json", "institution": "University of Texas at Arlington"}, {"id": 185555, "fullname": "Hehuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185555?format=json", "institution": "University of Texas at Arlington"}, {"id": 185556, "fullname": "Chunyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185556?format=json", "institution": "University of Texas at Arlington"}, {"id": 185557, "fullname": "Weizhi An", "url": "http://cvpr.thecvf.com/api/miniconf/users/185557?format=json", "institution": null}, {"id": 156403, "fullname": "Junzhou Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156403?format=json", "institution": "University of Texas, Arlington"}], "abstract": "Multimodal contrastive learning typically relies on pairwise similarities for alignment, but recent work has shown that Gramian volumes can capture higher-order correlations across modalities.However, Euclidean Gramian volumes suffer from volume collapse under L2 normalization, concentrating near unity with minimal discriminative variance.Hyperbolic geometry's exponential volume growth naturally addresses this via variance preservation, motivating us to extend Gramian alignment to hyperbolic space.Yet preliminary experiments reveal that pure hyperbolic geometry alone is insufficient: while it preserves variance, it underperforms Euclidean baselines on cross-category discrimination.We introduce HyperGRAM, a hybrid geometry framework that combines Euclidean discriminative stability with hyperbolic semantic variance through learnable mixing.Using the numerically stable Lorentz model, HyperGRAM enables volumes to serve dual roles: discriminating matched from mismatched triplets while preservingsemantic sensitivity within matched pairs that reflects interpretation spaces (the set of valid multimodal realizations).Evaluation across four video-text benchmarks demonstrates that hybrid geometry consistently outperforms both pure Euclidean and pure hyperbolic variants, achievingsignificant zero-shot improvements with cross-dataset semantic sensitivity exhibiting contrasting correlation patterns.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36646", "url": null, "sourceid": 35780, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37623, "uid": "f8c7555d2671b87529f7fb9b43fd0b71", "name": "Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty", "authors": [{"id": 187883, "fullname": "Jinfu Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187883?format=json", "institution": "Qingdao University"}, {"id": 187884, "fullname": "Jiangnan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187884?format=json", "institution": "Qingdao University"}, {"id": 187885, "fullname": "Xiaohui Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187885?format=json", "institution": "Qingdao University"}, {"id": 187886, "fullname": "Kangrui Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/187886?format=json", "institution": "Tongji University"}, {"id": 187887, "fullname": "Zhencun Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187887?format=json", "institution": "Tongji University"}, {"id": 187888, "fullname": "\u798f\u5efa\u8bdd \u8d63\u65b9\u8a00", "url": "http://cvpr.thecvf.com/api/miniconf/users/187888?format=json", "institution": "\u9752\u5c9b\u5927\u5b66"}, {"id": 187889, "fullname": "Tianhao Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187889?format=json", "institution": "Qingdao University"}, {"id": 187890, "fullname": "Linqing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187890?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Partial label learning (PLL) is a weakly supervised learning, where each instance is assigned a set of candidate labels and only one is true. However, due to potentially inaccurate annotations,  existing PLL algorithms disambiguate labeling by minimizing the prediction loss, which leaves the model unaware of its prediction credibility. To address this issue, this paper proposes the evidential deep partial label learning (ED-PLL) to quantify disambiguation uncertainty, aiming to achieve candidate label disambiguation and reliability prediction. Firstly, we extend the evidence modeling mechanism to PLL, treating the candidate label set as the source of evidence for the label hypothesis, and using belief and credibility to model classification uncertainty, thereby guiding a more reliable disambiguation process. Meanwhile, we propose the expectation calculation under the Dirichlet distribution of non-candidate labels, which suppresses the output of non-candidate labels by using consistency regularization to further improve the accuracy of disambiguation. Furthermore, a conflict-aware regularization is proposed to evaluate the degree of conflict, which measures the consistency between instances within the class by combining the differences in the distribution of prediction results and model uncertainty, and thus improves the robustness of the model. In addition, this paper theoretically analyzes our method from the perspective of the Expectation-Maximization (EM) algorithm, and the ED-PLL is compatible with any deep network or stochastic optimizer. Experiments on benchmark and real datasets verify the effectiveness of the proposed algorithm.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37623", "url": null, "sourceid": 41598, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39215, "uid": "282f81a0f102e8a41ac1e5b476779847", "name": "IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation", "authors": [{"id": 191610, "fullname": "Chenru Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191610?format=json", "institution": "Westlake University"}, {"id": 191611, "fullname": "Yunyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191611?format=json", "institution": "Eindhoven University of Technology; EPFL - EPF Lausanne"}, {"id": 191612, "fullname": "Zijun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191612?format=json", "institution": "Jinan University"}, {"id": 89174, "fullname": "Joey Tianyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89174?format=json", "institution": "National University of Singapore "}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Dataset Distillation aims to synthesize compact datasets that can approximate the training efficacy of large-scale real datasets, offering an efficient solution to the increasing computational demands of modern deep learning. Recently, diffusion-based dataset distillation methods have shown great promise by leveraging the strong generative capacity of diffusion models to produce diverse and structurally consistent samples. However, a fundamental goal misalignment persists: diffusion models are optimized for generative likelihood rather than discriminative utility, resulting in over-concentration in high-density regions and inadequate coverage of boundary samples crucial for classification. To address this issue, we propose two complementary strategies. Inversion-Matching (IM) introduces an inversion-guided fine-tuning process that aligns denoising trajectories with their inversion counterparts, broadening distributional coverage and enhancing diversity. Selective Subgroup Sampling( S$^3$ ) is a training-free sampling mechanism that improves inter-class separability by selecting synthetic subsets that are both representative and distinctive. Extensive experiments demonstrate that our approach significantly enhances the discriminative quality and generalization of distilled datasets, achieving state-of-the-art performance among diffusion-based methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39215", "url": null, "sourceid": 45820, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38095, "uid": "38d278dbb9c22bb5190e920ce2ed5fff", "name": "No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation", "authors": [{"id": 189044, "fullname": "Lizhi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189044?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 182829, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182829?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 187426, "fullname": "Ziqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187426?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 189045, "fullname": "Weiwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189045?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 189046, "fullname": "Zhangjie Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189046?format=json", "institution": "Nanjing University of Information Science and Technology"}], "abstract": "Recent advances in diffusion models have enabled high-fidelity, identity-preserving image generation for personalized applications such as digital avatars and virtual try-on systems. However, their reliance on sensitive facial reference images raises growing privacy concerns. Existing defense mechanisms are primarily designed for training-based personalization and struggle to generalize to emerging training-free approaches, due to fundamental differences in their identity integration paradigms. To bridge this gap, we propose $\\textbf{IDGuardian}$\u2014the first generalizable and model-agnostic identity protection framework capable of defending against both training-based and training-free personalization methods. IDGuardian abstracts the personalization process into two critical stages: identity extraction and identity injection. It then introduces crafted adversarial perturbations to simultaneously disrupt both stages. Specifically, it degrades the identity features extracted by external encoders and establishes an adversarial conceptual bridge that misdirects the generative trajectory away from the target identity. Extensive experiments show that IDGuardian effectively protects identity across various personalization pipelines and model architectures, while remaining robust to post-processing, adaptive attacks, and cross-dataset generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38095", "url": null, "sourceid": 36599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39026, "uid": "27d342e98cf600eb931a28036f48b691", "name": "Beyond Text Prompts: Precise Concept Erasure through Text\u2013Image Collaboration", "authors": [{"id": 182829, "fullname": "Jun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182829?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 189044, "fullname": "Lizhi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189044?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 187426, "fullname": "Ziqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187426?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 189045, "fullname": "Weiwei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189045?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 189046, "fullname": "Zhangjie Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189046?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 191197, "fullname": "Yong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191197?format=json", "institution": "Southeast University"}, {"id": 90405, "fullname": "Guo-Sen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90405?format=json", "institution": "inception institute of artificial intelligence (iiai)"}], "abstract": "Text-to-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets.Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose $\\textbf{TICoE}$, a Text\u2013Image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39026", "url": null, "sourceid": 36369, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36910, "uid": "35d97603af0a97686362d4a429e1ef28", "name": "GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry", "authors": [{"id": 181070, "fullname": "Xiankang He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181070?format=json", "institution": "ZJUT"}, {"id": 186194, "fullname": "Peile Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186194?format=json", "institution": "Zhejiang University of Technology"}, {"id": 186195, "fullname": "Ying Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/186195?format=json", "institution": "Zhejiang University of Technology"}, {"id": 182416, "fullname": "Dongyan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/182416?format=json", "institution": "Zhejiang University of Technology"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}, {"id": 89095, "fullname": "Xiaoqin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89095?format=json", "institution": "Wenzhou University"}], "abstract": "Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., $\\pi^3$), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36910", "url": null, "sourceid": 44713, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38182, "uid": "ea8df8c9242601dda142b2f4741bd1cf", "name": "Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images", "authors": [{"id": 189231, "fullname": "Donghai Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189231?format=json", "institution": "SUN YAT-SEN UNIVERSITY; YUNNAN UNIVERSITY"}, {"id": 189232, "fullname": "Yongheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189232?format=json", "institution": "Yunnan University"}, {"id": 189233, "fullname": "Zhen WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/189233?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 175949, "fullname": "Yuansong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175949?format=json", "institution": "Chongqing University"}, {"id": 188891, "fullname": "Wenwen Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/188891?format=json", "institution": "Yunnan University"}], "abstract": "Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from H&E-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene\u2013gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose **HINGE** (**HI**stology-co**N**ditioned **GE**neration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing **SoftAdaLN**, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space **masked diffusion** objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, HINGE outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38182", "url": null, "sourceid": 42791, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38521, "uid": "a1a03f215c558abe3813931ff93465e1", "name": "LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models", "authors": [{"id": 158894, "fullname": "Yiming Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158894?format=json", "institution": "The Chinese University of Hong Kong (Shenzhen)"}, {"id": 76392, "fullname": "Mutian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76392?format=json", "institution": "The Chinese University of Hong Kong (Shenzhen)"}, {"id": 90336, "fullname": "Chongjie Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/90336?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 190055, "fullname": "Jie Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190055?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 150956, "fullname": "Shunlin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/150956?format=json", "institution": "The Chinese University of HongKong, ShenZhen"}, {"id": 90989, "fullname": "Yipeng Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/90989?format=json", "institution": "Cardiff University"}, {"id": 88683, "fullname": "Xiaoguang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/88683?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}], "abstract": "Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability.To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting sparse response maps that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38521", "url": null, "sourceid": 46615, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39886, "uid": "edfa4753bb5c89bacdd59680e4f94f44", "name": "OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing", "authors": [{"id": 173494, "fullname": "Zeyu Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173494?format=json", "institution": "City University of Hong Kong"}, {"id": 186394, "fullname": "Lai Man Po", "url": "http://cvpr.thecvf.com/api/miniconf/users/186394?format=json", "institution": null}, {"id": 193051, "fullname": "XUYUAN XU", "url": "http://cvpr.thecvf.com/api/miniconf/users/193051?format=json", "institution": "Magic Light Inc."}, {"id": 193052, "fullname": "Yexin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193052?format=json", "institution": "Magiclight"}, {"id": 193053, "fullname": "Guoping Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193053?format=json", "institution": "Tencent video"}, {"id": 193054, "fullname": "Haoxuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193054?format=json", "institution": "City University of Hong Kong"}, {"id": 193055, "fullname": "Chenbo Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193055?format=json", "institution": "City University of Hong Kong"}, {"id": 193056, "fullname": "Kun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193056?format=json", "institution": "City University of Hong Kong"}, {"id": 193057, "fullname": "Yuyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193057?format=json", "institution": "City University of Hong Kong"}], "abstract": "Multimodal image synthesis has achieved remarkable progress in producing visually coherent results, yet most editing methods still rely on semantic instructions, which is less direct than using visual guidance.Recently, a new paradigm has emerged that focuses on \"editing one image from another\", enabling more direct and interpretable manipulation through reference exemplars. In this work, we formalize this paradigm as cross-image editing, which modifies a source image under the guidance of one or more references, encompassing subject replacement, style transfer, image completion, and other reference-to-source tasks. To address this, we introduce OrionEdit, a unified framework that regulates visual attribute transfer through two key mechanisms: (1) A symmetric orthogonal subspace update that partitions image features into branch-specific subspaces, mitigating feature entanglement and preserving subject identity; and (2) a reverse-causal attention mechanism with an information-flow mask that enforces unidirectional dependencies in the latent space. Built on standard diffusion backbones, OrionEdit enables zero-shot editing with multiple references and yields consistent gains over open-source baselines, rivaling proprietary models in fidelity and disentanglement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39886", "url": null, "sourceid": 37539, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39888, "uid": "16bf14ffd95ad00d59803e8f2bd8292e", "name": "NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction", "authors": [{"id": 179917, "fullname": "Muhammad Zarar", "url": "http://cvpr.thecvf.com/api/miniconf/users/179917?format=json", "institution": "Tianjin University"}, {"id": 193058, "fullname": "Mingzheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193058?format=json", "institution": "Tianjin University"}, {"id": 193059, "fullname": "Xiaowang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193059?format=json", "institution": "Tianjin University, China"}, {"id": 193060, "fullname": "Zhiyong Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193060?format=json", "institution": "Tianjin University"}], "abstract": "Scene Graph Generation (SGG) aims to structurally represent visual scenes by detecting objects and their pairwise relationships. Despite significant progress, current models encode visual knowledge with ambiguous visual context and logically inferred implicit relations due to their purely neural, pipeline-based nature. This limitation underscores the need to advance beyond identifying what relations exist to explaining why they exist and how they can be compositionally reasoned about through logical rule chaining. To address these challenges, we introduce NeuroRule, the first \\textbf{Neurally-Guided Rule Induction Network} that integrates Mask2Former pixel-precise visual understanding with a differentiable rule induction engine. Our proposed method enables automatic learning of compositional logical rules directly from visual data while providing transparent explanations for relational predictions. NeuroRule introduces three key innovations: (1) a neural-symbolic bridge that maps visual features to probabilistic symbolic representations; (2) a differentiable rule-learning mechanism that automatically discovers interpretable first-order logic rules without manual engineering; and (3) a compositional chain rule system that enables complex inference while propagating confidence scores through an end-to-end trainable pipeline. Extensive experiments on the benchmark datasets, including Visual Genome (VG), Panoptic Scene Graph (PSG), and OpenPSG, demonstrate that NeuroRule achieves state-of-the-art performance. Our method significantly improves few-shot relation extraction while maintaining full interpretability in its rule-based explanations. To ensure reproducibility, we will release the code after publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39888", "url": null, "sourceid": 40609, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39320, "uid": "bc59fd834d4cd8b4cab854019ade6209", "name": "VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging", "authors": [{"id": 191842, "fullname": "Ming Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191842?format=json", "institution": "Zhejiang University"}, {"id": 180268, "fullname": "Yuanlei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180268?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 191843, "fullname": "Liuzhou Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191843?format=json", "institution": "Peking University; Central China Normal University"}, {"id": 152591, "fullname": "Ruichuan An", "url": "http://cvpr.thecvf.com/api/miniconf/users/152591?format=json", "institution": "Peking University"}, {"id": 91572, "fullname": "Renrui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91572?format=json", "institution": "MMLab of CUHK &amp;amp;amp; Shanghai AI Laboratory"}, {"id": 182580, "fullname": "Hao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182580?format=json", "institution": "Peking University"}, {"id": 89661, "fullname": "Ming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89661?format=json", "institution": "Intel Labs China"}, {"id": 158001, "fullname": "Ying Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158001?format=json", "institution": "SUN YAT-SEN UNIVERSITY, Tsinghua University"}, {"id": 156267, "fullname": "Wentao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156267?format=json", "institution": "Peking University"}], "abstract": "While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present $\\textbf{\\textit{VCU-Bridge}}$, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct $\\textbf{\\textit{HVCU-Bench}}$, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on \\bench{} but also brings benefits on general benchmarks (average +2.53\\%), especially with substantial gains on MMStar (+7.26\\%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39320", "url": null, "sourceid": 36777, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39594, "uid": "51a3f1d4522af45e0e2b5f3d86f8e1c2", "name": "Cross-Subject EEG-to-Video Reconstruction and Beyond", "authors": [{"id": 180004, "fullname": "Runduo Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/180004?format=json", "institution": "Dalian University of Technology"}, {"id": 192434, "fullname": "Hongchen Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192434?format=json", "institution": "Dalian University of Technology"}], "abstract": "Reconstructing video content from EEG (electroencephalogram)  is a research task of significant scientific importance. However, due to inter-subject differences in physiological states and variations in signal acquisition configurations, this task faces the challenge of inconsistent cross-subject generation.To address this, we  propose a Subject Adversarial and Mapping Network (SAM-Net). In SAM-Net, we first introduce a Hybrid Region-Temporal (HRT) Encoder to conduct inter-channel semantic interactions guided by brain regions and aggregate temporal semantics across different time scales. Secondly, we propose a Centered-progressive Subject Adversarial (C-SA) Mechanism to gradually narrow the metric distance between different subjects, thereby obtaining a unified and stable semantic representation. Thirdly, we design a New2Source Mapper to align the EEG distribution of new  subjects with that of multiple known subjects. Finally, we adopt a keyframe-guided continuous semantic generation paradigm to drive the production of coherent and high-quality videos. Extensive experiments validate the competitive performance of our SAM-Net in cross-subject EEG-to-Video generation tasks, as well as its excellent performance in generation tasks involving new subjects.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39594", "url": null, "sourceid": 37733, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39250, "uid": "bd8f44067769dbfb91bd4b4c13c967c2", "name": "GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement", "authors": [{"id": 191703, "fullname": "Xiaodong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191703?format=json", "institution": "Wuhan University"}, {"id": 191704, "fullname": "Yuanming Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191704?format=json", "institution": "Wuhan University"}, {"id": 191705, "fullname": "Suting Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191705?format=json", "institution": null}, {"id": 191706, "fullname": "Junqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191706?format=json", "institution": "Wuhan University"}, {"id": 191454, "fullname": "Yuhong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191454?format=json", "institution": "Wuhan University"}, {"id": 191707, "fullname": "Weiping Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191707?format=json", "institution": "Wuhan University"}, {"id": 86262, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86262?format=json", "institution": "Wuhan University"}], "abstract": "Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-$k$ aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification\u2013regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39250", "url": null, "sourceid": 40202, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36828, "uid": "127f36d53863c8ef53c6426e948bfdf6", "name": "MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation", "authors": [{"id": 100873, "fullname": "Md Maklachur Rahman", "url": "http://cvpr.thecvf.com/api/miniconf/users/100873?format=json", "institution": "Texas A&amp;M University - College Station"}, {"id": 158813, "fullname": "Soon Ki Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/158813?format=json", "institution": "Kyungpook National University"}, {"id": 185974, "fullname": "Tracy Hammond", "url": "http://cvpr.thecvf.com/api/miniconf/users/185974?format=json", "institution": "Texas A&amp;M University"}], "abstract": "Recent segmentation models have demonstrated promising efficiency by aggressively reducing parameter counts and computational complexity. However, these models often struggle to accurately delineate fine lesion boundaries and texture patterns essential for early skin cancer diagnosis and treatment planning. In this paper, we propose MambaLiteUNet, a compact yet robust segmentation framework that integrates Mamba state space modeling into a U-Net architecture, along with three key modules: Adaptive Multi-Branch Mamba Feature Fusion (AMF), Local Global Feature Mixing (LGFM), and Cross-Gated Attention (CGA). These modules are designed to enhance local\u2013global feature interaction, preserve spatial details, and improve the quality of skip connections. MambaLiteUNet achieves an average IoU of 87.12\\% and average Dice score of 93.09\\% across ISIC2017, ISIC2018, HAM10000, and PH2 benchmarks, outperforming state-of-the-art models.Compared to U-Net, our model improves average IoU and Dice by 7.72 and 4.61 points, respectively, while reducing parameters by 93.6\\% and GFLOPs by 97.6\\%. Additionally, in domain generalization with six unseen lesion categories, MambaLiteUNet achieves 77.61\\% IoU and 87.23\\% Dice, performing best among all evaluated models. Our extensive experiments demonstrate that MambaLiteUNet achieves a strong balance between accuracy and efficiency, making it a competitive and practical solution for dermatological image segmentation.Code is available at https://github.com/abcdef987412365/MambaLiteUNet.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36828", "url": null, "sourceid": 42423, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36232, "uid": "3841efb99d3cff44808cbc5e20853a54", "name": "Dynamic Exposure Burst Image Restoration", "authors": [{"id": 103510, "fullname": "Woohyeok Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/103510?format=json", "institution": "POSTECH"}, {"id": 88365, "fullname": "Jaesung Rim", "url": "http://cvpr.thecvf.com/api/miniconf/users/88365?format=json", "institution": "POSTECH"}, {"id": 184523, "fullname": "Daeyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/184523?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 88380, "fullname": "Sunghyun Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/88380?format=json", "institution": "POSTECH"}], "abstract": "Burst image restoration aims to reconstruct a high-quality image from burst images, which are typically captured using manually designed exposure settings.Although these exposure settings significantly influence the final restoration performance, the problem of finding optimal exposure settings has been overlooked.In this paper, we present Dynamic Exposure Burst Image Restoration (DEBIR), a novel burst image restoration pipeline that enhances restoration quality by dynamically predicting exposure times tailored to the shooting environment.In our pipeline, Burst Auto-Exposure Network (BAENet) estimates the optimal exposure time for each burst image based on a preview image, as well as motion magnitude and gain.Subsequently, a burst image restoration network reconstructs a high-quality image from burst images captured using these optimal exposure times.For training, we introduce a differentiable burst simulator and a three-stage training strategy. Our experiments demonstrate that our pipeline achieves state-of-the-art restoration quality. Furthermore, we validate the effectiveness of our approach on a real-world camera system, demonstrating its practicality. The code will be made publicly available on our project page.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36232", "url": null, "sourceid": 44699, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39626, "uid": "245e57794c369c8617378e285ee9755c", "name": "MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention", "authors": [{"id": 192511, "fullname": "Zilong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192511?format=json", "institution": "Shandong University"}, {"id": 75472, "fullname": "Zhengming Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/75472?format=json", "institution": "Tulane University"}, {"id": 192512, "fullname": "Pei Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192512?format=json", "institution": "Shandong University"}, {"id": 192513, "fullname": "Wenhao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192513?format=json", "institution": "Shandong University"}, {"id": 192514, "fullname": "Feng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192514?format=json", "institution": "Shandong University"}], "abstract": "Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba\u2019s latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a multi-view processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39626", "url": null, "sourceid": 44318, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36608, "uid": "eebf196d08b0b670b3e2d4e66952f7a4", "name": "Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation", "authors": [{"id": 127773, "fullname": "Jun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/127773?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185454, "fullname": "Hongyi Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185454?format=json", "institution": "Shandong University"}, {"id": 152885, "fullname": "Yingjie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152885?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185455, "fullname": "Wangqiu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185455?format=json", "institution": "Hefei University of Technology"}, {"id": 185456, "fullname": "Jianbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185456?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185457, "fullname": "Linhan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185457?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 156245, "fullname": "Dandan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156245?format=json", "institution": "East China Normal University"}, {"id": 127227, "fullname": "Hua Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127227?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 89522, "fullname": "Xiongkuo Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/89522?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 184429, "fullname": "Wei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184429?format=json", "institution": "East China Normal University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "With the rapid progress in diffusion models, image synthesis has advanced to the stage of zero-shot image-to-image generation, where high-fidelity replication of facial identities or artistic styles can be achieved using just one portrait or artwork, without modifying any model weights. Although these techniques significantly enhance creative possibilities, they also pose substantial risks related to intellectual property violations, including unauthorized identity cloning and stylistic imitation. To counter such threats, this work presents Adapter Shield, the first universal and authentication-integrated solution aimed at defending personal images from misuse in zero-shot generation scenarios. We first investigate how current zero-shot methods employ image encoders to extract embeddings from input images, which are subsequently fed into the UNet of diffusion models through cross-attention layers. Inspired by this mechanism, we construct a reversible encryption system that maps original embeddings into distinct encrypted representations according to different secret keys. The authorized users can restore the authentic embeddings via a decryption module and the correct key, enabling normal usage for authorized generation tasks. For protection purposes, we design a multi-target adversarial perturbation method that actively shifts the original embeddings toward designated encrypted patterns. Consequently, protected images are embedded with a defensive layer that ensures unauthorized users can only produce distorted or encrypted outputs. Extensive evaluations demonstrate that our method surpasses existing state-of-the-art defenses in blocking unauthorized zero-shot image synthesis, while supporting flexible and secure access control for verified users.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36608", "url": null, "sourceid": 37808, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39806, "uid": "683789c77bf37c8dea098b73af2b52e6", "name": "HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning", "authors": [{"id": 192892, "fullname": "Eunju Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192892?format=json", "institution": "Chung-Ang University"}, {"id": 192893, "fullname": "MiHyeon Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192893?format=json", "institution": "Korea Telecom Research"}, {"id": 192894, "fullname": "Junehyoung Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/192894?format=json", "institution": "Chung-Ang University"}, {"id": 192895, "fullname": "Yoonji Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192895?format=json", "institution": "Chung-Ang University"}, {"id": 192896, "fullname": "JiHyun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192896?format=json", "institution": "Dentium"}, {"id": 192897, "fullname": "Soojin Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192897?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 192898, "fullname": "YoungBin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192898?format=json", "institution": "Chung-Ang University"}], "abstract": "Pretrained Vision\u2013Language Models (VLMs) like CLIP show promise in continual learning, but existing Few-Shot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties\u2014directional alignment and covariance-aware magnitude\u2014yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention\u2013adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39806", "url": null, "sourceid": 36132, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38663, "uid": "5a56d275949eb284b4884f30ed88a045", "name": "STAR: Test-Time Adaptation Can Enhance Universal Prompt Learning for Vision-Language Models", "authors": [{"id": 190416, "fullname": "Yiwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190416?format=json", "institution": "Peking University"}, {"id": 190417, "fullname": "Hui Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190417?format=json", "institution": "Yale University"}, {"id": 190418, "fullname": "Xiao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190418?format=json", "institution": "University of Wisconsin - Madison"}, {"id": 190419, "fullname": "Minghua Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190419?format=json", "institution": "Peking University"}], "abstract": "This paper studies the problem of universal test-time prompt learning for vision-language models (VLMs) which aims to enhance prompt learning for a pre-trained VLM via unlabeled target data containing out-of-distribution (OOD) samples. However, existing test-time adaptation approaches often overlook class-specific diversity in the target domain and rely on unreliable pseudo-labels due to inadequate uncertainty estimation, which may result in additional adaptation bias during test time. Towards this end, we propose a novel framework named Separability-aware Conjugate Optimization with Prototypical Retrieval (STAR) for universal test-time prompt learning of VLMs. The core of our STAR is to incorporate a separability-aware gating mechanism into conjugate optimization for reliable pseudo-learning with OOD samples. In particular, we first compute the Fisher score to quantify the separability between in-distribution (ID) and OOD samples, which guides our soft gating mechanism for divided training. Then, we employ conjugate optimization to derive reliable pseudo-labels of unlabeled data for test-time adaptation. To further mitigate biases in OOD detection, we maintain a dynamic memory bank which stores high-confidence samples to build class-wise prototypes, which would serve as queries for prototypical retrieval to calibrate OOD detection. Extensive experiments on multiple benchmarks demonstrate that STAR consistently outperforms competing baseline methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38663", "url": null, "sourceid": 35943, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40343, "uid": "31fa042538088bc9a6cf8de213b5181b", "name": "CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing", "authors": [{"id": 149525, "fullname": "Jianxiong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/149525?format=json", "institution": "Fudan University"}, {"id": 190392, "fullname": "Yichang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190392?format=json", "institution": "Fudan University"}, {"id": 190393, "fullname": "baofeng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190393?format=json", "institution": "Fudan University"}, {"id": 190394, "fullname": "Jianfeng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190394?format=json", "institution": "Fudan University; The University of Warwick"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}], "abstract": "Most research decoding brain signals into images, often using them as priors for generative models, has focused only on visual content. This overlooks the brain's natural ability to integrate auditory and visual information, for instance, sound strongly influences how we perceive visual scenes. To investigate this,we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. To enable this, we introduce CineBrain, the first large-scale dataset that synchronizes fMRI and EEG during audiovisual viewing, featuring six hours of The Big Bang Theory episodes for cross-modal alignment. We also conduct the first systematic exploration of combining fMRI and EEG for video reconstruction and present CineSync, a framework for reconstructing dynamic video using a Multi-Modal Fusion Encoder and a Neural Latent Decoder. CineSync achieves state-of-the-art performance in dynamic reconstruction, leveraging the complementary strengths of fMRI and EEG to improve visual fidelity. Our analysis shows that auditory cortical activations enhance decoding accuracy, highlighting the role of auditory input in visual perception.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40343", "url": null, "sourceid": -44510, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38652?format=json"], "related_events_ids": [38652]}, {"id": 39922, "uid": "a2409913883bc192c1608d76c6a47596", "name": "GDFA: Geometry-Driven Federated Unlearning with Directional Task Vector Alignment", "authors": [{"id": 180105, "fullname": "Xiuting Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180105?format=json", "institution": "Yunnan University"}, {"id": 193116, "fullname": "Ruizhi Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193116?format=json", "institution": null}, {"id": 193117, "fullname": "Yuanhang Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193117?format=json", "institution": "Southeast University"}, {"id": 193118, "fullname": "Kun Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193118?format=json", "institution": "Yunnan University"}, {"id": 193119, "fullname": "Zhiwen Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193119?format=json", "institution": "Yunnan University"}, {"id": 193120, "fullname": "Lixing Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193120?format=json", "institution": "Yunnan University"}], "abstract": "Federated Learning (FL) is a decentralized framework that not only enables collaborative training with different clients but also ensures their local data privacy. However, when deletion requests arise under privacy regulations, efficiently removing specific client data contributions from target clients can be challenging. Existing unlearning methods face significant limitations under Non-IID (Non-Independent and Identically Distributed) data distributions when attempting to unlearn specific target clients in FL. Models in sharp optimization regions can suffer catastrophic knowledge loss from minor parameter changes, exacerbating this forgetting due to conflicting parameter updates across clients caused by Non-IID data distributions in FL. Empirically, we observe that conflicting updates under Non-IID settings generate misaligned task vectors that fail to isolate target knowledge. Therefore, we exploit the loss landscape geometry in unlearning specific target clients. We demonstrate that migrating models to flat regions can enhance unlearning robustness in Non-IID FL. Correspondingly, we introduce GDFA, a framework that initially transitions the global model to a flat loss domain. Subsequently, relevant clients generate unlearning task vectors, which GDFA filters to retain only directionally consistent components. This process isolates shared knowledge attributes before precise removal through reverse vector aggregation, maximizing knowledge retention. Extensive experiments demonstrate that GDFA outperforms state-of-the-art methods in unlearning efficacy and efficiency across diverse datasets and architectures, with minimal accuracy loss on retained tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39922", "url": null, "sourceid": 40147, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38318, "uid": "71cd82385557b7d41c4c18e1a40e0979", "name": "Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization", "authors": [{"id": 173841, "fullname": "Chenwei Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/173841?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189586, "fullname": "Baoting Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189586?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189587, "fullname": "Xuchong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189587?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189588, "fullname": "Mingzhuo Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/189588?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189589, "fullname": "Bochen Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189589?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189590, "fullname": "Hongbin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189590?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose Quant Experts (QE), a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38318", "url": null, "sourceid": 33709, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38652, "uid": "31fa042538088bc9a6cf8de213b5181b", "name": "CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing", "authors": [{"id": 149525, "fullname": "Jianxiong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/149525?format=json", "institution": "Fudan University"}, {"id": 190392, "fullname": "Yichang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190392?format=json", "institution": "Fudan University"}, {"id": 190393, "fullname": "baofeng yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190393?format=json", "institution": "Fudan University"}, {"id": 190394, "fullname": "Jianfeng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190394?format=json", "institution": "Fudan University; The University of Warwick"}, {"id": 73907, "fullname": "Yanwei Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73907?format=json", "institution": "Fudan University"}], "abstract": "Most research decoding brain signals into images, often using them as priors for generative models, has focused only on visual content. This overlooks the brain's natural ability to integrate auditory and visual information, for instance, sound strongly influences how we perceive visual scenes. To investigate this,we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. To enable this, we introduce CineBrain, the first large-scale dataset that synchronizes fMRI and EEG during audiovisual viewing, featuring six hours of The Big Bang Theory episodes for cross-modal alignment. We also conduct the first systematic exploration of combining fMRI and EEG for video reconstruction and present CineSync, a framework for reconstructing dynamic video using a Multi-Modal Fusion Encoder and a Neural Latent Decoder. CineSync achieves state-of-the-art performance in dynamic reconstruction, leveraging the complementary strengths of fMRI and EEG to improve visual fidelity. Our analysis shows that auditory cortical activations enhance decoding accuracy, highlighting the role of auditory input in visual perception.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38652", "url": null, "sourceid": 44510, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40343?format=json"], "related_events_ids": [40343]}, {"id": 39058, "uid": "c669ac8af779712a937d11cd9561ece8", "name": "Annotation-Efficient Coreset Selection for Context-dependent  Segmentation", "authors": [{"id": 181583, "fullname": "jin zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181583?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191270, "fullname": "Zhe Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191270?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191271, "fullname": "Biwen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191271?format=json", "institution": "Beijing Institute of Technology"}, {"id": 191272, "fullname": "Ruiheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191272?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Context-dependent (CD) tasks demand the model to have advanced visual understanding ability, such as recognizing camouflaged objects and medical lesions. Current CD methods rely heavily on pixel-level annotated training sets, neglecting issues from redundant samples and the high annotation costs. In this paper, we address the pruning needs of CD datasets, focusing on selecting the most valuable samples for labeling and training using weak annotations. To achieve this, we decompose CD coreset selection into two steps: sample evaluation and coreset selection, proposing corresponding solutions: points-based optimal transport and a maximum distance entropy strategy. Specifically, we formulate sample evaluation as an optimal transport problem between foreground and background distributions, designing a foreground destruction-reconstruction process based on points to compute transport costs and score samples. For samples of varying importance, our selection strategy balances coreset coverage and diversity. We validate our method on six CD tasks, achieving 1\\% accuracy loss relative to full training under a 40\\% pruning rate.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39058", "url": null, "sourceid": 36779, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39878, "uid": "51d3a6f35b8dfe611ff24214c8ef79d1", "name": "CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction", "authors": [{"id": 182904, "fullname": "Wenyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182904?format=json", "institution": "ShanghaiTech University"}, {"id": 193040, "fullname": "Yutan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193040?format=json", "institution": "ShanghaiTech University"}, {"id": 190520, "fullname": "Xuming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190520?format=json", "institution": "ShanghaiTech University"}], "abstract": "Heterogeneous reconstruction in cryo-electron microscopy (Cryo-EM) is fundamental for understanding macromolecular structural diversity, yet remains challenging due to extreme noise, continuous conformational changes, and ambiguous image-to-structure mappings. Existing neural approaches often rely on encoder--decoder pipelines or fixed codebooks, which can be computationally demanding or struggle with complex heterogeneity. We propose \\textbf{CryoKRAQEN}, a decoder-only framework that integrates triplane implicit representations with kernel-guided latent assignment and quantized embeddings to improve stability and structural discrimination. The method avoids encoder dependencies and mitigates collapse during training, enabling accurate modeling of both conformational and compositional variations. Across diverse Cryo-EM benchmarks, CryoKRAQEN delivers competitive performance, robust reconstructions, and interpretable latent organization compared to state-of-the-art neural and classical methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39878", "url": null, "sourceid": 36379, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36631, "uid": "089c710c918b50da402b96d0601ee1a0", "name": "Scaling Spatial Intelligence with Multimodal Foundation Models", "authors": [{"id": 128967, "fullname": "Zhongang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128967?format=json", "institution": "Nanyang Technological University"}, {"id": 129473, "fullname": "Wang Ruisi", "url": "http://cvpr.thecvf.com/api/miniconf/users/129473?format=json", "institution": "SenseTime International"}, {"id": 185508, "fullname": "Chenyang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185508?format=json", "institution": "Sensetime"}, {"id": 185509, "fullname": "Fanyi Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185509?format=json", "institution": "Nanyang Technological University"}, {"id": 185510, "fullname": "Junxiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185510?format=json", "institution": null}, {"id": 185511, "fullname": "YUBO WANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/185511?format=json", "institution": "Sensetime"}, {"id": 129474, "fullname": "Wanqi Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129474?format=json", "institution": "SenseTime Research "}, {"id": 129490, "fullname": "Zhitao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129490?format=json", "institution": "SenseTime Co Ltd."}, {"id": 129495, "fullname": "Chen Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/129495?format=json", "institution": "SenseTime International PTE. LTD."}, {"id": 185512, "fullname": "Tongxi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185512?format=json", "institution": "Sensetime"}, {"id": 131677, "fullname": "Qingping SUN", "url": "http://cvpr.thecvf.com/api/miniconf/users/131677?format=json", "institution": "City University of Hong Kong"}, {"id": 146051, "fullname": "Hui En Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146051?format=json", "institution": "Nanyang Technological University"}, {"id": 129455, "fullname": "Jiaqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129455?format=json", "institution": "SenseTime"}, {"id": 185513, "fullname": "Oscar Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/185513?format=json", "institution": "Nanyang Technological University"}, {"id": 157773, "fullname": "Zhiqian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/157773?format=json", "institution": "Sensetime"}, {"id": 172845, "fullname": "Xuanke Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/172845?format=json", "institution": null}, {"id": 185514, "fullname": "Kewang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185514?format=json", "institution": null}, {"id": 181569, "fullname": "Xiaoyang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/181569?format=json", "institution": "SenseTime"}, {"id": 177959, "fullname": "Zukai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/177959?format=json", "institution": "SenseTime"}, {"id": 129476, "fullname": "Xiangyu Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129476?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 90125, "fullname": "Hanming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90125?format=json", "institution": "Sensetime"}, {"id": 88044, "fullname": "Lewei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88044?format=json", "institution": "SenseTime"}, {"id": 127120, "fullname": "Liang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127120?format=json", "institution": "Shanghai AI Lab"}, {"id": 90785, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90785?format=json", "institution": "Nanyang Technological University"}, {"id": 89788, "fullname": "Ziwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89788?format=json", "institution": "Nanyang Technological University"}, {"id": 185515, "fullname": "Quan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185515?format=json", "institution": "SenseTime Group Limited, SenseTime Group Limited"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 89773, "fullname": "Lei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89773?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SSI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SSI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SSI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, and 54.6% on ViewSpatial, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36631", "url": "https://github.com/OpenSenseNova/SenseNova-SI", "sourceid": 39413, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37613, "uid": "073754d13ca65ee6903491686ed7c141", "name": "Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors", "authors": [{"id": 152347, "fullname": "Rong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152347?format=json", "institution": "Australian National University"}, {"id": 187861, "fullname": "Ruyi Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/187861?format=json", "institution": "Australian National University"}, {"id": 187862, "fullname": "Ziang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187862?format=json", "institution": "Amazon"}, {"id": 187863, "fullname": "Jiayu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187863?format=json", "institution": "Amazon"}, {"id": 74171, "fullname": "Pulak Purkait", "url": "http://cvpr.thecvf.com/api/miniconf/users/74171?format=json", "institution": "Amazon"}, {"id": 92749, "fullname": "Hongdong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/92749?format=json", "institution": "Australian National University"}], "abstract": "We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37613", "url": null, "sourceid": 38062, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36203, "uid": "9e30387be1bdc568dfca37927852dc94", "name": "Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting", "authors": [{"id": 184267, "fullname": "Qinzheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184267?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184440, "fullname": "Zaychik Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184440?format=json", "institution": "South China University of Technology"}, {"id": 184441, "fullname": "Lijing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184441?format=json", "institution": "Institute of Remote Sensing and Geographic Information System, School of Earth and Space SciencesPeking University"}, {"id": 154801, "fullname": "Zhihang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154801?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "Sparse-view 3D Gaussian Splatting (3DGS) reconstructs scenes using 3D Gaussians from sparse input views. Yet, this method is prone to overfitting, which is exacerbated at higher resolutions as the expanded dimensionality amplifies floating artifacts and reconstruction ambiguities. In this paper, we present a systematic study of 3DGS under sparse-view conditions and varying input resolutions. While prior work has overlooked resolution as a key factor in sparse-view performance, we identify and quantify a trade-off: lower-resolution inputs facilitate stable global geometry reconstruction, whereas higher-resolution inputs enable finer detail recovery but introduce high-frequency artifacts and instability. Building on this insight, we further propose **CAGS**, a Confidence-Guided Multi-Scale Aggregation that reconstructs scenes through a coarse-to-fine hierarchical optimization process\u200c. Our approach employs a matching-based weighting aggregation strategy to anchor high-resolution reconstructions to stabilize structural priors and filtering noise through cross-scale consistency, and a multi-scale pseudo-view regularization to refine local details without amplifying noise. Extensive experiments on the LLFF and Mip-NeRF360 datasets demonstrate that CAGS significantly outperforms existing methods, particularly under demanding high-resolution conditions. \u200cMoreover, our paradigm can be seamlessly integrated into other 3DGS-based pipelines, thereby extending the field from low-resolution reconstructions to high-fidelity outputs under real-world sparse-view constraints.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36203", "url": null, "sourceid": 46431, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39542, "uid": "7e7be978afdba0dd76f42113acf0abd8", "name": "WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition", "authors": [{"id": 76563, "fullname": "Shan Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/76563?format=json", "institution": "Shanghaitech University"}, {"id": 76309, "fullname": "Longtian Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76309?format=json", "institution": "ShanghaiTech University"}, {"id": 192308, "fullname": "Jiaxuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192308?format=json", "institution": "ShanghaiTech University"}, {"id": 190520, "fullname": "Xuming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190520?format=json", "institution": "ShanghaiTech University"}], "abstract": "Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia.Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment.In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER.WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training.Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16\\% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39542", "url": null, "sourceid": 39218, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38717, "uid": "15577bd89f95fbe74ff708dd9d3c49a8", "name": "AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis", "authors": [{"id": 190519, "fullname": "Xiaofei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190519?format=json", "institution": "ShanghaiTech University"}, {"id": 176101, "fullname": "Yi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176101?format=json", "institution": "ShanghaiTech University"}, {"id": 156215, "fullname": "Yumeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156215?format=json", "institution": "The University of Hong Kong"}, {"id": 132350, "fullname": "Yuexin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/132350?format=json", "institution": "ShanghaiTech University"}, {"id": 133585, "fullname": "Yujiao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/133585?format=json", "institution": "ShanghaiTech University"}, {"id": 190520, "fullname": "Xuming He", "url": "http://cvpr.thecvf.com/api/miniconf/users/190520?format=json", "institution": "ShanghaiTech University"}], "abstract": "Generating human grasping poses that accurately reflect both object geometry and user-specified interaction semantics is essential for natural hand\u2013object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand\u2013object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent representation of hand poses with a dual-conditioning diffusion process, enabling the model to jointly reason over object geometry, spatial affordances, and instruction semantics. A distribution adjustment module further enforces physical contact consistency and semantic alignment. We evaluate AffordGrasp across four instruction-augmented benchmarks derived from HO-3D, OakInk, GRAB, and AffordPose, and observe substantial improvements over state-of-the-art methods in grasp quality, semantic accuracy, and diversity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38717", "url": null, "sourceid": 33301, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40015, "uid": "3df77411dbeaa2e3c6b9ab2efdacb395", "name": "GH-NAF: Grid-Adaptive Hash-Level\u2013Attended Neural Attenuation Fields for Discrepancy-Aware CBCT", "authors": [{"id": 177865, "fullname": "Seong Je Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/177865?format=json", "institution": "Seoul National University"}, {"id": 193305, "fullname": "Ju Hwan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193305?format=json", "institution": "Seoul National University"}, {"id": 193306, "fullname": "Chae Yeon Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193306?format=json", "institution": "Seoul National University"}, {"id": 193307, "fullname": "Donghwan Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193307?format=json", "institution": "Seoul National University"}, {"id": 193308, "fullname": "Myung Jin Chung", "url": "http://cvpr.thecvf.com/api/miniconf/users/193308?format=json", "institution": "Sung Kyun Kwan University School of Medicine"}, {"id": 105628, "fullname": "Kyungsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/105628?format=json", "institution": "Harvard Medical School and Massachusetts General Hospital"}], "abstract": "The advent of hash encodings has evolved neural radiance fields (NeRF)-based methods into fast and efficient 3D reconstruction techniques. In medical imaging, this framework has been extended to CT/CBCT reconstruction through neural attenuation fields (NAF), which directly model attenuation properties from projection data. Existing NeRF-based attenuation fields typically assume an idealized monoenergetic CBCT setting and therefore fail to model real-world projection inconsistencies such as scatter and noise contamination. Moreover, uniformly concatenating multi-resolution hash-grid features blends heterogeneous frequency components and noise into a single representation, causing artifacts: homogeneous regions acquire spurious high-frequency patterns, structural boundaries become blurred, and projection-induced bias propagates throughout the learned field. Given these limitations, we introduce the Grid-Adaptive Hash-Level\u2013Attended Neural Attenuation Field (GH-NAF). Instead of collapsing noise-corrupted projection signals into a single feature space, GH-NAF trained each hash-grid level independently, guided by uncertainty-based confidence scores. This enables stable low-frequency modeling in homogeneous tissues while selectively preserving high-frequency detail around structural boundaries. Experiments on synthetic and real CBCT datasets demonstrate that GH-NAF reliably preserves intra-material contrast and achieves superior reconstruction quality compared with state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40015", "url": null, "sourceid": 44636, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38988, "uid": "90d2301e323b1809708324229782def2", "name": "RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward", "authors": [{"id": 85175, "fullname": "Qiucheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85175?format=json", "institution": "University of California, Santa Barbara"}, {"id": 126843, "fullname": "Jing Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126843?format=json", "institution": "Adobe Systems"}, {"id": 87960, "fullname": "Simon Jenni", "url": "http://cvpr.thecvf.com/api/miniconf/users/87960?format=json", "institution": "Adobe Systems"}, {"id": 86864, "fullname": "Kushal Kafle", "url": "http://cvpr.thecvf.com/api/miniconf/users/86864?format=json", "institution": "Adobe Systems"}, {"id": 150606, "fullname": "Tianyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150606?format=json", "institution": "Adobe Research"}, {"id": 85178, "fullname": "Shiyu Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85178?format=json", "institution": "UC Santa Barbara"}, {"id": 85204, "fullname": "Handong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85204?format=json", "institution": "Adobe Systems"}], "abstract": "Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable Lightroom adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38988", "url": null, "sourceid": 39603, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39569, "uid": "d890f33e2b71506322bde37f3abafb71", "name": "Pano360: Perspective to Panoramic Vision with Geometric Consistency", "authors": [{"id": 192368, "fullname": "Zhengdong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192368?format=json", "institution": "South China University of Technology"}, {"id": 130979, "fullname": "Weiyi Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/130979?format=json", "institution": "Tongji University"}, {"id": 192369, "fullname": "Zuyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192369?format=json", "institution": "Guangdong University of Technology"}, {"id": 188388, "fullname": "Wenlve Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188388?format=json", "institution": "South China University of Technology"}, {"id": 188389, "fullname": "Zhiheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188389?format=json", "institution": null}], "abstract": "Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns.Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams.Additionally, to establish an evaluation benchmark and train our network, we collected a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39569", "url": null, "sourceid": 43108, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39667, "uid": "abad2c9bdb3db40975a1e0b6973940be", "name": "ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition", "authors": [{"id": 173924, "fullname": "Canyu Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/173924?format=json", "institution": "National University of Defense Technology"}, {"id": 132148, "fullname": "Yongxiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/132148?format=json", "institution": "National University of Defense Technology"}, {"id": 191607, "fullname": "Jiehua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191607?format=json", "institution": null}, {"id": 192602, "fullname": "Zilong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192602?format=json", "institution": "National University of Defense Technology"}, {"id": 131425, "fullname": "Zhen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131425?format=json", "institution": "National University of Defense Technology"}, {"id": 192603, "fullname": "Tianpeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192603?format=json", "institution": "National University of Defense Technology"}, {"id": 191606, "fullname": "Li Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191606?format=json", "institution": "National University of Defense Technology"}], "abstract": "Recent advances in Remote Sensing Foundation Models (RSFMs) have demonstrated considerable potential for Earth Observation (EO) tasks. While adopting natural image foundation models (e.g., DINO) provides a data-efficient strategy for building RSFMs, their strong generalization capability does not fully transfer to complex remote sensing (RS) scenarios due to severe background interference, notably in perceiving challenging targets like low-contrast objects. To this end, we propose ORSATR-X, a novel RSFM that effectively integrates the generalizable representations of DINOv3 with a dedicated mechanism for exciting local contrast information. ORSATR-X comprises two core components: (1) a DINOv3 encoder, which provides rich feature representation under limited RS pretraining data, and (2) a carefully designed side network incorporating a Weber Local Adapter (WLA) and a Multi-scale Aggregation Module (MSAM). The WLA enhances discriminability of low-contrast boundaries in complex scenes through center-surround contrast and directional gradient information enhancement, while the MSAM handles  inherent object scale variations in RS imagery by adaptive aggregation of features across multiple scales. Furthermore, we pretrain the side network using an efficient self-supervised distillation strategy. Extensive experiments on scene classification, object detection, and semantic segmentation demonstrate that ORSATR-X achieves state-of-the-art performance among existing RSFMs, demonstrating the effectiveness of our design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39667", "url": null, "sourceid": 34634, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38837, "uid": "35d823b496846c48c1d98c954aeb0964", "name": "MV-TAP: Tracking Any Point in Multi-View Videos", "authors": [{"id": 186750, "fullname": "Jahyeok Koo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186750?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 156301, "fullname": "In\u00e8s Hyeonsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156301?format=json", "institution": "KAIST AI"}, {"id": 188158, "fullname": "Mungyeom Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188158?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186751, "fullname": "Junghyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/186751?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology (KAIST)"}, {"id": 190802, "fullname": "Seohyeon Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/190802?format=json", "institution": "Korea University"}, {"id": 190803, "fullname": "Jaeyeong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190803?format=json", "institution": "KAIST"}, {"id": 147321, "fullname": "Jung Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/147321?format=json", "institution": "KAIST"}, {"id": 92999, "fullname": "Seokju Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/92999?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to many applications. Point tracking serves as a key mechanism for capturing dynamic motion; however, conventional single-view approaches often fail due to the limited geometric information available in monocular video, which becomes a critical bottleneck for multi-view scenarios. In this work, we present \\ours, a robust point tracker that tracks query points across multi-view videos of dynamic scenes by leveraging cross-view information.\\ours utilizes camera geometry and cross-view attention to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that \\ours outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38837", "url": null, "sourceid": 35521, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38534, "uid": "eb97f5fdf3701ebf7ce10f9600fb0ade", "name": "FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera", "authors": [{"id": 148303, "fullname": "Zijia Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/148303?format=json", "institution": "ShanghaiTech University"}, {"id": 87145, "fullname": "Nico Messikommer", "url": "http://cvpr.thecvf.com/api/miniconf/users/87145?format=json", "institution": "University of Zurich"}, {"id": 190080, "fullname": "Rong Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190080?format=json", "institution": "ETH AI Center; UZH"}, {"id": 190081, "fullname": "Nikola Zubic", "url": "http://cvpr.thecvf.com/api/miniconf/users/190081?format=json", "institution": "Department of Informatics, University of Zurich and ETH Zurich"}, {"id": 85631, "fullname": "Davide Scaramuzza", "url": "http://cvpr.thecvf.com/api/miniconf/users/85631?format=json", "institution": "University of Zurich"}, {"id": 130042, "fullname": "Laurent Kneip", "url": "http://cvpr.thecvf.com/api/miniconf/users/130042?format=json", "institution": "ShanghaiTech University"}], "abstract": "The demand for dynamic 3D assets in AR/VR has recently popularized Deformable Gaussian Splatting. However, traditional RGB cameras are limited in their ability to reconstruct high-speed scenes due to motion blur and low temporal resolution. While event cameras offer a promising alternative, reconstructing a complete scene from their sparse and noisy output is a significant challenge. Existing event-based methods rely on an auxiliary sensor, such as a frame camera, thereby inducing tedious hardware and calibration challenges.We introduce FastEventDGS, a novel Deformable Gaussian Splatting-based framework that leverages a single event camera for high-fidelity 4D reconstruction. Our method utilizes a continuous camera trajectory parametrization and integrates two event generation models to provide both photometric and geometric constraints. We further propose a local patch event motion loss to constrain object motion, effectively mitigating overfitting. To ensure robust reconstruction, we employ an off-the-shelf model for depth correction and apply noise regularization terms in the final stage. We demonstrate robust results on both new synthetic and real-world datasets, highlighting our framework's ability to provide a simplified, event-only solution for high-fidelity 4D reconstruction in dynamic scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38534", "url": null, "sourceid": 40422, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38213, "uid": "aca704179243b072a128766eb67ad4ab", "name": "PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling", "authors": [{"id": 180354, "fullname": "Bowen Ping", "url": "http://cvpr.thecvf.com/api/miniconf/users/180354?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 151722, "fullname": "Chengyou Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/151722?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 76356, "fullname": "Minnan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76356?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 157091, "fullname": "Changliang Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/157091?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189335, "fullname": "Xin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189335?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 157092, "fullname": "Zhuohang Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157092?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189336, "fullname": "Hangwei Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/189336?format=json", "institution": "A*STAR, Singapore"}], "abstract": "Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images,which is essential for applications such as storytelling and character design.Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences.In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner.To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm.The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing.It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons.The second component, PaCo-GRPO,leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost,alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization.Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency,and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability.Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation.We will release the code, dataset, and models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38213", "url": null, "sourceid": 46369, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36639, "uid": "34005deb3eddd8787abaa56e1febced6", "name": "Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models", "authors": [{"id": 101426, "fullname": "Xinyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/101426?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185528, "fullname": "Yuxuan Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185528?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 129237, "fullname": "Lingling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129237?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 151722, "fullname": "Chengyou Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/151722?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 157092, "fullname": "Zhuohang Dang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157092?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 185529, "fullname": "YiXing Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185529?format=json", "institution": "Boston University, Boston University"}, {"id": 185530, "fullname": "Yaqiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185530?format=json", "institution": "Lenovo"}, {"id": 185531, "fullname": "Basura Fernando", "url": "http://cvpr.thecvf.com/api/miniconf/users/185531?format=json", "institution": "Nanyang Technological University; A*STAR"}, {"id": 129235, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129235?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "While model merging has demonstrated remarkable success across diverse domains for large language models (LLMs), its application to vision-language models (VLMs) remains largely underexplored. Recent methods attempt to enhance VLM reasoning capabilities by integrating specialized LLM parameters through layer-wise merging. However, existing paradigms suffer from two critical limitations: (1) strict positional correspondence, which enforces rigid one-to-one layer alignment, and (2) uniform merging weights applied indiscriminately across all layers. These constraints fail to account for substantial functional disparities between corresponding layers in VLMs and LLMs, potentially misaligning incompatible layers and leading to detrimental parameter combinations.To address these, we propose Chain-of-Merging (CoM) framework that adaptively adjusts merging plans for different images and questions, including two key stages: (1) Adaptive Layer Matching, which identifies optimal layer pairings based on structural and semantic matching scores while filtering incompatible pairings, and (2) Dynamic Weight Merging, which determines layer-specific merging weights based on matching scores and employs spherical linear interpolation to minimize memory overhead.Extensive experiments demonstrate that CoM achieves substantial performance improvements, with Qwen2.5-VL-7B + Qwen2.5-Math-7B attaining a 4.4\\% average improvement on mathematical reasoning benchmarks while enhancing general visual understanding, significantly outperforming existing training-free methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36639", "url": null, "sourceid": 35933, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36847, "uid": "9ead219c61072cd000fa5fcec473e314", "name": "StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation", "authors": [{"id": 155975, "fullname": "Mingyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155975?format=json", "institution": "Zhejiang University"}, {"id": 186016, "fullname": "Jiuhe Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186016?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186017, "fullname": "Hui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186017?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 186018, "fullname": "Zeju Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186018?format=json", "institution": "Zhejiang University"}, {"id": 127931, "fullname": "Canyu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127931?format=json", "institution": "Zhejiang University"}, {"id": 156839, "fullname": "Jiange Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156839?format=json", "institution": "Nanjing University"}, {"id": 77244, "fullname": "Shenyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/77244?format=json", "institution": "HKUST"}, {"id": 185384, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185384?format=json", "institution": "Zhejiang University"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}], "abstract": "A fundamental challenge in embodied intelligence is developing expressive and compact state representations for efficient world modeling and decision making. However, existing methods often fail to achieve this balance, yielding representations that are either overly redundant or lacking in task-critical information. We propose an unsupervised approach that learns a highly compressed two-token state representation using a lightweight encoder and a pre-trained Diffusion Transformer (DiT) decoder, capitalizing on its strong generative prior. Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models, improving performance by 11.6% on LIBERO and 31% in real-world task success with minimal inference overhead. More importantly, we find that the difference between these tokens, obtained via latent interpolation, naturally serves as a highly effective latent action, which can be further decoded into executable robot actions. This emergent capability reveals that our representation captures structured dynamics without explicit supervision. We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation, which is encoded from static images, challenging the prevalent dependence to learning latent action on complex architectures and video data. The resulting latent actions also enhance policy co-training, outperforming prior methods by 10.4% with improved interpretability. Moreover, our approach scales effectively across diverse data sources, including real-world robot data, simulation, and human egocentric video.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36847", "url": null, "sourceid": 41891, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36420, "uid": "76e91437c7f841334f01db0c85a8ec00", "name": "Condensed Test-Time Adaptation of VLMs for Action Recognition", "authors": [{"id": 184991, "fullname": "Wenxuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/184991?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 106615, "fullname": "Qu Hongyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106615?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 88417, "fullname": "Rui Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88417?format=json", "institution": "National University of Singapore"}, {"id": 90405, "fullname": "Guo-Sen Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/90405?format=json", "institution": "inception institute of artificial intelligence (iiai)"}, {"id": 129605, "fullname": "Yazhou Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129605?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 157797, "fullname": "Xiangbo Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157797?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 89846, "fullname": "Jinhui Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89846?format=json", "institution": "Nanjing University of Science and Technology"}], "abstract": "Test-time adaptation for video understanding, which enables vision-language models (VLMs) to generalize to downstream tasks such as action recognition, has demonstrated substantial value in real-world applications. Existing memory-based methods typically build a visual cache from high-confidence test videos and perform inference via a two-step modality mapping chain, i.e., vision-vision and vision-text. However, due to the asymmetry of the two mappings, the chain exhibits non-transitivity, hindering the generalization of VLMs. To this end, we propose a novel training-free Condensed Dynamic Adapter ConDA for action recognition, which leverages vision-text alignment to guide vision-vision alignment. It first selects semantic patches based on the semantic activation probability obtained from the vision-text alignment (Probability-based Semantic Patch Selection, PSPS), and then adaptively constructs spatial-temporal video tubes based on patch-level visual similarity (Adaptive Tube Construction, ATC). We conduct extensive experiments on seven benchmarks with different backbones and baselines. The quantitative results demonstrate that ConDA is compatible with arbitrary VLM and generalizes well across complex scenarios, such as long-term and egocentric scenarios. In addition, qualitative analyses showcase the interpretability of ConDA in terms of capturing semantic cues.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36420", "url": null, "sourceid": 39962, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36722, "uid": "e8792705b5cea2f03388f86885f645a3", "name": "BAMI: Training-Free Bias Mitigation in GUI Grounding", "authors": [{"id": 130670, "fullname": "Borui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130670?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 178377, "fullname": "Bo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178377?format=json", "institution": "Tsinghua University"}, {"id": 185726, "fullname": "Bo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185726?format=json", "institution": "Tsinghua University"}, {"id": 130710, "fullname": "Wenzhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130710?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 184530, "fullname": "Yuhao Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184530?format=json", "institution": "Lenovo"}, {"id": 185727, "fullname": "Liang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185727?format=json", "institution": null}, {"id": 149155, "fullname": "Yiqiang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/149155?format=json", "institution": "Lenovo Research"}, {"id": 77142, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/77142?format=json", "institution": "Tsinghua University"}, {"id": 88597, "fullname": "Jiwen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88597?format=json", "institution": "Tsinghua University"}], "abstract": "GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \\textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \\textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\\% to 57.8\\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36722", "url": null, "sourceid": 32089, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38395, "uid": "7ffc7d5178d6731f7f2ce2f1bb408f5e", "name": "VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization", "authors": [{"id": 181515, "fullname": "Xiaoyan Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181515?format=json", "institution": "Brown University"}, {"id": 155045, "fullname": "Haotian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155045?format=json", "institution": "ByteDance"}, {"id": 156910, "fullname": "Angtian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156910?format=json", "institution": "ByteDance Inc."}, {"id": 88066, "fullname": "Yizhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88066?format=json", "institution": "Simon Fraser University"}, {"id": 90227, "fullname": "Yiding Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90227?format=json", "institution": "ByteDance"}, {"id": 128829, "fullname": "Canyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128829?format=json", "institution": "ByteDance"}, {"id": 87038, "fullname": "Chongyang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/87038?format=json", "institution": "ByteDance"}], "abstract": "Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video\u2013instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38395", "url": null, "sourceid": 45007, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38830, "uid": "a88fd022dcf42b0d75b10eb57886ab04", "name": "TGT: Text-Grounded Trajectories for Locally Controlled Video Generation", "authors": [{"id": 182368, "fullname": "Guofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182368?format=json", "institution": "Johns Hopkins University"}, {"id": 156910, "fullname": "Angtian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156910?format=json", "institution": "ByteDance Inc."}, {"id": 135024, "fullname": "Jacob Fang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135024?format=json", "institution": "ByteDance Inc."}, {"id": 90091, "fullname": "Liming Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90091?format=json", "institution": "Nanyang Technological University"}, {"id": 155045, "fullname": "Haotian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155045?format=json", "institution": "ByteDance"}, {"id": 190786, "fullname": "Bo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190786?format=json", "institution": "ByteDance Inc."}, {"id": 90227, "fullname": "Yiding Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90227?format=json", "institution": "ByteDance"}, {"id": 190787, "fullname": "Guang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190787?format=json", "institution": "ByteDance Inc."}, {"id": 91813, "fullname": "Longyin Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/91813?format=json", "institution": "Bytedance Inc."}, {"id": 84745, "fullname": "Alan L. Yuille", "url": "http://cvpr.thecvf.com/api/miniconf/users/84745?format=json", "institution": "Johns Hopkins University"}, {"id": 87038, "fullname": "Chongyang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/87038?format=json", "institution": "ByteDance"}], "abstract": "Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose $\\textit{Location-Aware Cross-Attention}$ (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38830", "url": null, "sourceid": 36825, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37110, "uid": "b0d75402a0d338ccfec035407bf4410c", "name": "SVAgent: Storyline-guided Long Video Understanding via Cross-modal Multi-agent Collaboration", "authors": [{"id": 180483, "fullname": "zhongyu yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180483?format=json", "institution": "Lanzhou University"}, {"id": 186684, "fullname": "Zuhao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186684?format=json", "institution": "Nanyang Technological University"}, {"id": 186685, "fullname": "SHUO ZHAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/186685?format=json", "institution": null}, {"id": 186686, "fullname": "Tan Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/186686?format=json", "institution": "Peking University"}, {"id": 186687, "fullname": "Wei Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186687?format=json", "institution": "Heriot-Watt University"}, {"id": 186688, "fullname": "Yingfang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186688?format=json", "institution": "Heriot-Watt University"}], "abstract": "Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37110", "url": null, "sourceid": 36237, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39398, "uid": "188ce673b936b2249a2a042ae39ef19b", "name": "SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting", "authors": [{"id": 191990, "fullname": "Xiang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191990?format=json", "institution": "ShanghaiTech University"}, {"id": 191991, "fullname": "Xiangbo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191991?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 191992, "fullname": "Tieshi Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191992?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 191993, "fullname": "Chengkai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191993?format=json", "institution": null}, {"id": 191994, "fullname": "Yiting Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191994?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 186107, "fullname": "Tianxiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186107?format=json", "institution": "IEEE"}, {"id": 129656, "fullname": "Zhenzhong Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129656?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 190187, "fullname": "Feiwei Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190187?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 162424, "fullname": "Xuefei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/162424?format=json", "institution": "Griffith University"}, {"id": 191995, "fullname": "Yanming Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191995?format=json", "institution": "Griffith University"}], "abstract": "3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce \\textbf{SR3R}, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes. Codes will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39398", "url": null, "sourceid": 43095, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36811, "uid": "c0e52e8868a002b463ebdd2864690629", "name": "A Training-Free Style-Personalization via SVD-Based Feature Decomposition", "authors": [{"id": 157089, "fullname": "Kyoungmin Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/157089?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 157087, "fullname": "Jihun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/157087?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 157088, "fullname": "Jongmin Gim", "url": "http://cvpr.thecvf.com/api/miniconf/users/157088?format=json", "institution": "Korea Electronics Technology Institute"}, {"id": 86694, "fullname": "Wonhyeok Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86694?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 185937, "fullname": "Kyumin Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185937?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 185938, "fullname": "Jaeyeul Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185938?format=json", "institution": "Daegu Gyeongbuk Institute of Science and Technology"}, {"id": 86689, "fullname": "Sunghoon Im", "url": "http://cvpr.thecvf.com/api/miniconf/users/86689?format=json", "institution": "DGIST"}], "abstract": "We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36811", "url": null, "sourceid": 42407, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38766, "uid": "23ec211d0365be0665abf1354689014d", "name": "ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks", "authors": [{"id": 158705, "fullname": "Ruixun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158705?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 190621, "fullname": "Bowen Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190621?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 190622, "fullname": "Jiayi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/190622?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 160589, "fullname": "Kaiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/160589?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189402, "fullname": "Wanchen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189402?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 190623, "fullname": "Lanxuan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/190623?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 189404, "fullname": "Hui Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189404?format=json", "institution": "China Telecom"}, {"id": 189405, "fullname": "Weizhan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189405?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 84835, "fullname": "Deyu Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/84835?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 71122, "fullname": "Xiangyong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71122?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping\u2013zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks  such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38766", "url": null, "sourceid": 32543, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37753, "uid": "4d28e74f86094725e098c6b7d10b449c", "name": "EventDrive: Event Cameras for Vision\u2013Language Driving Intelligence", "authors": [{"id": 152822, "fullname": "Dongyue Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152822?format=json", "institution": "National University of Singapore"}, {"id": 153433, "fullname": "Rong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/153433?format=json", "institution": "HKUST(GZ)"}, {"id": 188169, "fullname": "Alan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188169?format=json", "institution": "National University of Singapore; Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 76351, "fullname": "Lingdong Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76351?format=json", "institution": "National University of Singapore"}, {"id": 88382, "fullname": "Wei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88382?format=json", "institution": " Shenzhen DJI Sciences and Technologies Ltd."}, {"id": 188170, "fullname": "Lai Xing Ng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188170?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 95547, "fullname": "Benoit Cottereau", "url": "http://cvpr.thecvf.com/api/miniconf/users/95547?format=json", "institution": "CNRS"}, {"id": 188171, "fullname": "Camille Chane", "url": "http://cvpr.thecvf.com/api/miniconf/users/188171?format=json", "institution": "Ecole Nationale Sup\u00e9rieure de l&#x27;Electronique et de ses Applications"}, {"id": 188172, "fullname": "Wei Tsang Ooi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188172?format=json", "institution": "National University of Singapore"}], "abstract": "Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion where frame-based perception can become unreliable. However, existing event-aware vision\u2013language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a modality-routing mixture-of-experts to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37753", "url": null, "sourceid": 34176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37175, "uid": "22a993565308ccff8a1db375e32fdaee", "name": "Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images", "authors": [{"id": 180794, "fullname": "Kazuya Nishimura", "url": "http://cvpr.thecvf.com/api/miniconf/users/180794?format=json", "institution": "Osaka University, Tokyo Institute of Technology"}, {"id": 156004, "fullname": "Ryoma Bise", "url": "http://cvpr.thecvf.com/api/miniconf/users/156004?format=json", "institution": "Kyushu University,  Faculty of Information Science and Electrical Engineering"}, {"id": 156002, "fullname": "Shinnosuke Matsuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/156002?format=json", "institution": "Kyushu University"}, {"id": 186848, "fullname": "Haruka Hirose", "url": "http://cvpr.thecvf.com/api/miniconf/users/186848?format=json", "institution": "National Cancer Center Japan"}, {"id": 186849, "fullname": "Yasuhiro Kojima", "url": "http://cvpr.thecvf.com/api/miniconf/users/186849?format=json", "institution": "National Cancer Center Japan"}], "abstract": "Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression.To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes\u2014mean expression profiles that capture stable gene\u2013gene co-variation patterns. CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework.We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression.The code will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37175", "url": null, "sourceid": 46361, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36410, "uid": "ebbef3b7bea693ae9aa25f2885a828cc", "name": "ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers", "authors": [{"id": 184973, "fullname": "Qinghao Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184973?format=json", "institution": "Beijing Institute of Technology"}, {"id": 157969, "fullname": "Bingzhi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157969?format=json", "institution": "Beijing Institute of Technology"}, {"id": 157975, "fullname": "Yishu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157975?format=json", "institution": "Harbin Institute of Technology, Shnezhen"}, {"id": 157974, "fullname": "Minhua Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157974?format=json", "institution": "Shenzhen University"}, {"id": 130758, "fullname": "Guangming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130758?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}], "abstract": "Vision Transformers (ViTs) achieve state-of-the-art performance on a wide range of vision tasks, yet they remain highly vulnerable to adversarial perturbations due to the lack of explicit region-level semantic modeling. Adversarial perturbations are typically local and spatially structured, whereas the globally coupled self-attention and spatially uniform feed-forward networks in ViTs propagate local corruptions across the whole image without enforcing consistency within semantically coherent regions. To mitigate this mismatch, we propose Region-aware Mixture-of-Experts, namely \"ReMoE\", a plug-and-play module that replaces the standard feed-forward network (FFN) with a region-aware expert layer. Specifically, our ReMoE strategically introduces multi-granularity experts (i.e., global, center, and regional) and couples them with an attention-guided routing mechanism that operates on patch-to-region (P2R) and region-to-patch (R2P) transformations. This mechanism adaptively activates the most relevant experts for each spatial location according to its attention profile, enabling the model to capture region-level semantics and local context while preserving global consistency, thereby providing a stronger inductive bias for adversarially robust ViT representations. Extensive experiments demonstrate that our ReMoE substantially improves the adversarial robustness of ViTs with only marginal additional computational cost.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36410", "url": null, "sourceid": 33365, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37051, "uid": "796a13649f7504836cc88fd7b385a278", "name": "Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models", "authors": [{"id": 144162, "fullname": "Ruiyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/144162?format=json", "institution": "Xi&#x27;an University of artificial intelligence"}, {"id": 90787, "fullname": "Fang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90787?format=json", "institution": "Xidian University"}, {"id": 90795, "fullname": "Licheng Jiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90795?format=json", "institution": "Xidian University"}, {"id": 176822, "fullname": "Xinglin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/176822?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 144531, "fullname": "Jiayao Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144531?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology; Xi&#x27;an University of Electronic Science and Technology"}, {"id": 152373, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152373?format=json", "institution": "Xidian University"}, {"id": 90815, "fullname": "Xu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90815?format=json", "institution": "Xidian University"}, {"id": 186570, "fullname": "Jingyi yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186570?format=json", "institution": "Xidian University"}, {"id": 71561, "fullname": "Lingling Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/71561?format=json", "institution": "Xidian University"}, {"id": 72376, "fullname": "Puhua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/72376?format=json", "institution": "School of Artificial Intelligence , Xidian University"}, {"id": 131849, "fullname": "Wenping Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/131849?format=json", "institution": "Xidian University"}], "abstract": "Medical image segmentation provides critical support for clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets are often affected by acquisition noise and annotation ambiguity, leading to pervasive data uncertainty that substantially undermines model robustness. The existing study focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures (CNN, Transformer, and Mamba), revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37051", "url": null, "sourceid": 40688, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38371, "uid": "1a551b7323fefa14d9b4ac09bd73ee49", "name": "ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation", "authors": [{"id": 145424, "fullname": "Wei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/145424?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189737, "fullname": "Jizhihui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189737?format=json", "institution": "Harbin Institute of Technology"}, {"id": 186925, "fullname": "Li Yixing", "url": "http://cvpr.thecvf.com/api/miniconf/users/186925?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189738, "fullname": "Junwen Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189738?format=json", "institution": "State Key Laboratory of Mobile Network and Mobile Multimedia Technology\uff1bZTE Co., Ltd\uff1b"}, {"id": 89896, "fullname": "Rui Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89896?format=json", "institution": "Harbin Institute of Technology"}, {"id": 84777, "fullname": "Liqiang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/84777?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions but exhibit notable limitations in spatiotemporal perception and reasoning:  1) spatial representations often rely on additional sensors, introducing substantial computational overhead;  2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency.  To address these challenges, we propose **ConsisVLA-4D**, a unified and efficient framework that enhances spatiotemporal consistency in 3D-Perception and 4D-Reasoning. Specifically, we design:  **1) CV-Aligner**, which ensures **C**ross-**V**iew object semantic consistency via filtering instruction-relevant regions and aligning object identities across multiple viewpoints;  **2) CO-Fuser**, which guarantees **C**ross-**O**bject spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations.  Building upon these, we introduce  **3) CS-Thinker** to achieve **C**ross-**S**cene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations.  Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves **21.6%** and **41.5%** performance improvements, along with **2.3\u00d7** and **2.4\u00d7** inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38371", "url": null, "sourceid": 33193, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39546, "uid": "52f01a9665e3be7c05d2fb6cf8bb8082", "name": "Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations", "authors": [{"id": 192315, "fullname": "EunGyung Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192315?format=json", "institution": "Seoul National University"}, {"id": 182177, "fullname": "Jewon Yeom", "url": "http://cvpr.thecvf.com/api/miniconf/users/182177?format=json", "institution": "Seoul National University"}, {"id": 176982, "fullname": "Yonghoon Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/176982?format=json", "institution": "Kakao Healthcare Corp"}, {"id": 85394, "fullname": "Taesup Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85394?format=json", "institution": "Seoul National University"}], "abstract": "Federated Learning (FL) facilitates decentralized model training while preserving data privacy. However, achieving both robust generalization and effective personalization simultaneously in heterogeneous (non-IID) environments remains a formidable challenge. Furthermore, the widespread adoption of proprietary Foundation Models (FMs) introduces a critical requirement for dual privacy: (a) protecting sensitive client data and (b) securing the server's valuable intellectual property. This mandates strictly black-box access to the FM. To address these multifaceted challenges, we introduce FedOT, a novel FL framework optimized for black-box FMs. FedOT employs a shared global task-dependent classifier while facilitating local adaptation through client-specific orthogonal transformations applied externally to the FM embeddings. This architecture inherently guarantees that the FM's internal parameters remain inaccessible and unmodified. By enforcing orthogonality, FedOT effectively mitigates gradient conflicts across diverse clients, which is theoretically bounded, preserves the semantic integrity of the FM representations, and achieves robust performance under significant data heterogeneity. The synergy of global and local parameters optimally balances generalization and personalization, markedly outperforming baseline FL methods across diverse benchmarks. Extensive empirical analysis, including rigorous multi-seed validation and scalability assessments, substantiates the robustness, efficiency, and superior performance of FedOT.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39546", "url": null, "sourceid": 38829, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39442, "uid": "49b2e2112f4e7f769bdc0e6f9cc64b22", "name": "Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models", "authors": [{"id": 172872, "fullname": "Ruiying Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/172872?format=json", "institution": "Tsinghua University"}, {"id": 192085, "fullname": "Xueyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192085?format=json", "institution": null}, {"id": 192086, "fullname": "Jing Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192086?format=json", "institution": null}, {"id": 128148, "fullname": "Lu Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128148?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 192087, "fullname": "Yuanzheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192087?format=json", "institution": "National University of Defense Technology"}, {"id": 155950, "fullname": "Xiao-Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155950?format=json", "institution": "Huawei Technologies Ltd."}], "abstract": "Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, model's visual attention becomes scattered and drifts away from question-relevant regions, effectively \u201closing focus\u201d on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between model\u2019s overall attention on image tokens and the spatial dispersiveness of model\u2019s attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy\u2013focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning.  Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to  improvements in visual grounding and reasoning accuracy, while offering interpretable insights into how MLLMs process visual information.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39442", "url": null, "sourceid": 40429, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37780, "uid": "975511e6319614045b0a883b9d5c7f2e", "name": "EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network", "authors": [{"id": 71942, "fullname": "Yuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71942?format=json", "institution": "University of Science and Technology of China"}, {"id": 185297, "fullname": "Tianle Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185297?format=json", "institution": "University of Science and Technology of China"}, {"id": 129265, "fullname": "Yuqing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129265?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Event keypoint detection has garnered significant attention due to its crucial role in extracting spatial relationships between matched keypoints, which are fundamental for various computer vision tasks. However, achieving robust event keypoint detection remains challenging the difficulty in balancing the exploitation of event information and compatibility with established algorithms. Moreover, the limited use of co-visible information often results in excessive keypoint detection in non-matching regions, leading to incorrect matches. To address these challenges, we propose a novel Co-visible Focused 3D-guided 2D Event Keypoint Detection Network (EV-CGNet), which mainly consists of a 3D-guided 2D feature prototype learning (G2PL) module and a co-visible region-focused detector and descriptor learning (CDDL) module. The proposed method enjoys several merits. First, the proposed G2PL module can enhance event frame feature prototypes by recovering motion information with guidance from event points. Second, the proposed CDDL module can direct keypoint detection toward co-visible regions and ensure accurate matches. Comprehensive experimental evaluations on six challenging benchmarks show that our method surpasses state-of-the-art event keypoint detection method significantly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37780", "url": null, "sourceid": 41680, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39298, "uid": "4600fac20e73cc30e734c8201ae46d5c", "name": "SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models", "authors": [{"id": 191792, "fullname": "Jiesong Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191792?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 191793, "fullname": "Ruizhe Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191793?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 127553, "fullname": "Zixiang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127553?format=json", "institution": "xiaobing.ai"}, {"id": 100092, "fullname": "Xiaoyue Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/100092?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 130940, "fullname": "Long Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130940?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 155471, "fullname": "Yuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155471?format=json", "institution": "Microsoft Research"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 130965, "fullname": "yixue Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/130965?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 88195, "fullname": "Junchi Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88195?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles.Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39298", "url": null, "sourceid": 32046, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37138, "uid": "a4de67584a6420f7c296da3dee64f99b", "name": "AnthroTAP: Learning Point Tracking with Real-World Motion", "authors": [{"id": 156301, "fullname": "In\u00e8s Hyeonsu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/156301?format=json", "institution": "KAIST AI"}, {"id": 92999, "fullname": "Seokju Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/92999?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186750, "fullname": "Jahyeok Koo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186750?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 186751, "fullname": "Junghyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/186751?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology (KAIST)"}, {"id": 91712, "fullname": "Gabriel Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91712?format=json", "institution": "Adobe Research"}, {"id": 135184, "fullname": "Honglak Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/135184?format=json", "institution": "LG AI Research"}, {"id": 180187, "fullname": "Joon-Young Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/180187?format=json", "institution": "Adobe Research"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Point tracking models often struggle to generalize to real-world videos because large-scale training data is predominantly synthetic--the only source currently feasible to produce at   scale. Collecting real-world annotations, however, is prohibitively expensive, as it requires tracking hundreds of points across frames. We introduce AnthroTAP, an automated pipeline that generates large-scale pseudo-labeled point tracking data from real human motion videos. Leveraging the structured complexity of human movement-non-rigid deformations, articulated motion, and frequent occlusions\u2014AnthroTAP fits Skinned Multi-Person Linear (SMPL) models to detected humans, projects mesh vertices onto image planes, resolves occlusions via ray-casting, and filters unreliable tracks using optical flow consistency. A model trained on the AnthroTAP dataset achieves state-of-the-art performance on TAP-Vid, outperforming recent self-supervised teacher-student models trained on vastly larger real datasets, while requiring only one day of training on 4 GPUs. AnthroTAP shows that structured human motion offers a scalable and effective source of real-world supervision for point tracking. Code and datasets will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37138", "url": null, "sourceid": 37014, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37463, "uid": "4c51b2742b5de999681962dcea2c0639", "name": "SpatialVID: A Large-Scale Video Dataset with Spatial Annotations", "authors": [{"id": 174235, "fullname": "Jiahao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174235?format=json", "institution": "Nanjing University"}, {"id": 187507, "fullname": "Yufeng Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187507?format=json", "institution": "Nanjing University"}, {"id": 187508, "fullname": "Rujie Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187508?format=json", "institution": "nanjing university"}, {"id": 98719, "fullname": "Youtian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/98719?format=json", "institution": "Nanjing university"}, {"id": 143799, "fullname": "Jian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/143799?format=json", "institution": "Nanjing University"}, {"id": 187509, "fullname": "Lin-Zhuo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187509?format=json", "institution": "Nanjing university"}, {"id": 176513, "fullname": "Bao Yajie", "url": "http://cvpr.thecvf.com/api/miniconf/users/176513?format=json", "institution": "Department of Intelligent Science and Technology, nanjing university"}, {"id": 187510, "fullname": "Chang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187510?format=json", "institution": "nanjing university"}, {"id": 187511, "fullname": "Yanxi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187511?format=json", "institution": "nanjing university"}, {"id": 186119, "fullname": "Xiao-Xiao Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/186119?format=json", "institution": "Nanjing University"}, {"id": 153839, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153839?format=json", "institution": "Nanjing University"}, {"id": 88072, "fullname": "Zhaoxiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88072?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}, {"id": 128100, "fullname": "Yao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128100?format=json", "institution": "Nanjing University"}], "abstract": "Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collect **SpatialVID**, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions.Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.Through extensive validation experiments, we demonstrate SpatialVID\u2019s effectiveness across tasks such as controllable video generation, world simulation and geometric reconstruction, providing a strong foundation for spatial intelligence research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37463", "url": null, "sourceid": 42540, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36209, "uid": "a8833cd4d827a1bd3e3a22f2709aa7d9", "name": "Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling", "authors": [{"id": 184454, "fullname": "Youhan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184454?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 158278, "fullname": "Jiahua Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158278?format=json", "institution": "Sun Yat-Sen University"}, {"id": 184455, "fullname": "Kangrui Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/184455?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 158280, "fullname": "Jiancong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/158280?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 131106, "fullname": "Yuedong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131106?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Spatial transcriptomics (ST) links gene expression to tissue architecture and enables predicting spatial expression from H\\&E-stained whole-slide images (WSIs). However, existing spot- or slide-level predictors focus on single-spot features or pairwise relations, failing to capture high-order, many-to-many cross-cell interactions. As a result, they miss synergistic and antagonistic effects among multiple neighboring cells. Here, we introduce MCToGene, a scalable and accurate framework that explicitly models multi-cell interactions via many-body attention with hierarchical coupling to predict spatial gene expression. MCToGene employs a many-body attention module to encode high-order, many-to-many cross-cell dependencies, enabling context-aware microenvironment modeling. To mitigate the combinatorial burden of many-body modeling, we design a hierarchical interaction module that couples pairwise and many-body representations for feature aggregation and prediction, preserving many-body expressiveness while controlling computation and memory. On HEST-1k and STImage-1K4M, MCToGene surpasses state-of-the-art baselines with 7.85% relative improvement. Ablations confirm that explicit high-order, many-to-many modeling drives these gains, and visualizations demonstrate that multi-cell interactions is essential for biologically coherent spatial predictions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36209", "url": null, "sourceid": 34349, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38828, "uid": "b240984b4cc0f84615308bf708df6516", "name": "The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition", "authors": [{"id": 164508, "fullname": "Yuwen Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/164508?format=json", "institution": "Boston University"}, {"id": 190780, "fullname": "Yuan Qing", "url": "http://cvpr.thecvf.com/api/miniconf/users/190780?format=json", "institution": "Boston University"}, {"id": 88081, "fullname": "Boqing Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88081?format=json", "institution": "Google"}], "abstract": "This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38828", "url": null, "sourceid": 42515, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39876, "uid": "ac3485ecf4fcd5d56323ed2373fd7e32", "name": "Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning", "authors": [{"id": 143883, "fullname": "Hanjing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/143883?format=json", "institution": "Sun Yat-sen University"}, {"id": 158278, "fullname": "Jiahua Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158278?format=json", "institution": "Sun Yat-Sen University"}, {"id": 184454, "fullname": "Youhan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184454?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 158280, "fullname": "Jiancong Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/158280?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 131106, "fullname": "Yuedong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131106?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Understanding genotype\u2013phenotype relationships is pivotal for advancing biomedical research, drug discovery, and precision medicine. With the rise of high-throughput cellular imaging, it is essential to tightly integrate high-content cellular morphology with structured biological knowledge to extract cellular-scale evidence for genotype-to-phenotype mapping.However, integrating high-dimensional, heterogeneous, and noisy phenotypes with structured knowledge remains challenging. Prior approaches typically treat phenotypes as node features, overlooking that phenotypes primarily convey cellular-scale relational signals about how perturbations reshape interactions. We present KERNEL, a knowledge-guided multimodal graph learning framework that integrates cellular imaging phenotypes into a unified knowledge graph to predict genotype-phenotype interactions, including GRN inference, drug-target interaction prediction, and subtype-specific subnetwork discovery. KERNEL dynamically augments task-relevant edges from noisy phenotypic signals, explicitly learns per-edge confidence and marginal utility, and uses knowledge gating to align graph topology with mechanistic pathways. Across large-scale imaging and single-cell datasets, KERNEL consistently outperforms state-of-the-art baselines, e.g., up to 38.1\\% AUPR improvement for GRN inference, while delivering more accurate and interpretable DTI and subtype subnetwork discovery, demonstrating robust mechanism learning from richer, harder-to-denoise phenotypes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39876", "url": null, "sourceid": 35339, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38940, "uid": "b2483c130839641db1e7badbfbe9240b", "name": "Common Inpainted Objects In-N-Out of Context", "authors": [{"id": 163439, "fullname": "Tianze Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/163439?format=json", "institution": "University of Georgia"}, {"id": 191012, "fullname": "Tyson Jordan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191012?format=json", "institution": null}, {"id": 191013, "fullname": "Ruitong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191013?format=json", "institution": "University of Georgia"}, {"id": 191014, "fullname": "Ninghao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191014?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 163311, "fullname": "Jin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/163311?format=json", "institution": "University of Georgia"}], "abstract": "We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through Large Vision Language Model assessments. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) developing a fine-grained context reasoning approach that classifies objects as in- or out-of-context based on three criteria; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38940", "url": null, "sourceid": 32450, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37905, "uid": "eecb123d97c7bcdacb57163d01b01055", "name": "LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning", "authors": [{"id": 152506, "fullname": "Xinran Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152506?format=json", "institution": "Nanjing University"}, {"id": 182103, "fullname": "Shuichang Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/182103?format=json", "institution": "Alibaba Group"}, {"id": 104646, "fullname": "Jiangjing Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/104646?format=json", "institution": "Alibaba Group"}, {"id": 188546, "fullname": "Hongjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188546?format=json", "institution": "Alibaba Group"}, {"id": 188547, "fullname": "Bowen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188547?format=json", "institution": "Alibaba Group"}, {"id": 152507, "fullname": "Yuanqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152507?format=json", "institution": "Nanjing University"}, {"id": 188326, "fullname": "Jie Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188326?format=json", "institution": "Nanjing university"}, {"id": 188327, "fullname": "Zhengkang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188327?format=json", "institution": "Nanjing Urban Construction Tunnel&amp;Bridge Intelligent Management Co.,Ltd."}, {"id": 90556, "fullname": "Yanwen Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/90556?format=json", "institution": "Nanjing University"}], "abstract": "Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies\u2014such as open surfaces and intricate internal structures\u2014while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)\u2014a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to $2048^3$-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37905", "url": null, "sourceid": 39557, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36197, "uid": "8362dd8989632ca283cfac1a9b2e3b14", "name": "Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval", "authors": [{"id": 184417, "fullname": "Zhiheng Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184417?format=json", "institution": "Shandong University"}, {"id": 157298, "fullname": "Yupeng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157298?format=json", "institution": "Shandong University"}, {"id": 184418, "fullname": "Qianyun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184418?format=json", "institution": "Shandong University"}, {"id": 184419, "fullname": "Shiqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184419?format=json", "institution": "Shandong University"}, {"id": 184420, "fullname": "Zhiwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184420?format=json", "institution": "Shandong University"}, {"id": 184421, "fullname": "Zixu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184421?format=json", "institution": "Shandong University"}], "abstract": "Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the \"small loss hypothesis\", but the unique semantic ambiguity in NTC, such as \"partial matching\", invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic \"representation pollution\". To address this critical challenge, we propose a novel \"Expert-Proxy-Diversion\" decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) The External Prior Arbitration (EPA) module, which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI) module, which efficiently guides a lightweight proxy \"arbiter\" to internalize the expert's discriminative logic; (3) Dual Stream Reconciliation (DSR) module, which leverages the EKI's matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36197", "url": null, "sourceid": 38368, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39086, "uid": "b42c757f1e5172e5937125a809c22547", "name": "Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance", "authors": [{"id": 191331, "fullname": "Xinrong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191331?format=json", "institution": "Peking University"}, {"id": 191332, "fullname": "Xu Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191332?format=json", "institution": "Peking University"}, {"id": 191333, "fullname": "Yingmin Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191333?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 191334, "fullname": "Hengyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191334?format=json", "institution": "The University of Hong Kong"}, {"id": 191335, "fullname": "Jing Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191335?format=json", "institution": "University of Hong Kong"}, {"id": 89135, "fullname": "Shiyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89135?format=json", "institution": "Beihang University"}, {"id": 191337, "fullname": "Shuai Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191337?format=json", "institution": null}, {"id": 191338, "fullname": "Shaokang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191338?format=json", "institution": null}, {"id": 85999, "fullname": "Cheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85999?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 185001, "fullname": "Hayden Kwok-Hay So", "url": "http://cvpr.thecvf.com/api/miniconf/users/185001?format=json", "institution": "University of Hong Kong; University of Hong Kong"}, {"id": 157299, "fullname": "Ngai Wong", "url": "http://cvpr.thecvf.com/api/miniconf/users/157299?format=json", "institution": "The University of Hong Kong"}], "abstract": "Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input.To address this problem, we propose Residual Decoding (\\textit{ResDec}). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that \\textit{ResDec} effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, \\textit{ResDec} also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39086", "url": null, "sourceid": 37676, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39572, "uid": "8f414eeae19bc5ccd69f544fce81f5a6", "name": "SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation", "authors": [{"id": 192377, "fullname": "Jiayi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192377?format=json", "institution": "Shandong University"}, {"id": 192378, "fullname": "Zheyun Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192378?format=json", "institution": "Shandong University"}, {"id": 85258, "fullname": "Xiaoming Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85258?format=json", "institution": "Shandong University"}, {"id": 192379, "fullname": "Xiushan Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/192379?format=json", "institution": "Shandong Jianzhu University"}, {"id": 84891, "fullname": "Yilong Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84891?format=json", "institution": "Shandong University"}], "abstract": "The synergistic framework of multimodal large language models (MLLMs) and vision foundation models demonstrates exceptional performance in image understanding tasks, yet encounters severe temporal inconsistency challenges in video segmentation scenarios. Existing methods predominantly rely on MLLMs trained on static images to generate per-frame segmentation prompts, neglecting the physical continuity of video motion. This paper posits that performance limitations in video understanding tasks from inadequate constraints on model output behavior. Consequently, we propose a spatiotemporal co-optimization mechanism that achieves temporally consistent video segmentation solely by constraining MLLM output behavior, eliminating the need for large-scale video pretraining or complex architectural modifications. Our method features two complementary mechanisms: a Brownian bridge loss that models object trajectories as endpoint-constrained Gaussian processes to ensure temporal smoothness, and a geometry-aware prompt quality loss that enforces spatial consistency with target structures. Experiments on referring expression video segmentation and reasoning video segmentation tasks demonstrate that our method significantly surpasses state-of-the-art techniques on the Ref-YouTube-VOS, Ref-DAVIS-2017, MeVIS, A2d-Sentences, JHMDB-Sentences and ReVOS benchmarks. This work establishes that explicit modeling of physical world constraints can unlock the full potential of statically trained foundation models in dynamic visual understanding tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39572", "url": null, "sourceid": 32596, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37109, "uid": "dd39fad8b6d919682beca3fe8af879d3", "name": "Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses", "authors": [{"id": 180966, "fullname": "Wuque Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180966?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186676, "fullname": "Hongze Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186676?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186677, "fullname": "Quan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186677?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186678, "fullname": "Shifeng Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186678?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186679, "fullname": "Zhenxing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186679?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186680, "fullname": "Jiayi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186680?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186681, "fullname": "Duo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186681?format=json", "institution": "University of Electronic Science and Technology of China; Chongqing University of Education"}, {"id": 186682, "fullname": "Dezhong Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186682?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186683, "fullname": "Daqing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186683?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Spiking neural networks (SNNs) are well known for their high energy efficiency and strong temporal processing capabilities. However, the multilayer architectures of SNNs often incur substantial costs of communication, computation, and storage capacity. Inspired by biological autapses, we develop a simple yet effective framework for reconstructing spiking neural networks using a single neuron with time-delayed autapses (TDA-SNN), integrating a dedicated prototype learning-based optimization method. This design allows a single spiking neuron to dynamically reconfigure its internal temporal states, effectively emulating large-scale architectures such as reservoirs, multilayer perceptrons, and convolutional layers while maintaining efficient learning. Extensive experiments on sequential, event-stream, and image datasets demonstrate that TDA-SNN achieves performance comparable to deep SNNs, while significantly reducing computational overhead and enhancing internal information storage capacity. These results highlight the potential of single-neuron models as compact and efficient computational units, offering new insights into the development of biologically inspired neuromorphic systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37109", "url": null, "sourceid": 32314, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37170, "uid": "0f775ddc899a9c4d090d70905075dd66", "name": "Beyond Missing Modalities: Hypergraph Conditioned Diffusion for Uncertainty-Aware Multimodal Emotion Recognition", "authors": [{"id": 186828, "fullname": "Xihang Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186828?format=json", "institution": "BIT"}, {"id": 186829, "fullname": "Yuhao Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186829?format=json", "institution": "Shenzhen MSU-BIT University"}, {"id": 186830, "fullname": "Qing Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186830?format=json", "institution": "Moscow State University, Lomonosov Moscow State University; Shenzhen MSU-BIT University"}, {"id": 186831, "fullname": "Bin Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186831?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186832, "fullname": "Jialong Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186832?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186833, "fullname": "Wanpeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186833?format=json", "institution": "Beijing Institute of Technology"}, {"id": 149451, "fullname": "Yao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149451?format=json", "institution": "ShenZhen MSU-BIT University"}, {"id": 186834, "fullname": "Ye Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186834?format=json", "institution": "Beijing Institute of Technology"}, {"id": 104276, "fullname": "Chun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/104276?format=json", "institution": "Shenzhen MSU-BIT University"}], "abstract": "Multimodal Emotion Recognition in Conversations (MERC) aims to understand emotions expressed in each utterance by effectively integrating audio, text, and visual modalities. However, in real-world scenarios, unavoidable missing modalities often degrade multimodal interpretation performance. To address this, we propose \\textbf{Hypergraph Diffusion and Evidence Fusion based Emotion Recognition (HyperEF)}, a novel framework designed to mitigate challenges arising from incomplete modalities in MERC. Specifically, to mitigate performance degradation caused by modality absence, we propose Masked Hypergraph Attention (MHGAT) conditioned diffusion model to recover latent features of missing modalities in the latent space. To ensure semantic consistency between recovered and available modalities within the same utterance, we introduce MHGAT that captures high-order semantic information from available modalities to guide the diffusion model\u2019s denoising process. Furthermore, to disentangle and model the complex uncertainties inherent in MERC, we propose Dual Channel Evidence Fusion (DCEF), which estimates uncertainty at both feature source level and discriminative level, thereby achieving adaptive evidence fusion. Extensive comparative experiments and interpretability demonstrate the superior performance of our model in emotion recognition, as well as the contribution of each module within the model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37170", "url": null, "sourceid": 43879, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37653, "uid": "5dcf603ef0e2567f5b99bd3f58f22a40", "name": "Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting", "authors": [{"id": 182641, "fullname": "Jinhyeok Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182641?format=json", "institution": "ETRI"}, {"id": 187952, "fullname": "Jaehong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187952?format=json", "institution": "ETRI"}, {"id": 129674, "fullname": "Jung Uk Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/129674?format=json", "institution": "Kyung Hee University"}], "abstract": "Pre-trained weights have become a cornerstone of modern deep learning, enabling efficient knowledge transfer and improving downstream task performance, especially in data-scarce scenarios. However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce KNowledge-Overflowed Weights (KNOW) prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. Our key insight is that sequential fine-tuning on progressively downsized datasets induces a structured forgetting process, which can be modeled and reversed to recover knowledge as if trained on a larger dataset. We construct a dataset of weight transitions governed by this controlled forgetting and employ meta-learning to model weight prediction effectively. Specifically, our KNowledge-Overflowed Weights Nowcaster (KNOWN) acts as a hyper-model that learns the general evolution of weights and predicts enhanced weights with improved generalization. Extensive experiments across diverse datasets and architectures demonstrate that KNOW prediction consistently outperforms Na\\\"ive fine-tuning and simple weight prediction, leading to superior downstream performance. Our work provides a new perspective on reinterpreting forgetting dynamics to push the limits of knowledge transfer.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37653", "url": null, "sourceid": 37356, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36223, "uid": "311c033ee425d5a913e01b0add7d7760", "name": "PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors", "authors": [{"id": 134186, "fullname": "Chia-Ming Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/134186?format=json", "institution": "Institute of Data Science, National Cheng Kung University"}, {"id": 162435, "fullname": "Yu-Fan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/162435?format=json", "institution": "National Cheng Kung University"}, {"id": 184485, "fullname": "Yu-Jou Hsiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184485?format=json", "institution": "National Cheng Kung University"}, {"id": 184486, "fullname": "Jin-Hui Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184486?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 127300, "fullname": "Yu-Lun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127300?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 184487, "fullname": "Chih-Chung Hsu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184487?format=json", "institution": "National Yang Ming Chiao Tung University"}], "abstract": "Shadow removal under diverse lighting conditions requires disentangling illumination from intrinsic reflectance\u2014a challenge compounded when physical priors are not properly aligned. We propose PhaSR (Physically Aligned Shadow Removal), addressing this through dual-level prior alignment to enable robust performance from single-light shadows to multi-source ambient lighting. First, Physically Aligned Normalization (PAN) performs closed-form illumination correction via Gray-world normalization, log-domain Retinex decomposition, and dynamic range recombination, suppressing chromatic bias. Second, Geometric-Semantic Rectification Attention (GSRA) extends differential attention to cross-modal alignment, harmonizing depth-derived geometry with DINO-v2 semantic embeddings to resolve modal conflicts under varying illumination. Experiments show competitive performance in shadow removal with lower complexity and generalization to ambient lighting where traditional methods fail under multi-source illumination.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36223", "url": null, "sourceid": 33957, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36691, "uid": "07aa69970a284c5948c3a34ed5e4e0b1", "name": "ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation", "authors": [{"id": 134186, "fullname": "Chia-Ming Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/134186?format=json", "institution": "Institute of Data Science, National Cheng Kung University"}, {"id": 162435, "fullname": "Yu-Fan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/162435?format=json", "institution": "National Cheng Kung University"}, {"id": 184486, "fullname": "Jin-Hui Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184486?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 184485, "fullname": "Yu-Jou Hsiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184485?format=json", "institution": "National Cheng Kung University"}, {"id": 184487, "fullname": "Chih-Chung Hsu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184487?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 127300, "fullname": "Yu-Lun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127300?format=json", "institution": "National Yang Ming Chiao Tung University"}], "abstract": "Single Image Reflection Separation (SIRS) disentangles mixed images into transmission and reflection layers. Existing methods suffer from transmission-reflection confusion under nonlinear mixing, particularly in deep decoder layers, due to implicit fusion mechanisms and inadequate multi-scale coordination. We propose ReflexSplit, a dual-stream framework with three key innovations.(1) Cross-scale Gated Fusion (CrGF) adaptively aggregates semantic priors, texture details, and decoder context across hierarchical depths, stabilizing gradient flow and maintaining feature consistency. (2) Layer Fusion-Separation Blocks (LFSB) alternate between fusion for shared structure extraction and differential separation for layer-specific disentanglement. Inspired by Differential Transformer, we extend attention cancellation to dual-stream separation via cross-stream subtraction. (3) Curriculum training progressively strengthens differential separation through depth-dependent initialization and epoch-wise warmup.Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance with superior perceptual quality and robust generalization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36691", "url": null, "sourceid": 43789, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36250, "uid": "a2ff20730c919c3c30bcfa4aac8b4314", "name": "A Unified Perspective on Adversarial Membership Manipulation in Vision Models", "authors": [{"id": 180603, "fullname": "RUIZE GAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/180603?format=json", "institution": "National University of Singapore"}, {"id": 184580, "fullname": "Kaiwen Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184580?format=json", "institution": "Knowin AI"}, {"id": 184581, "fullname": "Yongqiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184581?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence; Carnegie Mellon University"}, {"id": 153389, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153389?format=json", "institution": "University of Melbourne"}], "abstract": "Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model\u2019s training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push non-member images into the \u201cmember\u2019\u2019 region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature\u2014a characteristic gradient-norm collapse trajectory\u2014that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36250", "url": null, "sourceid": 34697, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38997, "uid": "0ded6348389f35483e961d04d7babcc8", "name": "3D-Object Perception Transformer (3PT)", "authors": [{"id": 128371, "fullname": "Agastya Kalra", "url": "http://cvpr.thecvf.com/api/miniconf/users/128371?format=json", "institution": "Google"}, {"id": 191149, "fullname": "Tim Salzmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/191149?format=json", "institution": "Microsoft"}, {"id": 128344, "fullname": "Guy Stoppi", "url": "http://cvpr.thecvf.com/api/miniconf/users/128344?format=json", "institution": "Intrinsic"}, {"id": 97015, "fullname": "Dmitrii Marin", "url": "http://cvpr.thecvf.com/api/miniconf/users/97015?format=json", "institution": "Intrinsic"}, {"id": 191150, "fullname": "Rishav Agarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/191150?format=json", "institution": "Intrinsic"}, {"id": 128358, "fullname": "Vage Taamazyan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128358?format=json", "institution": "Intrinsic"}, {"id": 191151, "fullname": "Martin Bokeloh", "url": "http://cvpr.thecvf.com/api/miniconf/users/191151?format=json", "institution": "Intrinsic"}, {"id": 191152, "fullname": "Stefan Hinterstoisser", "url": "http://cvpr.thecvf.com/api/miniconf/users/191152?format=json", "institution": "Google"}, {"id": 128333, "fullname": "Anton Boykov", "url": "http://cvpr.thecvf.com/api/miniconf/users/128333?format=json", "institution": "University of Waterloo"}, {"id": 191153, "fullname": "Alberto Dall&amp;#x27;Olio", "url": "http://cvpr.thecvf.com/api/miniconf/users/191153?format=json", "institution": "Intrinsic"}, {"id": 191154, "fullname": "Pravin Dangol", "url": "http://cvpr.thecvf.com/api/miniconf/users/191154?format=json", "institution": "Intrinsic"}, {"id": 191155, "fullname": "Kartik Venkataraman", "url": "http://cvpr.thecvf.com/api/miniconf/users/191155?format=json", "institution": "Intrinsic Innovation LLC"}, {"id": 151104, "fullname": "Huaijin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151104?format=json", "institution": "University of Hawaii at Manoa"}], "abstract": "Current approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models.This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained Transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks. Achieving high accuracy and robustness, \\algshort{} is well-suited for practical industrial robotics applications such as bin picking and precise insertion.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38997", "url": null, "sourceid": 46111, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37680, "uid": "2b6319cb1c507418504e28bf285072ee", "name": "InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training", "authors": [{"id": 181795, "fullname": "Zihao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/181795?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 156940, "fullname": "Shaohao Rui", "url": "http://cvpr.thecvf.com/api/miniconf/users/156940?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 156942, "fullname": "Zhenyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156942?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187997, "fullname": "Guotai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187997?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 187998, "fullname": "Xiaosong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187998?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37680", "url": null, "sourceid": 41338, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38514, "uid": "2d801467e65a20df2ad5dd175526c3e3", "name": "Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack", "authors": [{"id": 180803, "fullname": "Chenyang LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/180803?format=json", "institution": "Nanyang Technological University"}, {"id": 190032, "fullname": "Wenbing Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190032?format=json", "institution": "Northwest A&amp;F University"}, {"id": 89536, "fullname": "Yihao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89536?format=json", "institution": "Nanyang Technological University"}, {"id": 190033, "fullname": "Simon Sinong Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190033?format=json", "institution": "Northwestern University"}, {"id": 190034, "fullname": "Ming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190034?format=json", "institution": "Singapore Management University"}, {"id": 130874, "fullname": "Xiaojun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/130874?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 127378, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127378?format=json", "institution": "Nanyang Technological University"}], "abstract": "Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38514", "url": null, "sourceid": 46317, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36184, "uid": "b57a35f9dda9f73ab2c04e1a6963c932", "name": "FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation", "authors": [{"id": 181771, "fullname": "Min Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181771?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184359, "fullname": "Junchao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184359?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184360, "fullname": "Yinfu FENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/184360?format=json", "institution": null}, {"id": 104052, "fullname": "Jiajun Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/104052?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184361, "fullname": "Wenwen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184361?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 184362, "fullname": "Tingting Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/184362?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 87673, "fullname": "Qian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87673?format=json", "institution": "Zhejiang University"}, {"id": 129656, "fullname": "Zhenzhong Kuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129656?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 76382, "fullname": "Zhou Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76382?format=json", "institution": "Hangzhou Dianzi University"}], "abstract": "Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36184", "url": null, "sourceid": 31746, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36908, "uid": "afb7f6f9b716939fedbce5b91d21c905", "name": "Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR", "authors": [{"id": 186184, "fullname": "Yufeng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186184?format=json", "institution": "Meituan"}, {"id": 186185, "fullname": "Lei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186185?format=json", "institution": "Meituan"}, {"id": 186186, "fullname": "Zhixiong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186186?format=json", "institution": "Meituan"}, {"id": 186187, "fullname": "Xuanle Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186187?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 186188, "fullname": "Deyang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186188?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 186189, "fullname": "Liming Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186189?format=json", "institution": "Meituan"}, {"id": 186190, "fullname": "Jing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186190?format=json", "institution": "Meituan"}, {"id": 166719, "fullname": "Haibo Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/166719?format=json", "institution": "Meituan"}, {"id": 186191, "fullname": "Peng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186191?format=json", "institution": "Meituan"}, {"id": 186192, "fullname": "Siqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186192?format=json", "institution": null}, {"id": 88103, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88103?format=json", "institution": "Meituan"}], "abstract": "Reading text from images or scanned documents via OCR models has been a longstanding focus of researchers. Intuitively, text reading is perceived as a straightforward perceptual task, and existing work primarily focuses on constructing enriched data engineering to enhance SFT capabilities. In this work, we observe that even advanced OCR models exhibit significantly higher entropy in formatted text (\\emph{e.g.}, formula, table, etc.) compared to plain text, often by an order of magnitude. These statistical patterns reveal that advanced OCR models struggle with high output uncertainty when dealing with format sensitive document, suggesting that reasoning over diverse reading pathways may improve OCR performance.To address this, we propose format decoupled reinforcement learning (FD-RL), which leverages high-entropy patterns for targeted optimization. Our approach employs entropy-based data filtration strategy to identify format-intensive instances, and adopt format decoupled rewards tailored to different format types, enabling format-level validation rather than token-level memorization. FD-RL achieves an average score of 90.41 on OmniDocBench, setting a new record for end-to-end models on this highly popular benchmark.More importantly, we conduct comprehensive ablation studies over data, training, filtering, and rewarding strategies, thoroughly validating their effectiveness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36908", "url": null, "sourceid": 39933, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39416, "uid": "1b80bd6703c274cdb50d8d1fd2a020ab", "name": "DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation", "authors": [{"id": 171724, "fullname": "Divyansh Srivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/171724?format=json", "institution": "UCSD"}, {"id": 192035, "fullname": "Akshay Mehra", "url": "http://cvpr.thecvf.com/api/miniconf/users/192035?format=json", "institution": "Dolby Labs"}, {"id": 186314, "fullname": "Pranav Maneriker", "url": "http://cvpr.thecvf.com/api/miniconf/users/186314?format=json", "institution": "Dolby Laboratories"}, {"id": 192036, "fullname": "Debopam Sanyal", "url": "http://cvpr.thecvf.com/api/miniconf/users/192036?format=json", "institution": "Georgia Institute of Technology"}, {"id": 192037, "fullname": "Vishnu Raj", "url": "http://cvpr.thecvf.com/api/miniconf/users/192037?format=json", "institution": "Dolby laboratories"}, {"id": 192038, "fullname": "Vijay Kamarshi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192038?format=json", "institution": "Dolby"}, {"id": 186315, "fullname": "Fan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/186315?format=json", "institution": "Dolby"}, {"id": 192039, "fullname": "Joshua Kimball", "url": "http://cvpr.thecvf.com/api/miniconf/users/192039?format=json", "institution": "Dolby"}], "abstract": "Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40\\% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1\\% relative to baseline models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39416", "url": null, "sourceid": 30848, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38912, "uid": "16f86429ff6ff5a9ea9b2bd590744243", "name": "IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding", "authors": [{"id": 167442, "fullname": "Junxian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/167442?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188804, "fullname": "Beining Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188804?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 84721, "fullname": "Simin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84721?format=json", "institution": "University of Texas at Dallas "}, {"id": 154209, "fullname": "Jiatong LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/154209?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 154204, "fullname": "Jingdi Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/154204?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 190951, "fullname": "Haodong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190951?format=json", "institution": "Shanghai Jiaotong University; National University of Singapore"}, {"id": 154203, "fullname": "Di Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154203?format=json", "institution": "Fudan University"}], "abstract": "Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the **first** multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the **best** ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding. Code is in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38912", "url": "https://github.com/lijunxian111/IAG", "sourceid": 31461, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65719, "file": "/media/PosterPDFs/CVPR%202026/38912.png", "modified": "2026-04-22T00:49:50.148768-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65720, "file": "/media/PosterPDFs/CVPR%202026/38912-thumb.png", "modified": "2026-04-22T00:49:50.331118-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37296, "uid": "7c04ffdf63fedddcb42474caf8c06540", "name": "CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models", "authors": [{"id": 187105, "fullname": "Vladislav Pyatov", "url": "http://cvpr.thecvf.com/api/miniconf/users/187105?format=json", "institution": "Applied AI Institute"}, {"id": 187106, "fullname": "Gleb Bobrovskikh", "url": "http://cvpr.thecvf.com/api/miniconf/users/187106?format=json", "institution": "Applied AI Institute"}, {"id": 187107, "fullname": "Saveliy Galochkin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187107?format=json", "institution": "Applied AI Institute"}, {"id": 187108, "fullname": "Nikita Boldyrev", "url": "http://cvpr.thecvf.com/api/miniconf/users/187108?format=json", "institution": "Higher School of Economics"}, {"id": 86933, "fullname": "Oleg Voynov", "url": "http://cvpr.thecvf.com/api/miniconf/users/86933?format=json", "institution": "Applied AI Institute"}, {"id": 187109, "fullname": "Alexander Filippov", "url": "http://cvpr.thecvf.com/api/miniconf/users/187109?format=json", "institution": "Huawei Noah&#x27;s Ark Lab"}, {"id": 187110, "fullname": "Gonzalo Ferrer", "url": "http://cvpr.thecvf.com/api/miniconf/users/187110?format=json", "institution": "Applied AI Institute"}, {"id": 75925, "fullname": "Peter Wonka", "url": "http://cvpr.thecvf.com/api/miniconf/users/75925?format=json", "institution": "KAUST"}, {"id": 187041, "fullname": "Evgeny Burnaev", "url": "http://cvpr.thecvf.com/api/miniconf/users/187041?format=json", "institution": "Applied AI Institute"}], "abstract": "We introduce CADFS, a data-centric framework that enables large vision-language models to generate complex CAD design histories. Existing generative CAD systems are restricted to sketch-and-extrude operations due to simplified representations and limited datasets. We address this by introducing a FeatureScript-based representation and constructing a dataset of 450k real-world CAD models spanning 15 modeling operations, obtained via a new pipeline that reconstructs clean, executable FeatureScript programs and provides multimodal annotations. Fine-tuning a VLM on this representation yields state-of-the-art results in text-conditioned CAD generation and image-based reconstruction, producing more accurate, diverse, and feature-rich designs than prior frameworks. Ablations show that FeatureScript, the expanded operation set, and representation-aligned textual descriptions all significantly improve performance. Our framework substantially broadens the complexity and realism achievable in generative CAD.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37296", "url": null, "sourceid": 40889, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36555, "uid": "47928638e0167f68b16389775b44aebd", "name": "Complet4R: Geometric Complete 4D Reconstruction", "authors": [{"id": 180723, "fullname": "Weibang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180723?format=json", "institution": "Tsinghua University"}, {"id": 176567, "fullname": "Kenan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/176567?format=json", "institution": "Tsinghua University"}, {"id": 176566, "fullname": "Zhuoguang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/176566?format=json", "institution": "Tsinghua University"}, {"id": 185330, "fullname": "Yijun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185330?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 75520, "fullname": "Hang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/75520?format=json", "institution": "Tsinghua University"}], "abstract": "We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single time step, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D point tracking task. Code will be released to support future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36555", "url": null, "sourceid": 35208, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39300, "uid": "8287ab1732631d0908d6ec134bd2592b", "name": "EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer\u2019s Disease", "authors": [{"id": 191796, "fullname": "Qiuhui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191796?format=json", "institution": "East China University of Science and Technology"}, {"id": 174994, "fullname": "Xuancheng Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/174994?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191797, "fullname": "Zhenglei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191797?format=json", "institution": "Tencent"}, {"id": 191798, "fullname": "Xinyue Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191798?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191799, "fullname": "Yi Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191799?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Deep learning models for medical image analysis often act as \u201cblack boxes,\u201d seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision\u2013language framework that generates structured AD diagnostic reports with each claim explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence\u2013Evidence\u2013Anatomy (SEA) Grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning\u2013diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods; we will release code and grounding annotations to support future research in trustworthy medical vision\u2013language models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39300", "url": null, "sourceid": 34108, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39885, "uid": "688026f6edb29cc7c96b287b186b03c9", "name": "MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement", "authors": [{"id": 129156, "fullname": "HAO ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129156?format=json", "institution": "Wuhan University"}, {"id": 193050, "fullname": "Yanping Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/193050?format=json", "institution": "Wuhan University Electronic Information School"}, {"id": 129146, "fullname": "Zizhuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129146?format=json", "institution": "Wuhan University"}, {"id": 191495, "fullname": "Meiqi Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191495?format=json", "institution": "Wuhan University"}, {"id": 86222, "fullname": "Jiayi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86222?format=json", "institution": "Wuhan University"}], "abstract": "This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39885", "url": null, "sourceid": 41814, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37427, "uid": "b89c980ec9a984a6eeb5c78108a497bd", "name": "Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts", "authors": [{"id": 187429, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187429?format=json", "institution": "National University of Defense Technology"}, {"id": 86480, "fullname": "Maojun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86480?format=json", "institution": "National University of Defense Technology"}, {"id": 86507, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86507?format=json", "institution": "National University of Defense Technology"}, {"id": 86493, "fullname": "Shen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86493?format=json", "institution": "National University of Defense Technology"}], "abstract": "Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This \"one-size-fits-all\" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform \\textbf{local precise refinement} on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37427", "url": null, "sourceid": 38002, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38903, "uid": "1bf502b835ee007957e558cbb1959ecb", "name": "D$^2$-FOSA: Dual-Diffusion Guided EEG-to-Image Reconstruction with Frequency-Oriented Semantic Alignment", "authors": [{"id": 180279, "fullname": "Yu Chenglong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180279?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 190941, "fullname": "Shuai Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190941?format=json", "institution": "Nanyang Technological University"}, {"id": 190942, "fullname": "Xiangsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190942?format=json", "institution": "Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College"}, {"id": 190943, "fullname": "Yang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190943?format=json", "institution": "Beihang University"}], "abstract": "Reconstructing visual semantics from Electroencephalography (EEG) signals enables a deeper understanding of human visual cognition and supports next-generation brain\u2013computer interface (BCI) applications.Despite notable advances in recent years, most existing EEG encoders still struggle to capture the frequency-specific neural dynamics that reflect perceptual and cognitive rhythms. Moreover, the cross-modal alignment between EEG and visual content remains insufficiently tackled, leading to limited semantic consistency and visual fidelity. To address these issues, we propose D$^2$-FOSA, a unified dual-diffusion guided framework with frequency-oriented semantic alignment, which strengthens the frequency-aware EEG representation for more semantically aligned image reconstruction.Specifically, we design a Frequency-Spatio-Temporal Dynamics Encoder (FSTDE) based on the Frequency-Oriented Mamba (FOMamba) to explicitly model oscillatory patterns and long-range dependencies in EEG signals. The extracted features are then pulled into the CLIP-aligned visual semantic space via contrastive learning.Meanwhile, a Dual Diffusion Latent Generator (DDLG) with bidirectional EEG\u2013image conditioning is designed to enforce cross-modal alignment and promote cycle-consistent generation.Extensive experiments on four challenging datasets demonstrate that our proposed D$^2$-FOSA significantly outperforms existing methods in both retrieval and reconstruction tasks. Particularly, our D$^2$-FOSA surpasses the contemporary MB2C method by over 20 FID in the reconstruction task on THINGS-EEG, indicating a substantial improvement in perceptual fidelity. The source code is in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38903", "url": null, "sourceid": 38946, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36589, "uid": "b53e7dc2bf929b9f76ab8117eb16cdb2", "name": "LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment", "authors": [{"id": 185415, "fullname": "Shuaibang Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185415?format=json", "institution": "National University of Defense Technology"}, {"id": 185416, "fullname": "Juelin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185416?format=json", "institution": "National University of Defense Technology"}, {"id": 185417, "fullname": "Xia Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185417?format=json", "institution": "National University of Defense Technology"}, {"id": 144716, "fullname": "Kun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144716?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 86507, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86507?format=json", "institution": "National University of Defense Technology"}, {"id": 86480, "fullname": "Maojun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86480?format=json", "institution": "National University of Defense Technology"}, {"id": 86493, "fullname": "Shen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86493?format=json", "institution": "National University of Defense Technology"}], "abstract": "We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 [89] achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces $\\textbf{InsLoD-Loc}$ - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance-level building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance-level silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36589", "url": null, "sourceid": 34737, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37000, "uid": "5a7acc9324aeef65925024a66800c015", "name": "PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization", "authors": [{"id": 186430, "fullname": "Xiaoya Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186430?format=json", "institution": "National University of Defense Technology"}, {"id": 86479, "fullname": "Long Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86479?format=json", "institution": "Sensetime"}, {"id": 186431, "fullname": "Yan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186431?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 186432, "fullname": "Xinyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186432?format=json", "institution": "National University of Defense Technology"}, {"id": 186433, "fullname": "Hanlin Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186433?format=json", "institution": "National University of Defense Technology"}, {"id": 86507, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86507?format=json", "institution": "National University of Defense Technology"}, {"id": 86480, "fullname": "Maojun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86480?format=json", "institution": "National University of Defense Technology"}, {"id": 86493, "fullname": "Shen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86493?format=json", "institution": "National University of Defense Technology"}], "abstract": "We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a lightweight network that generalizes in a zero-shot manner from simulation to real data; and 3) a Joint Neural-Guided Stochastic-Gradient Optimizer (JNGO) that achieves robust convergence even under aggressive motion.Evaluations on a comprehensive set of public and newly collected benchmarks show that PiLoT outperforms state-of-the-art methods while running over 25 FPS on NVIDIA Jetson Orin platform. Our code and dataset will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37000", "url": null, "sourceid": 30718, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38471, "uid": "947d64ec18585d216e88d3eef72267d2", "name": "Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models", "authors": [{"id": 180553, "fullname": "Yuxuan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180553?format=json", "institution": "fudan"}, {"id": 181360, "fullname": "Fan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181360?format=json", "institution": "Fudan University"}, {"id": 189923, "fullname": "Rui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189923?format=json", "institution": "Fudan University"}, {"id": 189924, "fullname": "Xu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189924?format=json", "institution": "Fudan University"}, {"id": 189925, "fullname": "Xiaolei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189925?format=json", "institution": "Fudan University"}, {"id": 189926, "fullname": "Zhe Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189926?format=json", "institution": "Fudan University"}, {"id": 189927, "fullname": "Bin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189927?format=json", "institution": "Fudan University"}, {"id": 89233, "fullname": "Xiangyang Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/89233?format=json", "institution": "Fudan University"}], "abstract": "Large Vision-Language Models (LVLMs) often hallucinate when visual evidence conflicts with world knowledge, i.e., in counterfactual scenarios. We propose **Envision-Attend-Respond (EnAR)**, a training-free framework that leverages visual priors to steer the model's attention toward counterfactual elements in the image. The **Envision** stage constructs a visual impression by invoking a diffusion prior to perform latent perturbations, yielding a prior-consistent counterpart of the input image. The **Attend** stage processes the original image and its visual impression through the LVLM's vision encoder to localize counterfactual elements, forming a corresponding padded input. The **Respond** stage performs contrastive decoding between the original and padded inputs to suppress bias and enhance visual understanding. Empirically, EnAR consistently mitigates hallucinations and improves response fidelity, achieving a 10.82\\% gain on VLMBias and an average 6.9\\% improvement on POPE, demonstrating robustness across both counterfactual and general hallucination settings. Moreover, the framework remains effective across heterogeneous LVLM architectures, offering a new perspective for hallucination governance in multimodal reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38471", "url": null, "sourceid": 41700, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39982, "uid": "b7574793f15c6db2537254b5732da315", "name": "Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance", "authors": [{"id": 185742, "fullname": "Liangyu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185742?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 146625, "fullname": "Yufei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146625?format=json", "institution": "Zhejiang University"}, {"id": 147417, "fullname": "Mingkun Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/147417?format=json", "institution": "Westlake University"}, {"id": 185462, "fullname": "Tong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185462?format=json", "institution": "Zhejiang University &amp; Westlake University"}, {"id": 185743, "fullname": "Ruoyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185743?format=json", "institution": "Westlake University"}, {"id": 158267, "fullname": "Chi Changxi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158267?format=json", "institution": "Zhejiang University"}, {"id": 182615, "fullname": "Yiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182615?format=json", "institution": "University of California, Merced"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Diffusion models generate synthetic images through an iterative refinement process. However, the misalignment between the simulation-free objective and the iterative process often causes accumulated gradient error along the sampling trajectory, which leads to unsatisfactory results and a failure to generalize. Guidance techniques like Classifier Free Guidance (CFG) and AutoGuidance (AG) alleviate this by extrapolating between the main and inferior signal for stronger generalization. Despite empirical success, the effective operational regimes of prevalent guidance methods are still under-explored, leading to ambiguity when selecting the appropriate guidance method given a precondition. In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SEG under the principle, taking the benefits of both. Furthermore, we demonstrate that the W2S principle along with SEG can be migrated into the training objective, improving the generalization ability of unguided diffusion models. We validate our approach with comprehensive experiments. At inference time, evaluations on SD3 and SD3.5 confirm that SEG outperforms existing training-free guidance variants. Training-time experiments on transformer architectures demonstrate the effective migration and performance gains in both conditional and unconditional settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39982", "url": null, "sourceid": 34888, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36821, "uid": "8513d71f50e57e470cd1b333f514cc3d", "name": "Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration", "authors": [{"id": 185952, "fullname": "Mengyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185952?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 185461, "fullname": "Yanming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185461?format=json", "institution": "Westlake University"}, {"id": 185953, "fullname": "Chenyi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185953?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 182531, "fullname": "Chenxi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/182531?format=json", "institution": "Westlake University"}, {"id": 185954, "fullname": "Yufan Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185954?format=json", "institution": "Westlake University; Renmin University of China"}, {"id": 185462, "fullname": "Tong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185462?format=json", "institution": "Zhejiang University &amp; Westlake University"}, {"id": 86162, "fullname": "Ruibo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86162?format=json", "institution": "Nanyang Technological University"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12\\% speed-up and a 54.8\\% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48\\%) and F-Score (1.95\\%).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36821", "url": null, "sourceid": 46761, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37650, "uid": "8d65294979cf7c59fa43f91f993fb5c2", "name": "Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval", "authors": [{"id": 180863, "fullname": "Hao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/180863?format=json", "institution": "Weifang University"}, {"id": 187946, "fullname": "Yadong Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187946?format=json", "institution": "Ocean University of China"}, {"id": 187947, "fullname": "Qibing Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187947?format=json", "institution": "Weifang University"}, {"id": 187948, "fullname": "Wenfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187948?format=json", "institution": null}, {"id": 187949, "fullname": "Lei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187949?format=json", "institution": "Ocean University of China"}], "abstract": "Recent cross-modal hashing methods have introduced sample generation strategies to enrich training signals. Despite these advances, sample generation-driven hashing still faces two major challenges: (1) Interpolation-based methods adopt deterministic and class-independent generation that restricts synthetic samples to a small region around the original data. Consequently, intra-class diversity is limited, which weakens the model\u2019s ability to learn discriminative binary codes. (2) Generation network-based methods, which leverage a complex generative model to produce synthetic samples, leading to extra model complexity. To address these issues, we propose a novel Intra-class Distribution-guided Generative Hashing (IDGH) that adaptively generates synthetic samples directly from estimated intra-class distributions. Specifically, we suggest an Intra-class Distribution Estimation (IDE) scheme to model the characteristic distribution of each class, providing essential support for adaptive sample generation. Meanwhile, by utilizing the distribution information from neighboring classes, we design a Neighbor-guided Distribution Refinement (NDR) mechanism to correct flawed estimations for classes. With refined intra-class distributions, we propose a Distribution-aware Adaptive Generation (DAG) strategy that synthesizes informative training samples by shifting features along diverse directions guided by intra-class distribution patterns. The proposed approach is plug-and-play and can be seamlessly integrated into various objective functions, providing semantically diverse training samples, thus enhancing similarity learning. Extensive experiments on benchmark datasets demonstrate that IDGH outperforms existing methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37650", "url": null, "sourceid": 42737, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37734, "uid": "6ac221eca41609c3aa7ea3aceab71cd8", "name": "Multimodal Causality-Driven Representation Learning for Generalizable Medical Image Segmentation", "authors": [{"id": 182070, "fullname": "XUSHENG LIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/182070?format=json", "institution": "City University of Hong Kong"}, {"id": 143939, "fullname": "Lihua Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/143939?format=json", "institution": "Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science &amp; Innovation, Chinese Academy of Sciences"}, {"id": 155494, "fullname": "Nianxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155494?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 182210, "fullname": "miao xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182210?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 188121, "fullname": "Ziyang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/188121?format=json", "institution": null}, {"id": 188122, "fullname": "Dong Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188122?format=json", "institution": "Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science &amp; Innovation, Chinese Academy of Sciences"}, {"id": 184329, "fullname": "Jinlin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184329?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 76503, "fullname": "Jiawei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/76503?format=json", "institution": "City University of Hong Kong"}, {"id": 188123, "fullname": "Hongbin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188123?format=json", "institution": "Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science &amp; Innovation, Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences"}, {"id": 89292, "fullname": "Zhen Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/89292?format=json", "institution": "Institute of Automation,  Chinese Academy of Sciences"}, {"id": 85765, "fullname": "Jiebo Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85765?format=json", "institution": "University of Rochester"}], "abstract": "Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37734", "url": null, "sourceid": 42059, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36891, "uid": "57a4fde3e6bc9c60fa095cc3e82e00eb", "name": "Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding", "authors": [{"id": 186126, "fullname": "Zijun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186126?format=json", "institution": "Westlake University"}, {"id": 127898, "fullname": "Ping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127898?format=json", "institution": "Westlake University"}, {"id": 104061, "fullname": "Xiaodong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/104061?format=json", "institution": "Zhejiang University"}, {"id": 186127, "fullname": "ChangChen ChangChen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186127?format=json", "institution": "Westlake University"}, {"id": 88684, "fullname": "Xin Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88684?format=json", "institution": "Westlake University"}], "abstract": "Coded Aperture Snapshot Spectral Imaging (CASSI) is an emerging hyperspectral image (HSI) acquisition technique for downstream semantic segmentation. Due to the ill-posedness nature of CASSI systems, typical solutions are compelled to conduct a two-stage reconstruction-then-segmentation pipeline, namely viewing them as two separate tasks. However, we observe that such two tasks are interrelated and mutually reinforcing for representation learning, and thus separating them limits the overall accuracy and efficiency. To this end, we propose the first \\textbf{C}ooperative \\textbf{R}econstruction-\\textbf{S}egmentation \\textbf{D}eep \\textbf{U}nfolding \\textbf{N}etwork (\\textbf{CRSDUN}) to solve the reconstruction and segmentation tasks in parallel. To make the two mutually reinforcing, we introduce the Cross-Aggregated Super-Token Attention (CASTA) mechanism to enhance the representation interactions between HSI reconstruction and semantic segmentation. Extensive experiments on both synthetic and real-world HSI reconstruction-segmentation datasets demonstrate that our method achieves state-of-the-art in both spectral reconstruction and semantic segmentation. The code and models will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36891", "url": null, "sourceid": 44389, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37022, "uid": "2f41e7aed7a2bbd8650634585c71f49a", "name": "Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model", "authors": [{"id": 180843, "fullname": "Xueyu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180843?format=json", "institution": "Taiyuan University of Technology"}, {"id": 186501, "fullname": "Xiaoyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186501?format=json", "institution": "Taiyuan University of Technology"}, {"id": 186502, "fullname": "Meilin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186502?format=json", "institution": "Taiyuan University of Technology"}, {"id": 180850, "fullname": "Guangze Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180850?format=json", "institution": "Taiyuan University of Technology"}, {"id": 159493, "fullname": "Jia Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159493?format=json", "institution": "Taiyuan University of Technology"}, {"id": 186503, "fullname": "Yujie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186503?format=json", "institution": "Taiyuan University of Technology"}, {"id": 186504, "fullname": "Cai Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186504?format=json", "institution": "Taiyuan University of Technology"}, {"id": 186505, "fullname": "Ziyuan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186505?format=json", "institution": null}, {"id": 155887, "fullname": "Yongfei Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155887?format=json", "institution": "Taiyuan University of Technology"}, {"id": 85378, "fullname": "Mingqiang Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/85378?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 186506, "fullname": "Yongle Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186506?format=json", "institution": "Taiyuan University of Technology"}], "abstract": "Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attack-for-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM\u2019s segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM\u2019s robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37022", "url": null, "sourceid": 41193, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36850, "uid": "8cdd10cbcfa2f43dc4c88c0f75ad5967", "name": "C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition", "authors": [{"id": 180915, "fullname": "Xuewei Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180915?format=json", "institution": "University of Science and Technology of China"}, {"id": 172803, "fullname": "Jiayue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172803?format=json", "institution": "University of Science and Technology of China"}, {"id": 186020, "fullname": "Zhiwen Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186020?format=json", "institution": "Chongqing University"}, {"id": 90108, "fullname": "Yanyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90108?format=json", "institution": "Rutgers University, Newark"}, {"id": 173197, "fullname": "Yan Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/173197?format=json", "institution": "Technical University of Munich"}], "abstract": "LiDAR-based place recognition is highly sensitive to rain, snow, and fog, where scattering and attenuation distort geometric structure and intensity. We tackle this problem with Conditional Latent Velocity Field (C-LaV) denoising, which restores weather-robust representations before retrieval. Single-sweep point clouds are projected into three-channel bird\u2019s-eye-view (BEV) images and encoded with a frozen DINOv2-based BEV transformer to obtain a semantically anchored latent space shared across weather conditions. On this manifold, a conditional Flow Matching model learns a velocity field whose probability-flow ordinary differential equation (ODE) deterministically transports noisy latents toward their clear-weather counterparts. From the denoised manifold, a Sinkhorn Aggregation of Local Descriptors (SALAD) head produces compact global descriptors optimized with a truncated Smooth-AP loss. We also establish a unified adverse-weather benchmark with 3 m frame spacing and shared evaluation thresholds across KITTI, NCLT, and Boreas datasets. Under this protocol, C-LaV improves Recall@1 by \\textbf{17.5\\%} on NCLT snow and \\textbf{21.5\\%} on Boreas, achieving state-of-the-art weather robustness. Our dataset and code will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36850", "url": null, "sourceid": 39906, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38712, "uid": "f297b6d8f7d2fc4548282b09b8a0294a", "name": "A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World", "authors": [{"id": 151412, "fullname": "Jikang Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151412?format=json", "institution": "Wuhan University"}, {"id": 190507, "fullname": "Renye Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190507?format=json", "institution": "Peking University"}, {"id": 131427, "fullname": "Zhiyuan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/131427?format=json", "institution": "Peking University"}, {"id": 190508, "fullname": "Yaozhong Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190508?format=json", "institution": "Qiyuan Laboratory"}, {"id": 187458, "fullname": "Xueyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187458?format=json", "institution": ""}, {"id": 89608, "fullname": "Wei Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89608?format=json", "institution": "Stanford University"}, {"id": 86262, "fullname": "Zhongyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86262?format=json", "institution": "Wuhan University"}, {"id": 153028, "fullname": "Ling Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153028?format=json", "institution": "Peking University"}], "abstract": "Existing methods for deepfake detection aim to develop generalizable detectors. Although ``generalizable'' could be the ultimate target once and for all, with limited training forgeries and domains, it appears idealistic to expect generalization that covers entirely unseen variations, especially given the diversity, advancement, and vast volume of real-world deepfakes. Therefore, introducing large-scale multi-domain data for training can be feasible and important for real-world applications.However,  within such a multi-domain scenario, the differences between multiple domains, rather than the subtle real/fake distinctions, dominate the feature space. As a result, despite detectors being able to \\textbf{relatively} separate real and fake within each domain (i.e., high AUC), they struggle with single-image real/fake judgments in domain-unspecified conditions (i.e., low ACC).In this paper, we first define a new research paradigm named \\textbf{Multi-In-Domain Face Forgery Detection (MID-FFD)}, which includes sufficient volumes of real-fake domains for training. Then, the detector should provide definitive real-fake judgments to the domain-unspecified inputs, which simulate the frame-by-frame independent detection scenario in the real world. Meanwhile, to address the domain-dominant issue, we propose a two-stage, model-agnostic framework termed DevDet (\\underline{Dev}eloper for \\underline{Det}ector) to amplify real/fake differences and make them dominant in the feature space. DevDet consists of a Face Forgery Developer (FFDev) and a Dose-Adaptive detector Fine-Tuning strategy (DAFT). Experiments demonstrate our superiority in effectively predicting real-fake under the MID-FFD scenario \\textbf{while} maintaining original generalization ability to unseen data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38712", "url": null, "sourceid": 42027, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38397, "uid": "9c8c3c20976adbc398d5366a0cd2f817", "name": "MeshSplatting: Differentiable Rendering with Opaque Meshes", "authors": [{"id": 154594, "fullname": "Jan Held", "url": "http://cvpr.thecvf.com/api/miniconf/users/154594?format=json", "institution": "ULiege / KAUST"}, {"id": 189791, "fullname": "Sanghyun Son", "url": "http://cvpr.thecvf.com/api/miniconf/users/189791?format=json", "institution": "University of Maryland, College Park"}, {"id": 154595, "fullname": "Renaud Vandeghen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154595?format=json", "institution": "University of Li\u00e8ge"}, {"id": 132581, "fullname": "Daniel Rebain", "url": "http://cvpr.thecvf.com/api/miniconf/users/132581?format=json", "institution": "Wayve; University of British Columbia"}, {"id": 106544, "fullname": "Matheus Gadelha", "url": "http://cvpr.thecvf.com/api/miniconf/users/106544?format=json", "institution": "Adobe Systems"}, {"id": 88688, "fullname": "Yi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88688?format=json", "institution": "Adobe Systems"}, {"id": 92295, "fullname": "Anthony Cioppa", "url": "http://cvpr.thecvf.com/api/miniconf/users/92295?format=json", "institution": "Universit\u00e9 de Li\u00e8ge"}, {"id": 150899, "fullname": "Ming Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/150899?format=json", "institution": "University of Maryland, College Park"}, {"id": 71630, "fullname": "Marc Van Droogenbroeck", "url": "http://cvpr.thecvf.com/api/miniconf/users/71630?format=json", "institution": "University of Li\u00e8ge"}, {"id": 126249, "fullname": "Andrea Tagliasacchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126249?format=json", "institution": "Simon Fraser University, Google Brain"}], "abstract": "Primitive-based splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering.However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering.By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines.On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38397", "url": null, "sourceid": 36660, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40333?format=json"], "related_events_ids": [40333]}, {"id": 40129, "uid": "4585ad1e2cbe41891c011a3e0e73e1d4", "name": "MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations", "authors": [{"id": 193597, "fullname": "Changlu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193597?format=json", "institution": "Technical University of Denmark"}, {"id": 193598, "fullname": "Anders Nymark Christensen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193598?format=json", "institution": "Technical University of Denmark"}, {"id": 93991, "fullname": "Anders Dahl", "url": "http://cvpr.thecvf.com/api/miniconf/users/93991?format=json", "institution": "DTU Compute"}, {"id": 94900, "fullname": "Morten Hannemose", "url": "http://cvpr.thecvf.com/api/miniconf/users/94900?format=json", "institution": "Technical University of Denmark"}], "abstract": "Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model\u2019s prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, performs inference over \\textbf{30\u00d7 faster} than the baseline and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation. Our code will be publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40129", "url": null, "sourceid": 31859, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36492, "uid": "2451de50d2dddd51f68c32c2a564b300", "name": "EEGiT: Teaching Vision Transformers to Understand the EEG signal", "authors": [{"id": 185186, "fullname": "Jiahao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185186?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 185187, "fullname": "Chenghao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185187?format=json", "institution": "Hohai University"}, {"id": 185188, "fullname": "Wei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185188?format=json", "institution": null}, {"id": 70925, "fullname": "Erkun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/70925?format=json", "institution": "Xidian University"}, {"id": 88245, "fullname": "Cheng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88245?format=json", "institution": "Xidian University"}], "abstract": "Decoding visual stimuli from electroencephalography (EEG) signals is a crucial step toward practical brain\u2013computer interfaces (BCIs). However, this task requires large-scale and high-quality EEG\u2013image paired datasets. Compared with abundant image data, the limited EEG recordings restrict the decoding models\u2019 performance. To address this challenge, we propose EEGiT, a framework that converts sequential EEG signals into image-like EEG patches and enables the direct use of a pretrained Vision Transformer (ViT) as the EEG encoder. To preserve the spatial topology of brain regions and minimize distributional differences across channels, we group EEG electrodes according to anatomical structures and apply linear interpolation along the spatial dimension. We then resample the EEG signals to align the structure of EEG patches with that of image patches in ViT. This design encourages effective transfer of visual priors learned from large-scale image datasets to EEG representation learning. Experiments on the THINGS-EEG and EEG-3D datasets show that fine-tuning pretrained ViTs improves EEG-to-image retrieval and EEG-based visual classification, while maintaining robustness and strong cross-subject generalization. These results demonstrate a promising direction for leveraging powerful vision models to mitigate data scarcity in EEG decoding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36492", "url": null, "sourceid": 33703, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39494, "uid": "e0a4c69b207126411a8a5e1049bfdfe5", "name": "CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning", "authors": [{"id": 181530, "fullname": "Chunlei Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181530?format=json", "institution": "College of Intelligent Robotics and Advanced Manufacturing, Fudan University"}, {"id": 192196, "fullname": "Guanhong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192196?format=json", "institution": "Shantou University"}, {"id": 187344, "fullname": "Rong Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187344?format=json", "institution": "University of Macau"}, {"id": 192197, "fullname": "Runmin.JIAN Runmin.JIAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/192197?format=json", "institution": "Guangzhou Huashang College"}, {"id": 89907, "fullname": "Zhongxue Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89907?format=json", "institution": "Fudan University"}, {"id": 187345, "fullname": "Chun Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187345?format=json", "institution": "Fudan University"}], "abstract": "information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39494", "url": null, "sourceid": 40018, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38782, "uid": "13c6af08ae7a22f54b558adc098b01ab", "name": "IGen: Scalable Data Generation for Robot Learning from Open-World Images", "authors": [{"id": 144036, "fullname": "Chenghao Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144036?format=json", "institution": "Tsinghua University"}, {"id": 190664, "fullname": "Haolan Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190664?format=json", "institution": "The University of Hong Kong"}, {"id": 190665, "fullname": "Junchao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190665?format=json", "institution": "Beijing University of Chemical Technology"}, {"id": 190666, "fullname": "Jinghe Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190666?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 190667, "fullname": "Duo Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190667?format=json", "institution": "Tsinghua University"}, {"id": 129851, "fullname": "Shuzhao Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/129851?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 145672, "fullname": "Fanding Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145672?format=json", "institution": "Tsinghua University"}, {"id": 190668, "fullname": "Junchen Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/190668?format=json", "institution": "Tsinghua University"}, {"id": 184641, "fullname": "Ziyang Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184641?format=json", "institution": "Shanghai Artificial Intelligence Laboratory; SUN YAT-SEN UNIVERSITY"}, {"id": 190669, "fullname": "Letian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190669?format=json", "institution": "Tsinghua University"}, {"id": 190670, "fullname": "Hongying Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190670?format=json", "institution": null}, {"id": 190671, "fullname": "Changwei Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/190671?format=json", "institution": null}, {"id": 119978, "fullname": "Zhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/119978?format=json", "institution": "SIGS, Tsinghua University"}], "abstract": "The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as $SE(3)$ end-effector pose sequences.  From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training. Code for IGen will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38782", "url": null, "sourceid": 34369, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37031, "uid": "01df78d65072e655cdc1a6f1c46432ac", "name": "MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture", "authors": [{"id": 158003, "fullname": "Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158003?format=json", "institution": "Fudan University"}, {"id": 186530, "fullname": "Jiayue Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186530?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 133160, "fullname": "Fu-Yun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/133160?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 158005, "fullname": "Kaihui Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158005?format=json", "institution": "Fudan University"}, {"id": 154534, "fullname": "Siyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154534?format=json", "institution": "Fudan University"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}], "abstract": "This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at the training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the training performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network at each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation, validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: $1.43$ FID (without guidance) and $1.10$ (with guidance) at $256 \\times 256$, and $1.55$ FID (without guidance) and $1.10$ (with guidance) at $512 \\times 512$.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37031", "url": null, "sourceid": 36814, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39564, "uid": "311fc1c7f3f28bd69e2cd124a4da3215", "name": "Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning", "authors": [{"id": 177295, "fullname": "Yikai Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177295?format=json", "institution": "University of Science and Technology of China"}, {"id": 187588, "fullname": "Renmin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187588?format=json", "institution": "Shandong University"}, {"id": 177806, "fullname": "Yuxuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177806?format=json", "institution": "University of Science and Technology of China"}, {"id": 188410, "fullname": "Youcheng Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188410?format=json", "institution": "University of Science and Technology of China"}, {"id": 85084, "fullname": "Ligang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85084?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Segment Anything Model (SAM)-based approaches have demonstrated remarkable potential for biomedical image segmentation. However, these methods often struggle to maintain spatial consistency in 3D electron microscopy (3D-EM) data and require extensive manual annotations. To this end, we propose Spatial-SAM, a spatially consistent and annotation-efficient framework that achieves high precision on 3D-EM data. Our method introduces two key innovations. First, we incorporate a 3D Signed Distance Field (SDF) memory mechanism that replaces the original memory in SAM2 with SDF representations precomputed by a 3D U-Net, providing richer geometric information and improving spatial consistency. Second, by combining the few-shot capability of SAM2 with a dual-track pseudo-label iterative optimization strategy, Spatial-SAM efficiently learns to segment large-scale 3D-EM datasets from minimal annotations. Experiments show that Spatial-SAM significantly outperforms existing semi-supervised methods and achieves performance comparable to state-of-the-art fully supervised approaches on multiple 3D-EM benchmarks, reducing annotation costs while preserving spatial consistency. The code will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39564", "url": null, "sourceid": 43347, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37214, "uid": "ca586d5ec745e3d9f7b31e948c8f5f4a", "name": "PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference", "authors": [{"id": 185473, "fullname": "Denis Korzhenkov", "url": "http://cvpr.thecvf.com/api/miniconf/users/185473?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 186940, "fullname": "Adil Karjauv", "url": "http://cvpr.thecvf.com/api/miniconf/users/186940?format=json", "institution": "Qualcomm AI Research"}, {"id": 186941, "fullname": "Animesh Karnewar", "url": "http://cvpr.thecvf.com/api/miniconf/users/186941?format=json", "institution": "Qualcomm AI Research"}, {"id": 152559, "fullname": "Mohsen Ghafoorian", "url": "http://cvpr.thecvf.com/api/miniconf/users/152559?format=json", "institution": "Qualcomm"}, {"id": 106397, "fullname": "Amirhossein Habibian", "url": "http://cvpr.thecvf.com/api/miniconf/users/106397?format=json", "institution": "Qualcomm AI Research"}], "abstract": "Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility.In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos.Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37214", "url": null, "sourceid": 39574, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38398, "uid": "aa5b738616facb0e2b0409c547b338d7", "name": "Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering", "authors": [{"id": 127637, "fullname": "Dongxing Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127637?format=json", "institution": "SUTD"}, {"id": 77429, "fullname": "Jinpeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77429?format=json", "institution": "Central South University"}, {"id": 189792, "fullname": "Jiahao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189792?format=json", "institution": "Central South University"}, {"id": 88436, "fullname": "Kevin Qinghong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/88436?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 91788, "fullname": "Linjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/91788?format=json", "institution": "Microsoft"}, {"id": 77198, "fullname": "Zhengyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77198?format=json", "institution": "Microsoft"}, {"id": 87290, "fullname": "Lijuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87290?format=json", "institution": "Microsoft"}, {"id": 189793, "fullname": "Min Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189793?format=json", "institution": "Central South University"}, {"id": 127523, "fullname": "Jingru Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127523?format=json", "institution": "Central South University"}], "abstract": "Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer.Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail.Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model.Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(\\method) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. \\method substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52\\% to 58.26\\% (TextVisionBlend), from 12.75\\% to 36.81\\% (StyledTextSynth) on competitive TextAtlas benchmark. Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38398", "url": null, "sourceid": 37741, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38843, "uid": "f469d652f14a20c7e37428beecda9654", "name": "NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration", "authors": [{"id": 190811, "fullname": "Subhajit Sanyal", "url": "http://cvpr.thecvf.com/api/miniconf/users/190811?format=json", "institution": "Samsung Research Bangalore"}, {"id": 190812, "fullname": "Srinivas Soumitri Miriyala", "url": "http://cvpr.thecvf.com/api/miniconf/users/190812?format=json", "institution": "Qualcomm India Pvt Ltd, Bangalore"}, {"id": 190813, "fullname": "Akshay Bankar", "url": "http://cvpr.thecvf.com/api/miniconf/users/190813?format=json", "institution": "Samsung"}, {"id": 190814, "fullname": "Manjunath Arveti", "url": "http://cvpr.thecvf.com/api/miniconf/users/190814?format=json", "institution": null}, {"id": 190815, "fullname": "Sowmya Vajrala", "url": "http://cvpr.thecvf.com/api/miniconf/users/190815?format=json", "institution": "Samsung Research Bangalore"}, {"id": 190816, "fullname": "Shreyas Pandith", "url": "http://cvpr.thecvf.com/api/miniconf/users/190816?format=json", "institution": "Samsung"}, {"id": 165245, "fullname": "Sravanth Kodavanti", "url": "http://cvpr.thecvf.com/api/miniconf/users/165245?format=json", "institution": "Samsung Research"}, {"id": 190817, "fullname": "Abhishek Ameta", "url": "http://cvpr.thecvf.com/api/miniconf/users/190817?format=json", "institution": "Samsung"}, {"id": 190818, "fullname": "Harshit Harshit", "url": "http://cvpr.thecvf.com/api/miniconf/users/190818?format=json", "institution": "Samsung"}, {"id": 131547, "fullname": "Amit Unde", "url": "http://cvpr.thecvf.com/api/miniconf/users/131547?format=json", "institution": "SRIB Bangalore"}], "abstract": "Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder\u2013decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy\u2013latency\u2013size frontier (e.g., 130M\u2013315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38843", "url": null, "sourceid": 34026, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38153, "uid": "913f7da49fa19c66a96f14d2c9c60185", "name": "Exploring 6D Object Pose Estimation with Deformation", "authors": [{"id": 182558, "fullname": "Zhiqiang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182558?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 90075, "fullname": "Rui Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/90075?format=json", "institution": "Xidian University"}, {"id": 189167, "fullname": "Chuanqi DuanMu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189167?format=json", "institution": "Xidian University"}, {"id": 90041, "fullname": "Jiaojiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90041?format=json", "institution": "Xidian University"}, {"id": 149413, "fullname": "David Ferstl", "url": "http://cvpr.thecvf.com/api/miniconf/users/149413?format=json", "institution": "Magic Leap"}, {"id": 85473, "fullname": "Yinlin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85473?format=json", "institution": "Magic Leap"}], "abstract": "We present DeSOPE, a large-scale dataset designed for Deformed Six-DoF Object Pose Estimation. Most existing 6D object pose approaches assume rigid or articulated objects, leaving deformed daily objects largely unexplored. This gap limits the realism and robustness of current pose estimation methods, which often fail when objects deviate from their canonical shapes due to wear, collision, or deformation. To address this issue, we present DeSOPE, a large-scale real-world dataset specifically designed for deformed object pose estimation. DeSOPE contains two major components: (1) a collection of high-fidelity 3D scans of 26 common object categories, each captured in one canonical and three deformed states using a non-rigid alignment framework; and (2) a real-scene RGB-D dataset comprising 133K frames and 665K pose annotations across 104 deformed instances, recorded in both static and dynamic scenarios. The varying degrees of deformation introduce substantial geometric and textural changes, presenting new challenges for existing methods. We benchmark several state-of-the-art algorithms on DeSOPE and demonstrate significant performance degradation as deformation increases, highlighting the limitations of current pose estimators. As the first large-scale dataset designed for systematic study of deformed object pose estimation, DeSOPE lays the groundwork for developing 6D pose estimators capable of handling real-world deformation and variability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38153", "url": null, "sourceid": 33760, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39676, "uid": "0decbbe8e7f4bbda5ecf7be75866985e", "name": "LaS-Comp: Zero-shot 3D Completion with Latent\u2013Spatial Consistency", "authors": [{"id": 144052, "fullname": "Weilong Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144052?format=json", "institution": "National University of Singapore"}, {"id": 107329, "fullname": "Li Haipeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107329?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 74042, "fullname": "Hao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/74042?format=json", "institution": "CUHK"}, {"id": 192626, "fullname": "Nianjin Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/192626?format=json", "institution": "Changhong intelligent robot"}, {"id": 180926, "fullname": "Yihao Ai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180926?format=json", "institution": "NUS (Ai Yihao)"}, {"id": 93490, "fullname": "Shuaicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/93490?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 192627, "fullname": "Jingyu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192627?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, LaS-Comp harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation.  Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Code and data will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39676", "url": null, "sourceid": 35655, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36667, "uid": "1da76f4e60189995aa60cc1d19993ae9", "name": "Correspondence-Attention Alignment for Multi-view Diffusion Models", "authors": [{"id": 185598, "fullname": "Minkyung Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/185598?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185599, "fullname": "Jinhyeok Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185599?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185600, "fullname": "Jiho Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/185600?format=json", "institution": "KAIST (Korea Advanced Institute of Science &amp; Technology)"}, {"id": 143190, "fullname": "Seonghu Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/143190?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology; Korea University"}, {"id": 185601, "fullname": "Jinhyuk Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185601?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185602, "fullname": "Junyoung Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185602?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185603, "fullname": "Min-Seop Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/185603?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 86994, "fullname": "Jin-Hwa Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/86994?format=json", "institution": "NAVER AI Lab"}, {"id": 153109, "fullname": "Seungryong Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/153109?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model. Code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36667", "url": null, "sourceid": 30883, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40066, "uid": "e5e63e1b118dabdbd6b5d2e7cf43b72c", "name": "FeatureFool: Zero-Query Fooling of Video Models via Feature Map", "authors": [{"id": 173947, "fullname": "Duoxun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173947?format=json", "institution": "Tsinghua University"}, {"id": 193428, "fullname": "Xi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193428?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 193429, "fullname": "Guangwu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193429?format=json", "institution": null}, {"id": 193430, "fullname": "Kangkang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193430?format=json", "institution": "Harbin Institute of Technology"}, {"id": 193431, "fullname": "Xiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193431?format=json", "institution": "Sichuan Agricultural University; The Hong Kong University of Science and Technology"}, {"id": 193432, "fullname": "Dongyang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193432?format=json", "institution": "Tsinghua University"}, {"id": 193433, "fullname": "Qing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193433?format=json", "institution": "Pengcheng Laboratory; Pengcheng Laboratory"}, {"id": 193434, "fullname": "Yong-jie Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193434?format=json", "institution": null}, {"id": 193435, "fullname": "Jiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193435?format=json", "institution": "McGill University"}], "abstract": "The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70\\% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40066", "url": null, "sourceid": 42053, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65757, "file": "/media/PosterPDFs/CVPR%202026/40066.png", "modified": "2026-04-30T05:47:53.274802-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36228, "uid": "245142a8282a24362c6a1762f55dab27", "name": "Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles", "authors": [{"id": 180832, "fullname": "Jiawei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180832?format=json", "institution": "Jilin University"}, {"id": 184508, "fullname": "Xun Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184508?format=json", "institution": "Jilin University, China"}, {"id": 184509, "fullname": "Fen Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184509?format=json", "institution": null}, {"id": 88220, "fullname": "Muli Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88220?format=json", "institution": "A*STAR"}, {"id": 184510, "fullname": "Bohao Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184510?format=json", "institution": "A*STAR"}, {"id": 184511, "fullname": "Yunfeng hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184511?format=json", "institution": "Jilin University"}, {"id": 184512, "fullname": "Hong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184512?format=json", "institution": "Tongji University"}, {"id": 107391, "fullname": "Xulei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107391?format=json", "institution": "Institute for Infocomm Research (I2R), A*STAR"}, {"id": 85967, "fullname": "Qing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85967?format=json", "institution": "Institute of High Performance Computing, Singapore, A*STAR"}], "abstract": "Most Human-Machine Interaction (HMI) research overlooks the maneuvering needs of passengers in autonomous driving (AD). Natural language offers an intuitive interface, yet translating passenger open-ended instructions into control signals\u2014without sacrificing interpretability and traceability\u2014remains a challenge. This study proposes an instruction-realization framework that leverages a large language model (LLM) to interpret instructions, generates executable scripts that schedule multiple model predictive control (MPC)-based motion planners based on real-time feedback, and converts planned trajectories into control signals. This scheduling-centric design decouples semantic reasoning from vehicle control at different timescales, establishing a transparent, traceable decision-making chain from high-level instructions to low-level actions. Due to the absence of high-fidelity evaluation tools, this study introduces a benchmark for open-ended instruction realization in a closed-loop setting. Comprehensive experiments reveal that the framework significantly improves task-completion rates over instruction-realization baselines, reduces LLM query costs, achieves safety and compliance on par with specialized AD approaches, and exhibits considerable tolerance to LLM inference latency. For more qualitative illustrations and a clearer understanding, please refer to the videos in the Supplementary Material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36228", "url": null, "sourceid": 44977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38971, "uid": "2151a9acfad2cad72725a1859f8ca776", "name": "Towards Dynamic Modality Alignment in Multimodal Continual Learning", "authors": [{"id": 191092, "fullname": "Jiayao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191092?format=json", "institution": "Tianjin University"}, {"id": 187560, "fullname": "Fan Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187560?format=json", "institution": "Computer Vision Center, Universitat Aut\u00f3noma de Barcelona"}, {"id": 191093, "fullname": "Tianle Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191093?format=json", "institution": "Suzhou University of Science and Technology"}, {"id": 157891, "fullname": "Fuyuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157891?format=json", "institution": "Suzhou University of Science and Technology"}, {"id": 90857, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90857?format=json", "institution": "Tianjin University"}], "abstract": "Multimodal Continual Learning (MMCL) aims to enable models to continuously accumulate knowledge across multiple tasks and modalities without forgetting prior information. MMCL presents more challenges than single-modal continual learning, as it requires effective cooperation and complementarity between modalities. Existing methods often treat modality alignment as a static process, assuming once alignment is established, it remains fixed. However, we argue that modality alignment is inherently dynamic, evolving with task learning and feature propagation across layers. To address this, we introduce Dynamic Alignment Graph Regularization (DAGR), a novel approach that explicitly models the evolving alignment across layers. By incorporating multi-level graph regularization, our method stabilizes the alignment process and mitigates catastrophic forgetting. Extensive experiments on benchmarks, such as MTIL, show that DAGR outperforms static alignment-based methods and other continual learning techniques, achieving superior stability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38971", "url": null, "sourceid": 35572, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39788, "uid": "d955b97b28c2966211c9de2fe22fefbd", "name": "SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment", "authors": [{"id": 180157, "fullname": "Wenjie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180157?format=json", "institution": "Xidian university"}, {"id": 192853, "fullname": "Chen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192853?format=json", "institution": "Alibaba Group"}, {"id": 192854, "fullname": "Xin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192854?format=json", "institution": null}, {"id": 192855, "fullname": "Zhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192855?format=json", "institution": null}, {"id": 192856, "fullname": "Yue Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192856?format=json", "institution": "Alibaba Group"}, {"id": 192857, "fullname": "Bobo Xi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192857?format=json", "institution": "School of Telecommunications Engineering, XIDIAN University"}, {"id": 192858, "fullname": "Pengbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192858?format=json", "institution": "Alibaba Group"}], "abstract": "Semantic route planning involves generating itineraries that align with user intent while respecting real-world spatial constraints. However, text-only large language models (LLMs) often hallucinate geographically implausible routes due to poor spatial grounding. Inspired by how humans use maps for route planning, we propose the SMAP, which is the first multimodal framework combining user queries, POI metadata, and map tiles to produce spatially coherent, preference-aware routes. To enhance the spatial consistency, the SMAP features a two-stage anti-hallucination mechanism: (1) a map-grounded self-editing pipeline where a multimodal LLM (MLLM) drafts routes and a second MLLM verifies and refines them using geographic evidence; and (2) hallucination-penalized Direct Preference Optimization (HDPO) that steers the route generator toward spatially plausible routes by using verified routes as accepted responses and hallucinated drafts as rejected ones. Additionally, we introduce MM-Route, the first multimodal dataset for semantic route planning, with 3,000 diverse queries annotated with POI metadata and map tiles, covering a broad spectrum of geographic granularities and user intents. Experimental results demonstrate that SMAP significantly reduces geographical hallucinations and outperforms strong baselines in spatial plausibility and user alignment. The code and dataset will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39788", "url": null, "sourceid": 39327, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36579, "uid": "116491a3880b172b226b7282823549f3", "name": "Learning to Focus and Precise Cropping\uff1aA Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs", "authors": [{"id": 185398, "fullname": "Xuanpu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185398?format=json", "institution": "University of Science and Technology of China"}, {"id": 70742, "fullname": "Zhentao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/70742?format=json", "institution": "Alibaba DAMO Academy; University of Science and Technology of China"}, {"id": 69935, "fullname": "Dianmo Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/69935?format=json", "institution": "University of Science and Technology of China"}, {"id": 185399, "fullname": "Tianxiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185399?format=json", "institution": "University of Science and Technology of China"}, {"id": 185400, "fullname": "Yao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185400?format=json", "institution": "Alibaba Group"}, {"id": 126248, "fullname": "Yue Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126248?format=json", "institution": "Alibaba Group"}, {"id": 131741, "fullname": "Tao Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/131741?format=json", "institution": "University of Science and Technology of China"}, {"id": 131736, "fullname": "Qi Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131736?format=json", "institution": "University of Science and Technology of China"}, {"id": 90580, "fullname": "Nenghai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90580?format=json", "institution": "University of Science and Technology of China"}], "abstract": "To enhance the perception and reasoning capabilities of multimodal large models (MLLMs) in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning (SFT) and reinforcement learning (RL), have made significant progress, our empirical analysis reveals a key limitation. By adding random noise to the cropped images, we find that they still maintain most of the performance, especially for models using only reinforcement learning, indicating a heavy reliance on the global input and a weak dependence on details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the \"Information Gap\" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to clipped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36579", "url": null, "sourceid": 44875, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37213, "uid": "2a250dac511b301faaf82502eedbb198", "name": "FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts", "authors": [{"id": 182893, "fullname": "Xin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182893?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 186936, "fullname": "Weilong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186936?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 186937, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186937?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 184315, "fullname": "Wenke Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184315?format=json", "institution": "Nanyang Technological University"}, {"id": 186938, "fullname": "Zhixi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186938?format=json", "institution": "Wuhan University of Science and Technology"}, {"id": 130757, "fullname": "Bin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130757?format=json", "institution": "Wuhan University"}, {"id": 186939, "fullname": "Xiaoying Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186939?format=json", "institution": null}, {"id": 152254, "fullname": "Kui Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152254?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Federated Domain Generalization for Person Re-Identification (FedDG-ReID) aims to learn domain-invariant representations from decentralized data. Although Vision Transformers (ViTs) are widely adopted, their global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints---a challenge further amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), which introduces learnable visual prompts to explicitly guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) that divides prompts into two groups: Holistic Full Body Prompts suppress cross-client background noise, while Body Part Alignment Prompts capture fine-grained details robust to pose and viewpoint variations. To mitigate the high communication cost of large Transformer models, we further design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS}achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37213", "url": null, "sourceid": 33903, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38285, "uid": "d7d71ade2798157258a02f85221648e3", "name": "When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models", "authors": [{"id": 189498, "fullname": "Zhengyang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189498?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 189499, "fullname": "Yu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189499?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 129828, "fullname": "Xin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129828?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 157000, "fullname": "Xiaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157000?format=json", "institution": "BAIDU Inc,"}, {"id": 159446, "fullname": "Xiwu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159446?format=json", "institution": "mach-drive"}, {"id": 76439, "fullname": "Dingkang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76439?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA, a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt\u2013layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4\\% on Wan2.1-1.3B, and by 4.9\\% and 5.5\\% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38285", "url": "https://h-embodvis.github.io/NUMINA/", "sourceid": 33769, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38592, "uid": "39a1defee57f7841d42d24a41918815b", "name": "PointTPA: Test-Time Parameter Adaptation for 3D Scene Understanding", "authors": [{"id": 190229, "fullname": "Siyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190229?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190230, "fullname": "Chaoqun Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190230?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 129828, "fullname": "Xin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/129828?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190231, "fullname": "Tianrui Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190231?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 76439, "fullname": "Dingkang Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76439?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced categories, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static parameters during inference, limiting their adaptability to dynamic scene data. We propose Test-time Parameter Adaptation for Point Cloud Scene Perception (PointTPA), a test-time dynamic adaptation framework that constructs input-aware parameters for scene-level point clouds. PointTPA uses a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while keeping parameter cost low. Integrated into PTv3, PointTPA reduces trainable parameters by over 95% and achieves competitive or superior performance to full fine-tuning. It achieves 74.9% mIoU on S3DIS and consistently surpasses existing PEFT baselines across multiple benchmarks, highlighting the efficacy of test-time dynamic parameter generation in enhancing robust 3D scene understanding. The code will be available soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38592", "url": null, "sourceid": 37511, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36948, "uid": "59dde4945f938af0224dc8f323348f35", "name": "VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA", "authors": [{"id": 162134, "fullname": "Youngrok Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162134?format=json", "institution": "LG AI Research"}, {"id": 186288, "fullname": "Hyesoo Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186288?format=json", "institution": "LG Corporation"}, {"id": 186289, "fullname": "Kyunghwan An", "url": "http://cvpr.thecvf.com/api/miniconf/users/186289?format=json", "institution": "LG Corporation"}, {"id": 186290, "fullname": "Jae Huh", "url": "http://cvpr.thecvf.com/api/miniconf/users/186290?format=json", "institution": "LG Corporation"}, {"id": 186291, "fullname": "Gyeonghun KIM", "url": "http://cvpr.thecvf.com/api/miniconf/users/186291?format=json", "institution": "LG AI Research."}, {"id": 186292, "fullname": "Stanley Jungkyu Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186292?format=json", "institution": "Language Lab, LG AI Research"}], "abstract": "Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset designed for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess such answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the VinQA training split substantially improves their performance and markedly narrows this gap. Modality Encoding is initially more robust than Page Encoding for complex documents with long text, many visual elements, and diverse visual citation requirements. After training on VinQA, however, Page Encoding reaches a comparable performance level, showing that it can compete effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36948", "url": null, "sourceid": 36136, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38092, "uid": "4c600743a60302ab380fb726cd7c49a8", "name": "Scalable Feature Matching via State Space Modeling and Sparse Correlation", "authors": [{"id": 189041, "fullname": "Choo Sin Wai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189041?format=json", "institution": "Tsinghua University"}, {"id": 76063, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/76063?format=json", "institution": "Northwestern Polytechnical University Xi&#x27;an"}], "abstract": "Efficient and robust feature matching is crucial for latency-sensitive and resource-constrained applications. While current semi-dense feature matching approaches commonly suffer from quadratic complexity in spatial resolution due to transformer-based long-range context modeling or redundant full correlation computations. To overcome these limitations, we present a novel scalable feature matching method that delivers reliable correspondences with low memory footprint and latency, especially at high resolutions. Our approach introduces three key innovations: (1) a hybrid Conv-Mamba backbone for efficient cross-scale and cross-view feature extraction with linear complexity, (2) a training-free norm-based feature filtering mechanism, enabling sparse correlation that significantly reduces computation overhead during inference, and (3) a lightweight recurrent coordinate refinement that surpasses expectation-based regression in subpixel accuracy. Experimental results demonstrate our method's superior accuracy and efficiency performance over state-of-the-art (SOTA) approaches on both indoor and outdoor datasets. Notably, in resolution scaling tests, our method achieves 45\\% lower memory usage and 2.4$\\times$ faster inference than JamMa, while also outperforming Efficient LoFTR with 57\\% memory reduction and 1.8$\\times$ speedup at high resolution. These results demonstrate the strong scalability and practical efficiency of our method.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38092", "url": null, "sourceid": 40873, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37939, "uid": "9c8001602e98208ef4e8d1bbf79fee65", "name": "Grounded Chain-of-Thought for Multimodal Large Language Models", "authors": [{"id": 180846, "fullname": "Qiong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180846?format=json", "institution": "Xiamen University"}, {"id": 188637, "fullname": "Xiangcong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188637?format=json", "institution": "Xiamen University"}, {"id": 88619, "fullname": "Yiyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/88619?format=json", "institution": "Xiamen University"}, {"id": 188638, "fullname": "Chenxin Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188638?format=json", "institution": "Xiamen University"}, {"id": 188639, "fullname": "Baiyang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/188639?format=json", "institution": "Xiamen University"}, {"id": 76395, "fullname": "Xiaoshuai Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76395?format=json", "institution": "Xiamen University"}, {"id": 86308, "fullname": "Rongrong Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/86308?format=json", "institution": "Xiamen University"}], "abstract": "Despite great progress, existing multimodal large language models (MLLMs) are still inferior in visual-spatial reasoning, which greatly impedes their trustworthy applications in scenarios such as  Embodied AI. To facilitate the research, we propose a new MLLM task in this paper, called Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT aims to improve the visual-spatial reasoning capabilities of MLLMs via recognizing and grounding the relevant visual cues step by step, which are also supported by step-vise grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a benchmark called multimodal grounded chain-of-thought (MM-GCoT). Besides, a comprehensive consistency evaluation system is also introduced, including  the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance; iii. a larger and stronger MLLM is not less affected by this issue.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37939", "url": null, "sourceid": 30688, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39295, "uid": "3998f40ab605512d599d8b96550e084d", "name": "FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning via Momentum Orthogonal Projection", "authors": [{"id": 191590, "fullname": "Yunlong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191590?format=json", "institution": "Central South University"}, {"id": 191591, "fullname": "Xiaoheng Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191591?format=json", "institution": "Central South University"}, {"id": 191787, "fullname": "Hongyan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191787?format=json", "institution": "Central South University"}, {"id": 191788, "fullname": "Zhuohua Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191788?format=json", "institution": "Shenzhen University"}, {"id": 191789, "fullname": "Xiaowen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191789?format=json", "institution": "Central South University"}, {"id": 187391, "fullname": "Shan You", "url": "http://cvpr.thecvf.com/api/miniconf/users/187391?format=json", "institution": "SenseTime Research"}, {"id": 191592, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191592?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 87489, "fullname": "Chang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87489?format=json", "institution": "University of Sydney"}, {"id": 156647, "fullname": "Xiu Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/156647?format=json", "institution": "Central South University"}], "abstract": "Federated Learning (FL) faces a fundamental dilemma: existing defenses against gradient leakage attacks (GLAs) invariably sacrifice model performance for privacy protection through noise injection or gradient clip. We introduce Federated Learning with Momentum-Based Orthogonal Projection (FedMOP), a method that simultaneously achieves strong privacy guarantees and superior model performance. The key insight is to leverage initialization-based offset mechanisms that operate on orthogonal dimensions. For performance enhancement, FedMOP employs gradient orthogonal projection to counteract local drift, effectively offsetting each client's round-training initial model using global statistical context. For privacy protection, it introduces momentum-based trajectory offset hiding, which makes the offset vector inherently unrecoverable by constructing information barriers through private initialization and randomized evolution. These two mechanisms are synergistic rather than antagonistic. Theoretically, we prove convergence preservation and characterize the computationally infeasible inverse problem faced by attackers. Extensive experiments on CIFAR-10/100 and Tiny-ImageNet demonstrate that FedMOP not only defends effectively against state-of-the-art GLAs but also surpasses existing FL methods in both accuracy and convergence speed, validating its ability to jointly enhance privacy and performance in FL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39295", "url": null, "sourceid": 45405, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38778, "uid": "2e283777e7e2191d23fab7566e28beca", "name": "RFDM: Residual Flow Diffusion Models for Video Editing", "authors": [{"id": 189860, "fullname": "Mohammadreza Salehi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189860?format=json", "institution": "Samsung; University of Amsterdam"}, {"id": 154328, "fullname": "Mehdi Noroozi", "url": "http://cvpr.thecvf.com/api/miniconf/users/154328?format=json", "institution": "Samsung"}, {"id": 190642, "fullname": "Luca Morreale", "url": "http://cvpr.thecvf.com/api/miniconf/users/190642?format=json", "institution": "Samsung"}, {"id": 190643, "fullname": "Ruchika Chavhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190643?format=json", "institution": "Samsung"}, {"id": 190644, "fullname": "Malcolm Chadwick", "url": "http://cvpr.thecvf.com/api/miniconf/users/190644?format=json", "institution": "Samsung"}, {"id": 190645, "fullname": "Alberto Gil Couto Pimentel Ramos", "url": "http://cvpr.thecvf.com/api/miniconf/users/190645?format=json", "institution": "Samsung"}, {"id": 190646, "fullname": "Abhinav Mehrotra", "url": "http://cvpr.thecvf.com/api/miniconf/users/190646?format=json", "institution": "Samsung AI Center"}], "abstract": "Autoregressive video generative methods have recently become popular due to their flexibility for variable-length video generation and computational efficiency. However, their deployment in video editing remains relatively unexplored. This paper introduces an efficient causal video editing model that edits a video frame-by-frame.  Specifically, we adapt an image-to-image (I2I) model to video-to-video (V2V) where editing at time frame $t$ is conditioned on the model prediction on $t-1$. To make use of the past predictions more effectively, we condition the sampling noise on the past prediction during the diffusion forward process. Our forward process guides the model to explicitly compute the residual between the target and the previous prediction during denoising; we denote this formulation as the Residual-Flow Diffusion Model, RFDM. We initialize RFDM with text-to-image SD1.5 model, and train on the Se\u00f1orita dataset for global style transfer, local style transfer, and object removal. RFDM achieves competitive results with computationally heavy counterparts while being significantly more efficient.  The latency of our method scales linearly with the number of frames, making it the most efficient diffusion-based video editing framework.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38778", "url": null, "sourceid": 40406, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38309, "uid": "0dc5daa36f3c5390b86b31be0e3bee9f", "name": "Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset", "authors": [{"id": 189568, "fullname": "TsaiChing Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/189568?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 175423, "fullname": "ZhenQi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/175423?format=json", "institution": "National Yang Ming Chiao Tung University"}, {"id": 71986, "fullname": "YuanFu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71986?format=json", "institution": "National Yang Ming Chiao Tung University"}], "abstract": "We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38309", "url": "https://ninaneon.github.io/projectpage/", "sourceid": 45050, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36890, "uid": "c5204334289d6a51e794d56aea6ebdf4", "name": "Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs", "authors": [{"id": 180300, "fullname": "Qian Jiayu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180300?format=json", "institution": "City University of Hong Kong (Dongguan)"}, {"id": 186120, "fullname": "Zongxian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186120?format=json", "institution": "City University of Hong Kong(Dongguan)"}, {"id": 186121, "fullname": "Guanxing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186121?format=json", "institution": "City University of Hong Kong (Dongguan)"}, {"id": 177901, "fullname": "Pengwei Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177901?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 186122, "fullname": "KC Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186122?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 186123, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186123?format=json", "institution": "Sichuan University"}, {"id": 186124, "fullname": "Yu-An Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186124?format=json", "institution": null}, {"id": 186125, "fullname": "Zhi-An Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186125?format=json", "institution": "City University of Hong Kong (Dongguan)"}], "abstract": "Ensuring fairness in medical vision-language models (VLMs) is essential for equitable healthcare, yet existing models amplify biases across demographic subgroups such as race and gender. Traditional fairness mitigation approaches relying on broad distribution alignment, fall short in addressing these nuanced intersectional disparities. We propose fairness-aware relational prompting (FRP), a novel framework that reformulates prompt generation as a dynamic, fairness-aware reasoning process. FRP constructs a relational graph to capture fine-grained, sample-level similarities and employs a hyperbolic graph layer to explicitly model the hierarchical structure of intersectional identities. Leveraging hyperbolic geometry enables reasoning over complex attribute combinations, effectively reducing entrenched biases. Evaluations on the FairVLMed and Harvard-GF datasets demonstrate that FRP achieves state-of-the-art diagnostic performance, with an area under the curve of 77.50\\% and 85.94\\% respectively, while substantially improving the demographic parity difference and equalized odds difference. This work moves beyond post-hoc bias correction toward inherently fair VLM architectures, offering a scalable solution for high-stakes medical applications.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36890", "url": null, "sourceid": 40138, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38348, "uid": "ec04b828851594e2a860ce40bc363855", "name": "Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks", "authors": [{"id": 189670, "fullname": "Jihang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189670?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189671, "fullname": "Dongcheng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189671?format=json", "institution": "CAS Center for Excellence in Brain Science and Intelligence Technology"}, {"id": 189672, "fullname": "Ruolin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189672?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189673, "fullname": "Qian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189673?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 127463, "fullname": "Yi Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127463?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Spiking Neural Networks (SNNs) utilize spike-based activations to mimic the brain's energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the $L_\\infty$ constraint\u2014Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38348", "url": null, "sourceid": 38974, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39602, "uid": "62dc354827cad0720c72e0d4e147826b", "name": "Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars", "authors": [{"id": 192456, "fullname": "Hailin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192456?format=json", "institution": "South China University of Technology"}, {"id": 131243, "fullname": "Yifan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131243?format=json", "institution": "South China University of Technology"}, {"id": 192457, "fullname": "Jiazhi Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192457?format=json", "institution": "South China University of Technology"}, {"id": 192458, "fullname": "Zixiong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192458?format=json", "institution": "South China University of Technology"}, {"id": 129517, "fullname": "Qi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129517?format=json", "institution": "The University of Adelaide"}, {"id": 84870, "fullname": "Qing Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84870?format=json", "institution": "South China University of Technology"}, {"id": 87321, "fullname": "Mingkui Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87321?format=json", "institution": "South China University of Technology"}], "abstract": "Reconstructing high-fidelity, animatable human avatars from monocular videos remains a critical challenge. Existing 3DGS-based human animation methods constrain Gaussian parameters but exclude scale, which we argue is crucial for adapting human poses to challenging out-of-distribution poses. To achieve robust animation under unseen poses, we propose Tavatar, which derives key parameters such as scale, rotation, and other geometric attributes directly from the local mesh geometry, instead of learning them through unconstrained optimization. This paradigm shift enforces topological consistency by design, as each Gaussian is analytically anchored to the local mesh geometry, inheriting its spatial structure and deformation behavior. Specifically, we bind Gaussians to mesh faces and vertices, deriving their scales and orientations from triangle properties and local edge lengths to ensure coherent surface coverage. To ensure the stability of this analytical mapping, we introduce a crucial equilateral regularization term that preserves mesh integrity. Extensive experiments demonstrate that Tavatar achieves superior animation robustness on challenging out-of-distribution poses, reducing normal error by 13.8\\% on X-Avatar and 17.9\\% on PeopleSnapshot against the best baseline, while maintaining competitive rendering quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39602", "url": null, "sourceid": 40794, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39022, "uid": "3af7a3caf8d2fe05aaf020bdb06f833c", "name": "RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation", "authors": [{"id": 191193, "fullname": "Jielun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191193?format=json", "institution": "University of Macau"}, {"id": 87613, "fullname": "Chi-Man Pun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87613?format=json", "institution": "University of Macau"}, {"id": 188354, "fullname": "Guoheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188354?format=json", "institution": "Guangdong University of Technology"}], "abstract": "Recent studies have shown that Reversible Adversarial Examples (RAE) can mislead unauthorized deep neural networks while remaining usable for authorized users, effectively preventing image data leakage. Existing RAE methods rely on reversibly embedding perturbation information into the original adversarial examples to enable restoration. However, this two-stage process often results in RAEs with inferior attack effectiveness and visual quality compared to the original versions.  To solve these challenges, we propose a novel end-to-end Invertible Neural Network for Reversible Adversarial Examples Generation (RevINN), which directly generates RAEs in one stage by scrambling the intrinsic frequency information of images. Specifically, our RevINN consists of the Cross-Frequency Modulation Attack (CFMA) module and the High-Frequency Perturbation Enhancement (HFPE) module. CFMA selectively exchanges discriminative information between low- and high-frequency wavelet components to achieve adversariality. To fully alter high-frequency semantics, HFPE innovatively employs a tri-branch structure for fine-grained modulation among high-frequency subbands, enhancing perturbation strength. Finally, the modified components are recomposed into RAEs via the inverse wavelet transform. Our RevINN is optimized with adversarial, perceptual, and invertible losses, and can restore images based on the reversibility of the wavelet operations and network modules. Extensive experiments demonstrate that our RevINN achieves state-of-the-art RAE generation quality. The code will be released to the public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39022", "url": null, "sourceid": 45171, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36917, "uid": "41e915bdc1287ca29865bcfb89f7d9ac", "name": "Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals", "authors": [{"id": 129476, "fullname": "Xiangyu Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129476?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 177655, "fullname": "Zesong Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177655?format=json", "institution": "Sensetime"}, {"id": 154673, "fullname": "Zhuguanyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154673?format=json", "institution": "Beihang University"}, {"id": 186222, "fullname": "Fanzhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186222?format=json", "institution": "SenseTime"}, {"id": 157773, "fullname": "Zhiqian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/157773?format=json", "institution": "Sensetime"}, {"id": 129470, "fullname": "Tianxiang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/129470?format=json", "institution": "Sensetime"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 139422, "fullname": "RUIHAO GONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/139422?format=json", "institution": "Beihang University"}, {"id": 89773, "fullname": "Lei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89773?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators,without requiring a one-to-one correspondence with the sampling trajectories of their teachers.Yet, the limited capacity of one-step distilled models compromises generative diversity and degrades performance in complex generative tasks, e.g., generating intricate object motions in text-to-video task.Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution,we observe that it substantially reduces the generative diversity in text-to-image generation and slows motion dynamics in video generation, reducing performance to the level of one-step models.To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity.Phased DMD incorporates two key ideas: progressive distribution matching and score matching within subintervals.First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions.Next, to ensure accurate training within each subinterval, we derive rigorous mathematical formulations for the objective.We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image-20B and Wan2.2-28B.Experiments demonstrate that Phased DMDenhances motion dynamics, improves visual fidelity in video generation, and increases output diversity in image generation.We will release our code and models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36917", "url": null, "sourceid": 38795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37064, "uid": "da032cb86f307ca2b97dd05275947650", "name": "CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion", "authors": [{"id": 186593, "fullname": "Sijie Mai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186593?format=json", "institution": "South China Normal University"}, {"id": 182369, "fullname": "Shiqin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/182369?format=json", "institution": "South China Normal University"}], "abstract": "Modality gap significantly restricts the effectiveness of multimodal fusion. Previous methods often use techniques such as diffusion models and adversarial learning to reduce the modality gap, but they typically focus on one-to-one alignment without exposing the data points of the source modality to the global distribution information of the target modality. To this end, leveraging the characteristic of rectified flow that can map one distribution to another via a straight trajectory, we extend rectified flow for modality distribution mapping. Specifically, we leverage the 'one-to-many mapping' strategy in rectified flow that allows each data point of source modality to observe the overall target distribution. This also alleviates the issue of insufficient paired data within each sample, enabling a more robust distribution transformation. Moreover, to achieve more accurate distribution mapping and address the ambiguous flow directions in one-to-many mapping, we design 'adaptive relaxed alignment', enforcing stricter alignment for modality pairs belonging to the same sample, while applying relaxed mapping for pairs not belonging to the same sample or category. Additionally, to prevent information loss during distribution mapping, we introduce 'cyclic rectified flow' to ensure the transferred features can be translated back to the original features, allowing multimodal representations to learn sufficient modality-specific information. After distribution alignment, our approach achieves very competitive results on multiple tasks of multimodal affective computing even with a simple fusion method, and visualizations verify that it can effectively reduce modality gap.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37064", "url": null, "sourceid": 39930, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36592, "uid": "696ea2c7271f939ccafc511902e85604", "name": "Breaking the Continuum: Discrete Distribution Learning for Structural MRI Reconstruction", "authors": [{"id": 172797, "fullname": "Tianle Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/172797?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 185420, "fullname": "Mengjingcheng Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185420?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 185421, "fullname": "Ting Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185421?format=json", "institution": "Sichuan University"}, {"id": 185422, "fullname": "Zhen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/185422?format=json", "institution": "Pengcheng Laboratory"}, {"id": 185423, "fullname": "Zinan Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185423?format=json", "institution": "Pengcheng Laboratory"}, {"id": 185424, "fullname": "Yanjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185424?format=json", "institution": null}], "abstract": "Anatomical structures in MRI exhibit strong spatial priors, including well-defined boundaries, low inter-subject variability, and consistent topology. These properties naturally induce clustered patterns in the latent space, which are difficult to capture using conventional continuous generative priors that assume smooth manifold distributions. To address this limitation, we propose DiCoS (Discrete\u2013Continuous Synthesis), a generative reconstruction framework that integrates discrete structural reasoning with continuous refinement. DiCoS models an anatomy-aware discrete distribution and generates diverse reconstructions in one coarse-to-fine pass through a Discrete Prior Network (DPN). A Dual-domain Balanced Scoring (DBS) mechanism adaptively evaluates candidates using both image-domain fidelity and k-space consistency.  To further enhance realism, Micro Diffusion Cycles (MDC) perform efficient score-guided refinement to enhance texture realism without disturbing global topology. Experiments on the fastMRI knee and brain datasets demonstrate that DiCoS achieves state-of-the-art reconstruction quality with sharper boundaries and improved anatomical consistency. Beyond pixel metrics, segmentation-based evaluations further confirm superior structural overlap and semantic alignment, highlighting DiCoS's advantages in anatomy-aware reconstruction. Code and models will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36592", "url": null, "sourceid": 39130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39848, "uid": "2367587380fb9491e1d4ce3d8b6463d2", "name": "Focus, Don\u2019t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding", "authors": [{"id": 168335, "fullname": "Mincheol Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/168335?format=json", "institution": "Korea University"}, {"id": 168334, "fullname": "MINSEUNG LEE", "url": "http://cvpr.thecvf.com/api/miniconf/users/168334?format=json", "institution": "Korea University"}, {"id": 192970, "fullname": "Seonga Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192970?format=json", "institution": "Korea University"}, {"id": 192971, "fullname": "Miso Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192971?format=json", "institution": "Korea University"}, {"id": 192972, "fullname": "Kyeong-Jin Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/192972?format=json", "institution": null}, {"id": 192973, "fullname": "Hyunyoung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192973?format=json", "institution": "KT Corporation"}, {"id": 192974, "fullname": "Park Cheonyoung", "url": "http://cvpr.thecvf.com/api/miniconf/users/192974?format=json", "institution": "Korea Telecom Research"}, {"id": 192975, "fullname": "Yongho Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/192975?format=json", "institution": "Korea Telecom Research"}, {"id": 69203, "fullname": "Seunghyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/69203?format=json", "institution": "NAVER Cloud"}, {"id": 192976, "fullname": "Jinkyu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/192976?format=json", "institution": "Korea University; Kakao Mobility"}], "abstract": "Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens. Our code and datasets will be publicly released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39848", "url": null, "sourceid": 37547, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38858, "uid": "0b668d973688aeb13be05aab06902066", "name": "Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios", "authors": [{"id": 176746, "fullname": "Yu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/176746?format=json", "institution": "Hefei University of Technology"}, {"id": 187013, "fullname": "Yu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187013?format=json", "institution": "Hefei University of Technology"}, {"id": 190850, "fullname": "Zhong-Cheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190850?format=json", "institution": "Hefei University of Technology"}, {"id": 187262, "fullname": "Juan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187262?format=json", "institution": "Hefei University of Technology"}, {"id": 127563, "fullname": "Huafeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127563?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 190851, "fullname": "Xun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190851?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Complex degradations like noise, blur, and low resolution are typical challenges in real-world image fusion tasks, limiting the performance and practicality of existing methods. End-to-end neural network\u2013based approaches are generally simple to design and highly efficient in inference, but their black-box nature leads to limited interpretability. Diffusion-based methods alleviate this to some extent by providing powerful generative priors and a more structured inference process. However, they are trained to learn a single-domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation-aware diffusion framework for image fusion under arbitrary degradation scenarios. Specifically, instead of explicitly predicting noise as in conventional diffusion models, our method performs implicit denoising by directly regressing the fused image, enabling flexible adaptation to diverse fusion tasks under complex degradations with limited steps. Moreover, we design a joint observation model correction mechanism that simultaneously imposes degradation and fusion constraints during sampling to ensure high reconstruction accuracy. Experiments on diverse fusion tasks and degradation configurations demonstrate the superiority of the proposed method under complex degradation scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38858", "url": null, "sourceid": 30800, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40047, "uid": "b70f3694fb81b9a36d7584574de5c73e", "name": "LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight", "authors": [{"id": 85270, "fullname": "Yunze Man", "url": "http://cvpr.thecvf.com/api/miniconf/users/85270?format=json", "institution": "Department of Computer Science, University of Illinois at Urbana-Champaign"}, {"id": 157791, "fullname": "Shihao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157791?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 191034, "fullname": "Guowen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191034?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 84926, "fullname": "Johan Bjorck", "url": "http://cvpr.thecvf.com/api/miniconf/users/84926?format=json", "institution": "Microsoft"}, {"id": 91538, "fullname": "Liangyan Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/91538?format=json", "institution": "UIUC"}, {"id": 169493, "fullname": "Linxi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/169493?format=json", "institution": "NVIDIA"}, {"id": 73960, "fullname": "Jan Kautz", "url": "http://cvpr.thecvf.com/api/miniconf/users/73960?format=json", "institution": "NVIDIA"}, {"id": 73909, "fullname": "Yu-Xiong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73909?format=json", "institution": "School of Computer Science, Carnegie Mellon University"}, {"id": 91930, "fullname": "Zhiding Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91930?format=json", "institution": "NVIDIA"}], "abstract": "To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how people reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP, surpassing the previous best by +15.51 absolute improvement even when baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong calibration and robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40047", "url": null, "sourceid": 34988, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38949, "uid": "f8ddf6d7c12e62d733cb1ebc55cb5511", "name": "Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass", "authors": [{"id": 71098, "fullname": "Liyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/71098?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 76302, "fullname": "Pengfei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76302?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 191034, "fullname": "Guowen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191034?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 76238, "fullname": "Zhiyuan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/76238?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88043, "fullname": "Lei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88043?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38949", "url": null, "sourceid": 31673, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39746, "uid": "47570c0e20eec98528284fe7c80c5f13", "name": "Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models", "authors": [{"id": 90483, "fullname": "Dong Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/90483?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 90485, "fullname": "Huaijiang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/90485?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 190983, "fullname": "Fan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190983?format=json", "institution": "Hohai University"}, {"id": 192773, "fullname": "Yuhui Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192773?format=json", "institution": "Qinghai Normal University"}], "abstract": "Many recent human motion prediction methods adopt a multi-stage refinement framework, where each stage produces an initial guess of future poses for the next stage. These guesses are progressively refined towards the target prediction through a sequence of spatial-temporal reasoning stages.However, such a cascaded design incurs large computation and memory overheads that grow at least linearly with network depth, and lack an explicit stopping criteria.In this paper, we propose MotionDEQ, a deep equilibrium motion predictor that reformulates progressive guessing paradigm as a fixed point problem within an implicit layer. This formulation is conceptually equivalent to performing infinitely many refinement steps, but requires only O(1) training memory and can be solved efficiently through any black-box solvers. We carefully design this implicit refinement process by integrating Euclidean geometric transformations into equilibrium learning, allowing the entire network to be equivariant. We also find DEQs naturally fit the real-world scenario where motion data comes streamingly: the converged fixed point can be reused as a warm initial guess, to help recycle the redundant inference computation when making subsequent predictions.Our experiments demonstrate that MotionDEQ achieves the state-of-the-art prediction performances with superior memory efficiency, using fewer than 300K parameters with 55.3mm prediction error at 400ms on the Human3.6M dataset.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39746", "url": null, "sourceid": 36867, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36221, "uid": "e8609464e1813ad2494416cb12676159", "name": "Markovian Scale Prediction: A New Era of Visual Autoregressive Generation", "authors": [{"id": 180239, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180239?format=json", "institution": "Tongji University"}, {"id": 184478, "fullname": "Jingyi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184478?format=json", "institution": "Tongji University"}, {"id": 184479, "fullname": "Yiwei Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184479?format=json", "institution": "University of Bristol"}, {"id": 184480, "fullname": "Qi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184480?format=json", "institution": "Tongji University"}, {"id": 184481, "fullname": "Duoqian Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184481?format=json", "institution": "Tongji University"}, {"id": 158850, "fullname": "Changwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158850?format=json", "institution": "Qilu University of Technology (Shandong Academy of Sciences)"}, {"id": 184482, "fullname": "Longbing Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184482?format=json", "institution": "Macquarie University"}], "abstract": "Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process.  Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5\\% (256\u00d7256) and decreases peak memory consumption by 83.8\\% (1024\u00d71024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36221", "url": null, "sourceid": 34668, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37532, "uid": "fbee03df75b555bb2da82ebf8d4496dc", "name": "DialogueVPR: Towards Conversational Visual Place Recognition", "authors": [{"id": 101082, "fullname": "yukun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/101082?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 158850, "fullname": "Changwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158850?format=json", "institution": "Qilu University of Technology (Shandong Academy of Sciences)"}, {"id": 187657, "fullname": "xingtianPei xingtianPei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187657?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 126348, "fullname": "Shibiao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126348?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187658, "fullname": "Wenhao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187658?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187659, "fullname": "Shunpeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187659?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 180239, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180239?format=json", "institution": "Tongji University"}, {"id": 187660, "fullname": "Ke Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187660?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 187661, "fullname": "Rongtao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187661?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 187662, "fullname": "Xuxiang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187662?format=json", "institution": "University of Macau; Aerospace Information Research Institute"}, {"id": 187663, "fullname": "Pengyang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187663?format=json", "institution": "University of Macau"}], "abstract": "Inspired by how humans communicate spatial information, language-guided geo-localization has gained significant traction for its intuitive and practical value. Despite this progress, most methods still rely on a static, one-shot retrieval paradigm, which fails to handle the ambiguity and incompleteness inherent in real-world natural language descriptions. We propose a paradigm shift to reasoning retrieval and introduce Dialogue Place Recognition (DlgPR), which casts localization as an interactive, dialogue-driven reasoning process. To support this new task, we present DlgQuest-Cities, the first large-scale dialogue-based benchmark for place recognition, and a unified reasoning framework that couples a cross-modal multi-level retriever with an intelligent questioner, DQ-pilot. DQ-pilot is trained in a curriculum: supervised fine-tuning on a curated DQ-cities-20k subset  followed by reinforcement refinement on a harder DQ-cities-10k split via GRPO. Two task-aligned metrics guide learning: a Discriminative Difficulty Index (DDI) for curriculum sampling and a Positional Retrieval Gain (PRG) reward that directly measures retrieval improvement induced by a question. Experiments show this reasoning-based approach significantly outperforms baselines. The code will be made publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37532", "url": null, "sourceid": 34499, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38929, "uid": "990c47be825204a64cfe2c32db046120", "name": "Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting", "authors": [{"id": 181782, "fullname": "Qian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181782?format=json", "institution": "Hohai University"}, {"id": 152257, "fullname": "Rao Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152257?format=json", "institution": "Nanyang Technological University"}, {"id": 178032, "fullname": "Jiangtao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178032?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 190983, "fullname": "Fan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190983?format=json", "institution": "Hohai University"}], "abstract": "Unsigned distance fields (UDFs) are well suited for representing open surfaces, but learning them from multi-view images is challenging because ground-truth surfaces are unavailable for supervision in most cases and the gradient of a UDF is undefined on the underlying surface. Prior methods optimize UDFs with global objectives and apply gradient-based priors ignoring the non-differentiability for queries on the target surface, which leads to unstable training and over-smoothing on fine details. We address these issues by distilling a patch-based UDF prior, trained on synthetic ground truth algebraic surfaces with closed form expressions, into a lightweight student UDF inside Gaussian optimization process.  We design band-limited knowledge distillation strategy that leverages a pretrained patch-based UDF predictor to provide reliable near-surface UDF supervision, enabling stable student training and the recovery of high-frequency geometric details. In addition, we introduce a visibility- and geometry-aware confidence weighting that modulates teacher influence, further steering the student toward accurate surfaces in ambiguous or weakly constrained regions. Extensive experiments on various datasets demonstrate that our approach consistently improves reconstruction accuracy while maintaining competitive efficiency compared to existing UDF- and SDF-based methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38929", "url": null, "sourceid": 45813, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37643, "uid": "360136214356045a328d50619083ac42", "name": "QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer", "authors": [{"id": 187936, "fullname": "Zhizhen Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187936?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 177740, "fullname": "Hesong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/177740?format=json", "institution": "Westlake University"}, {"id": 87566, "fullname": "Huan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87566?format=json", "institution": "Northeastern University"}], "abstract": "Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps.Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\\sim$4.9$\\times$ memory reduction and up to 2.8$\\times$ real hardware speedup over FP32.Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37643", "url": null, "sourceid": 32007, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38474, "uid": "383b440962aa48e9b4bb8b51701bb76e", "name": "Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach", "authors": [{"id": 128314, "fullname": "Viktoria Ehm", "url": "http://cvpr.thecvf.com/api/miniconf/users/128314?format=json", "institution": "Technical University of Munich"}, {"id": 73681, "fullname": "Dongliang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73681?format=json", "institution": "University of Bonn"}, {"id": 91088, "fullname": "Riccardo Marin", "url": "http://cvpr.thecvf.com/api/miniconf/users/91088?format=json", "institution": "Eberhard-Karls-Universit\u00e4t T\u00fcbingen"}, {"id": 189931, "fullname": "Daniel Scholz", "url": "http://cvpr.thecvf.com/api/miniconf/users/189931?format=json", "institution": "Technical University of Munich"}, {"id": 127499, "fullname": "Weikang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127499?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 88205, "fullname": "Florian Bernard", "url": "http://cvpr.thecvf.com/api/miniconf/users/88205?format=json", "institution": "University of Bonn"}, {"id": 84985, "fullname": "Daniel Cremers", "url": "http://cvpr.thecvf.com/api/miniconf/users/84985?format=json", "institution": "Technical University Munich"}], "abstract": "Partial shape matching is a crucial yet underexplored problem in 3D vision, with significant relevance to real-world scenarios where shapes are often only partially observed. Existing feature descriptors face difficulties in this setting, as traditional representations either struggle with the boundaries of partial shapes or heavily depend on the shape's spatial position. While existing approaches have employed DINO features for partial shape matching, these features are not inherently suited for handling partial observations. In this work, we propose a method to refine DINO features using LoRA-based self-supervised learning, enabling the generation of feature descriptors that are robust to partiality. Our features substantially improve performance on partial shape matching compared to traditional or vision foundation features. Additionally, when integrated into existing partial shape matching pipelines, we achieve state-of-the-art results on partial shape matching and left-right prediction benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38474", "url": null, "sourceid": 40861, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36735, "uid": "95d40515d78b92d75f485224d51a7ea6", "name": "FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift", "authors": [{"id": 181700, "fullname": "Huy Quang Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/181700?format=json", "institution": "Kyung Hee University"}, {"id": 185750, "fullname": "Loc Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185750?format=json", "institution": "Kyunghee University"}, {"id": 185751, "fullname": "Yu Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185751?format=json", "institution": "Kyung Hee University"}, {"id": 88332, "fullname": "Seong Tae Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/88332?format=json", "institution": "Kyung Hee University"}, {"id": 185752, "fullname": "Eui-Nam Huh", "url": "http://cvpr.thecvf.com/api/miniconf/users/185752?format=json", "institution": "Kyung Hee University"}, {"id": 185753, "fullname": "Choong Seon Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185753?format=json", "institution": "Kyung Hee University"}], "abstract": "Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy-sensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a $\\textit{single global prototype}$ per class by aggregating local prototypes from all clients without preserving domain information.  (2) Current feature-prototype alignment is $\\textit{domain-agnostic}$, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes  (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36735", "url": null, "sourceid": 30704, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37904, "uid": "292ba5ccb0af067365526de09c211af7", "name": "High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning", "authors": [{"id": 181087, "fullname": "Jia Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181087?format=json", "institution": "Alibaba International Digital Commerce Group"}, {"id": 188541, "fullname": "Yijing Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188541?format=json", "institution": "Harbin Institute of Technology"}, {"id": 126590, "fullname": "Tingfeng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126590?format=json", "institution": "South China University of Technology"}, {"id": 188542, "fullname": "Meiling Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188542?format=json", "institution": "Alibaba Group"}, {"id": 188543, "fullname": "Tao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188543?format=json", "institution": "Alibaba Group"}, {"id": 188544, "fullname": "Jian Dong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188544?format=json", "institution": "Alibaba Group"}, {"id": 130758, "fullname": "Guangming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130758?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 188545, "fullname": "Xiaoyi Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188545?format=json", "institution": null}], "abstract": "Diffusion-based virtual try-on methods rely on vast high-quality garment-person pairs, which are scarce in practice due to the high cost of data collection and preprocessing, limiting their performance in real-world scenarios.To overcome this bottleneck, we propose Cycle-Consistent Virtual Try-On (CCVTON), a diffusion-based approach that enables effective training using massive in-the-wild person images. Specifically, CCVTON introduces a Cycle-Consistent Learning (CCL) strategy that just employs a single unified generative model to disentangle a garment from a person image (try-off branch) and transfer it to the same individual (try-on branch), forming a reconstruction cycle. To this end, we first warm up a Unified Diffusion Transformer (UDiT) on open-source paired data to acquire basic try-on and try-off capabilities. When adapting UDiT to in-the-wild person images, we employ a Multi-Criteria Filtering Operation to select high-quality garments disentangled from person images by the pretrained UDiT. These filtered garments are not used as inputs for CCL, but serve as soft constraints for a perceptual regularization loss, preventing the try-off branch from collapsing to trivial copying. In addition, we propose a garment-aware mask generation with a two-stage refinement process to suppress garment leakage while maintaining person consistency.Extensive experiments show that CCVTON achieves state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37904", "url": null, "sourceid": 44329, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37244, "uid": "9a0ee07cccaf34531e452c4775245b1f", "name": "Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters", "authors": [{"id": 186998, "fullname": "Dongxin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186998?format=json", "institution": "South China University of Technology"}, {"id": 182954, "fullname": "Yan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182954?format=json", "institution": "South China University of Technology"}, {"id": 126780, "fullname": "Yong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126780?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 86218, "fullname": "Hui Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/86218?format=json", "institution": "National University of Singapore"}], "abstract": "Atmospheric turbulence severely degrades long-range images with distortions and blur, hindering downstream applications. While supervised methods rely on synthetic data with limited real-world generalization, existing unsupervised approaches often ignore the underlying physics, leading to suboptimal restoration. We propose TMFS, an optimization-based and physically-grounded approach for unsupervised turbulence mitigation. The method operates by optimizing an imaging model with frame-shared degradation parameters under physically-motivated regularization. Inspired by sampling procedures in physical simulators, the degradation parameters are further decomposed into a frame-shared correlation function and per-frame noise maps. TMFS gains a strong inductive bias that improves generalization and mitigates overfitting. In extensive experiments, TMFS achieves state-of-the-art results among unsupervised methods.  In contrast, supervised methods show a significant domain gap on real data, thereby validating the advantage of our physics-aware, unsupervised approach.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37244", "url": null, "sourceid": 43339, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38206, "uid": "13f4d2015796000f8216b3d4c488ea9f", "name": "MOSAIC3D:Modular Scene Assembly for Real-Time 3D Segment Anything", "authors": [{"id": 189306, "fullname": "Peng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189306?format=json", "institution": "Renmin University of China"}, {"id": 155015, "fullname": "Yongcai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155015?format=json", "institution": "Renmin University of China"}, {"id": 189307, "fullname": "Wang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189307?format=json", "institution": "Renmin University of China"}, {"id": 189308, "fullname": "Hualong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189308?format=json", "institution": "Renmin University of China"}, {"id": 189309, "fullname": "Kang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189309?format=json", "institution": "Renmin University of China"}, {"id": 189310, "fullname": "Chunxu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189310?format=json", "institution": "Renmin University of China"}, {"id": 189311, "fullname": "Wen Jie", "url": "http://cvpr.thecvf.com/api/miniconf/users/189311?format=json", "institution": null}, {"id": 155017, "fullname": "Deying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155017?format=json", "institution": "Renmin University of China"}], "abstract": "Online 3D instance segmentation is a critical capability for embodied agents navigating in dynamic environments. However, a fundamental challenge remains in adapting powerful 2D foundation models, like SAM, to 3D online segmentation. Naively lifting SAM's 2D masks to 3D results in severe spatial fragmentation, where a single object is shattered into multiple disconnected parts, especially under occlusion. Subsequent attempts to link these fragments over time via conventional 3D IoU-based tracking prove highly fragile: they struggle to handle occlusions or topological changes, ultimately causing catastrophic identity drift. Departing from such post-processing approaches, we reframe online segmentation as a learnable composition problem. We introduce MOSAIC3D, a differentiable framework that treats SAM-derived masks as \"mosaic tiles\" and learns to assemble them into temporally consistent 3D instances. MOSAIC3D comprises two key components: Fragment-to-Instance Adaptive Assembly that aggregates fragments through soft-gated attention, and Instance-to-Scene Online Merging that employs cascaded semantic-geometric matching to preserve object identities\u2014replacing rigid IoU thresholds with learnable association guided by observation maturity. Evaluations on ScanNet, ScanNet200, SceneNN and 3RScan datasets demonstrate state-of-the-art performance and zero-shot cross-dataset generalization. Extensive ablation studies validate the effectiveness of the designed modules. The code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38206", "url": null, "sourceid": 44770, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37877, "uid": "d6c27078830ccbe122a310ec6d3e52b9", "name": "Inference-time Physics Alignment of Video Generative Models with Latent World Models", "authors": [{"id": 181452, "fullname": "Jianhao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/181452?format=json", "institution": "University of Oxford / Meta"}, {"id": 188473, "fullname": "Zhang Xiaofeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188473?format=json", "institution": "Facebook; Mila - Quebec Artificial Intelligence Institute; Universit\u00e9 de Montr\u00e9al"}, {"id": 188474, "fullname": "Felix Friedrich", "url": "http://cvpr.thecvf.com/api/miniconf/users/188474?format=json", "institution": "Meta AI"}, {"id": 188475, "fullname": "Nicolas Beltran-Velez", "url": "http://cvpr.thecvf.com/api/miniconf/users/188475?format=json", "institution": "Columbia University; Facebook"}, {"id": 150972, "fullname": "Melissa Hall", "url": "http://cvpr.thecvf.com/api/miniconf/users/150972?format=json", "institution": "FAIR (Meta)"}, {"id": 137286, "fullname": "Reyhane Askari", "url": "http://cvpr.thecvf.com/api/miniconf/users/137286?format=json", "institution": "FAIR"}, {"id": 188476, "fullname": "Xiaochuang Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188476?format=json", "institution": "Meta FAIR"}, {"id": 84547, "fullname": "Nicolas Ballas", "url": "http://cvpr.thecvf.com/api/miniconf/users/84547?format=json", "institution": "Facebook"}, {"id": 150973, "fullname": "Michal Drozdzal", "url": "http://cvpr.thecvf.com/api/miniconf/users/150973?format=json", "institution": "Meta"}, {"id": 139063, "fullname": "Adriana Romero-Soriano", "url": "http://cvpr.thecvf.com/api/miniconf/users/139063?format=json", "institution": "Mila; McGill University; Meta"}], "abstract": "State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, on the challenging PhysicsIQ benchmark we achieve 62.00% final score, outperforming previous state of the art by 6.78%. Our work demonstrates the viability of using latent world models to improve physical plausibility of video generation, beyond this specific instantiation or parameterization.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37877", "url": null, "sourceid": 35878, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39019, "uid": "d0f38389c3df6e16b175a6a31b9702d4", "name": "Edge-Focused Super-Resolution for Omnidirectional Images with Spherical Geometric Augmentation", "authors": [{"id": 191188, "fullname": "Shaolin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191188?format=json", "institution": "Southwest University"}, {"id": 191189, "fullname": "Yuying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191189?format=json", "institution": "Southwest University"}, {"id": 153509, "fullname": "Lei Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153509?format=json", "institution": "University of Edinburgh"}, {"id": 191190, "fullname": "Shigang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191190?format=json", "institution": "Hiroshima City University"}, {"id": 133500, "fullname": "Jianfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/133500?format=json", "institution": "Southwest University"}], "abstract": "Omnidirectional image super-resolution (ODISR) remains challenging due to extreme magnification factors (e.g., 8\u00d7, 16\u00d7) and projection-specific distortions, which degrade edge integrity and limit model performance. This paper proposes an edge-focused framework combined with spherical geometric augmentation to address these issues. Our approach includes an Edge Focused Block (EFB) that integrates spatial-channel attention via Edge Enhanced and Refined Blocks, strengthening edge feature capture and optimization. We also design an Edge-Aware Multi-Scale (EAM) pipeline, leveraging shallow convolutions for initial feature extraction, local modules for deep mining, and a Global Integration Block for multi-scale aggregation, ensuring coherent edge reconstruction in distorted regions. To mitigate data scarcity, we introduce a rotation-translation augmentation strategy based on spherical projections, expanding datasets while preserving scene continuity. Extensive experiments show our method outperforms state-of-the-art approaches on public datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39019", "url": null, "sourceid": 44989, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36280, "uid": "827b816da4b81bd040387b27f62a434c", "name": "Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning", "authors": [{"id": 135840, "fullname": "Wentao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135840?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 184664, "fullname": "Weimin Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184664?format=json", "institution": null}, {"id": 178218, "fullname": "Peiliang Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/178218?format=json", "institution": "Mayo Clinic"}, {"id": 184665, "fullname": "Qingqiao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184665?format=json", "institution": "State University of New York at Stony Brook"}, {"id": 155324, "fullname": "Xiaoling Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155324?format=json", "institution": "MGH and Harvard Medical School"}, {"id": 134616, "fullname": "Shahira Abousamra", "url": "http://cvpr.thecvf.com/api/miniconf/users/134616?format=json", "institution": "Stanford University"}, {"id": 184666, "fullname": "Wenchao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/184666?format=json", "institution": "Mayo Clinic"}, {"id": 184667, "fullname": "Ruifeng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184667?format=json", "institution": "Mayo Clinic"}, {"id": 156079, "fullname": "Jiawei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156079?format=json", "institution": "Stony Brook University"}, {"id": 86841, "fullname": "Chao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86841?format=json", "institution": "State University of New York, Stony Brook"}, {"id": 184668, "fullname": "Chen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184668?format=json", "institution": "Mayo Clinic"}], "abstract": "Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision\u2013language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question\u2013answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36280", "url": null, "sourceid": 46175, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39066, "uid": "c5475a756076c9675123cee9448923f5", "name": "Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs", "authors": [{"id": 191288, "fullname": "Zhikang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191288?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 156164, "fullname": "Qianqian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156164?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185856, "fullname": "Zitai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185856?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 191289, "fullname": "Cong Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/191289?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 191290, "fullname": "Sicong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191290?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 185534, "fullname": "Zhiyong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185534?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate \\textbf{intra-modal} distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the \\textbf{inter-modal} distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\\% improvement in AUROC on the challenging Near-OOD benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39066", "url": null, "sourceid": 42992, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38787, "uid": "5ec11c9c51904e1a37a76d1acc14aa0f", "name": "LoST: Level of Semantics Tokenization for 3D Shapes", "authors": [{"id": 127412, "fullname": "Niladri Shekhar Dutt", "url": "http://cvpr.thecvf.com/api/miniconf/users/127412?format=json", "institution": "University College London; Ready Player Me"}, {"id": 76625, "fullname": "Zifan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76625?format=json", "institution": "HKUST"}, {"id": 85663, "fullname": "Paul Guerrero", "url": "http://cvpr.thecvf.com/api/miniconf/users/85663?format=json", "institution": "Adobe Systems"}, {"id": 86072, "fullname": "Chun-Hao P. Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86072?format=json", "institution": "Adobe Systems"}, {"id": 86074, "fullname": "Duygu Ceylan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86074?format=json", "institution": "Adobe Systems"}, {"id": 85661, "fullname": "Niloy J. Mitra", "url": "http://cvpr.thecvf.com/api/miniconf/users/85661?format=json", "institution": "University College London"}, {"id": 88780, "fullname": "Xuelin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88780?format=json", "institution": "Adobe Research"}], "abstract": "Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation.However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss, inspired by relational knowledge distillation, that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%\u201310% of the tokens needed by prior 3D AR models. Code will be released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38787", "url": null, "sourceid": 31731, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39695, "uid": "dbc47b8c1f31c559de985ed898f644de", "name": "UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution", "authors": [{"id": 179935, "fullname": "Thien Tan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/179935?format=json", "institution": "Ho Chi Minh City University of Technology"}, {"id": 192665, "fullname": "Phan Thi Thu Trang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192665?format=json", "institution": "Hanoi University of Science and Technology"}, {"id": 192666, "fullname": "Duc N. Do", "url": "http://cvpr.thecvf.com/api/miniconf/users/192666?format=json", "institution": "University of Manitoba"}, {"id": 192667, "fullname": "Ho Anh", "url": "http://cvpr.thecvf.com/api/miniconf/users/192667?format=json", "institution": null}, {"id": 180363, "fullname": "Nguyen Duc Dung", "url": "http://cvpr.thecvf.com/api/miniconf/users/180363?format=json", "institution": null}, {"id": 192668, "fullname": "Duc Dung Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192668?format=json", "institution": "Bach Khoa University (HCMUT)"}], "abstract": "Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration. The code will be published later.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39695", "url": null, "sourceid": 40037, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39213, "uid": "c56aa2102f060ad7471fbefe5e296c92", "name": "DriveCTR: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving", "authors": [{"id": 180547, "fullname": "Enhui Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/180547?format=json", "institution": "Westlake University"}, {"id": 191600, "fullname": "Jiahuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191600?format=json", "institution": "Tianjin University"}, {"id": 191601, "fullname": "Guantian Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191601?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 71020, "fullname": "Tao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71020?format=json", "institution": "SYSU"}, {"id": 191602, "fullname": "Shengbo Eben Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191602?format=json", "institution": "Tsinghua University"}, {"id": 191603, "fullname": "Yuhang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191603?format=json", "institution": "The University of Hong Kong"}, {"id": 191604, "fullname": "xia zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191604?format=json", "institution": "Li Auto Inc."}, {"id": 150336, "fullname": "Xueyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150336?format=json", "institution": "Li Auto Inc."}, {"id": 153217, "fullname": "Yifei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153217?format=json", "institution": "Li Auto Inc."}, {"id": 153218, "fullname": "Kun Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153218?format=json", "institution": "LiAuto"}, {"id": 185069, "fullname": "Zhihui Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185069?format=json", "institution": "Li Auto Inc."}, {"id": 153220, "fullname": "XianPeng Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153220?format=json", "institution": "LiAuto"}, {"id": 86349, "fullname": "Kaicheng Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86349?format=json", "institution": "Alibaba Group"}], "abstract": "Multimodal Large Language Models (MLLMs) are rapidly becoming the intelligence brain of end-to-end autonomous driving systems. A key challenge is to assess whether MLLMs can truly understand and follow complex real-world traffic rules. However, existing benchmarks mainly focus on single-rule scenarios like traffic sign recognition, neglecting the complexity of multi-rule concurrency and conflicts in real driving. Consequently, models perform well on simple tasks but often fail or violate rules in real world complex situations. To bridge this gap, we propose **DriveCTR**, a text and vision-based benchmark for compositional traffic rule reasoning. Inspired by human drivers\u2019 cognitive development, we propose a systematic **Five-Level Cognitive Ladder** that evaluates reasoning from single-rule understanding to multi-rule integration and conflict resolution, enabling quantitative assessment across cognitive stages.  We further propose a **Rule2Scene Agent** that maps language-based traffic rules to dynamic driving scenes through rule crafting and scene generation, enabling scene-level traffic rule visual reasoning. Evaluations of 14 mainstream MLLMs reveal a pronounced decline in performance as task difficulty increases, especially in the rule conflict resolution task. After splitting the dataset and fine-tuning on the training set, we further observe substantial improvements in both traffic rule reasoning and downstream planning capabilities. These results highlight the effectiveness of DriveCTR in advancing compliant and intelligent autonomous driving systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39213", "url": null, "sourceid": 30676, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39388, "uid": "8bee20f309dc54c893b46f6325a2d2df", "name": "Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement", "authors": [{"id": 174840, "fullname": "Zeyu Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/174840?format=json", "institution": "Northeast Normal University"}, {"id": 181571, "fullname": "HUI LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/181571?format=json", "institution": "HONOR Device Co., Ltd."}, {"id": 153457, "fullname": "Yu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153457?format=json", "institution": "Honor"}, {"id": 152897, "fullname": "Song Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152897?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 152896, "fullname": "Congchao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152896?format=json", "institution": "Honor"}, {"id": 191975, "fullname": "Caixia Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191975?format=json", "institution": "Northeast Normal University"}], "abstract": "Low-Light Image Enhancement (LLIE) is a challenging task, as severe information loss means a single input can correspond to multiple plausible restorations. This inherent ambiguity causes conventional regression-based models to produce overly-smooth results that lack detail. While recent generative models can create richer details, their common unidirectional design often compromises content fidelity by distorting original structures. We introduce Bi-Bridge, a unified framework that models both enhancement and its inverse degradation within a single symmetric diffusion bridge. By compelling the network to preserve essential content structures across both transformations, this bidirectional learning acts as a powerful constraint, leading to significantly more faithful and realistic restorations. Extensive experiments show that Bi-Bridge outperforms state-of-the-art (SOTA) methods across multiple benchmarks, establishing a new standard for fidelity and perceptual quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39388", "url": null, "sourceid": 41935, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39153, "uid": "cbfad8cfc479906fe118a2696d34ee4f", "name": "OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models", "authors": [{"id": 175962, "fullname": "tengjin Weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175962?format=json", "institution": "Shenzhen university"}, {"id": 191462, "fullname": "Wenhao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191462?format=json", "institution": "Guangming Laboratory"}, {"id": 191463, "fullname": "Jingyi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191463?format=json", "institution": "Tsinghua University"}, {"id": 152147, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152147?format=json", "institution": "Guangming Laboratory"}, {"id": 88103, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88103?format=json", "institution": "Meituan"}, {"id": 191464, "fullname": "Zhong Ming", "url": "http://cvpr.thecvf.com/api/miniconf/users/191464?format=json", "institution": "Shenzhen University"}], "abstract": "Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision\u2013language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis.In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position.Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model\u2019s fine-grained visual discrimination ability.We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence.All resources will be publicly released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39153", "url": "https://wwwtttjjj.github.io/OddGridBench/", "sourceid": 46019, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65723, "file": "/media/PosterPDFs/CVPR%202026/39153-thumb.png", "modified": "2026-04-27T06:17:40.688848-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65722, "file": "/media/PosterPDFs/CVPR%202026/39153.png", "modified": "2026-04-27T06:17:40.498932-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40150, "uid": "1042ee016ad60422f90a4a82ddd9fa40", "name": "Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors", "authors": [{"id": 193642, "fullname": "Subin Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/193642?format=json", "institution": "Yonsei University"}, {"id": 193643, "fullname": "In Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/193643?format=json", "institution": "Yonsei University"}, {"id": 193644, "fullname": "Junyoung Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193644?format=json", "institution": "Yonsei University"}, {"id": 193645, "fullname": "Woong Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/193645?format=json", "institution": "Yonsei University"}, {"id": 89061, "fullname": "Seon Joo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/89061?format=json", "institution": "Yonsei University"}], "abstract": "Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect.This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions.To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model.In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model.We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes.Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model.Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40150", "url": null, "sourceid": 38291, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38297, "uid": "831fc8fc3df63a054cda8cc5b32a95f7", "name": "The Universal Normal Embedding", "authors": [{"id": 189539, "fullname": "Chen Tasker", "url": "http://cvpr.thecvf.com/api/miniconf/users/189539?format=json", "institution": "Technion - Israel Institute of Technology, Technion"}, {"id": 180248, "fullname": "Roy Betser", "url": "http://cvpr.thecvf.com/api/miniconf/users/180248?format=json", "institution": "Technion - Israel Institute of Technology, Technion; Fujitsu Research of Europe"}, {"id": 189540, "fullname": "Eyal Gofer", "url": "http://cvpr.thecvf.com/api/miniconf/users/189540?format=json", "institution": "Electrical Engineering Department, Technion \u2013 Israel Institute of Technology,"}, {"id": 186994, "fullname": "Meir Yossef Levi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186994?format=json", "institution": "Technion - Israel Institute of Technology, Technion - Israel Institute of Technology"}, {"id": 77507, "fullname": "Guy Gilboa", "url": "http://cvpr.thecvf.com/api/miniconf/users/77507?format=json", "institution": "Technion"}], "abstract": "Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian.We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections.To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO).On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions.These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements.Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38297", "url": null, "sourceid": 34981, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37048, "uid": "1d6f9ad9c60c2f65d59a06c1f56e75bd", "name": "NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization", "authors": [{"id": 186563, "fullname": "Yik Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186563?format=json", "institution": "University of Sydney, University of Sydney"}, {"id": 137241, "fullname": "Runkai Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/137241?format=json", "institution": null}, {"id": 90101, "fullname": "Weidong Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/90101?format=json", "institution": "The University of Sydney"}], "abstract": "2D visual foundation models, such as DINOv3, a self-supervised model trained on large-scale natural images, have demonstrated strong zero-shot generalization, capturing rich global context and fine-grained structural cues. However, an analogous 3D foundation model for downstream volumetric neuroimaging remains lacking, largely due to the challenges of 3D image acquisition and the scarcity of high-quality annotations. To address this gap, we propose to adapt the 2D visual representations learned by DINOv3 to a 3D biomedical segmentation model, enabling more data-efficient and morphologically faithful neuronal reconstruction. Specifically, we design an inflation-based adaptation strategy that inflates 2D filters into 3D operators, preserving semantic priors from DINOv3 while adapting to 3D neuronal volume patches. In addition, we introduce a topology-aware skeleton loss to explicitly enforce structural fidelity of graph-based neuronal arbor reconstruction. Extensive experiments on four neuronal imaging datasets, including two from BigNeuron and two public datasets, NeuroFly and CWMBS, demonstrate consistent improvements in reconstruction accuracy over SoTA methods, with average gains of 2.9% in Entire Structure Average, 2.8% in Different Structure Average, and 3.8% in Percentage of Different Structure.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37048", "url": null, "sourceid": 39918, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36212, "uid": "e221f76a7e0db31148d3229797c8e3c3", "name": "Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark", "authors": [{"id": 180361, "fullname": "Ke Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180361?format=json", "institution": "University of Science and Technology of China"}, {"id": 184463, "fullname": "Xuanhua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184463?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 183010, "fullname": "Xueheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183010?format=json", "institution": "University of Science and Technology of China"}, {"id": 76217, "fullname": "Lingting Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76217?format=json", "institution": "The University of Hong Kong"}, {"id": 184464, "fullname": "Yingying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184464?format=json", "institution": "Xiamen University"}, {"id": 184465, "fullname": "Ao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/184465?format=json", "institution": "JD.com"}, {"id": 184466, "fullname": "Zhanjie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184466?format=json", "institution": "Zhejiang University"}, {"id": 86574, "fullname": "Man Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/86574?format=json", "institution": "University of Science and Technology of China"}, {"id": 88254, "fullname": "Chengjun Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/88254?format=json", "institution": "University of Science and Technology of China"}, {"id": 184467, "fullname": "Jie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184467?format=json", "institution": null}], "abstract": "Pansharpening aims to generate high-resolution multi-spectral images by fusing the spatial detail of panchromatic images with the spectral richness of low-resolution MS data. However, most existing methods are evaluated under limited, low-resolution settings, limiting their generalization to real-world, high-resolution scenarios. To bridge this gap, we systematically investigate the data, algorithmic, and computational challenges of cross-scale pansharpening. We first introduce PanScale, the first large-scale, cross-scale pansharpening dataset, accompanied by PanScale-Bench, a comprehensive benchmark for evaluating generalization across varying resolutions and scales. To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. ScaleFormer reframes generalization across image resolutions as generalization across sequence lengths: it tokenizes images into patch sequences of the same resolution but variable length proportional to image scale. A Scale-Aware Patchify module enables training for such variations from fixed-size crops. ScaleFormer then decouples intra-patch spatial feature learning from inter-patch sequential dependency modeling, incorporating Rotary Positional Encoding to enhance extrapolation to unseen scales. Extensive experiments show that our approach outperforms SOTA methods in fusion quality and cross-scale generalization. The datasets and source code are available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36212", "url": null, "sourceid": 32180, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38476, "uid": "e2d4866ea6828d3806235a88e5cd0a46", "name": "Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks", "authors": [{"id": 181910, "fullname": "Quanyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181910?format=json", "institution": "Shandong University"}, {"id": 189934, "fullname": "Zhongyi Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/189934?format=json", "institution": "Shandong University"}, {"id": 189935, "fullname": "Hao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189935?format=json", "institution": "Shandong University"}, {"id": 130609, "fullname": "Yongshun Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/130609?format=json", "institution": "Shandong University"}, {"id": 189936, "fullname": "Xiaoyan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189936?format=json", "institution": null}, {"id": 84891, "fullname": "Yilong Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84891?format=json", "institution": "Shandong University"}, {"id": 90856, "fullname": "Shuo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90856?format=json", "institution": "Case Western Reserve University"}], "abstract": "Pretraining on large-scale data followed by fine-tuning has become a standard paradigm for visual models. However, noise in the pretraining data can be absorbed by the model and carried into downstream tasks, causing catastrophic inheritance, where inherited pretraining noise reduces downstream generalization. Prior studies mainly link this issue to changes in the feature spectrum, arguing that noise reduces the strength of key feature components. Following this view, they aim to improve transferability by amplifying these components. However, these approaches focus only on spectral energy and implicitly assume that the feature directions remain fixed, which does not hold in practice. In this work, we revisit this view and reveal an overlooked effect: even mild pretraining noise can cause a clear rotation of the dominant feature subspace, despite negligible spectral energy degradation. To quantitatively characterize this phenomenon, we propose using the Principal Directional Angle (PDA) to measure the directional shift between the clean and noisy models. Building on this observation, we introduce the Feature Geometry Stabilization (FGS) framework, which aims to counteract the subspace rotation revealed by PDA by enhancing the geometric stability of the feature space through the synergistic interaction of perturbation consistency, variance-activation regularization, and feature consistency distillation. Experiments across multiple visual benchmarks demonstrate the effectiveness of FGS and verify the importance of stabilizing feature geometry to mitigate catastrophic inheritance.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38476", "url": null, "sourceid": 41530, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38478, "uid": "d4d6f757cd2582da7e0f91c5cf066321", "name": "Multigrain-aware Semantic Prototype Scanning and Tri-token Prompt Learning embraced High-order RWKV for Pan-sharpening", "authors": [{"id": 189940, "fullname": "Junfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189940?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 189941, "fullname": "Wenyang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/189941?format=json", "institution": "Xidian University"}, {"id": 183010, "fullname": "Xueheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183010?format=json", "institution": "University of Science and Technology of China"}, {"id": 184463, "fullname": "Xuanhua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184463?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 156614, "fullname": "Jianhou Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/156614?format=json", "institution": "Yunnan Normal University"}, {"id": 75641, "fullname": "Wenqi Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/75641?format=json", "institution": "Sun Yat-Sen University"}], "abstract": "In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a KV-sharing RWKV architecture for efficient global modeling, coupled with a novel tri-token prompting mechanism derived from semantic clustering to steer the fusion process adhering to the following principles: 1) Multigrain-aware Semantic Prototype Scanning.  While the RWKV model offers an efficient linear alternative, its recurrent scanning mechanism often introduces positional bias and lacks semantic guidance. To address this, we introduce a semantic-driven scanning strategy. Local hashing is first employed to generate semantic prototypes via clustering, segmenting the image into coherent regions. Our scanning mechanism is then explicitly aware of multi-grain semantic structures, allowing the model to focus on contextually relevant regions during fusion, thereby enhancing spectral integrity and spatial coherence beyond sequence-agnostic approaches. 2) Tri-token Prompt Learning. The core of our framework is a tri-token prompting mechanism: (i) a globally-sourced token to encapsulate the holistic image context, (ii) cluster-derived prototype tokens to represent distinct semantic regions, and (iii)  learnable token register that acts as a dynamic buffer to explicitly identify and eliminate feature noisy artifacts that commonly arise from standard global modeling. The global and prototype tokens are broadcast as semantic prompts to guide RWKV's processing, while the register continuously refines the intermediate features. 3) Invertible Q-Shift. To counteract spatial detail, we tailor two key designs: apply a center difference convolution on  value pathway within the RWKV block, actively injecting high-frequency information to preserve fine textures and moving beyond parameter-heavy receptive field expansion via invertible neural network empowered  multi-scale Q-shift operation. This module performs efficient, lossless feature transformation and shifting across split channels, significantly enriching feature representation.  Experimental results demonstrate superiority of our method.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38478", "url": null, "sourceid": 42382, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37405, "uid": "59d49e1ef1cc5496dfde1a1cc0c74004", "name": "SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning", "authors": [{"id": 107132, "fullname": "Yong Xien Chng", "url": "http://cvpr.thecvf.com/api/miniconf/users/107132?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187360, "fullname": "Tao Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187360?format=json", "institution": "University of Science and Technology of China"}, {"id": 187361, "fullname": "Wenwen Tong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187361?format=json", "institution": "SenseTime"}, {"id": 183010, "fullname": "Xueheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183010?format=json", "institution": "University of Science and Technology of China"}, {"id": 187362, "fullname": "Jiandong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187362?format=json", "institution": "Sensetime"}, {"id": 187363, "fullname": "Haojia Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187363?format=json", "institution": ""}, {"id": 176271, "fullname": "Jiefan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176271?format=json", "institution": "Sensetime"}, {"id": 187364, "fullname": "Hewei Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187364?format=json", "institution": "SenseTime Research"}, {"id": 90125, "fullname": "Hanming Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90125?format=json", "institution": "Sensetime"}, {"id": 88254, "fullname": "Chengjun Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/88254?format=json", "institution": "University of Science and Technology of China"}, {"id": 93664, "fullname": "Gao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93664?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 88044, "fullname": "Lewei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88044?format=json", "institution": "SenseTime"}], "abstract": "Vision-Language Models (VLMs) are limited by static knowledge and insufficient fine-grained visual analysis, hindering their performance on knowledge-intensive and visually complex tasks. While recent research has explored VLMs that employ external tools like search or cropping to enhance model performance, they typically employ tools in isolation and lack the ability to coordinate multiple tools effectively. To address this gap, we propose SenseSearch, the first agentic VLM for search-reasoning that supports adaptive multi-tool coordination via reinforcement learning (RL). Specifically, SenseSearch dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. We first construct a high-quality cold-start dataset to instill basic tool-usage behaviors. In the subsequent RL stage, we introduce Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to enhance the tool invocation and reasoning ability. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseSearch achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks, outperforming baselines by 19.18% on HR-MMSearch. SenseSearch provides a promising path toward agentic VLMs with effective and robust tool invocation capabilities. All code and data will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37405", "url": null, "sourceid": 44273, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38063, "uid": "0d2d3e6ccfde2699ef5e675f23448bca", "name": "Image-Guided Geometric Stylization of 3D Meshes", "authors": [{"id": 76124, "fullname": "Changwoon Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76124?format=json", "institution": "Seoul National University"}, {"id": 106137, "fullname": "Hyunsoo Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/106137?format=json", "institution": "Seoul National University"}, {"id": 188954, "fullname": "Cl\u00e9ment Jambon", "url": "http://cvpr.thecvf.com/api/miniconf/users/188954?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 135805, "fullname": "Yael Vinker", "url": "http://cvpr.thecvf.com/api/miniconf/users/135805?format=json", "institution": "Tel Aviv University"}, {"id": 90067, "fullname": "Young Min Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/90067?format=json", "institution": "Seoul National University"}], "abstract": "Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38063", "url": null, "sourceid": 37439, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36983, "uid": "e027ffbf4c1d80802a64ca25f39caa8d", "name": "One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework", "authors": [{"id": 127367, "fullname": "Lorenzo Bianchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/127367?format=json", "institution": "CNR-ISTI"}, {"id": 186377, "fullname": "Giacomo Pacini", "url": "http://cvpr.thecvf.com/api/miniconf/users/186377?format=json", "institution": "ISTI - CNR"}, {"id": 127354, "fullname": "Fabio Carrara", "url": "http://cvpr.thecvf.com/api/miniconf/users/127354?format=json", "institution": "CNR-ISTI"}, {"id": 127371, "fullname": "Nicola Messina", "url": "http://cvpr.thecvf.com/api/miniconf/users/127371?format=json", "institution": "Institute of Information Science and Technologies - National Research Council (ISTI-CNR)"}, {"id": 186378, "fullname": "Giuseppe Amato", "url": "http://cvpr.thecvf.com/api/miniconf/users/186378?format=json", "institution": "CNR"}, {"id": 127355, "fullname": "Fabrizio Falchi", "url": "http://cvpr.thecvf.com/api/miniconf/users/127355?format=json", "institution": "CNR"}], "abstract": "Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense captioning and region-set captioning. We also introduce a new trace captioning task that further demonstrates the effectiveness of patch-wise semantic representations for flexible caption generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36983", "url": null, "sourceid": 31622, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37685, "uid": "a3118fa8703d93a510ea9bd337a4d144", "name": "CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving", "authors": [{"id": 148712, "fullname": "Pei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/148712?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 178288, "fullname": "Qingtian Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/178288?format=json", "institution": "Li Auto"}, {"id": 148260, "fullname": "Xinyan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/148260?format=json", "institution": "Li Auto Inc."}, {"id": 188009, "fullname": "Haipeng LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/188009?format=json", "institution": "Li Auto Inc."}, {"id": 188010, "fullname": "Weiliang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/188010?format=json", "institution": "Li Auto Inc."}, {"id": 188011, "fullname": "Dangen She", "url": "http://cvpr.thecvf.com/api/miniconf/users/188011?format=json", "institution": "Li Auto Inc."}, {"id": 153220, "fullname": "XianPeng Lang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153220?format=json", "institution": "LiAuto"}, {"id": 153506, "fullname": "Jun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/153506?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "The pursuit of autonomous agents with predictive cognitive world models is hindered by a fundamental flaw in current vision-language models (VLMs): they lack cognitive inertia. Operating on isolated snapshots, these models cannot form a temporally coherent world view, leading to erratic decision jitter and a failure to execute complex, multi-step maneuvers. To remedy this, we introduce CogDriver, a framework designed to build a coherent world model by instilling this crucial cognitive property. Our work makes two key contributions: (1) We present CogDriver-Data, a large-scale vision-language-action dataset whose narrative annotations provide the supervisory signal for learning the temporal dynamics of a world model. (2) We develop the CogDriver-Agent, an architecture featuring a sparse temporalmemory to maintain a stable internal state, the foundation of a world model. This is enabled by a spatiotemporal knowledge distillation approach that explicitly teaches decision coherence. Comprehensive experiments validate our paradigm: CogDriver-Agent achieves a 22\\% increase in the closed-loop Driving Score on Bench2Drive and a 21\\% reduction in mean L2 error on nuScenes, establishing a new state-of-the-art. These significant gains in both long-term decision-making and imitation accuracy provide strong evidence that our agent is developing a more stable internal world model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37685", "url": null, "sourceid": 44986, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37091, "uid": "cb92ee0067e7481c88954b7d8abd8f4a", "name": "ReCoFuse: Ultra-Robust Image Fusion via Restorative Multi-Modal Diffusion Reciprocal Coupling", "authors": [{"id": 129156, "fullname": "HAO ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129156?format=json", "institution": "Wuhan University"}, {"id": 186634, "fullname": "Shuhan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186634?format=json", "institution": "Wuhan University"}, {"id": 129161, "fullname": "Linfeng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129161?format=json", "institution": "Wuhan University"}, {"id": 104468, "fullname": "Xunpeng Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/104468?format=json", "institution": "Wuhan University"}, {"id": 86222, "fullname": "Jiayi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86222?format=json", "institution": "Wuhan University"}], "abstract": "Existing methods following the integrated hard-regression or decoupling optimization paradigms exhibit limited fusion performance under complex degradations. To address these paradigm-level shortcomings, we propose ReCoFuse, an ultra-robust image fusion framework based on restorative multi-modal diffusion reciprocal coupling. ReCoFuse redefines the relationship between information restoration and integration, deriving a novel reciprocal coupling optimization paradigm through their mutual reinforcement. It first constructs two restoration branches using diffusion modules (DiM) to capture modality-specific restoration priors. Then, time-aware cross-modal integration modules (TIM) are introduced as a bridge to couple restoration and integration, embedded at each DiM sampling timestep to aggregate multi-modal information. The aggregated variable not only feeds back to each restoration branch to enhance degradation removal via cross-modal complementarity, but also generates high-quality fused images that comprehensively represent the scene. Moreover, an alternating regularization mechanism is designed to iteratively optimize DiM and TIM along the gradient path, ensuring effective collaboration between restoration and integration. Extensive experiments show that ReCoFuse achieves state-of-the-art performance under challenging degradations such as low light, haze, noise, low contrast, and stripes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37091", "url": null, "sourceid": 33653, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40009, "uid": "14b14c86550a0d4c618b4764e11d49db", "name": "Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image", "authors": [{"id": 144926, "fullname": "Joohyun Kwon", "url": "http://cvpr.thecvf.com/api/miniconf/users/144926?format=json", "institution": "Daegu Gyeongbuk Institute of Science &amp; Technology"}, {"id": 193288, "fullname": "Geonhee Sim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193288?format=json", "institution": "Korea University"}, {"id": 77281, "fullname": "Gyeongsik Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/77281?format=json", "institution": "Korea University"}], "abstract": "Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow\u2013guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods. Code, pretrained models, and reannotations will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40009", "url": null, "sourceid": 31897, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39181, "uid": "15afbdffe28ab61640b4e837d80567d2", "name": "Linking Perception, Confidence and Accuracy in MLLMs", "authors": [{"id": 147823, "fullname": "Yuetian Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/147823?format=json", "institution": "Zhejiang University"}, {"id": 191520, "fullname": "Yucheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191520?format=json", "institution": "Zhejiang University"}, {"id": 191521, "fullname": "Rongyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191521?format=json", "institution": "Zhejiang University"}, {"id": 185839, "fullname": "Zhijie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185839?format=json", "institution": "University of Michigan - Ann Arbor"}, {"id": 177775, "fullname": "BOYU YANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/177775?format=json", "institution": "Alibaba Group Holding Limited"}, {"id": 107273, "fullname": "Ming Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/107273?format=json", "institution": "Zhejiang University"}, {"id": 89719, "fullname": "Jie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89719?format=json", "institution": "City University of Hong Kong"}, {"id": 128218, "fullname": "Qiang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128218?format=json", "institution": "Zhejiang University"}], "abstract": "Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8\\% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority. Our code will be released after the acception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39181", "url": null, "sourceid": 34549, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36975, "uid": "64f951635b90805e592ee138ed09c9b1", "name": "Tracking through Severe Occlusion via Event-Derived Transient Cues", "authors": [{"id": 186360, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/186360?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186361, "fullname": "Yujin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186361?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 102342, "fullname": "Haoyue Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102342?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 186362, "fullname": "Zhenyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186362?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 128519, "fullname": "Shihan Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/128519?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 128026, "fullname": "Zhiwei Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/128026?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90470, "fullname": "Yi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90470?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90454, "fullname": "Luxin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90454?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Tracking targets with high-speed and nonlinear motion under occlusion remains challenging due to spatial appearance deprivation and temporal trajectory fragmentation caused by missing visual cues. Existing methods typically either dynamically update templates to maintain appearance similarity or employ autoregressive models to predict targets from historical trajectories. However, these methods are ineffective under severe occlusion owing to template contamination and limited frame rates for complex motion.    In this work, we observe that occlusion inherently degrades the spatial matching mechanism, highlighting the importance of temporal cues. Meanwhile, event cameras with microsecond-level temporal resolution provide transient dynamic cues that facilitate modeling nonlinear motion.    In light of this, we propose \\textbf{EvoTrack}, an occlusion-robust tracking framework via event-derived transient evolution, which comprises event-based motion autoregression and target-aware appearance matching. Specifically, for motion autoregression, the fine-grained timestamps of events naturally encode the target's direction and speed, motivating a bidirectional motion consistency that constrains inter-frame displacement prediction under nonlinear motion. For appearance matching, we adopt a Gaussian masking strategy to simulate occlusion degradation, guiding the model to focus on target regions and learn invariant representations. Furthermore, we build a pixel-aligned Frame-Event tracking dataset with higher spatial resolution and explicit occlusion labels. Extensive experiments demonstrate the effectiveness of EvoTrack in challenging occlusion scenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36975", "url": null, "sourceid": 35900, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36927, "uid": "f24f96d246285d951fe203d06473ff09", "name": "Parallel Jacobi Decoding for Fast Autoregressive Image Generation", "authors": [{"id": 186236, "fullname": "Boya Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186236?format=json", "institution": "Westlake University"}, {"id": 186237, "fullname": "Ying Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186237?format=json", "institution": "Westlake University"}, {"id": 186238, "fullname": "Siyong Jian", "url": "http://cvpr.thecvf.com/api/miniconf/users/186238?format=json", "institution": null}, {"id": 87566, "fullname": "Huan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87566?format=json", "institution": "Northeastern University"}], "abstract": "Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images.However, their inherently sequential next-token prediction leads to significantly slows inference.Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation.Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence.Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement.PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability.Extensive experiments on diverse datasets show that PJD achieves **4.8\u00d7\u20136.4\u00d7** acceleration across multiple autoregressive image generation models while preserving image quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36927", "url": null, "sourceid": 37589, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36569, "uid": "43a905649d52ad653f605560b971f996", "name": "MPL: Match-guided Prototype Learning for Few-shot Action Recognition", "authors": [{"id": 183489, "fullname": "Feng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183489?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 183686, "fullname": "Jie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183686?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 185370, "fullname": "Fulin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185370?format=json", "institution": "Chongqing University"}, {"id": 185371, "fullname": "Anyong Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185371?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 185372, "fullname": "Tiecheng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/185372?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 185373, "fullname": "Yue Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185373?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 185374, "fullname": "CHENQIANG GAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/185374?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86299, "fullname": "Junwei Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/86299?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "Current few-shot action recognition methods achieve impressive performance by learning representative prototypes and designing diverse video matching strategies. However, these approaches typically face two critical limitations: i) prototypes learned through implicit sample interactions lack clear semantic correspondence between query-support pairs, limiting their class representativeness; ii) the independent design of prototype learning and matching mechanisms creates a potential incompatibility between prototype representations and matching strategies. To address these limitations, we propose a Match-guided Prototype Learning (MPL) method comprising two key components: enhanced match (E-Match) and key-frame extraction match (K-Match). E-Match explicitly enhances prototype learning in class-specific embeddings by incorporating the matched semantics of query samples, while K-Match further refines the prototype representation through key-frame matching at the fine-grained frame level. Additionally, we propose a Cross-Shot Attention Aggregator (CSA-Aggregator) that dynamically aggregates adjacent frames across support samples, thereby obtaining a prototype representation that captures intra-class shared action patterns. In this way, the proposed MPL effectively mines coarse-to-fine, match-guided semantic information from query-support pairs to generate discriminative class prototypes, and improve the compatibility of prototype representation with the match mechanism. Extensive evaluations on four public datasets confirm that MPL achieves superior performance over leading few-shot action recognition techniques.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36569", "url": null, "sourceid": 34397, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40142, "uid": "430e0eac74786ccb1ae4d1c1d29ce9af", "name": "BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment", "authors": [{"id": 193612, "fullname": "Risa Shinoda", "url": "http://cvpr.thecvf.com/api/miniconf/users/193612?format=json", "institution": "The University of Tokyo"}, {"id": 96749, "fullname": "Kaede Shiohara", "url": "http://cvpr.thecvf.com/api/miniconf/users/96749?format=json", "institution": "The University of Tokyo"}, {"id": 90313, "fullname": "Nakamasa Inoue", "url": "http://cvpr.thecvf.com/api/miniconf/users/90313?format=json", "institution": "Tokyo Institute of Technology"}, {"id": 190413, "fullname": "Kuniaki Saito", "url": "http://cvpr.thecvf.com/api/miniconf/users/190413?format=json", "institution": "OMRON SINICX"}, {"id": 90746, "fullname": "Hiroaki Santo", "url": "http://cvpr.thecvf.com/api/miniconf/users/90746?format=json", "institution": "Osaka University"}, {"id": 90743, "fullname": "Fumio Okura", "url": "http://cvpr.thecvf.com/api/miniconf/users/90743?format=json", "institution": "Osaka University"}], "abstract": "Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology.While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40142", "url": null, "sourceid": 33238, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39745, "uid": "eaa8c9c3ac9ad5a1d863d20da580a0a3", "name": "ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion", "authors": [{"id": 188514, "fullname": "Souymodip Chakraborty", "url": "http://cvpr.thecvf.com/api/miniconf/users/188514?format=json", "institution": "Adobe Inc"}, {"id": 192772, "fullname": "Ankur Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/192772?format=json", "institution": "Adobe Systems"}, {"id": 179955, "fullname": "Amit Vikram Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/179955?format=json", "institution": "Adobe"}, {"id": 188515, "fullname": "Vineet Batra", "url": "http://cvpr.thecvf.com/api/miniconf/users/188515?format=json", "institution": "Adobe Systems"}, {"id": 171312, "fullname": "Ankit Phogat", "url": "http://cvpr.thecvf.com/api/miniconf/users/171312?format=json", "institution": "Adobe Systems"}], "abstract": "We present ShapeAR, a novel autoregressive latent diffusion framework that decomposes raster images into editable, artist-like vector shape layers. Unlike conventional raster-to-SVG methods that rely on boundary tracing or joint path optimization, ShapeAR generates non-overlapping RGBA shape layers directly in latent space via flow-matching diffusion. To scale generation to complex scenes with many shapes, we formulate the process autoregressively, conditioning each step on both the input image (global context) and the partial composition of previously generated layers (local context). In addition, we propose geometry-aware evaluation metrics that quantify the aesthetic and structural quality of the generated shapes, enabling more rigorous assessment beyond pixel-level reconstruction. ShapeAR achieves cleaner decompositions and more coherent vector layers.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39745", "url": null, "sourceid": 31515, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37313, "uid": "2eb6c7a030d336aa15486f40154049a4", "name": "FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping", "authors": [{"id": 181198, "fullname": "YUANSHEN GUAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/181198?format=json", "institution": "University of Science and Technology of China"}, {"id": 76716, "fullname": "Ruikang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76716?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88201, "fullname": "Chang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88201?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 187137, "fullname": "Yinuo Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187137?format=json", "institution": "University of Science and Technology of China"}, {"id": 87157, "fullname": "Dehua Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/87157?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88204, "fullname": "Fenglong Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/88204?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 76559, "fullname": "Zhiwei Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76559?format=json", "institution": "USTC"}], "abstract": "Inverse tone mapping (ITM) becomes significantly harder when the SDR input is produced by local tone mapping, which jointly applies global radiometric compression and spatially varying adaptations that distort dynamic range, contrast, and channel-wise color ratios. Existing ITM methods ignore this degradation structure and either regress HDR values directly or rely on a single-channel gain map, which scale luminance only and cannot restore the compressed dynamic range and wide color gamut.We introduce FastGaMer, a structured and resolution-agnostic ITM framework that explicitly mirrors this degradation process. Instead of regressing HDR values, we reconstruct a color gain map, which preserves per-channel amplification, simplifies learning, and enables proper gamut extension. Local and global degradations are inverted separately using dynamic bilateral grids and learnable 3D LUTs, followed by a lightweight neural modulator for global refinement and coherence. All high-resolution operations are network-free, yielding exceptional efficiency.To support color-GM supervision under realistic local TMO degradations, we create a dataset of over 8,000 4K SDR\u2013GM pairs with an additional real-captured test set. FastGaMer outperforms prior lightweight ITM methods by +1.4 dB PQ-PSNR, reduces runtime by 70\\%, and processes 4K images in only 6.2 ms, achieving both high accuracy and real-time performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37313", "url": null, "sourceid": 42184, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39757, "uid": "0e1542eb9501f06a8c684bc836568a84", "name": "Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios", "authors": [{"id": 181088, "fullname": "Zhipeng Sui", "url": "http://cvpr.thecvf.com/api/miniconf/users/181088?format=json", "institution": "Tsinghua University"}, {"id": 192796, "fullname": "Haiqing Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192796?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 192797, "fullname": "Weihua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192797?format=json", "institution": null}, {"id": 192798, "fullname": "Seng-Hong Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/192798?format=json", "institution": "Tsinghua University"}, {"id": 192799, "fullname": "Wenhui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192799?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Most event-based algorithms typically split the event stream into fixed groups (e.g., fixed time or fixed count) for downstream processing, lacking adaptivity to scene dynamics. Several adaptive partitioning strategies have been proposed, but they are unable to cope well with heterogeneous velocity scenarios (HVS) involving both fast- and slow-moving objects. To address this issue, we propose Adaptive Spatial-Temporal Window (ASTW) strategy, which simultaneously achieves temporal adaptivity and spatial locality in event partitioning. Based on the principle of maximum entropy, we derive a patch-level time window determination criterion and efficiently implement it based on event density and vectorized calculations. Experiments on publicly available event-based object detection and tracking datasets demonstrate that ASTW significantly outperforms existing state-of-the-art partitioning strategies. We also construct HetVel, the first RGB-event dual-modality dataset for HVS, and further highlight the advantages of ASTW on this challenging benchmark. We believe that our ASTW strategy and the constructed HetVel dataset will advance the field of neuromorphic vision.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39757", "url": null, "sourceid": 40035, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37893, "uid": "66401bf02a6a57e0b5ad01232e78a470", "name": "VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation", "authors": [{"id": 183693, "fullname": "Tarun Gehlaut", "url": "http://cvpr.thecvf.com/api/miniconf/users/183693?format=json", "institution": "Adobe"}, {"id": 106443, "fullname": "Difan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106443?format=json", "institution": "Adobe Research"}, {"id": 180221, "fullname": "Charu Bansal", "url": "http://cvpr.thecvf.com/api/miniconf/users/180221?format=json", "institution": "Adobe Systems"}, {"id": 188513, "fullname": "Krutik Malani", "url": "http://cvpr.thecvf.com/api/miniconf/users/188513?format=json", "institution": "Adobe Systems"}, {"id": 188514, "fullname": "Souymodip Chakraborty", "url": "http://cvpr.thecvf.com/api/miniconf/users/188514?format=json", "institution": "Adobe Inc"}, {"id": 171312, "fullname": "Ankit Phogat", "url": "http://cvpr.thecvf.com/api/miniconf/users/171312?format=json", "institution": "Adobe Systems"}, {"id": 85655, "fullname": "Matthew Fisher", "url": "http://cvpr.thecvf.com/api/miniconf/users/85655?format=json", "institution": "Adobe Research"}, {"id": 188515, "fullname": "Vineet Batra", "url": "http://cvpr.thecvf.com/api/miniconf/users/188515?format=json", "institution": "Adobe Systems"}], "abstract": "Recent vision-language model (VLM)\u2013based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models.We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs.Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37893", "url": null, "sourceid": 40212, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40077, "uid": "b9a5bb0159ab00fa62f31e5c956eac14", "name": "FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution", "authors": [{"id": 145111, "fullname": "Yidi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/145111?format=json", "institution": "USTC"}, {"id": 160417, "fullname": "Zihao Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/160417?format=json", "institution": "University of Science and Technology of China"}, {"id": 86575, "fullname": "Jie Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86575?format=json", "institution": "University of Science and Technology of China"}, {"id": 88802, "fullname": "Jie Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88802?format=json", "institution": "University of Science and Technology of China"}, {"id": 90682, "fullname": "Dong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/90682?format=json", "institution": "University of Science and Technology of China"}, {"id": 193450, "fullname": "Wenlong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193450?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 87059, "fullname": "Lei Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87059?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 76679, "fullname": "Xueyang Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76679?format=json", "institution": "University of Science and Technology of China"}, {"id": 86637, "fullname": "Zheng-Jun Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/86637?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions.This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking.Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40077", "url": null, "sourceid": 36594, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38276, "uid": "35791a616f27612f46106ad5f152c3a3", "name": "Plug-and-Play Incomplete Multi-View Clustering via  Janus-Faced Affinity Learning with Topology Harmonization", "authors": [{"id": 189479, "fullname": "Shengju Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189479?format=json", "institution": "National University of Defense Technology"}, {"id": 129430, "fullname": "Suyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129430?format=json", "institution": "National University of Defense Technology"}, {"id": 189480, "fullname": "Wenhao SHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/189480?format=json", "institution": "Guangxi University"}, {"id": 90222, "fullname": "Siwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90222?format=json", "institution": "Academy of Military Sciences"}, {"id": 129418, "fullname": "KE LIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/129418?format=json", "institution": "National University of Defense Technology"}, {"id": 129434, "fullname": "Xihong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129434?format=json", "institution": "National University of Defense Technology"}, {"id": 189481, "fullname": "Tiejun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189481?format=json", "institution": "National University of Defense Technology; National University of Defense Technology"}, {"id": 90226, "fullname": "Xinwang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90226?format=json", "institution": "National University of Defense Technology"}], "abstract": "Prevailing incomplete multi-view clustering (IMVC) approaches typically fail to account for the interference of view-exclusive artifacts when learning view-consensus representations, which could compromise the fidelity of the resulting similarity measure. Moreover, inconsistencies in anchor order across views may distort the graph structure, impairing the clustering performance. The reliance on carefully-tuned regularization hyper-parameters also usually undermines the model's practical utility. To alleviate these issues, we propose a plug-and-play IMVC framework named PJFTH that incorporates Janus-faced affinity learning with topology harmonization. It explicitly models the exclusive-to-consensus interplay, derives a view-private graph from each view, and adaptively integrates them into a global consensus affinity according to the respective view's intrinsic characteristics. Furthermore, a permutation transformation with unary encoding constraints is applied to anchor matrix, realigning  anchor topology while preserving the values. This process synchronizes anchor order prior to similarity integration and maintains original anchor properties.  Notably, all components are coupled seamlessly and optimized in a joint manner. Also, the provable overall linear complexity further enlarges its scalability and practicality. Experimental results confirm that PJFTH receives competitive performance compared to several leading methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38276", "url": null, "sourceid": 34854, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39207, "uid": "70bcd280925c9104e1647cd668e98c94", "name": "DART: Dynamic ModAlity-balanced Multimodal RepresenTation Learning for E-commerce Product Understanding", "authors": [{"id": 181630, "fullname": "Zhanheng Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/181630?format=json", "institution": "Alibaba Group"}, {"id": 191584, "fullname": "ChengHanFu ChengHanFu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191584?format=json", "institution": null}, {"id": 191585, "fullname": "Daoze Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191585?format=json", "institution": "Alibaba Cloud"}, {"id": 191586, "fullname": "Junxian Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191586?format=json", "institution": "Zhejiang University"}, {"id": 191587, "fullname": "Wanxian Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191587?format=json", "institution": null}, {"id": 191588, "fullname": "Pengjie Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191588?format=json", "institution": "Alibaba Group"}, {"id": 191589, "fullname": "Jian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191589?format=json", "institution": "Alibaba Group"}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}], "abstract": "The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose DART, a Dynamic modAlity-balanced multimodal RepresenTation learning framework for e-commerce product understanding. DART comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce CoIN, a Co-augmented multImodal represeNtation benchmark for e-commerce representation learning and evaluation. Experiments show that DART delivers state-of-the-art zero-shot performance on CoIN and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of DART.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39207", "url": null, "sourceid": 42294, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37837, "uid": "f4c96584752db3229177fd72775ec66c", "name": "HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models", "authors": [{"id": 188371, "fullname": "MD Khalequzzaman Chowdhury Sayem", "url": "http://cvpr.thecvf.com/api/miniconf/users/188371?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 188372, "fullname": "Mubarrat Chowdhury", "url": "http://cvpr.thecvf.com/api/miniconf/users/188372?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 188373, "fullname": "Yihalem Tiruneh", "url": "http://cvpr.thecvf.com/api/miniconf/users/188373?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 163151, "fullname": "Muneeb Ahmed Khan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163151?format=json", "institution": "Sangmyung University"}, {"id": 188374, "fullname": "Muhammad Salman Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/188374?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 85910, "fullname": "Binod Bhattarai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85910?format=json", "institution": "Fogsphere (Redev.AI Ltd, UK)/ University of Aberdeen, UK"}, {"id": 70503, "fullname": "Seungryul Baek", "url": "http://cvpr.thecvf.com/api/miniconf/users/70503?format=json", "institution": "UNIST"}], "abstract": "Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human\u2013AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning\u2014especially in interpreting complex, articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33\\%) and hand-object interaction (+2.63\\%). Code and dataset will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37837", "url": "https://kcsayem.github.io/handvqa/", "sourceid": 36233, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36348, "uid": "ec345c4ea8758243dc3d1424ff9f86b1", "name": "AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers", "authors": [{"id": 184836, "fullname": "Nghia Huu Vu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184836?format=json", "institution": "AIOZ Network"}, {"id": 87898, "fullname": "Tuong Do", "url": "http://cvpr.thecvf.com/api/miniconf/users/87898?format=json", "institution": "AIOZ"}, {"id": 184837, "fullname": "Khang Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184837?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 127983, "fullname": "Baoru Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127983?format=json", "institution": "University College London, University of London"}, {"id": 184838, "fullname": "Nhat Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/184838?format=json", "institution": "University of Western Australia; AIOZ"}, {"id": 184839, "fullname": "Binh Xuan Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184839?format=json", "institution": "AIOZ"}, {"id": 87863, "fullname": "Erman Tjiputra", "url": "http://cvpr.thecvf.com/api/miniconf/users/87863?format=json", "institution": "AIOZ"}, {"id": 87886, "fullname": "Quang D. Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/87886?format=json", "institution": "AIOZ"}, {"id": 184840, "fullname": "Ravi Prakash", "url": "http://cvpr.thecvf.com/api/miniconf/users/184840?format=json", "institution": "Indian Institute of Science, Indian institute of science, Bangalore"}, {"id": 184841, "fullname": "Te-Chuan Chiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184841?format=json", "institution": "National Tsing Hua University"}, {"id": 87853, "fullname": "Anh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87853?format=json", "institution": "University of Liverpool"}], "abstract": "Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward; for example, 3D instance identification often struggles with small, interactable, functional parts (i.e., knobs, handles, etc.). In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach against other methods. Our code and dataset will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36348", "url": null, "sourceid": 42566, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39480, "uid": "3d8aa6ef287b95625bf19d431e946ae7", "name": "PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories", "authors": [{"id": 192161, "fullname": "Gemma Canet Tarr\u00e9s", "url": "http://cvpr.thecvf.com/api/miniconf/users/192161?format=json", "institution": "Amazon"}, {"id": 192162, "fullname": "Manel Baradad", "url": "http://cvpr.thecvf.com/api/miniconf/users/192162?format=json", "institution": "Amazon"}, {"id": 85783, "fullname": "Francesc Moreno-Noguer", "url": "http://cvpr.thecvf.com/api/miniconf/users/85783?format=json", "institution": "Universidad Polit\u00e9cnica de Cataluna"}, {"id": 192163, "fullname": "Yumeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192163?format=json", "institution": "Amazon"}], "abstract": "Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near\u2011perfect preservation of each item\u2019s identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model\u2019s temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39480", "url": null, "sourceid": 37905, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39612, "uid": "f32fabfd1b93d08674ebf1784ea8f3fa", "name": "Rare-E2E: Rare Events Dataset for End-to-End Driving in Challenging Long-tail Scenarios", "authors": [{"id": 86684, "fullname": "Runsheng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86684?format=json", "institution": "University of California, Los Angeles"}, {"id": 134907, "fullname": "Hubert Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/134907?format=json", "institution": "Waymo"}, {"id": 192475, "fullname": "Wonseok Jeon", "url": "http://cvpr.thecvf.com/api/miniconf/users/192475?format=json", "institution": "Waymo LLC"}, {"id": 192476, "fullname": "Hao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192476?format=json", "institution": "Waymo"}, {"id": 133883, "fullname": "Yuliang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/133883?format=json", "institution": "Waymo"}, {"id": 192477, "fullname": "Liting Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192477?format=json", "institution": "Waymo"}, {"id": 192478, "fullname": "John Gorman", "url": "http://cvpr.thecvf.com/api/miniconf/users/192478?format=json", "institution": "Waymo, LLC"}, {"id": 150998, "fullname": "Ekaterina Tolstaya", "url": "http://cvpr.thecvf.com/api/miniconf/users/150998?format=json", "institution": "Waymo"}, {"id": 192479, "fullname": "Sarah Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192479?format=json", "institution": "Waymo"}, {"id": 192480, "fullname": "Brandyn White", "url": "http://cvpr.thecvf.com/api/miniconf/users/192480?format=json", "institution": "Waymo"}, {"id": 85279, "fullname": "Benjamin Sapp", "url": "http://cvpr.thecvf.com/api/miniconf/users/85279?format=json", "institution": "Waymo"}, {"id": 138724, "fullname": "Mingxing Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/138724?format=json", "institution": "Waymo"}, {"id": 152765, "fullname": "Jyh-Jing Hwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152765?format=json", "institution": "Waymo"}, {"id": 85225, "fullname": "Dragomir Anguelov", "url": "http://cvpr.thecvf.com/api/miniconf/users/85225?format=json", "institution": "Waymo"}], "abstract": "Vision-based end-to-end (E2E) driving has garnered interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios paired with existing open-loop evaluation metrics that fall short in capturing the multimodal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Rare Events Dataset for End-to-End Driving (Rare-E2E).  Rare-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that  that are rare in daily life with an occurring frequency of less than 0.03%. Each segment in Rare-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional distance-based metrics, RFS measures how closely a predicted trajectory matches rater-annotated trajectory preference labels. Rare-E2E includes rater preference labels for validation, and a separate held out test set is used for the 2025  Rare-E2E benchmark leaderboard.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39612", "url": null, "sourceid": 33236, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38053, "uid": "2f838cade4a6012a6cb1016d1d8d95ed", "name": "FAITHFUL CONTOURING: NEAR-LOSSLESS 3D VOXEL REPRESENTATION FREE FROM ISO-SURFACE", "authors": [{"id": 183591, "fullname": "Yihao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183591?format=json", "institution": "Imperial College London"}, {"id": 188934, "fullname": "Xianglong He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188934?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 188935, "fullname": "Chuanyu Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188935?format=json", "institution": null}, {"id": 87723, "fullname": "Yiwen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87723?format=json", "institution": "Nanyang Technological University"}, {"id": 188936, "fullname": "Jiaqi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188936?format=json", "institution": "Math Magic"}, {"id": 126956, "fullname": "Yangguang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/126956?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 151075, "fullname": "Wanli Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151075?format=json", "institution": "Shanghai AI Lab"}, {"id": 188937, "fullname": "Yuanming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188937?format=json", "institution": "Meshy AI"}, {"id": 149352, "fullname": "Guang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149352?format=json", "institution": "Imperial College London"}, {"id": 188938, "fullname": "Choon Hwai Yap", "url": "http://cvpr.thecvf.com/api/miniconf/users/188938?format=json", "institution": "Imperial College London"}], "abstract": "Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\\% reduction in Chamfer Distance and a 35\\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38053", "url": null, "sourceid": 36354, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40325?format=json"], "related_events_ids": [40325]}, {"id": 37255, "uid": "479f751686a3b7f1783994dd56619558", "name": "ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation", "authors": [{"id": 131186, "fullname": "Kim Youwang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131186?format=json", "institution": "Pohang University of Science and Technology"}, {"id": 184992, "fullname": "Lee Hyoseok", "url": "http://cvpr.thecvf.com/api/miniconf/users/184992?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 187018, "fullname": "Park Subin", "url": "http://cvpr.thecvf.com/api/miniconf/users/187018?format=json", "institution": "Ulsan National Institute of Science and Technology"}, {"id": 75975, "fullname": "Gerard Pons-Moll", "url": "http://cvpr.thecvf.com/api/miniconf/users/75975?format=json", "institution": "University of T\u00fcbingen"}, {"id": 152617, "fullname": "Tae-Hyun Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/152617?format=json", "institution": "KAIST"}], "abstract": "We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild,  while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37255", "url": null, "sourceid": 45527, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39168, "uid": "58881db4173c7dc955b15305b4d60599", "name": "Relational Visual Similarity", "authors": [{"id": 106344, "fullname": "Thao Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/106344?format=json", "institution": "UW-Madison \ud83e\udda1"}, {"id": 128532, "fullname": "Sicheng Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128532?format=json", "institution": "University of California, Los Angeles"}, {"id": 88736, "fullname": "Krishna Kumar Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/88736?format=json", "institution": "Adobe Systems"}, {"id": 95788, "fullname": "Yilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/95788?format=json", "institution": "Adobe Systems"}, {"id": 126843, "fullname": "Jing Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/126843?format=json", "institution": "Adobe Systems"}, {"id": 126940, "fullname": "Nicholas Kolkin", "url": "http://cvpr.thecvf.com/api/miniconf/users/126940?format=json", "institution": "Adobe Systems"}, {"id": 75717, "fullname": "Eli Shechtman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75717?format=json", "institution": "Adobe Research, US"}, {"id": 89480, "fullname": "Yong Jae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/89480?format=json", "institution": "Professor, UW-Madison and Research Scientist, Adobe"}, {"id": 155628, "fullname": "Yuheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155628?format=json", "institution": "Adobe Systems"}], "abstract": "Humans do not just see attribute similarity---we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach\u2019s skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. %intelligentYet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive.How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space?To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ.We then curate 114k image\u2013caption dataset in which the captions are anonymized---describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision Language Model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance.Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it---revealing a critical gap in visual computing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39168", "url": null, "sourceid": 38038, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39748, "uid": "f5c317aa2c2a1fd626033cc086a2471a", "name": "Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration", "authors": [{"id": 128532, "fullname": "Sicheng Mo", "url": "http://cvpr.thecvf.com/api/miniconf/users/128532?format=json", "institution": "University of California, Los Angeles"}, {"id": 106344, "fullname": "Thao Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/106344?format=json", "institution": "UW-Madison \ud83e\udda1"}, {"id": 89223, "fullname": "Richard Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89223?format=json", "institution": "Adobe Systems"}, {"id": 126940, "fullname": "Nicholas Kolkin", "url": "http://cvpr.thecvf.com/api/miniconf/users/126940?format=json", "institution": "Adobe Systems"}, {"id": 192780, "fullname": "Siddharth Iyer", "url": "http://cvpr.thecvf.com/api/miniconf/users/192780?format=json", "institution": "Adobe Systems"}, {"id": 75717, "fullname": "Eli Shechtman", "url": "http://cvpr.thecvf.com/api/miniconf/users/75717?format=json", "institution": "Adobe Research, US"}, {"id": 88736, "fullname": "Krishna Kumar Singh", "url": "http://cvpr.thecvf.com/api/miniconf/users/88736?format=json", "institution": "Adobe Systems"}, {"id": 89480, "fullname": "Yong Jae Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/89480?format=json", "institution": "Professor, UW-Madison and Research Scientist, Adobe"}, {"id": 89955, "fullname": "Bolei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/89955?format=json", "institution": "University of California, Los Angeles"}, {"id": 155628, "fullname": "Yuheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155628?format=json", "institution": "Adobe Systems"}], "abstract": "In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect \u2014 larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256\u00d7256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39748", "url": null, "sourceid": 36327, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37020, "uid": "3bedcf261a8ed93b45aee274277861b2", "name": "MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry", "authors": [{"id": 181758, "fullname": "Leo Kaixuan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/181758?format=json", "institution": "University of Toronto"}, {"id": 186498, "fullname": "Abdus Shaikh", "url": "http://cvpr.thecvf.com/api/miniconf/users/186498?format=json", "institution": "University of Toronto"}, {"id": 95045, "fullname": "Ruofan Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/95045?format=json", "institution": "University of Toronto"}, {"id": 77269, "fullname": "Zhijie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77269?format=json", "institution": "University of Toronto"}, {"id": 186499, "fullname": "Yushi Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186499?format=json", "institution": "University of Toronto"}, {"id": 158164, "fullname": "Nandita Vijaykumar", "url": "http://cvpr.thecvf.com/api/miniconf/users/158164?format=json", "institution": "University of Toronto"}], "abstract": "Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets\u2014including 7-Scenes, NRGBD, Tanks \\& Temples, and Cambridge Landmarks\u2014MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37020", "url": null, "sourceid": 39775, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38563, "uid": "80a777588c741a17fd957b8b989dd4c7", "name": "PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization", "authors": [{"id": 182361, "fullname": "Mingzhe Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182361?format=json", "institution": "University of Massachusetts Amherst"}, {"id": 190154, "fullname": "Renhao &amp;#x27;Norman&amp;#x27; Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190154?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 190155, "fullname": "Zhiyang Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190155?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 190156, "fullname": "Siqi Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190156?format=json", "institution": "Dolby"}, {"id": 176791, "fullname": "Bruno da Silva", "url": "http://cvpr.thecvf.com/api/miniconf/users/176791?format=json", "institution": "University of Massachusetts"}, {"id": 128512, "fullname": "Juan Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/128512?format=json", "institution": "University of Massachusetts at Amherst"}, {"id": 128520, "fullname": "Shiqing Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/128520?format=json", "institution": "University of Massachusetts at Amherst"}], "abstract": "Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning\u2013based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5% in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38563", "url": null, "sourceid": 40122, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39133, "uid": "c736b91eecdcfc795549afee33c96ce4", "name": "Exposing and Evaluating Hallucinations for GUI Grounding", "authors": [{"id": 89537, "fullname": "Zicheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89537?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191417, "fullname": "Hongyi Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/191417?format=json", "institution": "Alibaba Group"}, {"id": 191418, "fullname": "Rui Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/191418?format=json", "institution": "Alibaba Group"}, {"id": 191419, "fullname": "Shuo Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191419?format=json", "institution": null}, {"id": 191420, "fullname": "Shiai Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191420?format=json", "institution": null}, {"id": 144761, "fullname": "Junying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/144761?format=json", "institution": "Northwestern Polytechnical University, Xi\u2018an"}, {"id": 103856, "fullname": "Chunyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/103856?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 71043, "fullname": "Xiaohong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71043?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 156101, "fullname": "Chenguang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/156101?format=json", "institution": "Ant Group"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Existing GUI benchmarks primarily focus on evaluating models\u2019 comprehensive capabilities but largely overlook hallucination phenomena in grounding tasks, which are crucial to the reliability of GUI understanding. In this work, we expose two major types of hallucinations in GUI grounding: 1) Confusion Hallucination, where distractor elements are mistakenly selected, and 2) Fabricated Hallucination, where nonexistent elements are hallucinated with plausible coordinates. To systematically investigate their origins, we introduce GUI-HalluBench, a benchmark comprising two complementary subsets: a parsing subset for assessing structural representation of GUI elements and a hallucination subset for measuring grounding robustness under challenging conditions. This design allows us to associate hallucination patterns with deficiencies in prerequisite abilities: parsing errors are closely tied to both fabricated and confusion hallucinations. Experiments on state-of-the-art models confirm these connections, offering new insights into the root causes of hallucinations and guiding the development of more reliable GUI understanding tools.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39133", "url": null, "sourceid": 34880, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37054, "uid": "27f2d2e6436617e42e2f251113eeb791", "name": "Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos", "authors": [{"id": 147748, "fullname": "SHRESHTH SAINI", "url": "http://cvpr.thecvf.com/api/miniconf/users/147748?format=json", "institution": "The University of Texas at Austin"}, {"id": 181925, "fullname": "Bowen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181925?format=json", "institution": "The University of Texas at Austin"}, {"id": 185869, "fullname": "Yilin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185869?format=json", "institution": "Google"}, {"id": 186574, "fullname": "Neil Birkbeck", "url": "http://cvpr.thecvf.com/api/miniconf/users/186574?format=json", "institution": "Google"}, {"id": 185870, "fullname": "Balu Adsumilli", "url": "http://cvpr.thecvf.com/api/miniconf/users/185870?format=json", "institution": "YouTube"}, {"id": 183579, "fullname": "Alan Bovik", "url": "http://cvpr.thecvf.com/api/miniconf/users/183579?format=json", "institution": "University if Colorado Boulder"}], "abstract": "High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR\u2019s higher bit depth, wide color gamut, and elevated luminance range expose distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate \\textbf{HDR-UGC-44K}, a large-scale subjective dataset of $\\sim$44K videos from 6.5K sources with >1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce \\textbf{HDR-Q}, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. HAPO augments GRPO via an HDR\u2013SDR contrastive KL that encourages token reliance on HDR inputs and a gaussian weighted regression reward for fine-grained MOS calibration. Across HDR-UGC-44K and public HDR-VQA benchmarks, HDR-Q delivers state-of-the-art performance. The dataset and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37054", "url": "https://shreshthsaini.github.io/Beyond8Bits", "sourceid": 34677, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39424, "uid": "7e4fbd0ee181747b48a8626835bbfae9", "name": "BiPreManip: Learning Affordance-Based Bimanual Pre-Manipulation through Anticipatory Collaboration", "authors": [{"id": 129189, "fullname": "Yan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129189?format=json", "institution": "Peking University"}, {"id": 192050, "fullname": "Feng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192050?format=json", "institution": "Peking University"}, {"id": 192051, "fullname": "Zichen He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192051?format=json", "institution": null}, {"id": 129188, "fullname": "Xiaoqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129188?format=json", "institution": "Peking University"}, {"id": 192052, "fullname": "Yuchen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192052?format=json", "institution": "Peking University"}, {"id": 192053, "fullname": "Zhiyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192053?format=json", "institution": "Peking University"}, {"id": 129288, "fullname": "Ruihai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129288?format=json", "institution": "Peking University"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}], "abstract": "Many everyday objects are difficult to directly grasp (e.g., a flat iPad) or manipulate functionally (e.g., opening the cap of a pen lying on a desk). Such tasks require sequential, asymmetric coordination between two arms, where one arm performs preparatory manipulation that enables the other\u2019s goal-directed action\u2014for instance, pushing the iPad to the table\u2019s edge before picking it up, or lifting the pen body to allow the other hand to remove its cap. In this work, we introduce Collaborative Preparatory Manipulation, a class of bimanual manipulation tasks that demand understanding object semantics and geometry, anticipating spatial relationships, and planning long-horizon coordinated actions between the two arms. To tackle this challenge, we propose a visual affordance-based framework that first envisions the final goal-directed action and then guides one arm to perform a sequence of preparatory manipulations that facilitate the other arm\u2019s subsequent operation. This affordance-centric representation enables anticipatory inter-arm reasoning and coordination, generalizing effectively across various objects spanning diverse categories. Extensive experiments in both simulation and the real world demonstrate that our approach substantially improves task success rates and generalization compared to competitive baselines.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39424", "url": null, "sourceid": 45177, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39391, "uid": "0189eee3401cf07ec0bb15b54859d25e", "name": "Adaptive Sparsity for Efficient Long-Video Understanding", "authors": [{"id": 146076, "fullname": "Handong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/146076?format=json", "institution": "UCAS"}, {"id": 189224, "fullname": "Zikang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189224?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 129588, "fullname": "Longteng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129588?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 129596, "fullname": "Tongtian Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/129596?format=json", "institution": ", Institute of automation, Chinese academy of science"}, {"id": 184863, "fullname": "Yepeng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184863?format=json", "institution": "Beijing Jiaotong University"}, {"id": 87838, "fullname": "Xinxin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87838?format=json", "institution": ", Institute of automation, Chinese academy of science"}, {"id": 191983, "fullname": "Chuanyang Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191983?format=json", "institution": "Alibaba Group"}, {"id": 189617, "fullname": "Ziming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189617?format=json", "institution": null}, {"id": 86261, "fullname": "Zhibin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86261?format=json", "institution": "Alibaba Group"}, {"id": 152920, "fullname": "Jun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152920?format=json", "institution": "Alibaba Group"}, {"id": 185404, "fullname": "YuCheng YuCheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185404?format=json", "institution": null}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 87818, "fullname": "Jing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87818?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which dynamically selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39391", "url": null, "sourceid": 33407, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38857, "uid": "f487c050e3c1f3c31f85b9f818ccc0c2", "name": "Scaling View Synthesis Transformers", "authors": [{"id": 190849, "fullname": "Evan Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190849?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 133098, "fullname": "Hyunwoo Ryu", "url": "http://cvpr.thecvf.com/api/miniconf/users/133098?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 129942, "fullname": "Thomas W. Mitchel", "url": "http://cvpr.thecvf.com/api/miniconf/users/129942?format=json", "institution": "PlayStation"}, {"id": 75463, "fullname": "Vincent Sitzmann", "url": "http://cvpr.thecvf.com/api/miniconf/users/75463?format=json", "institution": "MIT"}], "abstract": "Recently, geometry-free view synthesis transformers have achieved state-of-the-art results in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. However, the specific factors that govern how their performance scales with compute remain poorly understood. In this work, we conduct a rigorous analysis of the scaling laws for view synthesis transformers and elucidate a series of design choices for training compute-optimal NVS models. Most significantly, we find that an encoder\u2013decoder architecture, which was previously found to be less scalable, can in fact be compute-optimal. We attribute the previously inferior performance of previous encoder\u2013decoder methods to certain architectural choices and inconsistent training compute across comparisons. Across several compute levels, we demonstrate that our encoder\u2013decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance\u2013compute Pareto frontier, and outperforms the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38857", "url": null, "sourceid": 44786, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38178, "uid": "d8577a104f9962cb3533946068f61dae", "name": "ROSE: Rotate Your Large Language Model to See", "authors": [{"id": 129596, "fullname": "Tongtian Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/129596?format=json", "institution": ", Institute of automation, Chinese academy of science"}, {"id": 189223, "fullname": "Xuange Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189223?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 129588, "fullname": "Longteng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/129588?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 129617, "fullname": "Zijia Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/129617?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 189224, "fullname": "Zikang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189224?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 189225, "fullname": "Jie Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189225?format=json", "institution": ", Institute of automation, Chinese academy of science"}, {"id": 127562, "fullname": "Hua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127562?format=json", "institution": "Beijing Normal University"}, {"id": 87818, "fullname": "Jing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87818?format=json", "institution": "Institute of automation, Chinese academy of science"}], "abstract": "Recent advances in multimodal large language models (MLLMs) have shown impressive progress in integrating visual and linguistic understanding. However, most existing MLLMs inject visual information into the input space of large language models (LLMs), which substantially increases context length and computational overhead, while often disrupting pretrained linguistic priors by forcing the LLM to optimize on vision-dominant multimodal sequences. In this work, we propose a rotation-based vision injection paradigm that aligns visual information with the parameter space of LLMs. Visual semantics are encoded as rotation matrices and applied directly to the pretrained parameters. This parameter-space injection eliminates the need for long input sequences, thus avoiding the quadratic computational overhead inherent in input-space injection. Besides, it preserves the linguistic competence of the LLM by maintaining the intrinsic geometric structure of the pretrained parameters. Building upon this paradigm, we develop ROSE, a 7B MLLM that achieves fine-grained vision\u2013language alignment with remarkable computational efficiency. Extensive experiments across 12 multimodal benchmarks show that ROSE delivers superior or competitive performance compared with leading models.At comparable accuracy, ROSE reduces FLOPs by 80.7% and inference latency by 56.4% relative to Qwen2.5-VL-7B, demonstrating the effectiveness and scalability. All training code, model weights and data will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38178", "url": null, "sourceid": 35450, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36650, "uid": "e2ae76e18d4d1fd029953dc12efab959", "name": "BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition", "authors": [{"id": 185560, "fullname": "Qingyuan Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/185560?format=json", "institution": "Beijing Normal University"}, {"id": 185561, "fullname": "Saihui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185561?format=json", "institution": "Beijing Normal University"}, {"id": 185562, "fullname": "Xuecai Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185562?format=json", "institution": "AMAP, Alibaba Group"}, {"id": 88386, "fullname": "Yongzhen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88386?format=json", "institution": "Beijing Normal University"}], "abstract": "Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research. The dataset, code, and models will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36650", "url": null, "sourceid": 30611, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39723, "uid": "42f2d2234188bfdd8fab51ddcadcee37", "name": "LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space", "authors": [{"id": 192724, "fullname": "Dazhong Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192724?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192725, "fullname": "Jingjing Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192725?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192726, "fullname": "Qiang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192726?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192727, "fullname": "Meng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192727?format=json", "institution": "Nanjing University of Aeronautics and Astronautics"}, {"id": 192728, "fullname": "Ying Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192728?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Millimeter-wave radar\u2019s all-weather capability makes it increasingly vital for autonomous perception. However, the high cost of radar data collection drives the need for data generation to augment radar datasets. Existing works mainly target partial radar representations, e.g., 2D or 3D slices, leading to information loss and limited downstream performance. To overcome these issues, we introduce the novel task of LiDAR-to-4DRadar translation, which generates complete 4D radar tensors, with three spatial and one Doppler axes, guided by LiDAR data that preserve spatial and semantic consistency. We propose a novel diffusion bridge model in an aligned LiDAR-4DRadar latent space, namely \\textbf{L2RLDB}, to tackle this task. Specifically, first, a key-voxel-aware VAE compresses high-dimensional, noisy radar tensors into a compact latent space, while enabling precise numerical reconstruction and key-voxel identification. Second, to bridge the cross-modal gap between sparse 3D LiDAR and dense 4D radar, we develop a patch-wise contrastive learning module to align LiDAR latents with radar semantically and spatially. Finally, we formulate the translation as a diffusion bridge process between LiDAR and radar latents, enabling the synthesis of full radar tensors from Doppler-lacking LiDAR inputs. Experiments verify that L2RLDB achieves high-fidelity 4D radar generation and significantly improves downstream detection through data augmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39723", "url": null, "sourceid": 35268, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36550, "uid": "e6aebd5be1ba81557dbcc5f6f57bbe5c", "name": "Faster-GS: Analyzing and Improving Gaussian Splatting Optimization", "authors": [{"id": 172813, "fullname": "Florian Hahlbohm", "url": "http://cvpr.thecvf.com/api/miniconf/users/172813?format=json", "institution": "TU Braunschweig"}, {"id": 185323, "fullname": "Linus Franke", "url": "http://cvpr.thecvf.com/api/miniconf/users/185323?format=json", "institution": "Julius-Maximilians-Universit\u00e4t W\u00fcrzburg"}, {"id": 185324, "fullname": "Martin Eisemann", "url": "http://cvpr.thecvf.com/api/miniconf/users/185324?format=json", "institution": "Technische Universit\u00e4t Carolo-Wilhelmina Braunschweig"}, {"id": 185325, "fullname": "Marcus Magnor", "url": "http://cvpr.thecvf.com/api/miniconf/users/185325?format=json", "institution": "Institute for Computer Graphics, TU Braunschweig"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison.In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation.The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\\times$ faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36550", "url": null, "sourceid": 38182, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39144, "uid": "04eab5b5d416d6cab3f80348dc31591d", "name": "WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery", "authors": [{"id": 127679, "fullname": "Chandrakanth Gudavalli", "url": "http://cvpr.thecvf.com/api/miniconf/users/127679?format=json", "institution": "University of California, Santa Barbara"}, {"id": 191443, "fullname": "Tajuddin Manhar Mohammed", "url": "http://cvpr.thecvf.com/api/miniconf/users/191443?format=json", "institution": "Mayachitra, Inc."}, {"id": 147209, "fullname": "Abhay Yadav", "url": "http://cvpr.thecvf.com/api/miniconf/users/147209?format=json", "institution": "Johns Hopkins University"}, {"id": 191444, "fullname": "Ananth Bhaskar", "url": "http://cvpr.thecvf.com/api/miniconf/users/191444?format=json", "institution": "Mayachitra Inc"}, {"id": 191445, "fullname": "Hardik Prajapati", "url": "http://cvpr.thecvf.com/api/miniconf/users/191445?format=json", "institution": "University of California, Santa Barbara"}, {"id": 187937, "fullname": "Cheng Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187937?format=json", "institution": "School of Data Science, University of Virginia; Mathematical Institute for Data Science (MINDS) at JHU"}, {"id": 75499, "fullname": "Rama Chellappa", "url": "http://cvpr.thecvf.com/api/miniconf/users/75499?format=json", "institution": "Johns Hopkins University"}, {"id": 191446, "fullname": "Shivkumar Chandrasekaran", "url": "http://cvpr.thecvf.com/api/miniconf/users/191446?format=json", "institution": "University of California, Santa Barbara"}, {"id": 191447, "fullname": "B.S. Manjunath", "url": "http://cvpr.thecvf.com/api/miniconf/users/191447?format=json", "institution": "Mayachitra, Inc.; University of California, Santa Barbara"}], "abstract": "Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth\u2013based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task\u2014which lacks suitable benchmarks\u2014we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivinder and MC-Sat provide a first comprehensive baseline and testbed for studying geometry-centered cross-view alignment without paired supervision. In zero-shot experiments, Wrivinder achieves sub-30m geolocation accuracy across both dense and large-area scenes, highlighting the promise of geometry-based aggregation for robust ground-to-satellite localization. The MC-Sat dataset and Wrivinder codebase will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39144", "url": null, "sourceid": 35832, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36246, "uid": "f15f63c20752f53718ca375b02b47153", "name": "Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation", "authors": [{"id": 146657, "fullname": "Fatimaelzahraa Ahmed", "url": "http://cvpr.thecvf.com/api/miniconf/users/146657?format=json", "institution": "Hamad Medical Corporation"}, {"id": 77203, "fullname": "Zhihe Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77203?format=json", "institution": "Hamad Bin Khalifa University"}, {"id": 184566, "fullname": "Gianni Di Caro", "url": "http://cvpr.thecvf.com/api/miniconf/users/184566?format=json", "institution": "Carnegie Mellon University in Qatar"}, {"id": 184567, "fullname": "Diram Tabaa", "url": "http://cvpr.thecvf.com/api/miniconf/users/184567?format=json", "institution": "Carnegie Mellon University"}, {"id": 184568, "fullname": "Mohamed Hamdy", "url": "http://cvpr.thecvf.com/api/miniconf/users/184568?format=json", "institution": "University of Qatar; University of Qatar"}, {"id": 184569, "fullname": "Muraam Abdel-Ghani", "url": "http://cvpr.thecvf.com/api/miniconf/users/184569?format=json", "institution": "Hamad Medical Corporation; Hamad Bin Khalifa University"}, {"id": 184570, "fullname": "Abdulaziz Al-Ali", "url": "http://cvpr.thecvf.com/api/miniconf/users/184570?format=json", "institution": "University of Qatar"}, {"id": 146981, "fullname": "Muhammad Arsalan", "url": "http://cvpr.thecvf.com/api/miniconf/users/146981?format=json", "institution": "Qatar University"}, {"id": 184571, "fullname": "Shidin Balakrishnan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184571?format=json", "institution": "Hamad Medical Corporation"}], "abstract": "We present BRDG, a boundary-responsive differentiable gating superpixel framework designed to resolve the trade-off between computational efficiency and segmentation precision in surgical scenes. At its core, the architecture is organized into three cooperative agents within a fully differentiable backbone. The Region Creator agent converts dense features into learnable superpixel tokens, jointly learning region descriptors and dense context. The Boundary Detector agent acts as a gating mechanism, estimating boundary confidence from region features to predict where refinement is needed. The refinement agent uses this gate to selectively fuse efficient coarse predictions with a high-fidelity refinement path that restores pixel-level details. To further enhance distinctiveness, an adjacency-boosted contrastive loss mines hard negatives across neighboring regions to improve boundary separation. We evaluate BRDG on three surgical tasks requiring high-precision EndoVis18-parts, EndoVis18-tools, EndoVis17-tools, as well as general domain benchmarks. Our model improves mIoU by substantial margins 4.5-7.0 over strong pixel-wise baselines while raising Boundary-F1 scores by over 10 points. Under the same hardware (RTX-A6000 Pro), it reaches 150.25 FPS with only 24M parameters. This makes it 10 times faster and 3.5 smaller than current state-of-the-art models, effectively resolving the critical accuracy\u2013efficiency trade-off in real-time segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36246", "url": null, "sourceid": 46459, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40023, "uid": "c9e8bebd1f410934b634288c41048dc9", "name": "Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection", "authors": [{"id": 193329, "fullname": "Zixun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193329?format=json", "institution": "Zhejiang University"}, {"id": 193330, "fullname": "Yubo Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193330?format=json", "institution": "Zhejiang University"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}], "abstract": "Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision\u2013language understanding, yet processing long visual token sequences remains computationally expensive. Existing approaches mitigate this cost by reducing image tokens, either by discarding them after the visual encoder or by pruning them in the early Transformer layers of the LLM. While these strategies improve efficiency, they inevitably discard informative visual content and risk degrading downstream reasoning performance. To address this challenge, we introduce Hi-Lo Prune, a training-free pruning strategy that is built on a simple principle: look at what you will lose before you remove it. Instead of directly dropping tokens, Hi-Lo Prune first identifies which tokens are safe to prune through a coarse-to-fine selection process, and then encourages the model to absorb their information before pruning occurs. Specifically, the framework consists of three stages: (1) Hierarchical Pruning Token Selection. After visual encoding, we apply a coarse-to-fine process that identifies tokens to retain and selects a critical set of pruning candidates from redundant ones. (2) Attention-Guided Candidate Token Merge. Before removing selected tokens, an attention mechanism is applied to the early LLM layers, which explicitly transfers information from these candidates to the retained tokens. (3) Low-informative Candidate Token Removal. At a designated Transformer layer, the pruned tokens are removed, reducing computation for all subsequent layers. This design enables aggressive early-layer pruning while preserving critical visual cues. Experiments on Qwen2-VL, Qwen2.5-VL, and Qwen3-VL demonstrate that Hi-Lo Prune consistently outperforms existing pruning methods across multiple benchmarks, achieving strong performance even under high pruning ratios without any fine-tuning. The code has been submitted as supplementary material and will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40023", "url": null, "sourceid": 45679, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36707, "uid": "5e1982541fb01c50b6509e8ad3b5221c", "name": "PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion", "authors": [{"id": 185693, "fullname": "Jaehyun Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185693?format=json", "institution": "Korea AI Safety Institute (AISI), ETRI"}, {"id": 185694, "fullname": "Jiwan Hur", "url": "http://cvpr.thecvf.com/api/miniconf/users/185694?format=json", "institution": "KAIST"}, {"id": 76494, "fullname": "Gyojin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/76494?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 185695, "fullname": "Jaemyung Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185695?format=json", "institution": "NAVER AI Lab"}, {"id": 85168, "fullname": "Junmo Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/85168?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Video dataset condensation aims to mitigate the immense computational cost of video processing, but faces the unique challenge of preserving the complex interplay between spatial content and temporal dynamics. Prior work often unnaturally disentangles these elements, overlooking their essential interdependence. We introduce Progressive Refinement and Insertion for Sparse Motion (PRISM), a novel approach that preserves this critical coupling. PRISM begins with a minimal set of key frames and dynamically synthesizes new ones by identifying moments of high motion complexity, where simple interpolation fails, through gradient misalignments. This adaptive process allocates new frames only where such complexity exists, creating highly efficient and temporally coherent synthetic datasets. Extensive experiments show PRISM achieves highly competitive performance on standard action recognition benchmarks, often matching or exceeding prior methods, while creating powerful representations with significantly less storage", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36707", "url": null, "sourceid": 46297, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36395, "uid": "742f435c46a0e01916b8e324cb36af29", "name": "AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance", "authors": [{"id": 178647, "fullname": "Tianling Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/178647?format=json", "institution": "Southern University of Science and Technology"}, {"id": 180674, "fullname": "Shengzhe GAN", "url": "http://cvpr.thecvf.com/api/miniconf/users/180674?format=json", "institution": "Southern University of Science and Technology"}, {"id": 183561, "fullname": "Leslie Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/183561?format=json", "institution": "Harvard University"}, {"id": 184942, "fullname": "Yuelei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184942?format=json", "institution": "University of California, San Diego; California Institute of Technology"}, {"id": 179596, "fullname": "Fangneng Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179596?format=json", "institution": "Harvard University &amp; MIT"}, {"id": 89796, "fullname": "Hanspeter Pfister", "url": "http://cvpr.thecvf.com/api/miniconf/users/89796?format=json", "institution": "Harvard University"}], "abstract": "Active 3D reconstruction enables an agent to autonomously select viewpoints to build accurate and complete scene geometry efficiently, rather than passively reconstructing scenes from pre-collected images. Existing active reconstruction methods often rely on geometric heuristics, which may result in redundant observations without improving reconstruction quality. To address this, we propose \\textbf{AREA3D}, an active reconstruction agent for 3D reconstruction by leveraging feed-forward 3D models and vision-language guidance. The framework decouples view uncertainty modeling from feed-forward reconstruction, enabling precise uncertainty estimation without online optimization. Moreover, the integrated Vision-Language Model provides high-level semantic guidance that guides exploration beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks (Replica and OmniObject3D) demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, especially in sparse views.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36395", "url": null, "sourceid": 45918, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36877, "uid": "04221815070349c0923cd85e6957bfb3", "name": "SANER: Switchable Adapter with Non-parametric Enhanced Routing for Person De-Reidentification", "authors": [{"id": 186086, "fullname": "Yimin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186086?format=json", "institution": "Hefei University of Technology"}, {"id": 180552, "fullname": "Nan Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180552?format=json", "institution": "Hefei University of Technology"}, {"id": 186087, "fullname": "Fengxiang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186087?format=json", "institution": "Vivo"}, {"id": 130948, "fullname": "Wenjing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130948?format=json", "institution": "University of Nottingham"}, {"id": 185783, "fullname": "Zhihui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185783?format=json", "institution": "University of Science and Technology of China"}, {"id": 77038, "fullname": "Zhun Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/77038?format=json", "institution": "University of Nottingham"}], "abstract": "Person De-Reidentification (De-ReID) is an emerging and safety-critical task that aims to selectively forget specific individuals in surveillance systems while preserving the recognition capability for others. Existing methods typically learn both forgetting and retaining objectives within a unified feature space, which leads to conflicting optimization goals and may cause unexpected performance degradation on novel retaining identities. Although decoupling pre-trained feature space for forgetting or retaining purpose is a promising solution, discriminating which feature space should be used for the given novel query remain unsolved. To alleviate these challenges, we propose SANER, advancing De-ReID with switchable adapter (SA) and test-time non-parametric enhanced routing (NER) algorithm. SA decouples the pre-trained feature space into two task-specific subspaces with forgetting adapter (F-Adapter) and retaining adapter (R-Adapter). The former suppresses identity-specific semantics for de-identification, while the latter preserves discriminative cues for accurate re-ID. Moreover, SA is further enhanced with NER to adaptively analyze optimal feature space routing for the given novel query at test-time. Specifically, NER compares queries with pre-computed prototypes in the original feature space, mitigating the potential training\u2013testing gap and thus ensures accurate routing for De-ReID. Extensive experiments on multiple De-ReID benchmarks demonstrate SANER's efficacy, providing a new perspective for privacy-preserving visual perception.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36877", "url": null, "sourceid": 33676, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36271, "uid": "788d4162bb33b3dd36f14cb9fdc14905", "name": "HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation", "authors": [{"id": 184637, "fullname": "Lin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184637?format=json", "institution": "Northeastern University"}, {"id": 184638, "fullname": "Xinru Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184638?format=json", "institution": "Northeastern University"}, {"id": 169031, "fullname": "Xi Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/169031?format=json", "institution": "Oak Ridge National Laboratory; University of Alabama at Birmingham"}, {"id": 184639, "fullname": "Qihui Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184639?format=json", "institution": "Northeastern University"}, {"id": 184640, "fullname": "Lei Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184640?format=json", "institution": "Northeastern University"}, {"id": 84736, "fullname": "Yanzhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84736?format=json", "institution": "Northeastern University"}, {"id": 73927, "fullname": "Xue Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/73927?format=json", "institution": "Northeastern University"}, {"id": 182980, "fullname": "OCTAVIA CAMPS", "url": "http://cvpr.thecvf.com/api/miniconf/users/182980?format=json", "institution": "Northeastern University"}, {"id": 84711, "fullname": "Pu Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84711?format=json", "institution": "Northeastern University"}, {"id": 152538, "fullname": "Jianyang Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152538?format=json", "institution": "Zhejiang University"}], "abstract": "Dataset distillation often prioritizes global semantic proximity when creating small surrogate datasets for original large-scale ones. However, object semantics are inherently hierarchical. For example, the position and appearance of a bird's eyes are constrained by the outline of its head. Global proximity alone fails to capture how object-relevant structures at different levels support recognition. In this work, we investigate the contributions of hierarchical semantics to effective distilled data. We leverage the vision autoregressive (VAR) model whose coarse-to-fine generation mirrors this hierarchy and propose HierAmp to amplify semantics at different levels. At each VAR scale, we inject class tokens that dynamically identify salient regions and use their induced maps to guide amplification at that scale. This adds only marginal inference cost while steering synthesis toward discriminative parts and structures. Empirically, we find that semantic amplification leads to more diverse token choices in constructing coarse-scale object layouts. Conversely, at fine scales, the amplification concentrates token usage, increasing focus on object-related details. Across popular dataset distillation benchmarks, HierAmp consistently improves validation performance without explicitly optimizing global proximity, demonstrating the importance of semantic amplification for effective dataset distillation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36271", "url": null, "sourceid": 34945, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37763, "uid": "5e70fb4a015f91554fd3480becd9aca1", "name": "BabyVLM v2: Toward Developmentally Grounded Vision\u2013Language Models with Real Infant-View Data and Cognitive Evaluation Benchmarks", "authors": [{"id": 182485, "fullname": "Shengao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182485?format=json", "institution": "Boston University"}, {"id": 188188, "fullname": "Wenqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188188?format=json", "institution": "Boston University, Boston University"}, {"id": 188189, "fullname": "Zecheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188189?format=json", "institution": "Boston University, Boston University"}, {"id": 188190, "fullname": "Max Whitton", "url": "http://cvpr.thecvf.com/api/miniconf/users/188190?format=json", "institution": "Boston University, Boston University"}, {"id": 178245, "fullname": "Michael Wakeham", "url": "http://cvpr.thecvf.com/api/miniconf/users/178245?format=json", "institution": "Boston University"}, {"id": 188191, "fullname": "Arjun Chandra", "url": "http://cvpr.thecvf.com/api/miniconf/users/188191?format=json", "institution": "Boston University"}, {"id": 188192, "fullname": "Joey Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188192?format=json", "institution": "Boston University, Boston University"}, {"id": 188193, "fullname": "Pengyue Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188193?format=json", "institution": "Boston University"}, {"id": 188194, "fullname": "Helen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188194?format=json", "institution": null}, {"id": 188195, "fullname": "David Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188195?format=json", "institution": "Boston University, Boston University; University of California, Irvine"}, {"id": 188196, "fullname": "Jeffrey Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188196?format=json", "institution": "Boston University, Boston University"}, {"id": 188197, "fullname": "Shawn Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188197?format=json", "institution": ""}, {"id": 188198, "fullname": "Andrew Zagula", "url": "http://cvpr.thecvf.com/api/miniconf/users/188198?format=json", "institution": "California Institute of Technology"}, {"id": 188199, "fullname": "Amy Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188199?format=json", "institution": null}, {"id": 188200, "fullname": "Andrew Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188200?format=json", "institution": null}, {"id": 188201, "fullname": "Sayaka Nakamura", "url": "http://cvpr.thecvf.com/api/miniconf/users/188201?format=json", "institution": "Sony Group Corporation"}, {"id": 144042, "fullname": "Yuki Yamamoto", "url": "http://cvpr.thecvf.com/api/miniconf/users/144042?format=json", "institution": "Sony Group Corporation"}, {"id": 188202, "fullname": "Jerry Yokono", "url": "http://cvpr.thecvf.com/api/miniconf/users/188202?format=json", "institution": "Sony Group Corporation"}, {"id": 188203, "fullname": "Aaron Mueller", "url": "http://cvpr.thecvf.com/api/miniconf/users/188203?format=json", "institution": "Boston University"}, {"id": 77007, "fullname": "Bryan A. Plummer", "url": "http://cvpr.thecvf.com/api/miniconf/users/77007?format=json", "institution": "Boston University"}, {"id": 69230, "fullname": "Kate Saenko", "url": "http://cvpr.thecvf.com/api/miniconf/users/69230?format=json", "institution": "Meta / Boston University"}, {"id": 158633, "fullname": "Venkatesh Saligrama", "url": "http://cvpr.thecvf.com/api/miniconf/users/158633?format=json", "institution": "Boston University"}, {"id": 88081, "fullname": "Boqing Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/88081?format=json", "institution": "Google"}], "abstract": "Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision\u2013language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37763", "url": null, "sourceid": 37051, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39230, "uid": "34571ad4ab328f2e87f24657505a6a3e", "name": "Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation", "authors": [{"id": 180007, "fullname": "Haodong Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180007?format=json", "institution": "HKUST (GZ)"}, {"id": 191650, "fullname": "Hang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191650?format=json", "institution": null}, {"id": 191651, "fullname": "Zhide Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191651?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 191652, "fullname": "Weilin Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191652?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 191653, "fullname": "Xin Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191653?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 191654, "fullname": "Zehang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191654?format=json", "institution": "China University of Geoscience"}, {"id": 191655, "fullname": "Chengxi Heyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191655?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 191656, "fullname": "Junfeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191656?format=json", "institution": "The Hong Kong University of Science and Technology (GuangZhou)"}, {"id": 158747, "fullname": "Wenxuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/158747?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 132017, "fullname": "Shunbo Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/132017?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 87753, "fullname": "Haoang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87753?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}], "abstract": "Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39230", "url": null, "sourceid": 33844, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37253, "uid": "767474c706888300885c4662c26fc30c", "name": "EVObject: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision", "authors": [{"id": 181081, "fullname": "Jiahao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/181081?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 76680, "fullname": "Zihui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76680?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 183886, "fullname": "Yafei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183886?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 158682, "fullname": "Jinxi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/158682?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 187014, "fullname": "Shenxing Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187014?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 187015, "fullname": "Zhixuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187015?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 85913, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85913?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "We introduce EVObject for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EVObject integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries before discovering object. We conduct extensive experiments on two real-world datasets and one synthetic dataset, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37253", "url": null, "sourceid": 34950, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40098, "uid": "f72a5ab7c0c772a4ed63c40329a3bfcf", "name": "SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks", "authors": [{"id": 155892, "fullname": "Yimeng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155892?format=json", "institution": "Liaoning Technical University"}, {"id": 193519, "fullname": "Zhenbang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193519?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193520, "fullname": "Haodi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193520?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155891, "fullname": "Wenjie Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/155891?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193521, "fullname": "Rui-Jie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193521?format=json", "institution": "University of California, Santa Cruz"}, {"id": 155889, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155889?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155890, "fullname": "Dehao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155890?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 151866, "fullname": "Yichen Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151866?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 180305, "fullname": "Jieyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180305?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 181694, "fullname": "Kexin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181694?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193522, "fullname": "Jingzhinan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193522?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193523, "fullname": "Jason Eshraghian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193523?format=json", "institution": "University of California, Santa Cruz"}, {"id": 193524, "fullname": "Haicheng Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193524?format=json", "institution": "Liaoning Technical University"}, {"id": 85579, "fullname": "Malu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85579?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Event cameras provide superior temporal resolution, dynamic range, energy efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches combining Artificial Neural Networks (ANNs) and SNNs suffer from suboptimal architectures that compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based \\textbf{S}pike-\\textbf{D}riven \\textbf{T}racking (SDTrack) pipeline. It incorporates a novel event frame aggregation method called Global Trajectory Prompt (GTP) and a Transformer-based tracker. The GTP method effectively captures global trajectory information and aggregates it with event streams into event frames to enhance spatiotemporal representation. The Transformer-based tracker comprises a fully spike-driven SNN backbone and a simple tracking head. The SDTrack pipeline operates end-to-end without data augmentation or post-processing. Extensive experiments demonstrate that our SDTrack-Tiny pipeline achieves competitive accuracy with only 19.61$M$ parameters and 8.16$mJ$ energy consumption, while our Base version achieves state-of-the-art accuracy across three datasets. Our work establishes a solid foundation for future neuromorphic vision research.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40098", "url": null, "sourceid": 45356, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40394?format=json"], "related_events_ids": [40394]}, {"id": 37791, "uid": "f2dfb1d3d8632126dab09d09f52c70b4", "name": "Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models", "authors": [{"id": 188274, "fullname": "Gengwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188274?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 188275, "fullname": "Jie Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188275?format=json", "institution": "University of Science and Technology of China"}, {"id": 136311, "fullname": "Zhen Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/136311?format=json", "institution": null}, {"id": 188276, "fullname": "Mufan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188276?format=json", "institution": "Department of Computer Science, University of North Carolina at Chapel Hill"}, {"id": 188277, "fullname": "Hossein Nourkhiz Mahjoub", "url": "http://cvpr.thecvf.com/api/miniconf/users/188277?format=json", "institution": "Honda Research Institute US"}, {"id": 182479, "fullname": "Vaishnav Tadiparthi", "url": "http://cvpr.thecvf.com/api/miniconf/users/182479?format=json", "institution": "Honda Research Institute USA, Inc."}, {"id": 93484, "fullname": "Kwonjoon Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/93484?format=json", "institution": "Honda Research Institute USA"}, {"id": 90108, "fullname": "Yanyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90108?format=json", "institution": "Rutgers University, Newark"}, {"id": 128595, "fullname": "Tianlong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128595?format=json", "institution": "Massachusetts Institute of Technology"}], "abstract": "The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from and reason with visual information. In this work, we propose the *Hallucination-as-Cue Framework*, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of reasoning datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can achieve comparable or even better performance to standard training. These findings challenge prevailing assumptions about MLLM reasoning and motivate the development of more modality-aware RL-based post-training strategies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37791", "url": null, "sourceid": 45290, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40394, "uid": "f72a5ab7c0c772a4ed63c40329a3bfcf", "name": "SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks", "authors": [{"id": 155892, "fullname": "Yimeng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155892?format=json", "institution": "Liaoning Technical University"}, {"id": 193519, "fullname": "Zhenbang Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/193519?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193520, "fullname": "Haodi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193520?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155891, "fullname": "Wenjie Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/155891?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193521, "fullname": "Rui-Jie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193521?format=json", "institution": "University of California, Santa Cruz"}, {"id": 155889, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155889?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155890, "fullname": "Dehao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155890?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 151866, "fullname": "Yichen Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/151866?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 180305, "fullname": "Jieyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180305?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 181694, "fullname": "Kexin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181694?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193522, "fullname": "Jingzhinan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193522?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 193523, "fullname": "Jason Eshraghian", "url": "http://cvpr.thecvf.com/api/miniconf/users/193523?format=json", "institution": "University of California, Santa Cruz"}, {"id": 193524, "fullname": "Haicheng Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193524?format=json", "institution": "Liaoning Technical University"}, {"id": 85579, "fullname": "Malu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85579?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Event cameras provide superior temporal resolution, dynamic range, energy efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches combining Artificial Neural Networks (ANNs) and SNNs suffer from suboptimal architectures that compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based \\textbf{S}pike-\\textbf{D}riven \\textbf{T}racking (SDTrack) pipeline. It incorporates a novel event frame aggregation method called Global Trajectory Prompt (GTP) and a Transformer-based tracker. The GTP method effectively captures global trajectory information and aggregates it with event streams into event frames to enhance spatiotemporal representation. The Transformer-based tracker comprises a fully spike-driven SNN backbone and a simple tracking head. The SDTrack pipeline operates end-to-end without data augmentation or post-processing. Extensive experiments demonstrate that our SDTrack-Tiny pipeline achieves competitive accuracy with only 19.61$M$ parameters and 8.16$mJ$ energy consumption, while our Base version achieves state-of-the-art accuracy across three datasets. Our work establishes a solid foundation for future neuromorphic vision research.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40394", "url": null, "sourceid": -45356, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40098?format=json"], "related_events_ids": [40098]}, {"id": 37513, "uid": "295515298179c116880965e483d2ad07", "name": "LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation", "authors": [{"id": 187619, "fullname": "Yuwei Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/187619?format=json", "institution": "Sun Yat-sen University"}, {"id": 90968, "fullname": "Ganlong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/90968?format=json", "institution": "University of Hong Kong"}, {"id": 90989, "fullname": "Yipeng Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/90989?format=json", "institution": "Cardiff University"}, {"id": 75839, "fullname": "Si Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75839?format=json", "institution": "Beihang University"}, {"id": 153470, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153470?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 75470, "fullname": "Liang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/75470?format=json", "institution": "Sun Yat-sen University"}, {"id": 74074, "fullname": "Guanbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/74074?format=json", "institution": "Sun Yat-sen University"}], "abstract": "Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments.While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues\u2014a key source of spatial context in human navigation.In this work, we propose LookasideVLN, a new paradigm that exploits directional cues in natural language to achieve both more accurate spatial reasoning and greater computational efficiency. LookasideVLN comprises three core components: (1) an Egocentric Lookaside Graph (ELG) that dynamically encodes instruction-relevant landmarks and their directional relationships, (2) a Spatial Landmark Knowledge Base (SLKB) that provides lightweight memory retrieval from prior navigation experiences, and (3) a Lookaside MLLM Navigation Agent that aligns multimodal information from user instructions, visual observations, and landmark-direction information from ELG for path planning.Extensive experiments show that LookasideVLN significantly outperforms the state-of-the-art CityNavAgent, even with a single-level lookahead, demonstrating that leveraging directional cues is a powerful yet efficient strategy for Aerial VLN.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37513", "url": null, "sourceid": 36944, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39346, "uid": "5b9e3317e97a8e48fd98477d8d0f22e4", "name": "Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos", "authors": [{"id": 181993, "fullname": "Sontao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181993?format=json", "institution": "Zhejiang University"}, {"id": 86554, "fullname": "Sibo Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/86554?format=json", "institution": "Alibaba Group"}, {"id": 191895, "fullname": "Chenyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191895?format=json", "institution": "Zhejiang University"}, {"id": 191715, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191715?format=json", "institution": "Zhejiang University"}, {"id": 185835, "fullname": "Ruizhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185835?format=json", "institution": "Zhejiang University"}, {"id": 90267, "fullname": "Tongkun Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90267?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185834, "fullname": "Ruilin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185834?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191716, "fullname": "Yan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191716?format=json", "institution": "ByteDance Inc."}, {"id": 191718, "fullname": "Zhihang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191718?format=json", "institution": "Tianjin University"}, {"id": 99871, "fullname": "Yuchong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/99871?format=json", "institution": null}, {"id": 126166, "fullname": "Hang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126166?format=json", "institution": "Sichuan University"}, {"id": 86577, "fullname": "Zhibo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86577?format=json", "institution": "Alibaba Group"}, {"id": 191896, "fullname": "Shuai Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191896?format=json", "institution": "Alibaba Group"}, {"id": 185837, "fullname": "Junyang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185837?format=json", "institution": "Alibaba Group"}, {"id": 191720, "fullname": "Zuozhu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191720?format=json", "institution": "Zhejiang University"}], "abstract": "The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce \\textbf{SynRL}, a post-training framework that teaches models \\textit{temporal primitives}, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives (state tracking, retrodictive inference), constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1's 165K real-world samples. We attribute this to fundamental temporal skills\u2014such as tracking frame-by-frame changes and comparing velocity\u2014that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: systematic video temporal learning through carefully designed synthetic data provides a more cost-efficient scaling path.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39346", "url": null, "sourceid": 33499, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38014, "uid": "4e6809f26fea04711177d3d55880f25c", "name": "Long-Term Personalized Multimodal LLMs", "authors": [{"id": 188831, "fullname": "Chang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188831?format=json", "institution": "Nanjing University"}, {"id": 126789, "fullname": "Chaoyou Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126789?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 188832, "fullname": "YiFan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188832?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 188833, "fullname": "HaiHuaYang HaiHuaYang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188833?format=json", "institution": null}, {"id": 153687, "fullname": "Caifeng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153687?format=json", "institution": "Nanjing University"}], "abstract": "Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited.Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users\u2019 evolving preferences and personality over time (see Fig.1).In this paper, we introduce Pal-R3, an innovative personalized multimodal agent framework designed for long-term personalization.Pal-R3 transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities:(a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database.(b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics.For evaluation, we establish MME-P, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks.Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (MME-P) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively.Our code is available in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38014", "url": null, "sourceid": 44849, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38812, "uid": "e2e48e38bc4d2b93fe84305377f2a649", "name": "FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation", "authors": [{"id": 162094, "fullname": "Jing Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/162094?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 190748, "fullname": "Lingzhou Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190748?format=json", "institution": "Tsinghua University"}, {"id": 190749, "fullname": "Fan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190749?format=json", "institution": "Alibaba Group"}, {"id": 190750, "fullname": "Chengcheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190750?format=json", "institution": "Alibaba Group"}, {"id": 154906, "fullname": "Mu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154906?format=json", "institution": "Alibaba Group"}, {"id": 86631, "fullname": "Yonggang Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86631?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose *FantasyVLN*, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38812", "url": null, "sourceid": 41293, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36779, "uid": "8b42718d99b26b5eac54bc18260b8cd4", "name": "Making Training-Free Diffusion Segmentors Scale with the Generative Power", "authors": [{"id": 182418, "fullname": "Benyuan Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/182418?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 156164, "fullname": "Qianqian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156164?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185856, "fullname": "Zitai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185856?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 126239, "fullname": "Xiaochun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126239?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 157341, "fullname": "Longtao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157341?format=json", "institution": "Alibaba Group"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks as well. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what we call training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model\u2019s attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, \\textit{i.e.}, stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly, and in some cases, segmentation performance even degrades when using more powerful models. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage model capability. We extensively evaluate our approach on standard semantic segmentation benchmarks and further integrate it into an advanced generative framework, demonstrating both its broad applicability and improved performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36779", "url": null, "sourceid": 30785, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38966, "uid": "5f31aac2b5c1e3234a9eb40029ea0f11", "name": "Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion", "authors": [{"id": 181734, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181734?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191077, "fullname": "Yunkuo Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/191077?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191078, "fullname": "Tingting Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191078?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191079, "fullname": "Hang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/191079?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191080, "fullname": "Yaxian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191080?format=json", "institution": "Chang&#x27;an University"}, {"id": 191081, "fullname": "Weiping Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191081?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 129237, "fullname": "Lingling Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129237?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 129235, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129235?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Multi-focus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box networks are difficult to generate high-quality decision maps.To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are biological neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps.Specifically, we first conduct an in-depth analysis of the model's neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF.Based on this analysis, we propose a Neurodynamics-Driven CNP Fusion model (ND-CNPFuse) tailored for the challenging MFIF task.Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF.Code is provided in the supplementary material and will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38966", "url": null, "sourceid": 35925, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39787, "uid": "ed1fc02e7c8dd71f71cded7ca5f29345", "name": "Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning", "authors": [{"id": 192849, "fullname": "Linjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192849?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184212, "fullname": "HUIYU XIAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/184212?format=json", "institution": "BUPT"}, {"id": 192850, "fullname": "Jiarui Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192850?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 192851, "fullname": "Zhenyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192851?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 192852, "fullname": "JI Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192852?format=json", "institution": null}], "abstract": "Class-incremental learning (CIL) aims to continuously accumulate knowledge from a stream of tasks and construct a unified classifier over all previously seen classes. A key challenge of CIL lies in the discrepancy between clear task boundaries during training and blurred boundaries during inference, where samples from different tasks often occupy overlapping subspaces. Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. Specifically, we introduce a quantum-gated task modulation gating mechanism to model the relational dependencies among task embedding, dynamically capturing the sample-to-task relevance for both joint training and inference across streaming tasks. Furthermore, we employ lightweight adapters to adapt PTMs to downstream tasks while freezing previously learned adapters. Guided by the quantum gating outputs, we perform task-interaction knowledge distillation guided by these task-embedding-level correlation weights from old to new adapters, enabling the model to bridge the representation gaps between independent task subspaces and jointly calibrate the unified classifier. Extensive experiments on five benchmark datasets demonstrate that QKD effectively mitigates catastrophic forgetting and achieves state-of-the-art performance in class-incremental settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39787", "url": null, "sourceid": 45083, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40378, "uid": "55a2471ff102d6ea6bd23c81aafdfd43", "name": "Understanding Task Transfer in Vision-Language Models", "authors": [{"id": 192703, "fullname": "Bhuvan Sachdeva", "url": "http://cvpr.thecvf.com/api/miniconf/users/192703?format=json", "institution": "Microsoft Research"}, {"id": 148817, "fullname": "Karan Uppal", "url": "http://cvpr.thecvf.com/api/miniconf/users/148817?format=json", "institution": "Microsoft Research India"}, {"id": 92900, "fullname": "Abhinav Java", "url": "http://cvpr.thecvf.com/api/miniconf/users/92900?format=json", "institution": "Microsoft"}, {"id": 153478, "fullname": "Vineeth Balasubramanian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153478?format=json", "institution": "Microsoft Research and IIT-Hyderabad"}], "abstract": "Vision\u2013Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer \\& risks of negative interference, offering actionable guidance for advancing VLMs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40378", "url": "https://bhuvan-21.github.io/task-transfer-vlms/", "sourceid": -36087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/39710?format=json"], "related_events_ids": [39710]}, {"id": 39017, "uid": "cab8c37623b52f4be02f2b1a70e0cc46", "name": "Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning", "authors": [{"id": 165342, "fullname": "Linge Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/165342?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 157888, "fullname": "Yingying Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157888?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 157886, "fullname": "Bingke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157886?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 191187, "fullname": "Lu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191187?format=json", "institution": "Institute of automation, Chinese academy of sciences"}, {"id": 85436, "fullname": "Jinqiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85436?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Recent advances in audio\u2013visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, when these objectives are jointly optimized within a single representation space, the contrastive branch is forced to rely on randomly visible patches that often lack semantic relevance. This coupling injects semantic noise into global tokens and creates interference between generative and discriminative objectives, ultimately weakening fine-grained cross-modal alignment.We revisit this formulation and propose TG-DP, a Teacher-Guided Dual-Path framework that separates reconstruction and alignment into independent optimization paths while injecting stable semantic structure into the contrastive branch. A teacher model provides holistic, unmasked semantic targets that guide the student\u2019s token selection, allowing the alignment pathway to focus on consistently meaningful regions without being constrained by reconstruction dynamics.TG-DP yields substantial improvements in zero-shot retrieval, increasing R@1 from 35.2\\% to 37.4\\% (Vision\u2192Audio) and 27.9\\% to 37.1\\% (Audio\u2192Vision) on AudioSet, and from 27.9\\% to 31.3\\% and 23.2\\% to 30.3\\% on VGGSound. Despite prioritizing alignment fidelity, the learned representations remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound.Taken together, our findings show that decoupling multimodal objectives while imposing teacher-guided semantic structure provides a simple yet powerful principle for advancing large-scale audio\u2013visual pretraining.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39017", "url": null, "sourceid": 39565, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36202, "uid": "6e61e57c1a5ea6fa42bf249242440e91", "name": "XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening", "authors": [{"id": 184435, "fullname": "Hongxia Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184435?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 184436, "fullname": "Yixin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184436?format=json", "institution": "South China University of Technology"}, {"id": 184437, "fullname": "Jiali Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184437?format=json", "institution": "South China University of Technology, South China University of Technology"}, {"id": 184438, "fullname": "Litao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184438?format=json", "institution": "South China University of Technology"}, {"id": 175961, "fullname": "Qianyun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175961?format=json", "institution": "South China University of Technology"}, {"id": 184439, "fullname": "Kaijie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184439?format=json", "institution": "South China University of Technology"}], "abstract": "X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM\u2019s poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36202", "url": null, "sourceid": 44802, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39541, "uid": "57d7aa461a3d06574ccd1df4a2de2301", "name": "Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures", "authors": [{"id": 180433, "fullname": "Zeyao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180433?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 192305, "fullname": "Zhendong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192305?format=json", "institution": null}, {"id": 187881, "fullname": "Xiaojun Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187881?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 144070, "fullname": "Xin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144070?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 192306, "fullname": "Yuexin Xuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192306?format=json", "institution": null}, {"id": 192307, "fullname": "XIAOSHUANG JI", "url": "http://cvpr.thecvf.com/api/miniconf/users/192307?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}], "abstract": "Existing ViT backdoor attacks based on backbone-overwriting full-tuning are computationally expensive and inflict performance degradation. This has forced adversaries towards the Visual Parameter-Efficient Fine-Tuning (PEFT) paradigm, dominated by adapter-based (e.g., LoRA) and prompt-based (e.g., VPT) approaches. While adapter security has seen initial study, the risks of the burgeoning prompt-based ecosystem remain critically unexplored. We fill this critical gap, exposing how the evolution of VPT towards dynamic, context-aware architectures innately creates a far more dangerous, emergent threat. This vulnerability arises even though these dynamic modules unlock superior benign performance. We propose VIPER, an attack framework built on a lightweight, dynamic Visual Prompt Generator (VPG) that demonstrates this vulnerability. Critically, this dynamic architecture enables Functional Fusion: an emergent phenomenon where malicious logic and benign task utility are inseparably fused into the same sparse, high-magnitude parameter core. This fusion creates an unsolvable ``hostage\" dilemma, as pruning the attack necessarily destroys the benign performance.Comprehensive evaluations show VIPER resolves the attacker's trilemma: VIPER not only achieves state-of-the-art performance on clean data, but also maintains near-100\\% ASR even under 90\\% VPG-module pruning (where LoRA attacks collapse), while adding only an imperceptible 0.06ms (1.16\\%) of inference latency.  VIPER's results, driven by Functional Fusion, expose a new, paradigm-level risk in dynamic prompt architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39541", "url": null, "sourceid": 46255, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39260, "uid": "388f856bab37b4524fbd2e9a77c344c9", "name": "Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning", "authors": [{"id": 155028, "fullname": "Yifei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155028?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 130710, "fullname": "Wenzhao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130710?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 146684, "fullname": "Yanran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146684?format=json", "institution": "Tsinghua University"}, {"id": 191732, "fullname": "Runze Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191732?format=json", "institution": "Tsinghua University"}, {"id": 184826, "fullname": "Yu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184826?format=json", "institution": "Tsinghua University"}, {"id": 157160, "fullname": "Lei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157160?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 77142, "fullname": "Jie Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/77142?format=json", "institution": "Tsinghua University"}, {"id": 88597, "fullname": "Jiwen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88597?format=json", "institution": "Tsinghua University"}], "abstract": "The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present $\\textbf{Skyra}$, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct $\\textbf{ViF-CoT-4K}$ for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations.We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce $\\textbf{ViF-Bench}$, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection. Our code, models, and datasets will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39260", "url": null, "sourceid": 39800, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36511, "uid": "587fa921165411090038e3250be05577", "name": "Deformation-based In-Context Learning for Point Cloud Understanding", "authors": [{"id": 185239, "fullname": "Chengxing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185239?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 153186, "fullname": "Jinhong Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153186?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 76644, "fullname": "Yinjie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76644?format=json", "institution": "Sichuan University"}, {"id": 88470, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88470?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Recent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training\u2013inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidance from prompts, enabling explicit geometric reasoning and consistent objectives. Extensive experiments demonstrate that DeformPIC consistently outperforms previous state-of-the-art methods, achieving reductions of 1.6, 1.8, and 4.7 points in average Chamfer Distance on reconstruction, denoising, and registration tasks, respectively. Furthermore, we introduce a new out-of-domain benchmark to evaluate generalization across unseen data distributions, where DeformPIC achieves state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36511", "url": null, "sourceid": 33594, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37738, "uid": "c1080323abd122db8e79376998d29915", "name": "HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models", "authors": [{"id": 180550, "fullname": "Yangguang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/180550?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188134, "fullname": "Quan Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188134?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188135, "fullname": "Yufei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188135?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 188136, "fullname": "Jiachen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188136?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 76372, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76372?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 90407, "fullname": "Jitao Sang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90407?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces\u2014visual evidence, conflicting priors, and residual uncertainty\u2014enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37738", "url": null, "sourceid": 46778, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36684, "uid": "775ced8a4d77641ed809d8a33917e6f0", "name": "EvoID: Reinforced Evolution for Identity-Preserving Video Generation", "authors": [{"id": 85537, "fullname": "Yiheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85537?format=json", "institution": "HiDream.ai Inc."}, {"id": 85198, "fullname": "Zhaofan Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85198?format=json", "institution": "University of Science and Technology of China"}, {"id": 185640, "fullname": "Zunxu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185640?format=json", "institution": "University of Science and Technology of China"}, {"id": 77084, "fullname": "Yingwei Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/77084?format=json", "institution": "HiDream.ai"}, {"id": 85029, "fullname": "Ting Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85029?format=json", "institution": "JD AI Research"}, {"id": 85027, "fullname": "Tao Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/85027?format=json", "institution": "JD Explore Academy"}], "abstract": "We present EvoID, a novel framework that reformulates Identity-Preserving Video Generation as a self-evolving process through Reinforcement Learning. Moving beyond the static paradigm of imitation learning, EvoID enables a generative model to actively learn and optimize the complex trade-offs between identity fidelity, motion naturalness, and temporal coherence. At the heart of our EvoID is a dynamic, dual-path reward mechanism, which acts as an intrinsic critic by adaptively combining objective metric indicators and MLLM-based holistic quality assessment. This allows the model to \"evolve\" its generation strategy, focusing on different aspects of quality at different stages of training. To ensure stable and coherent evolution, we anchor the exploring Student model with a frozen Teacher, preserving robust world priors while allowing for creative refinement when generating videos. Extensive experiments demonstrate the superiority of our proposal, and EvoID achieves the total score of 0.687 on the Human-Domain of OpenS2V-Eval dataset, surpassing 0.658 of the open-source VACE and 0.653 of the commercial Hailuo. Moreover, EvoID also obtains a new record of 0.718 on our newly minted MLLM-based metric, prioritizing human perception and more comprehensively reflecting video quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36684", "url": null, "sourceid": 39314, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39800, "uid": "7f61d36c9a87ccb4776231dd48d45a79", "name": "Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning", "authors": [{"id": 192884, "fullname": "Yusong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192884?format=json", "institution": "TeleAI"}, {"id": 192885, "fullname": "Zheyuan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192885?format=json", "institution": "Peking University"}, {"id": 192886, "fullname": "MAO KEYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/192886?format=json", "institution": "Tokyo Institute of Technology, Tokyo Institute of Technology"}, {"id": 192887, "fullname": "Minghao Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192887?format=json", "institution": "New York University"}, {"id": 154776, "fullname": "Mingkun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154776?format=json", "institution": "Guangdong Institute of Intelligence Science and Technology"}, {"id": 153425, "fullname": "Prayag Tiwari", "url": "http://cvpr.thecvf.com/api/miniconf/users/153425?format=json", "institution": "Halmstad University"}, {"id": 192888, "fullname": "Jiawei Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192888?format=json", "institution": "China Telecom"}, {"id": 181785, "fullname": "qingsong zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181785?format=json", "institution": "Institute of Artificial Intelligence (TeleAI), China Telecom"}], "abstract": "Crime anticipation enables proactive public safety interventions, yet existing video security systems remain largely reactive, unable to detect precursors of crime. While current visual language models (VLM)-based video understanding methods show promise in high-level reasoning, they are not designed to explicitly model the spatio-temporal causal relationships essential for anticipating crimes.We address this limitation by two causal-driven components. First, we develop the Spatio-Temporal Causal Reasoning Crime (STCRC) dataset, a hierarchical dataset comprising 73K samples across five progressive causal reasoning tasks, facilitating criminal precursors learning. Second, we propose the Spatio-Temporal Causal Hypergraph (STCH), a streaming module that transforms implicit entity dynamics into explicit causal structures to enhance causal reasoning for crime in VLMs. By combining these two components, our framework advances real-time crime anticipation, achieving improvements in anticipatory tasks: a 70.7% relative improvement in crime classification, a 10.1% in crime detection, and a 3.7% reduction in temporal prediction error.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39800", "url": null, "sourceid": 33512, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39499, "uid": "a9136794e5b084c915b1470232654802", "name": "Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation", "authors": [{"id": 192204, "fullname": "Shilong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192204?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 147842, "fullname": "Xiurui Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/147842?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 182377, "fullname": "Qiugang Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/182377?format=json", "institution": "Southwestern University of Finance and Economics"}, {"id": 192205, "fullname": "Luochao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192205?format=json", "institution": "Southwest University of Finance and Economics"}, {"id": 192206, "fullname": "Yong Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192206?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 192207, "fullname": "Guisong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192207?format=json", "institution": "Southwestern University of Finance and Economics; University of Electronic Science and Technology of China"}], "abstract": "The temporal evolution patterns of surface spatial structures constitute a central concern within the field of intelligent remote sensing interpretation.However, constrained by the availability of only two temporal phases, modeling sparse spatio-temporal change processes to effectively interpret surface alterations remains a core challenge in intelligent remote sensing analysis. To address this, this paper proposes SpikeAdapter, a lightweight enhancement framework. This framework comprises Geo-Spike Interpolation (GSI-P), an spiking neural network (SNN) feature extractor, and the spatio-temporal fusion module STSpikeFuse. Inspired by the brain\u2019s perceptual response to new and fading stimuli, the core GSI-P module transforms bi-temporal radiometric differences into sparse spike sequences with time-to-first-spike characteristics.Then we use a feature extractor of SNN to capture dynamic variations of land-surface targets. The STSpkeFuse module employs a learnable temporal decay mechanism to adaptively fuse the SNN features with the semantic representations. This representations are generated by a traditional artificial neural network (ANN) backbone.Extensive experiments on change detection datasets demonstrate that SpikeAdapter effectively enhances temporal awareness and interpretability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39499", "url": null, "sourceid": 42998, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37500, "uid": "080261e4427a081fc6e637b654f590ee", "name": "A supervised multi-task framework for joint cryo-ET restoration enabled by generative physical simulation", "authors": [{"id": 135974, "fullname": "Xinsheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/135974?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187586, "fullname": "Zhidong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187586?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 187587, "fullname": "Xiaohua Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187587?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187588, "fullname": "Renmin Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/187588?format=json", "institution": "Shandong University"}, {"id": 187589, "fullname": "Shuai Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187589?format=json", "institution": "Beijing Institute of Technology"}, {"id": 174100, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/174100?format=json", "institution": "School of Medical Technology, Beijing Institute of Technology ;Key Laboratory of Brain Health Intelligent Evaluation and Intervention(Beijing Institute of Technology), Ministry of Education"}, {"id": 187590, "fullname": "Fa Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187590?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187591, "fullname": "Bin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187591?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Cryo-electron tomography (cryo-ET) enables in-situ visualization of cellular structures at near-native state, yet its practical utility is often hampered by extremely low signal-to-noise ratio (SNR) and severe missing wedge artifacts resulting from dose limitations and restricted tilt angles. While several computational methods have been proposed for reconstructing high-quality tomograms, the performance is still limited by the absence of accurate noise modeling and reliable ground truth data. To address this challenge, we propose cryoDeRec, a multi-task learning framework to jointly address denoising and missing wedge reconstruction in fully supervised manner. The main contribution of cryoDeRec is a dual-objective training strategy incorporated with synthetically corrupted tomogram and raw noisy tomogram, enabling simultaneous restoration of structural fidelity and reconstruction of missing information. The model is trained on a physically synthetic dataset generated by a novel imaging simulation pipeline that incorporates authentic noise distributions and isotropic structural priors. We evaluate cryoDeRec on four realistic cryo-ET datasets and two simulated datasets with extremely low SNR, all reconstructed using Weighted Back Projection (WBP). Extensive experimental results demonstrate that our method achieves high-quality restoration directly from raw tomograms without any pre-processing, outperforming existing state-of-the-art methods. Our findings show that training on a comprehensive simulated dataset, which captures realistic noise and structural diversity, enables models to generalize effectively to real cryo-ET tomograms. The code and datasets will be available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37500", "url": null, "sourceid": 43303, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36940, "uid": "dd31b8ef014fa2630dc87d559037be7c", "name": "SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations", "authors": [{"id": 184258, "fullname": "Yunnan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184258?format=json", "institution": "Ant Group; Shanghai Jiao Tong University; Eastern Institute of Technology"}, {"id": 90363, "fullname": "Kecheng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90363?format=json", "institution": "Ant Group"}, {"id": 160682, "fullname": "Jianyuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/160682?format=json", "institution": "Oxford VGG"}, {"id": 128085, "fullname": "Minghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128085?format=json", "institution": "University of Oxford"}, {"id": 152120, "fullname": "David Novotny", "url": "http://cvpr.thecvf.com/api/miniconf/users/152120?format=json", "institution": "Meta"}, {"id": 129663, "fullname": "Christian Rupprecht", "url": "http://cvpr.thecvf.com/api/miniconf/users/129663?format=json", "institution": "University of Oxford"}, {"id": 91631, "fullname": "Yinghao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91631?format=json", "institution": "Stanford University"}, {"id": 186273, "fullname": "Xing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186273?format=json", "institution": "Ant Group"}, {"id": 133538, "fullname": "Wenjun Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/133538?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 132631, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/132631?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36940", "url": null, "sourceid": 44367, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38769, "uid": "e2899436599560e1d5c1ac76f324cd85", "name": "MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration", "authors": [{"id": 181201, "fullname": "Chenran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181201?format=json", "institution": "Southeast University"}, {"id": 190628, "fullname": "Ruiqi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190628?format=json", "institution": "Southeast University"}, {"id": 184810, "fullname": "Tao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184810?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 185180, "fullname": "Yi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185180?format=json", "institution": "Southeast University"}], "abstract": "Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. The source codes will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38769", "url": null, "sourceid": 43544, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39817, "uid": "b5c4df4f16237128934c0b239a2e49b3", "name": "Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs", "authors": [{"id": 192913, "fullname": "Chengwei Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/192913?format=json", "institution": "Lanzhou University"}, {"id": 107182, "fullname": "Fan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/107182?format=json", "institution": "Zhejiang University"}, {"id": 88905, "fullname": "Ruijie Quan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88905?format=json", "institution": "Zhejiang University"}, {"id": 192914, "fullname": "Yunqiu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192914?format=json", "institution": "National University of Singapore"}, {"id": 190270, "fullname": "Kun Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190270?format=json", "institution": "Lanzhou University"}, {"id": 86325, "fullname": "Yi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86325?format=json", "institution": "Zhejiang University"}], "abstract": "With the rapid deployment and widespread adoption of multimodal large language models (MLLMs), disputes regarding model version attribution and ownership have become increasingly frequent, raising significant concerns about intellectual property protection. In this paper, we propose a framework for generating copyright triggers for MLLMs, enabling model publishers to embed verifiable ownership information into the model. The goal is to construct trigger images that elicit ownership-related textual responses exclusively in fine-tuned derivatives of the original model, while remaining inert in other non-derivative models. Our method constructs a tracking trigger image by treating the image as a learnable tensor, performing adversarial optimization with dual-injection of ownership-relevant semantic information. The first injection is achieved by enforcing textual consistency between the output of an auxiliary MLLM and a predefined ownership-relevant target text; the consistency loss is backpropagated to inject this ownership-related information into the image. The second injection is performed at the semantic-level by minimizing the distance between the CLIP features of the image and those of the target text. Furthermore, we introduce an additional adversarial training stage involving the auxiliary model derived from the original model itself. This model is trained to resist generating ownership-relevant target text, thereby enhancing robustness in heavily fine-tuned derivative models. Extensive experiments demonstrate the effectiveness of our dual-injection approach in tracking model lineage under various fine-tuning and domain-shift scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39817", "url": null, "sourceid": 41535, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39159, "uid": "e9db2508797f0fd542868299c00e8091", "name": "HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction", "authors": [{"id": 180529, "fullname": "Chen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180529?format=json", "institution": "Xiamen University"}, {"id": 191477, "fullname": "Yilu An", "url": "http://cvpr.thecvf.com/api/miniconf/users/191477?format=json", "institution": "Xiamen University"}, {"id": 127979, "fullname": "Ying Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/127979?format=json", "institution": "Xiamen University"}, {"id": 191478, "fullname": "Hao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191478?format=json", "institution": "Xiamen University"}, {"id": 143860, "fullname": "Xitong Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/143860?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 191479, "fullname": "Lihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191479?format=json", "institution": "Amazon"}, {"id": 86671, "fullname": "Junjun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/86671?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 191480, "fullname": "Yuxiang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191480?format=json", "institution": "Xiamen University"}, {"id": 156462, "fullname": "Zihui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156462?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 127954, "fullname": "Rongshan Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127954?format=json", "institution": "Xiamen University"}], "abstract": "Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spot-level function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data's inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design a Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for more scalable and accurate spatial transcriptomics prediction.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39159", "url": null, "sourceid": 41526, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39064, "uid": "e787c80d4fae910c1c6d0a14eb2d5db5", "name": "DRM: Diffusion-based Reward Model With Step-wise Guidance", "authors": [{"id": 191286, "fullname": "Jaxon Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191286?format=json", "institution": "Peking University"}, {"id": 89377, "fullname": "Binxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89377?format=json", "institution": "University of Science and Technology of China"}, {"id": 131667, "fullname": "Hubery Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/131667?format=json", "institution": "Tencent"}, {"id": 86641, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86641?format=json", "institution": "WeChat, Tencent"}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}], "abstract": "Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models.However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities\u2014such as aesthetics, composition, and visual harmony.In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes.Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone.A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways.First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment.Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes.Extensive experiments confirm that our approach significantly enhances the final quality of generated images.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39064", "url": null, "sourceid": 41027, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39710, "uid": "55a2471ff102d6ea6bd23c81aafdfd43", "name": "Understanding Task Transfer in Vision-Language Models", "authors": [{"id": 192703, "fullname": "Bhuvan Sachdeva", "url": "http://cvpr.thecvf.com/api/miniconf/users/192703?format=json", "institution": "Microsoft Research"}, {"id": 148817, "fullname": "Karan Uppal", "url": "http://cvpr.thecvf.com/api/miniconf/users/148817?format=json", "institution": "Microsoft Research India"}, {"id": 92900, "fullname": "Abhinav Java", "url": "http://cvpr.thecvf.com/api/miniconf/users/92900?format=json", "institution": "Microsoft"}, {"id": 153478, "fullname": "Vineeth Balasubramanian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153478?format=json", "institution": "Microsoft Research and IIT-Hyderabad"}], "abstract": "Vision\u2013Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. To quantify these effects, we introduce Perfection Gap Factor (PGF), a metric that captures both the breadth and magnitude of transfer. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task-transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer \\& risks of negative interference, offering actionable guidance for advancing VLMs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39710", "url": "https://bhuvan-21.github.io/task-transfer-vlms/", "sourceid": 36087, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40378?format=json"], "related_events_ids": [40378]}, {"id": 38955, "uid": "44c7f5d8df325a8dfa1140bbd5825d89", "name": "Learning to See Through a Baby\u2019s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines", "authors": [{"id": 182236, "fullname": "Yusen Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/182236?format=json", "institution": "Nanyang Technological University"}, {"id": 191055, "fullname": "Qing Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191055?format=json", "institution": ""}, {"id": 191056, "fullname": "BHARGAVA SATYA NUNNA", "url": "http://cvpr.thecvf.com/api/miniconf/users/191056?format=json", "institution": "Indian Institute of Technology, Madras"}, {"id": 151070, "fullname": "Mengmi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151070?format=json", "institution": "Nanyang Technological University, Singapore"}], "abstract": "Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged ``visual diets\", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)\u2014collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture\u2013shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm.All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants\u2019 visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38955", "url": null, "sourceid": 43294, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40114, "uid": "bd76216ab6325a6355d8caa07e5cbfec", "name": "MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation", "authors": [{"id": 143857, "fullname": "Guohui Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143857?format=json", "institution": "University of Science and Technology of China"}, {"id": 86528, "fullname": "Hu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86528?format=json", "institution": "University of Science and Technology of China"}, {"id": 126408, "fullname": "Xiaoxiao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/126408?format=json", "institution": "University of Science and Technology of China"}, {"id": 193568, "fullname": "Yaning Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193568?format=json", "institution": "Fudan University"}, {"id": 144943, "fullname": "Hang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144943?format=json", "institution": "University of Science and Technology of China"}, {"id": 86575, "fullname": "Jie Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86575?format=json", "institution": "University of Science and Technology of China"}, {"id": 86207, "fullname": "Feng Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86207?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging.The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results.In this paper, we present $\\textbf{MaskFocus}$, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps.Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image.Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them.Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy.Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40114", "url": null, "sourceid": 43896, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38543, "uid": "66744053f818d4032f5ba881340db020", "name": "Reconstructing CLIP for Open-Vocabulary Dense Perception", "authors": [{"id": 190096, "fullname": "Yajie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190096?format=json", "institution": "Beijing University"}, {"id": 120011, "fullname": "Jinjin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/120011?format=json", "institution": "Beihang University"}, {"id": 77156, "fullname": "Qingjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77156?format=json", "institution": "Beihang University"}, {"id": 87605, "fullname": "Di Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87605?format=json", "institution": "Beihang University"}], "abstract": "Large-scale vision\u2013language models (VLMs) such as CLIP have excelled in zero-shot image classification, yet they struggle to achieve the dense cross-modal alignment required by open-vocabulary dense perception (OVDP). While recent self-distillation methods address this by aligning dense features with the generalizable global semantics, a key question remains: how should such dense features be constructed to achieve optimal alignment? To address this, we propose DenseRC, a principled $\\textbf{Dense}$ $\\textbf{R}$epresentations $\\textbf{C}$onstruction framework that reconstructs CLIP for OVDP based on two key insights.First, by analyzing the internal semantics encoded in the global $\\textit{cls}$ token, we identify that multi-layer value embeddings serve as an informative basis for dense features. Second, we reveal that spatial aggregation tends to amplify semantic misalignment.  Motivated by this, we design a lightweight Head-Selective Gating (HSG) module that adaptively reweights feature heads according to their intrinsic heterogeneity, enabling discriminative and alignment-friendly dense representations construction. Extensive experiments demonstrate that DenseRC delivers consistent and substantial gains across OVDP tasks including object detection and semantic segmentation, setting new state-of-the-art performance on multiple benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38543", "url": null, "sourceid": 41691, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37231, "uid": "d25814a309be54d1d2279a4ff921ea28", "name": "Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs", "authors": [{"id": 186962, "fullname": "Jingze Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186962?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 182900, "fullname": "Quan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182900?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186963, "fullname": "Hongfei Suo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186963?format=json", "institution": "Sun Yat-sen University"}, {"id": 186964, "fullname": "Zeqiang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186964?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186965, "fullname": "Hongbo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186965?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Although reinforcement learning (RL) has significantly advanced reasoning capabilities in large multimodal language models (MLLMs), its efficacy remains limited for lightweight models essential for edge deployments.To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities.Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. First, the Bias Aware Training stage forges a dedicated \"bias model\" to embody these shortcut behaviors. Then, the Causal Debiasing Policy Optimization (CDPO) algorithm fine-tunes the primary model, employing an innovative repulsive objective to actively push it away from the bias model's flawed logic while simultaneously pulling it toward correct, generalizable solutions.Our model, VideoThinker-R1, establishes a new state-of-the-art in video reasoning efficiency. For same-scale comparison, requiring no Supervised Fine-Tuning (SFT) and using only 1 of the training data for RL, it surpasses VideoRFT-3B with a 3.2% average gain on widely-used benchmarks and a 7% lead on VideoMME. For cross-scale comparison, it outperforms the larger Video-UTR-7B model on multiple benchmarks, including a 2.1% gain on MVBench and a 3.8% gain on TempCompass. Code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37231", "url": "https://github.com/falonss703/VideoThinker", "sourceid": 42561, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40029, "uid": "f03c011dbcfc842f49e035316a45c6fc", "name": "View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification", "authors": [{"id": 182900, "fullname": "Quan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182900?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186964, "fullname": "Zeqiang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186964?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 193338, "fullname": "Peiming Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193338?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186962, "fullname": "Jingze Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186962?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 193339, "fullname": "Cailun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193339?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 186965, "fullname": "Hongbo Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186965?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87632, "fullname": "Jianhuang Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87632?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts under varying viewpoints. Extensive experiments on three large-scale AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06\\% mAP improvement on the challenging CARGO cross-view protocol. The code will be available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40029", "url": null, "sourceid": 37064, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39954, "uid": "14ef00902e6cd5feb4cb182b173545d9", "name": "Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning", "authors": [{"id": 182148, "fullname": "bozhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182148?format=json", "institution": "Harbin Institute of Technology (shenzhen)"}, {"id": 193188, "fullname": "Shaocong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193188?format=json", "institution": "PengCheng Laboratory"}, {"id": 156705, "fullname": "Tong Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156705?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 158206, "fullname": "Senqiao Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158206?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 193189, "fullname": "Qiben Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193189?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 89574, "fullname": "Zhuotao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89574?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 89550, "fullname": "Jingyong Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/89550?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur.  This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model's robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on $D^3$. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Data, code and models will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39954", "url": null, "sourceid": 33805, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36824, "uid": "13250c4462cc3e960d24f5fc585c7b87", "name": "Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache", "authors": [{"id": 185963, "fullname": "Bowen Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/185963?format=json", "institution": "Alibaba Group"}, {"id": 185964, "fullname": "Yuanbin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185964?format=json", "institution": "Alibaba Group"}, {"id": 185965, "fullname": "Huajiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185965?format=json", "institution": "Alibaba Group"}, {"id": 89789, "fullname": "Biaolong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/89789?format=json", "institution": "Alibaba Group"}, {"id": 185966, "fullname": "Aixi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185966?format=json", "institution": "Alibaba Group"}, {"id": 87163, "fullname": "Hao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87163?format=json", "institution": "Eindhoven University of Technology"}, {"id": 185967, "fullname": "Zhengzheng Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185967?format=json", "institution": "Alibaba Group"}, {"id": 175938, "fullname": "Xu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175938?format=json", "institution": "Alibaba Group"}, {"id": 185968, "fullname": "Pipei Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185968?format=json", "institution": null}], "abstract": "Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling.Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep.Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity.During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework.Code is provided in supplementary materials and will be released upon acceptance, with support for mainstream models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36824", "url": null, "sourceid": 38357, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37556, "uid": "634b9747aeb852449da30a415fb60aa9", "name": "FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement", "authors": [{"id": 187714, "fullname": "Wenshuo Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187714?format=json", "institution": "Peking University"}, {"id": 187715, "fullname": "Junyi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187715?format=json", "institution": "Peking University"}, {"id": 187716, "fullname": "Jiangyue Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187716?format=json", "institution": "Peking University"}, {"id": 90417, "fullname": "Shuai Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90417?format=json", "institution": "Nanyang Technological University"}], "abstract": "Video relighting with background replacement is a challenging task critical for applications in film production and creative media. Existing methods struggle to balance temporal consistency, spatial fidelity, and illumination naturalness. To address these issues, we introduce FlowPortal, a novel training-free flow-based video relighting framework. Our core innovation is a Residual-Corrected Flow mechanism that transforms a standard flow-based model into an editing model, guaranteeing perfect reconstruction when input conditions are identical and enabling faithful relighting when they differ, resulting in high structural consistency. This is further enhanced by a Decoupled Condition Design for precise lighting control and a High-Frequency Transfer mechanism for detail preservation. Additionally, a masking strategy isolates foreground relighting from background pure generation process. Experiments demonstrate that FlowPortal achieves superior performance in temporal coherence, structural preservation, and lighting realism, while maintaining high efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37556", "url": null, "sourceid": 44073, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37689, "uid": "c8e1d3b1704b30d49e255de822a921cf", "name": "ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers", "authors": [{"id": 188018, "fullname": "Mengling Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188018?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 76798, "fullname": "Sisi You", "url": "http://cvpr.thecvf.com/api/miniconf/users/76798?format=json", "institution": "Hefei University of Technology"}, {"id": 175882, "fullname": "Li Yaning", "url": "http://cvpr.thecvf.com/api/miniconf/users/175882?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 76806, "fullname": "Bing-Kun Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76806?format=json", "institution": "Hefei University of Technology"}], "abstract": "Procedural sequence generation aims to create intermediate images through multi-step processes, which is applied in industrial design, educational tutorial, and creative content inspiration. However, existing methods often focus on a specific domain or initialize several expert networks for different domains, which face three challenges. First, the poor generalization to unseen domains. Second, the parameter redundancy due to multiple expert networks.Third, the difficulty in adaptively determining the number of generation steps for different processes.To address these challenges, we propose ProcessMaker, a novel framework that harnesses the inherent generalization capabilities in Diffusion Transformers (DiTs) for procedural sequence generation. Concretely, we introduce three key innovations: (1) Self-supervised Representation Alignment to explore the generalized ability for unseen processes. (2) Sparse Masks for different domains without additional expert networks. (3) A sliding window strategy, which dynamically accommodates the generation steps based on the process complexity. Extensive experiments validate that our ProcessMaker achieves procedural sequence generation with generalization ability and adaptive steps, while using only 7.3% trainable parameters compared with the state-of-the-art method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37689", "url": null, "sourceid": 32428, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40234, "uid": "9f31d6fe1de940d081e72ce1778c5661", "name": "Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization", "authors": [{"id": 189098, "fullname": "Tze Ho Elden Tse", "url": "http://cvpr.thecvf.com/api/miniconf/users/189098?format=json", "institution": "National University of Singapore"}, {"id": 148359, "fullname": "Jizong Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/148359?format=json", "institution": "dConstruct Robotics"}, {"id": 85773, "fullname": "Angela Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85773?format=json", "institution": "National University of Singapore"}], "abstract": "Learning-based structure-from-motion methods such as ACE-Zero have demonstrated strong performance in estimating camera poses and scene coordinates from unordered image collections without requiring ground truth supervision. However, the lack of global and multi-view consistency constraints in ACE-Zero can lead to pose drift and misalignment, particularly in complex or ambiguous scenes. In this work, we propose a hybrid framework that integrates pose graph optimization (PGO) into ACE-Zero to refine camera poses and suppress incorrect refinements. We construct pose graphs directly from ACE-Zero outputs by extracting relative pose constraints from predicted scene coordinates. Furthermore, we introduce an uncertainty-aware optimization strategy by estimating confidence scores using geometric priors, including epipolar and optical flow consistencies across views. Our approach improves the robustness and accuracy of pose estimation, demonstrating that global geometric reasoning can effectively complement learning-based inference in structure-from-motion.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40234", "url": null, "sourceid": 38004, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37287, "uid": "4ef352d82c0b3506a0df2ed486d44b17", "name": "CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think", "authors": [{"id": 187085, "fullname": "Zening Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187085?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 187086, "fullname": "Zhengpeng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187086?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187087, "fullname": "Lichen Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187087?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 131008, "fullname": "Shitong Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131008?format=json", "institution": "Southeast University"}, {"id": 185431, "fullname": "Shuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185431?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 73868, "fullname": "Zeke Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73868?format=json", "institution": "Baidu Research"}], "abstract": "Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37287", "url": null, "sourceid": 39325, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39753, "uid": "5fb48cf4ff1cf43f1c2085bfab38f18b", "name": "STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval", "authors": [{"id": 181278, "fullname": "Miaoge Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181278?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88630, "fullname": "Dongsheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88630?format=json", "institution": "Xidian University"}, {"id": 187085, "fullname": "Zening Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187085?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 174502, "fullname": "Jinsen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174502?format=json", "institution": "Shenzhen university"}, {"id": 86370, "fullname": "Wenhan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86370?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 85166, "fullname": "Jingcai Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/85166?format=json", "institution": "The Hong Kong Polytechnic University"}], "abstract": "Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions.To address these challenges, we introduce a novel Semantic Transition and Transportation in Collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score.Extensive experimentsdemonstrate that our method can be general, effective, and beneficial for many CIR tasks.The code is attached in the supplementary material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39753", "url": null, "sourceid": 36557, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36493, "uid": "890e62e5aedf7ff639cf8a1e7a17d743", "name": "High-Quality and Efficient Turbulence Mitigation with Events", "authors": [{"id": 185189, "fullname": "Xiaoran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185189?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185190, "fullname": "Jian Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/185190?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 101464, "fullname": "Yuxing Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/101464?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 102342, "fullname": "Haoyue Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102342?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 156851, "fullname": "Gang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/156851?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 90470, "fullname": "Yi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90470?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90454, "fullname": "Luxin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90454?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Turbulence mitigation (TM) is highly ill-posed due to the stochastic nature of atmospheric turbulence. Most methods rely on multiple frames recorded by conventional cameras to capture stable patterns in natural scenarios. However, they inevitably suffer from a trade-off between accuracy and efficiency: more frames enhance restoration at the cost of higher system latency and larger data overhead. Event cameras, equipped with microsecond temporal resolution and efficient sensing of dynamic changes, offer an opportunity to break the bottleneck. In this work, we present EHETM, a high-quality and efficient TM method inspired by the superiority of events to model motions in continuous sequences. We discover two key phenomena: (1) turbulence-induced events exhibit distinct polarity alternation correlated with sharp image gradients, providing structural cues for restoring scenes; and (2) dynamic objects form spatiotemporally coherent \"event tubes'' in contrast to irregular patterns within turbulent events, providing motion priors for disentangling objects from turbulence. Based on these insights, we design two complementary modules that respectively leverage polarity-weighted gradients for scene refinement and event-tube constraints for motion decoupling, achieving high-quality restoration with few frames. Furthermore, we construct two real-world event-frame turbulence datasets covering atmospheric and thermal cases. Extensive experiments show that EHETM outperforms SOTA methods, especially under scenes with dynamic objects, while reducing data overhead and system latency by approximately 77.3% and 89.5%, respectively.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36493", "url": null, "sourceid": 38697, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39197, "uid": "47ebca2644fd2a35105cb3ab82a1d297", "name": "SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls", "authors": [{"id": 191562, "fullname": "Qianxun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191562?format=json", "institution": "Duke Kunshan University"}, {"id": 182531, "fullname": "Chenxi Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/182531?format=json", "institution": "Westlake University"}, {"id": 153126, "fullname": "Yujun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/153126?format=json", "institution": "The University of Queensland"}, {"id": 153810, "fullname": "Chi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153810?format=json", "institution": "Westlake University"}], "abstract": "Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39197", "url": null, "sourceid": 46285, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39229, "uid": "d7bf523a314c5d650e9b49cd3788cc82", "name": "Exploring Spatial Intelligence from a Generative Perspective", "authors": [{"id": 129165, "fullname": "Muzhi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129165?format=json", "institution": "Zhejiang University"}, {"id": 191647, "fullname": "Shunyao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191647?format=json", "institution": "Zhejiang University"}, {"id": 187146, "fullname": "Huanyi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/187146?format=json", "institution": "Zhejiang University"}, {"id": 180126, "fullname": "Zekai Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180126?format=json", "institution": "Zhejiang University"}, {"id": 180084, "fullname": "Hao Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180084?format=json", "institution": "Zhejiang University"}, {"id": 185379, "fullname": "Anzhou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185379?format=json", "institution": "Zhejiang University"}, {"id": 191648, "fullname": "Kaijun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191648?format=json", "institution": null}, {"id": 191649, "fullname": "Jintao Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191649?format=json", "institution": "Zhejiang University of Technology"}, {"id": 129178, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129178?format=json", "institution": "Zhejiang University"}, {"id": 185384, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185384?format=json", "institution": "Zhejiang University"}, {"id": 158625, "fullname": "Tao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/158625?format=json", "institution": "Westlake University"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}], "abstract": "Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI)\u2014the ability to respect and manipulate 3D spatial constraints during image generation\u2014and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling.Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning\u2014establishing a new pathway for advancing spatial intelligence in multimodal models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39229", "url": null, "sourceid": 37061, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36928, "uid": "039cc47f960921e3c330b64673d59ae1", "name": "Don\u2019t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs", "authors": [{"id": 186239, "fullname": "Muhammad Kamran Janjua", "url": "http://cvpr.thecvf.com/api/miniconf/users/186239?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186240, "fullname": "Hugo Luis Andrade Silva", "url": "http://cvpr.thecvf.com/api/miniconf/users/186240?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88141, "fullname": "Di Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88141?format=json", "institution": "University of Alberta"}, {"id": 185411, "fullname": "Bahador Rashidi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185411?format=json", "institution": "Huawei Canada"}], "abstract": "Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P^2), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P^2 consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P^2 raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42% to 81.45% on relative depth, and achieves a 22% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15\u201340% absolute gains from P^2, surpassing prior agentic, supervised, and RL-based tool-use methods\u2014without any training or model modifications.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36928", "url": "https://github.com/AISmartPerception/perception-programs", "sourceid": 38521, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38622, "uid": "4c5e8027a2dd42ca4ee1b0fcb4cd4fc5", "name": "SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation", "authors": [{"id": 190325, "fullname": "Zhenyu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190325?format=json", "institution": "University of the Chinese Academy of Sciences; Fudan University; Nanjing University"}, {"id": 190326, "fullname": "Liupeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190326?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 181347, "fullname": "Jinpeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181347?format=json", "institution": "Harbin Institue of Technology, Shenzhen"}, {"id": 190327, "fullname": "Haoqian Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190327?format=json", "institution": "Jilin University"}, {"id": 185047, "fullname": "Yan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185047?format=json", "institution": "Meituan"}, {"id": 154602, "fullname": "Ke Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154602?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}], "abstract": "While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception.Current methods, such as latent query alignment, are end-to-end yet opaque \"black boxes\".Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step.To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway.Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace.The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space.A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder.The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision.This SAE-driven interface provides a \"white-box\" connection that is significantly more traceable than latent queries and more coherent than textual readouts.Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance.Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment.Faithful code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38622", "url": null, "sourceid": 44613, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36239, "uid": "3a1d312205a596bf188df151fd9c30e6", "name": "From Infusion to Assimilation Distillation for Medical Image Segmentation", "authors": [{"id": 181618, "fullname": "Jiankang Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181618?format=json", "institution": "Tongji University"}, {"id": 184539, "fullname": "Ye Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184539?format=json", "institution": "Tongji University"}, {"id": 184540, "fullname": "Yinan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184540?format=json", "institution": "Tongji University"}, {"id": 87284, "fullname": "Junsong Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87284?format=json", "institution": "State University of New York at Buffalo"}], "abstract": "Although foundation models (e.g. SAM) perform remarkably in medical image segmentation, its high computational complexity limits deployment. Knowledge distillation (KD) allows lightweight models to inherit the representational capabilities of large models, thereby mitigating this issue. Existing KD methods enhance student performance, but due to teacher-student different feature advantages, they neglect to internalize and integrate student's semantic information adaptively after knowledge transfer, causing poor knowledge assimilation and limiting gains and generalization. To address this limitation, we propose a novel medical image segmentation framework, which is injection to assimilation distillation (IAD). In Knowledge Injection Stage (KIS), to semantically align teacher-student prediction distributions, soft-label distillation is combined with class-weighted prototype alignment strategy. In Knowledge Assimilation Stage (KAS), to promote adaptively semantic assimilation, a contrastive semantic self-optimization strategy refines student predictions through positive and negative sample pairs and imposes reverse constraints on encoder features to enhance semantic consistency. IAD achieves DICE gains of 4.32\\% on Synapse, 1.85\\% on ACDC, and 2.42\\% on Polyp datasets, and delivers an average 4.16\\% generalization gain on ISIC2018, PH2, BUSI, and STU datasets, outperforming mainstream KD methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36239", "url": null, "sourceid": 38928, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38224, "uid": "18c80ce8cbae9be00f9abc5375be7aa8", "name": "GS-ASM: 2DGS-Supervised Active Stereo Matching", "authors": [{"id": 176555, "fullname": "Zhengling Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176555?format=json", "institution": "Shandong University"}, {"id": 189365, "fullname": "Rongfeng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189365?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 189366, "fullname": "Quan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189366?format=json", "institution": "Jiaxing University"}, {"id": 189367, "fullname": "Longjian Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189367?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 89661, "fullname": "Ming Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89661?format=json", "institution": "Intel Labs China"}, {"id": 185862, "fullname": "Yaoqi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185862?format=json", "institution": "Lishui University"}, {"id": 189368, "fullname": "Yahong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189368?format=json", "institution": "Lishui University"}, {"id": 189369, "fullname": "Baofeng Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/189369?format=json", "institution": null}, {"id": 89584, "fullname": "Chenggang Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/89584?format=json", "institution": "Hangzhou Dianzi University, Tsinghua University"}], "abstract": "Due to the lack of ground truth, existing methods of active stereo matching generally employ fully self-supervised learning to produce precise depth estimates. Although they can achieve promising results, their performance still has a noticeable gap compared with supervised models. To fill this gap, we propose a novel framework that synthesizes proxy labels to enable supervised training of deep active stereo networks without requiring any ground-truth depth. To expand the training data and generate disparity proxy labels, we develop an active 2D Gaussian Splatting (2DGS)-based synthesis method that explicitly models the scene geometry and the projected active pattern. Furthermore, to balance the varying contributions of different supervisions during training, we design a hybrid supervision regularization strategy that dynamically adjusts the loss weights to achieve stable optimization. We also contribute a real-world dataset captured by a handheld RealSense camera, along with our active 2DGS model, which facilitates future research on active stereo matching. Extensive qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance on active stereo matching task. The code and dataset will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38224", "url": null, "sourceid": 34713, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37728, "uid": "67614aacd469da7f9d611c9be60462f1", "name": "TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis", "authors": [{"id": 180684, "fullname": "Rui Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180684?format=json", "institution": "Peking University"}, {"id": 188108, "fullname": "Ziru Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188108?format=json", "institution": "Peking University"}, {"id": 188109, "fullname": "Lingyuan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/188109?format=json", "institution": "Tsinghua University"}, {"id": 188110, "fullname": "Yuxing Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188110?format=json", "institution": "Georgia Institute of Technology Peking University"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}, {"id": 155008, "fullname": "Jinzhuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155008?format=json", "institution": "Peking University"}], "abstract": "Accurately modeling the relationship between perturbations, transcriptional responses, and phenotypic changes is essential for building an AI Virtual Cell (AIVC). However, existing methods typically constrained to modeling direct associations, such as *Perturbation $\\rightarrow$ RNA* or *Perturbation $\\rightarrow$ Morphology*, overlook the crucial causal link from RNA to morphology. To bridge this gap, we propose TRIDENT, a cascade generative framework that synthesizes realistic cellular morphology by conditioning on both the perturbation and the corresponding gene expression profile. To train and evaluate this task, we construct MorphoGene, a new dataset pairing L1000 gene expression with Cell Painting images for 98 compounds. TRIDENT significantly outperforms state-of-the-art approaches, achieving up to 7-fold improvement with strong generalization to unseen compounds. In a case study on docetaxel, we validate that RNA-guided synthesis accurately produces the corresponding phenotype. An ablation study further confirms that this RNA conditioning is essential for the model's high fidelity. By explicitly modeling transcriptome\u2013phenome mapping, TRIDENT provides a powerful in silico tool and moves us closer to a predictive virtual cell.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37728", "url": null, "sourceid": 32167, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36335, "uid": "5824aa83d1ae73b56a67273714b57843", "name": "RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding", "authors": [{"id": 181984, "fullname": "Hanqing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181984?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184803, "fullname": "Mingjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184803?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184804, "fullname": "Luoping Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/184804?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184805, "fullname": "Endian Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184805?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 184806, "fullname": "Donghong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184806?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184324, "fullname": "Chuang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184324?format=json", "institution": "Beijing University of Posts and Telecommunications"}], "abstract": "Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question\u2013answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36335", "url": null, "sourceid": 34429, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36343, "uid": "249aebbe9551fa31430ae6d1f5802749", "name": "SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark", "authors": [{"id": 162321, "fullname": "Gui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162321?format=json", "institution": "Shenzhen University"}, {"id": 184822, "fullname": "YongSong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184822?format=json", "institution": "Shenzhen University"}, {"id": 184823, "fullname": "Kaijun Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184823?format=json", "institution": "Shenzhen University"}, {"id": 184824, "fullname": "Wooi Ping Cheah", "url": "http://cvpr.thecvf.com/api/miniconf/users/184824?format=json", "institution": "The University of Nottingham Ningbo China"}, {"id": 158558, "fullname": "Rong Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158558?format=json", "institution": "University of Nottingham"}, {"id": 158556, "fullname": "Jianfeng Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/158556?format=json", "institution": "University of Nottingham Ningbo China"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}], "abstract": "Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce **SurgCoT,** a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across **7 surgical specialties** and **35 diverse procedures**. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue\u2013Action Alignment, Affordance Mapping, Micro\u2011Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (*Question \u2192 Option \u2192 Knowledge \u2192 Clue \u2192 Answer*), where the *Knowledge* field provides essential background context and *Clue* provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36343", "url": null, "sourceid": 40209, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38227, "uid": "523d6984fa5aa355c0f4b63b564ce892", "name": "X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis", "authors": [{"id": 162321, "fullname": "Gui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162321?format=json", "institution": "Shenzhen University"}, {"id": 189378, "fullname": "Zehao Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189378?format=json", "institution": "Shenzhen University"}, {"id": 184822, "fullname": "YongSong Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184822?format=json", "institution": "Shenzhen University"}, {"id": 189379, "fullname": "Yudong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189379?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 189380, "fullname": "Ende Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189380?format=json", "institution": null}, {"id": 184824, "fullname": "Wooi Ping Cheah", "url": "http://cvpr.thecvf.com/api/miniconf/users/184824?format=json", "institution": "The University of Nottingham Ningbo China"}, {"id": 158558, "fullname": "Rong Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158558?format=json", "institution": "University of Nottingham"}, {"id": 158556, "fullname": "Jianfeng Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/158556?format=json", "institution": "University of Nottingham Ningbo China"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}], "abstract": "Despite significant progress in Multi-modal Large Language Models (MLLMs), their clinical reasoning capacity in complex multi-modal diagnostic scenarios remains largely unexamined. Current benchmarks, predominantly limited to single-modality data, lack the capacity to evaluate progressive reasoning and cross-modal integration essential for clinical practice. To bridge this gap, we introduce **Cross-Modality Progressive Clinical Reasoning** (**X-PCR**) benchmark, the first comprehensive evaluation framework for MLLMs spanning the complete ophthalmology diagnostic workflow. X-PCR incorporates two core reasoning tasks: 1) a **six-stage progressive reasoning chain** spanning image quality assessment to clinical decision-making, and 2) A **cross-modality reasoning task** integrating six ophthalmic imaging modalities. The benchmark comprises **26,415 images** and **177,868 expert-verified VQA pairs** curated from 51 public datasets, covering 52 ophthalmic diseases. Our evaluation of **21 leading MLLMs** reveals critical gaps in progressive reasoning and cross-modal integration. X-PCR establishes a unified benchmark to advance MLLMs from task-specific performance to comprehensive diagnostic capability through aligned multi-modal clinical data. Dataset and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38227", "url": null, "sourceid": 32917, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37889, "uid": "60d179bc263ce0fe8e342c2ea1e67fe6", "name": "SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation", "authors": [{"id": 181596, "fullname": "Ziyan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181596?format=json", "institution": "Shenzhen University"}, {"id": 188494, "fullname": "Qiudan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188494?format=json", "institution": "Shenzhen University"}, {"id": 88103, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/88103?format=json", "institution": "Meituan"}, {"id": 188495, "fullname": "Xu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188495?format=json", "institution": "Shenzhen University"}], "abstract": "Panoramic depth estimation enables a complete $360^\\circ$ understanding of 3D environments but faces significant challenges in generalizing to real-world scenes. While recent zero-shot depth models like Depth Anything achieve remarkable generalization on perspective images, their performance sharply degrades on panoramas due to projection distortions and the lack of spherical geometric awareness. Moreover, collecting large-scale panoramic RGB-D data is costly, hindering the large-scale training of panoramic foundation models. To address these issues, we propose an SO(3)-Equivariant ViT-Adapter, which transfers the powerful zero-shot capability of the perspective pre-trained ViT to panoramic depth estimation by explicitly incorporating a rotation-equivariant inductive bias. Our adapter introduces an SO(3) deformable cross-attention mechanism to effectively align SO(3)-equivariant features with perspective features, enhancing rotational consistency without modifying the ViT backbone. Trained solely on synthetic panoramas, our framework achieves robust zero-shot sim-to-real performance on real indoor benchmarks, including Matterport3D and Stanford2D3D, demonstrating both data efficiency and strong generalization for panoramic depth estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37889", "url": null, "sourceid": 41017, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39759, "uid": "381376c0b4781c5f03024f6f49d4e10a", "name": "NEC-Diff: Noise-Robust Event\u2013RAW Complementary Diffusion for Seeing Motion in Extreme Darkness", "authors": [{"id": 102342, "fullname": "Haoyue Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102342?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 153632, "fullname": "Jinghan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153632?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 192800, "fullname": "Luxin Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192800?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 153633, "fullname": "Hanyu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153633?format=json", "institution": "National University of Singapore"}, {"id": 153634, "fullname": "Haozhi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153634?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90470, "fullname": "Yi Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90470?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 90454, "fullname": "Luxin Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90454?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "High-quality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event\u2013RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001\u20130.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39759", "url": null, "sourceid": 36616, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40035, "uid": "81fc53c51059936bda7ac43bdcb32449", "name": "MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding", "authors": [{"id": 132496, "fullname": "Yuhao Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/132496?format=json", "institution": "Northeastern University"}, {"id": 192762, "fullname": "Anwesa Choudhuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/192762?format=json", "institution": "United Imaging Intelligence"}, {"id": 96696, "fullname": "Zhongpai Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/96696?format=json", "institution": "United Imaging Intelligence"}, {"id": 126118, "fullname": "Benjamin Planche", "url": "http://cvpr.thecvf.com/api/miniconf/users/126118?format=json", "institution": "United Imaging Intelligence"}, {"id": 132524, "fullname": "Van Nguyen Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132524?format=json", "institution": null}, {"id": 158293, "fullname": "Meng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158293?format=json", "institution": "UII America, Inc."}, {"id": 126822, "fullname": "Yuhan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126822?format=json", "institution": "Northeastern University"}, {"id": 193353, "fullname": "Arun Innanje", "url": "http://cvpr.thecvf.com/api/miniconf/users/193353?format=json", "institution": null}, {"id": 158295, "fullname": "Terrence Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158295?format=json", "institution": "United Imaging Intelligence"}, {"id": 92940, "fullname": "Ehsan Elhamifar", "url": "http://cvpr.thecvf.com/api/miniconf/users/92940?format=json", "institution": "Northeastern University"}, {"id": 192764, "fullname": "Ziyan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192764?format=json", "institution": "United Imaging Intelligence"}], "abstract": "Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \\textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \\textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \\emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \\emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench's efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40035", "url": null, "sourceid": 32923, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39740, "uid": "a63cc7c9ad9178c5b1dd3cbca0ce4ea7", "name": "Consistent Instance Field for Dynamic Scene Understanding", "authors": [{"id": 180502, "fullname": "Junyi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180502?format=json", "institution": "University of Illinois at Chicago"}, {"id": 132524, "fullname": "Van Nguyen Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/132524?format=json", "institution": null}, {"id": 126118, "fullname": "Benjamin Planche", "url": "http://cvpr.thecvf.com/api/miniconf/users/126118?format=json", "institution": "United Imaging Intelligence"}, {"id": 192761, "fullname": "Jiachen Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192761?format=json", "institution": "University of Illinois at Chicago"}, {"id": 153250, "fullname": "Changchang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153250?format=json", "institution": "University of Illinois at Chicago"}, {"id": 96696, "fullname": "Zhongpai Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/96696?format=json", "institution": "United Imaging Intelligence"}, {"id": 152141, "fullname": "Zhenghao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152141?format=json", "institution": "University of Illinois Chicago"}, {"id": 192762, "fullname": "Anwesa Choudhuri", "url": "http://cvpr.thecvf.com/api/miniconf/users/192762?format=json", "institution": "United Imaging Intelligence"}, {"id": 192763, "fullname": "Gengyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192763?format=json", "institution": "University of Illinois Chicago"}, {"id": 158293, "fullname": "Meng Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158293?format=json", "institution": "UII America, Inc."}, {"id": 100296, "fullname": "Feiran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/100296?format=json", "institution": "Illinois Institute of Technology"}, {"id": 158295, "fullname": "Terrence Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158295?format=json", "institution": "United Imaging Intelligence"}, {"id": 150984, "fullname": "Yan Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/150984?format=json", "institution": "University of Illinois Chicago"}, {"id": 192764, "fullname": "Ziyan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192764?format=json", "institution": "United Imaging Intelligence"}], "abstract": "We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding.Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space\u2013time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization.Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39740", "url": null, "sourceid": 41740, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36263, "uid": "6fca7a9b9ac98ccc83aebb9fa27a2149", "name": "SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation", "authors": [{"id": 184609, "fullname": "Qiang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184609?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184610, "fullname": "Jiajie Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/184610?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184611, "fullname": "Zhenyu Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184611?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184612, "fullname": "Zhifen Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184612?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184613, "fullname": "Yingjie Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184613?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184614, "fullname": "Hongkuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184614?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184615, "fullname": "Ge-Peng Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/184615?format=json", "institution": "Australian National University"}, {"id": 184616, "fullname": "Qiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184616?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 184617, "fullname": "Zhiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184617?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Mix-supervised image segmentation aims to effectively leverage heterogeneous annotations. Recent prompt-based advances utilize foundation models such as Segment Anything Model (SAM) to generate pseudo-masks by treating weak labels as spatial prompts. However, these methods rely heavily on sparse spatial priors, leading to suboptimal performance in ambiguous regions and overlooking the potential of unlabeled data due to the absence of promptable cues. In this paper, we propose SAMIX, a novel framework that adapts SAM2 into a semantic-aware pseudo-label generator SA-SAM2 by incorporating a lightweight semantic adapter. Beyond being guided by sparse spatial prompts, SA-SAM2 facilitates dense contextual prompts provided by valuable image\u2013mask reference pairs with shared semantics. This design allows SAMIX to produce high-quality pseudo-masks even for ambiguous objects with sparse or no annotations. Another core component of SAMIX is the Selecting Policy Network (SPNet), which auto-regressively retrieves relevant and complementary reference samples for each query image. Unlike rule-based selections, SPNet is trained via reinforcement learning to actively explore reference combinations that maximize pseudo-label quality. Guided by customized and verifiable rewards associated with mask quality, the selection toward semantically informative and diverse contexts. We conduct extensive experiments on two general datasets (PASCAL VOC 2012 and Cityscapes) and two challenging specific datasets with ambiguous boundaries (camouflaged object detection and image polyp segmentation). Across diverse mix-supervision settings, SAMIX consistently achieves state-of-the-art performance, effectively leveraging both weakly labeled and unlabeled data. Codes will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36263", "url": null, "sourceid": 33314, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36443, "uid": "e1b6c6fb7a203ca62d0108642cc5767c", "name": "AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety\u2013Critical Cloud Forecasts", "authors": [{"id": 185058, "fullname": "ZIJIAN ZHU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185058?format=json", "institution": "Fudan University"}, {"id": 180780, "fullname": "Huang Qiusheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180780?format=json", "institution": "Fudan University"}, {"id": 185059, "fullname": "AnboyuGuo AnboyuGuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185059?format=json", "institution": "Fudan University"}, {"id": 185060, "fullname": "Xiaohui Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/185060?format=json", "institution": "Fudan University"}, {"id": 99574, "fullname": "Hao li", "url": "http://cvpr.thecvf.com/api/miniconf/users/99574?format=json", "institution": "Fudan University"}], "abstract": "Current AI weather forecasting models predict conventional atmospheric variables but cannot distinguish between cloud microphysical species critical for aviation safety. We introduce AviaSafe, a hierarchical, physics-informed neural forecaster that produces global, six-hourly predictions of these four hydrometeor species for lead times up to 7 days. Our approach addresses the unique challenges of cloud prediction: extreme sparsity, discontinuous distributions, and complex microphysical interactions between species. We integrate the Icing Condition (IC) index from aviation meteorology as a physics-based constraint that identifies regions where supercooled water fuels explosive ice crystal growth. The model employs a hierarchical architecture that first predicts cloud spatial distribution through masked attention, then quantifies species concentrations within identified regions. Training on ERA5 reanalysis data, our model achieves lower RMSE for cloud species compared to baseline and outperforms operational numerical models on certain key variables at 7-day lead times.The ability to forecast individual cloud species enables new applications in aviation route optimization where distinguishing between ice and liquid water determines engine icing risk.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36443", "url": null, "sourceid": 42127, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38839, "uid": "b8a7a43c42b54ca12d847eaff8f7c9e8", "name": "Hunting Normality from Query Sample via Residual Learning for Generalist  AnomalyDetection", "authors": [{"id": 153442, "fullname": "Xiaolei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153442?format=json", "institution": "University of Liverpool"}, {"id": 190806, "fullname": "Yuexin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190806?format=json", "institution": "University of Liverpool"}, {"id": 157613, "fullname": "Tianhong Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157613?format=json", "institution": "Kashmir Intelligence"}, {"id": 89883, "fullname": "Huihui Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/89883?format=json", "institution": "Beijing jiaotong university"}, {"id": 88385, "fullname": "Yao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88385?format=json", "institution": "Beijing Jiaotong University"}, {"id": 89348, "fullname": "Jimin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89348?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Generalist Anomaly Detection (GAD) seeks to overcome the domain-specific limitations of traditional anomaly detection by training a unified model that can generalize to unseen classes. A promising GAD strategy involves using residual features to create a class-invariant space. However, existing methods that directly model the distribution of residuals face unpredictable risks: there is inconsistency between residual and instance features, i.e., subtle defects may yield small residuals (false negatives), or normal feature residuals could be large due to the diversity of normality (false positives). To address these limitations, we propose a novel residual-based learning framework that re-purposes residuals as a guide to learn instance-level normality, rather than modeling their distribution directly. Our framework features two new attention-based modules: Residual Feature Learning (RFL), which uses learnable proxies to capture diverse patterns from the residual features, and Normality Learning from Support (NLS), which leverages these residual proxies to aggregate query-related normality proxies from the support instance features. These dynamically generated normality proxies are then used to hunt for normality within the query patch features, enabling accurate anomaly localization. Extensive experiments on GAD benchmarks demonstrate the effectiveness of our method. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38839", "url": null, "sourceid": 32563, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36642, "uid": "e52e11bcaceb09318392c684df299da3", "name": "CoLoGen: Progressive Learning of Concept\u2013Localization Duality for Unified Image Generation", "authors": [{"id": 153774, "fullname": "YuXin Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/153774?format=json", "institution": "Baidu"}, {"id": 157350, "fullname": "Yu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157350?format=json", "institution": "Zhejiang University"}, {"id": 185536, "fullname": "Haoyuan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185536?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 185537, "fullname": "Huanjin Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185537?format=json", "institution": "ByteDance Inc."}, {"id": 185538, "fullname": "Fanglong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185538?format=json", "institution": "ByteDance"}, {"id": 75735, "fullname": "Yifan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/75735?format=json", "institution": "Baidu Research"}, {"id": 90074, "fullname": "Haocheng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/90074?format=json", "institution": "Baidu"}, {"id": 90530, "fullname": "Hang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90530?format=json", "institution": "Baidu"}, {"id": 126338, "fullname": "Jingdong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126338?format=json", "institution": "Baidu"}], "abstract": "Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict.To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages.Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36642", "url": null, "sourceid": 45173, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37484, "uid": "c16a2e5d3f5c70f954488189c3b3fa44", "name": "Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation", "authors": [{"id": 181585, "fullname": "Boyu Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/181585?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 156164, "fullname": "Qianqian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156164?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185533, "fullname": "Shilong Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185533?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185534, "fullname": "Zhiyong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185534?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187567, "fullname": "Ruochen Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/187567?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185535, "fullname": "Xilin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185535?format=json", "institution": "Beijing Institute of Technology"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. In particular, on OpenAICLIP, DCR boosts D-Ability by 5% while preserving the gains in P-Ability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37484", "url": null, "sourceid": 35220, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36641, "uid": "0878794aeceb5eb52bdbe43c6bfb3009", "name": "BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation", "authors": [{"id": 182882, "fullname": "Feiran Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182882?format=json", "institution": "Institute of Information Engineering, CAS"}, {"id": 156164, "fullname": "Qianqian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156164?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185533, "fullname": "Shilong Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185533?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185534, "fullname": "Zhiyong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185534?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185535, "fullname": "Xilin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185535?format=json", "institution": "Beijing Institute of Technology"}, {"id": 126239, "fullname": "Xiaochun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/126239?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "This paper investigates the challenging task of detecting backdoored text-to-image generative models under black-box settings and introduces a novel detection framework **BlackMirror**. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit more significant cross-sample consistency than those from clean ones. Despite their success, such **global signals struggle to** generalize to recently emerging backdoor attacks, where backdoored generations can also appear visually diverse. Our BlackMirror is motivated by an insightful observation: across a wide range of backdoor attacks, **only partial semantic patterns** within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two core components: **MirrorMatch**, which aligns extracted visual patterns with the corresponding instructions to detect semantic deviations; and **MirrorVerify**, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. Note that BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module for detecting backdoor risks in real-world Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of existing attacks. It surpasses prior methods by over $15\\%$ in detection performance and reduces false positives by more than $30\\%$.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36641", "url": null, "sourceid": 40893, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38515, "uid": "9cb88dd759fcae2f3cb5907b9280bcaa", "name": "Texvent: Asynchronous Event Data Simulation via Text Prompt", "authors": [{"id": 190035, "fullname": "Ruofei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190035?format=json", "institution": "Hong Kong Baptist University"}, {"id": 70351, "fullname": "Peiqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/70351?format=json", "institution": "Peking University"}, {"id": 88097, "fullname": "Ka Chun Cheung", "url": "http://cvpr.thecvf.com/api/miniconf/users/88097?format=json", "institution": "NVIDIA"}, {"id": 88114, "fullname": "Simon See", "url": "http://cvpr.thecvf.com/api/miniconf/users/88114?format=json", "institution": "NVIDIA"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}, {"id": 76273, "fullname": "Renjie Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76273?format=json", "institution": "Hong Kong Baptist University"}], "abstract": "Current event simulation methods focus on employing videos to synthesize new event data, suffering from costly video capture and limited scalability across viewpoints, motions, and lighting. To this end, we propose a Text-to-event simulation framework (Texvent) that can directly generate asynchronous event data from simple text prompts. Texvent first renders prompt-driven videos via multimodal large language models and subsequently applies a new physical simulator to generate event streams. Specifically, an adaptive brightness-aware frame interpolation approach is proposed to enhance the temporal resolution of the rendered videos. A balanced logarithmic intensity comparison strategy and a cache\u2013based voltage refreshment mechanism are introduced into the simulator to generate event data.To narrow the sim-to-real gap, we also introduce background activity noise injection and dense time stamp reconstruction operations. Extensive experiments demonstrate Texvent\u2019s superior computational efficiency and its ability to generate more realistic event data than existing simulators.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38515", "url": null, "sourceid": 31056, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40338?format=json"], "related_events_ids": [40338]}, {"id": 40338, "uid": "9cb88dd759fcae2f3cb5907b9280bcaa", "name": "Texvent: Asynchronous Event Data Simulation via Text Prompt", "authors": [{"id": 190035, "fullname": "Ruofei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190035?format=json", "institution": "Hong Kong Baptist University"}, {"id": 70351, "fullname": "Peiqi Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/70351?format=json", "institution": "Peking University"}, {"id": 88097, "fullname": "Ka Chun Cheung", "url": "http://cvpr.thecvf.com/api/miniconf/users/88097?format=json", "institution": "NVIDIA"}, {"id": 88114, "fullname": "Simon See", "url": "http://cvpr.thecvf.com/api/miniconf/users/88114?format=json", "institution": "NVIDIA"}, {"id": 76401, "fullname": "Boxin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/76401?format=json", "institution": "Peking University"}, {"id": 76273, "fullname": "Renjie Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76273?format=json", "institution": "Hong Kong Baptist University"}], "abstract": "Current event simulation methods focus on employing videos to synthesize new event data, suffering from costly video capture and limited scalability across viewpoints, motions, and lighting. To this end, we propose a Text-to-event simulation framework (Texvent) that can directly generate asynchronous event data from simple text prompts. Texvent first renders prompt-driven videos via multimodal large language models and subsequently applies a new physical simulator to generate event streams. Specifically, an adaptive brightness-aware frame interpolation approach is proposed to enhance the temporal resolution of the rendered videos. A balanced logarithmic intensity comparison strategy and a cache\u2013based voltage refreshment mechanism are introduced into the simulator to generate event data.To narrow the sim-to-real gap, we also introduce background activity noise injection and dense time stamp reconstruction operations. Extensive experiments demonstrate Texvent\u2019s superior computational efficiency and its ability to generate more realistic event data than existing simulators.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40338", "url": null, "sourceid": -31056, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38515?format=json"], "related_events_ids": [38515]}, {"id": 36676, "uid": "08d32c7b011f031f15cf135dc6360f20", "name": "Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis", "authors": [{"id": 185621, "fullname": "Yinuo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185621?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185622, "fullname": "Jun Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185622?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 151587, "fullname": "Yiran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151587?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185623, "fullname": "Cheng Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185623?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36676", "url": null, "sourceid": 32438, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37285, "uid": "eb54c84cfe7e3adc6f6db7d7975fa3d3", "name": "Think 360\u00b0: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth", "authors": [{"id": 128304, "fullname": "Mingrui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/128304?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences (CASIA)"}, {"id": 187080, "fullname": "Hexiong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187080?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 187081, "fullname": "Haogeng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187081?format=json", "institution": "TikTok"}, {"id": 76200, "fullname": "Huaibo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76200?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 76640, "fullname": "Ran He", "url": "http://cvpr.thecvf.com/api/miniconf/users/76640?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "In this paper, we present a holistic multimodal benchmark that evaluates the reasoning capabilities of MLLMs with an explicit focus on reasoning **`width`**, a complementary dimension to the more commonly studied reasoning **`depth`**.Specifically, reasoning depth measures the model\u2019s ability to carry out long-chain, sequential reasoning in which each step is tightly and rigorously linked to the next.Reasoning width tends to focus more on the model\u2019s capacity for broad trial-and-error search or multi-constrained optimization: it must systematically traverse many possible and parallelized reasoning paths, apply diverse constraints to prune unpromising branches, and identify valid solution routes for efficient iteration or backtracking.To achieve it, we carefully curate 1200+ high-quality multimodal cases spanning heterogeneous domains, and propose a fine-grained tree-of-thought evaluation protocol that jointly quantifies reasoning *width* and *depth*.We evaluate **12** major model families (over **30** advanced MLLMs) across difficulty tiers, question types, and required skills.Results show that while current models exhibit strong performance on general or common-sense VQA tasks, they still struggle to combine deep sequential thought chains with wide exploratory search to perform genuine insight-based reasoning. Finally, we analyze characteristic failure modes to provide possible directions for building MLLMs that reason not only *deeper* but also *wider*.Our code is available in **supplementary materials**.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37285", "url": null, "sourceid": 46427, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40326, "uid": "273a634ff7ef3901439b31356661d9a2", "name": "PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs", "authors": [{"id": 189194, "fullname": "Bowen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189194?format=json", "institution": "Tsinghua University"}, {"id": 153126, "fullname": "Yujun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/153126?format=json", "institution": "The University of Queensland"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 189195, "fullname": "Hang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189195?format=json", "institution": "University of California, Merced"}, {"id": 182615, "fullname": "Yiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182615?format=json", "institution": "University of California, Merced"}], "abstract": "Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40326", "url": null, "sourceid": -31899, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38165?format=json"], "related_events_ids": [38165]}, {"id": 38165, "uid": "273a634ff7ef3901439b31356661d9a2", "name": "PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs", "authors": [{"id": 189194, "fullname": "Bowen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/189194?format=json", "institution": "Tsinghua University"}, {"id": 153126, "fullname": "Yujun Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/153126?format=json", "institution": "The University of Queensland"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 189195, "fullname": "Hang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189195?format=json", "institution": "University of California, Merced"}, {"id": 182615, "fullname": "Yiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182615?format=json", "institution": "University of California, Merced"}], "abstract": "Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38165", "url": null, "sourceid": 31899, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40326?format=json"], "related_events_ids": [40326]}, {"id": 36182, "uid": "23270334cb68c628783066181ece864b", "name": "UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders", "authors": [{"id": 90897, "fullname": "Matthew Walmer", "url": "http://cvpr.thecvf.com/api/miniconf/users/90897?format=json", "institution": "University of Maryland, College Park"}, {"id": 156800, "fullname": "Saksham Suri", "url": "http://cvpr.thecvf.com/api/miniconf/users/156800?format=json", "institution": "Facebook"}, {"id": 184355, "fullname": "Anirud Aggarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/184355?format=json", "institution": "Department of Computer Science, University of Maryland, College Park"}, {"id": 98052, "fullname": "Abhinav Shrivastava", "url": "http://cvpr.thecvf.com/api/miniconf/users/98052?format=json", "institution": "University of Maryland"}], "abstract": "The space of task-agnostic feature upsampling has  emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-resolution versions. While early works in this space used iterative upsampling approaches, more recent works have switched to cross-attention-based methods, which risk falling into the same efficiency scaling problems of the backbones they are upsampling. In this work, we demonstrate that iterative upsampling methods can still compete with cross-attention-based methods; moreover, they can achieve state-of-the-art performance with lower inference costs. We propose UPLiFT, an architecture for Universal Pixel-dense Lightweight Feature Transforms. We also propose an efficient Local Attender operator to  overcome the limitations of prior iterative feature upsampling methods. This operator uses an alternative attentional pooling formulation defined fully locally. We show that our Local Attender allows UPLiFT to maintain stable features throughout upsampling, enabling state-of-the-art performance with lower inference costs than existing pixel-dense feature upsamplers. In addition, we apply UPLiFT to generative downstream tasks and show that it achieves competitive performance with state-of-the-art Coupled Flow Matching models for VAE feature upsampling. Altogether, UPLiFT offers a versatile and efficient approach to creating denser features.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36182", "url": null, "sourceid": 34743, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37840, "uid": "395fb64a55a3a57fdd9de78e425b9852", "name": "Masked Representation Modeling for Domain-Adaptive Segmentation", "authors": [{"id": 188388, "fullname": "Wenlve Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188388?format=json", "institution": "South China University of Technology"}, {"id": 188389, "fullname": "Zhiheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188389?format=json", "institution": null}, {"id": 188390, "fullname": "Tiantao Xian", "url": "http://cvpr.thecvf.com/api/miniconf/users/188390?format=json", "institution": "South China University of Technology"}, {"id": 188391, "fullname": "Yikui Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188391?format=json", "institution": "Wuyi University"}, {"id": 188392, "fullname": "Weibin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188392?format=json", "institution": null}, {"id": 188393, "fullname": "Biyun MA", "url": "http://cvpr.thecvf.com/api/miniconf/users/188393?format=json", "institution": "South China University of Technology"}], "abstract": "Unsupervised domain adaptation (UDA) for semantic segmentation seeks to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary self-supervised tasks such as contrastive learning have enhanced feature discriminability, masked modeling remains underexplored due to architectural constraints and misaligned objectives. We propose Masked Representation Modeling (MRM), an auxiliary task that performs representation masking and reconstruction directly in the latent space. Unlike prior masked modeling methods that reconstruct low-level signals (e.g., pixels or visual tokens), MRM targets high-level semantic features, aligning its objective with segmentation and integrating seamlessly into standard architectures like DeepLab and DAFormer. To support efficient reconstruction, we design a lightweight auxiliary module, Rebuilder, which is jointly trained with the segmentation network but removed during inference, introducing zero test-time overhead. Extensive experiments demonstrate that MRM consistently improves segmentation performance across diverse architectures and UDA benchmarks. When integrated with four representative baselines, MRM achieves an average gain of +2.3 mIoU on GTA $\\rightarrow$ Cityscapes and +2.8 mIoU on Cityscapes $\\rightarrow$ Synthia, establishing it as a simple, effective, and generalizable strategy for unsupervised domain-adaptive semantic segmentation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37840", "url": null, "sourceid": 38672, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39633, "uid": "4abece55286a5b88ab94dc570e9663c5", "name": "TC-Pad\u00e9: Trajectory-Consistent Pad\u00e9 Approximation for Diffusion Acceleration", "authors": [{"id": 152597, "fullname": "Shaoxuan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/152597?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 156578, "fullname": "Benlei Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/156578?format=json", "institution": "Alibaba Group"}, {"id": 181137, "fullname": "Bukun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181137?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 185818, "fullname": "Zhizeng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/185818?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 192531, "fullname": "Yunyun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192531?format=json", "institution": "Alibaba Group"}, {"id": 157341, "fullname": "Longtao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157341?format=json", "institution": "Alibaba Group"}, {"id": 90252, "fullname": "Hui Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/90252?format=json", "institution": "Zhejiang University, Tsinghua University"}, {"id": 192532, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192532?format=json", "institution": "Alibaba Group"}, {"id": 129579, "fullname": "Haiwen Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/129579?format=json", "institution": "Alibaba Group"}, {"id": 126844, "fullname": "Jingqun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126844?format=json", "institution": "Bytedance"}, {"id": 84818, "fullname": "Zhou Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84818?format=json", "institution": "Zhejiang University, Tsinghua University"}], "abstract": "Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the interval between steps increases, polynomial-based extrapolators like TaylorSeer suffer from error accumulation and trajectory drift. Meanwhile, conventional caching strategies often overlook the distinct dynamical properties of different denoising phases.   To address these challenges, we propose Trajectory-Consistent Pad\u00e9($\\textbf{TC-Pad\u00e9}$) approximation, a feature prediction framework grounded in Pad\u00e9 approximation. By modeling feature evolution through rational functions, our approach captures asymptotic and transitional behaviors more accurately than Taylor-based methods. To enable stable and trajectory-consistent sampling under reduced step counts, TC-Pad\u00e9 incorporates (1) adaptive coefficient modulation that leverages historical cached residuals to detect subtle trajectory transitions, and (2) step-aware prediction strategies tailored to the distinct dynamics of early, mid, and late sampling stages. Extensive experiments on DiT-XL/2, FLUX.1-dev, and Wan2.1 across both image and video generation demonstrate the effectiveness of TC-Pad\u00e9. For instance, TC-Pad\u00e9 achieves 2.88$\\times$ acceleration on FLUX.1-dev and 1.72$\\times$ on Wan2.1 while maintaining high quality across FID, CLIP, Aesthetic, and VBench-2.0 metrics, substantially outperforming existing feature caching methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39633", "url": null, "sourceid": 37799, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36652, "uid": "2e12fe5bc3f8c2719adf7e96acdd5bc5", "name": "Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation", "authors": [{"id": 181708, "fullname": "Xiangkai Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/181708?format=json", "institution": "State Key Laboratory for Novel Software Technology, Nanjing University"}, {"id": 185565, "fullname": "Lekai Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/185565?format=json", "institution": "nanjing university"}, {"id": 185566, "fullname": "Han Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185566?format=json", "institution": "Nanjing University"}, {"id": 76378, "fullname": "Wenzhong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/76378?format=json", "institution": "Nanjing University"}, {"id": 72976, "fullname": "Sanglu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/72976?format=json", "institution": "Nanjing University"}], "abstract": "Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5\\%, 9.6\\% and 12.1\\% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5\\% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36652", "url": null, "sourceid": 39375, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37490, "uid": "de1e7f6da2c60b9bb6768ba10c8ebc28", "name": "See Through the Noise: Improving Domain Generalization in Gaze Estimation", "authors": [{"id": 180986, "fullname": "Yanming Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180986?format=json", "institution": "Beijing Jiaotong University"}, {"id": 187574, "fullname": "Shijing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187574?format=json", "institution": null}, {"id": 87019, "fullname": "Yaping Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87019?format=json", "institution": "Beijing Jiaotong University"}, {"id": 187575, "fullname": "Yi Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187575?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Generalizable gaze estimation methods have garnered increasing attention due to its critical importance in real-world applications and achieved significant progress. However, they often overlook the effect of label noise, arising from the inherent difficulty of acquiring precise gaze annotations, on model generalization performance. In this paper, we are the first to comprehensively investigate the negative effects for the generalization of gaze estimation. Further, we propose a novel solution, called See-Through-Noise (SeeTN) framework, which improves generalization from a novel perspective of mitigating label noise. Specifically, we propose to construct a semantic embedding space via a prototype-based transformation to preserve a consistent topological structure between gaze features and continuous labels, mitigating the effects of label noise. We then measure feature-label affinity consistency to distinguish noisy from clean samples, and introduce a novel affinity regularization in the semantic manifold to transfer gaze-related information from clean to noisy samples. Our proposed SeeTN promotes semantic structure alignment and enforces domain-invariant gaze relationships, thereby enhancing robustness against both label noise and domain shifts. Extensive experiments demonstrate that our SeeTN effectively mitigates the adverse impact of source-domain noise, leading to superior cross-domain generalization without compromising the source-domain accuracy, and highlighting the importance of explicitly handling noise in generalized gaze estimation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37490", "url": null, "sourceid": 34512, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37707, "uid": "49024dbba391e7b6d937dfe7b3354b45", "name": "Language-Guided One-Step Diffusion Model for Nighttime Flare Removal", "authors": [{"id": 180214, "fullname": "Aoxiang Ning", "url": "http://cvpr.thecvf.com/api/miniconf/users/180214?format=json", "institution": "Chongqing University of Technology"}, {"id": 147362, "fullname": "Kailong Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/147362?format=json", "institution": "Beijing Institute of Technology"}, {"id": 188058, "fullname": "Minglong Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/188058?format=json", "institution": null}, {"id": 88760, "fullname": "Liyuan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88760?format=json", "institution": "Beijing Institute of Technology"}, {"id": 188059, "fullname": "Jinhong He", "url": "http://cvpr.thecvf.com/api/miniconf/users/188059?format=json", "institution": "Chongqing University of Technology"}, {"id": 173797, "fullname": "Wenchao Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/173797?format=json", "institution": "chongqiong university of technology"}, {"id": 152604, "fullname": "Mingliang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152604?format=json", "institution": "Chongqing University"}, {"id": 187987, "fullname": "Yirui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187987?format=json", "institution": "hohai University"}], "abstract": "Nighttime photography is susceptible to flare caused by strong light sources, which degrades visual quality and disrupts structural information required by downstream vision tasks. Existing nighttime flare removal methods generally lack semantic priors for flare-occluded regions and thus tend to introduce artifacts and lose details under severe degradation. To address this problem, we propose a language-guided one-step diffusion framework that explicitly aligns flare-occluded regions with the underlying scene content at the semantic level. Specifically, we develop the first flare-specific vision\u2013language model, Flare-VLM, which extracts fine-grained textual descriptions to guide one-step diffusion for high-quality restoration of severely damaged areas. Then, we propose semantics-aware distribution distillation to constrain the noise distribution with high-level semantics, suppressing redundant perturbations on clean backgrounds and improving the stability of distillation. In addition, we design an instruction-driven data synthesis pipeline to generate geometrically and semantically aligned nighttime flare samples, narrowing the gap to real degradations. Experimental results demonstrate that the proposed method achieves better restoration and enhances the performance of downstream vision tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37707", "url": null, "sourceid": 44043, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38668, "uid": "2ac4bbf0c1236555cc7e9e5211465ed8", "name": "Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction", "authors": [{"id": 158025, "fullname": "Jiaxin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158025?format=json", "institution": "Zhejiang University"}, {"id": 155573, "fullname": "Yuanbo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155573?format=json", "institution": "Zhejiang University"}, {"id": 88696, "fullname": "Bangbang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88696?format=json", "institution": "ByteDance Inc"}, {"id": 190427, "fullname": "Lin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190427?format=json", "institution": null}, {"id": 190428, "fullname": "Yuewen Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190428?format=json", "institution": "ByteDance"}, {"id": 88190, "fullname": "Yiyi Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88190?format=json", "institution": "Zhejiang University"}], "abstract": "We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, \\method{} produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38668", "url": null, "sourceid": 34017, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36275, "uid": "187df28d558f25a18507ba287ce90f5d", "name": "Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG", "authors": [{"id": 181857, "fullname": "Zhenghao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181857?format=json", "institution": "South China University of Technology"}, {"id": 184650, "fullname": "Kaikai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184650?format=json", "institution": "South China University of Technology"}, {"id": 184651, "fullname": "HUILIN YAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/184651?format=json", "institution": "South China University of Technology"}, {"id": 184652, "fullname": "Lin Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184652?format=json", "institution": "South China University of Technology"}], "abstract": "Electromyography (EMG) is crucial for decoding human motor intentions and achieving natural human-computer interaction, but its generalization ability across subjects, devices, and tasks has long been limited by data heterogeneity, scarce annotations, and the lack of a unified representation paradigm. In this work, we introduce a novel perspective on EMG signals, treating muscle contractions as words and activation sequences as sentences. Based on this perspective, we design a Neuromuscular Contraction Tokenizer (NCT) that generates semantically consistent EMG sentences from raw signals. Building on this, we propose the first large-scale pre-training framework for EMG\u2014Any Electromyography (AEMG), a general EMG representation learning framework based on self-supervised pre-training. Furthermore, we construct the largest cross-device EMG vocabulary to date, which supports seamless transfer across arbitrary channel topologies and sampling rates. Extensive experiments demonstrate that AEMG outperforms state-of-the-art baselines by 5.79\u20139.25% in zero-shot leave-one-subject-out accuracy, and achieves over 90% few-shot adaptation performance with only 5% of the target user\u2019s data. Our work has proposed the concept of electromyography signals as a cross-device physiological language, learned their grammar from massive amounts of data, and laid the groundwork for a single-training, universally applicable EMG foundation model.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36275", "url": null, "sourceid": 39368, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38679, "uid": "3870e23479b42ad046256d65834e987f", "name": "Knowing Thyself: Ego-Grounding for Personalized Question-Answering in Egocentric Videos", "authors": [{"id": 190448, "fullname": "Junbin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190448?format=json", "institution": "National University of Singapore"}, {"id": 174327, "fullname": "Shenglang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174327?format=json", "institution": "University of Science and Technology of China"}, {"id": 190449, "fullname": "Pengxiang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190449?format=json", "institution": "National University of Singapore"}, {"id": 85773, "fullname": "Angela Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85773?format=json", "institution": "National University of Singapore"}], "abstract": "We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 videos and 5K personalized questions asking about \"my things\", \"my activities\", and \"my past\". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top models such as GPT-5 achieve only 46% accuracy, trailing human performance (85%) by a whopping 39%. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering \"me\" and \"my past\". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38679", "url": null, "sourceid": 39147, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38527, "uid": "2bed3c1ab71f19f8456e4366203e5562", "name": "UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation", "authors": [{"id": 190065, "fullname": "Tianhao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/190065?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 183318, "fullname": "HaoYang ZHANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/183318?format=json", "institution": "University College Dublin"}, {"id": 149059, "fullname": "Liang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/149059?format=json", "institution": "National University of Defense Technology, Tsinghua University"}, {"id": 145893, "fullname": "Haochen Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145893?format=json", "institution": "Sun Yat-sen University, SYSU"}, {"id": 161543, "fullname": "Kun Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/161543?format=json", "institution": "Peking University"}, {"id": 190066, "fullname": "Yuan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190066?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 126479, "fullname": "Pengfei Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/126479?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 149083, "fullname": "Erwei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/149083?format=json", "institution": "Defense Innovation Institute, Academy of Military Sciences (AMS)"}], "abstract": "Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multiview consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diversity hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multiview and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8\\% in Mean Per Vertex Position Error (MPVPE).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38527", "url": null, "sourceid": 31112, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37165, "uid": "413f2dc2067072c4339624db3d9fa537", "name": "Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models", "authors": [{"id": 180862, "fullname": "SIQI LIU", "url": "http://cvpr.thecvf.com/api/miniconf/users/180862?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 186819, "fullname": "Xinyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186819?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 184923, "fullname": "Bochao Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184923?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 157056, "fullname": "Junbao Zhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/157056?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 157058, "fullname": "Huimin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/157058?format=json", "institution": "University of Science and Technology Beijing"}, {"id": 157057, "fullname": "Jiansheng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157057?format=json", "institution": "University of Science and Technology Beijing"}], "abstract": "As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human\u2013AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark\u2014an egocentric, real-world video dataset for ToM with three multiple-choice QA settings\u2014demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine\u2013human collaboration toward greater alignment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37165", "url": null, "sourceid": 31612, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38696, "uid": "323799e616cdbe8f6d89d6a253e8f0fc", "name": "AHS: Adaptive Head Synthesis via Synthetic Data Augmentations", "authors": [{"id": 156200, "fullname": "Taewoong Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156200?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 156201, "fullname": "Hyojin Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156201?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 159499, "fullname": "Sohyun Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/159499?format=json", "institution": "Korea Advanced Institue of Science and Technology"}, {"id": 190482, "fullname": "Seunggi Moon", "url": "http://cvpr.thecvf.com/api/miniconf/users/190482?format=json", "institution": "Korea University"}, {"id": 190483, "fullname": "Gihwi Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/190483?format=json", "institution": "Kyung Hee University; Fliption Korea"}, {"id": 190484, "fullname": "Hoon Jung", "url": "http://cvpr.thecvf.com/api/miniconf/users/190484?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 87936, "fullname": "Jaegul Choo", "url": "http://cvpr.thecvf.com/api/miniconf/users/87936?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping\u2014where one image's head is seamlessly integrated onto another's body. Current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. These methods struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that our approach achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve both identity and expression fidelity across various head orientations and hairstyles. Notably, our method shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38696", "url": null, "sourceid": 42536, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39873, "uid": "74eccde1710c32f8ed1a5d6b3c26ab44", "name": "MultiAnimate: Pose-Guided Image Animation Made Extensible", "authors": [{"id": 143241, "fullname": "Yingcheng Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143241?format=json", "institution": "shanghaiTech university"}, {"id": 193037, "fullname": "Haowen Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193037?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 130183, "fullname": "Chuanguang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130183?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 130221, "fullname": "Zhulin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/130221?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 105588, "fullname": "Yongjun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105588?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 180264, "fullname": "Songhua Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180264?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components\u2014Identifier Assigner and Identifier Adapter\u2014which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines. Codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39873", "url": null, "sourceid": 41202, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37924, "uid": "82dd5a905e59accb6f62295917517991", "name": "TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models", "authors": [{"id": 180642, "fullname": "Jinlun Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180642?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188600, "fullname": "Jiang Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188600?format=json", "institution": "China Mobile Communications Group Co.,Ltd"}, {"id": 188601, "fullname": "Runhe Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188601?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188602, "fullname": "Xinhua Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188602?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 141087, "fullname": "Jia-Xin ZHUANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/141087?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 188603, "fullname": "Zhiyong Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188603?format=json", "institution": "China United Network Communications Corporation  Limited Guangdong  Branch"}, {"id": 181359, "fullname": "Ruixuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181359?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels.However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams.To address this limitation, we introduce **T**est-time **T**extual **L**earning (**TTL**), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels.TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise.In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches.Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Code will be released publicly upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37924", "url": null, "sourceid": 40119, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39619, "uid": "607e79272c337728322fe13f83866c17", "name": "FedCART: Tackling Long-Tailed Distributions in Federated Adversarial Training via Classifier Refinement", "authors": [{"id": 182045, "fullname": "Yuchen Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182045?format=json", "institution": "DALIAN UNIVERSITY OF TECHNOLOGY"}, {"id": 192496, "fullname": "Yizhi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/192496?format=json", "institution": "Dalian Ocean University"}, {"id": 192497, "fullname": "Junxiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192497?format=json", "institution": "Guangzhou University"}, {"id": 186341, "fullname": "Xin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186341?format=json", "institution": "Tianjin University"}, {"id": 192498, "fullname": "Heng QI", "url": "http://cvpr.thecvf.com/api/miniconf/users/192498?format=json", "institution": "Dalian University of Technology"}], "abstract": "Growing privacy and security demands in the real world have spurred interest in adversarially robust Federated Learning (FL). While Adversarial Training (AT) is a well-established defense in centralized learning, its extension to the federated setting, known as Federated Adversarial Training (FAT), faces significant challenges due to data heterogeneity across clients. Existing FAT methods have made significant contributions, but they typically assume a balanced global data distribution, an assumption that rarely holds true in practice due to the prevalence of long-tailed distributions. This work first identifies and diagnoses the severe performance degradation of FAT under long-tailed data, attributing it to skewed feature representations and impaired classifier discriminability. To address this, we propose FedCART, a novel FAT framework that decouples the model into a shared feature extractor and a dual-classifier structure. On the client side, a representation-alignment loss enhances adversarial robustness, while gradient-based class prototypes are extracted for classifier calibration. On the server side, models and prototype sets are aggregated to synthesize balanced virtual features, enabling the re-training of an auxiliary classifier to mitigate long-tailed bias. Extensive experiments demonstrate that FedCART significantly improves both accuracy and robustness, outperforming state-of-the-art FAT methods. To the best of our knowledge, this is the first work to systematically investigate and address FAT under long-tailed distributions, representing a significant step toward practical adversarial robustness in FL. Our code will be publicly available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39619", "url": null, "sourceid": 33132, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39461, "uid": "f7afdb11c40f9e80df4ba34919af5618", "name": "Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation", "authors": [{"id": 192117, "fullname": "Bin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192117?format=json", "institution": "Northwest A&amp;F University"}, {"id": 181042, "fullname": "Bin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181042?format=json", "institution": "College of Information Engineering, Northwest A&amp;F University"}, {"id": 192118, "fullname": "Qianqian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192118?format=json", "institution": null}, {"id": 192119, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192119?format=json", "institution": "Northwest A&amp;F University"}, {"id": 192120, "fullname": "yijiechen yijiechen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192120?format=json", "institution": "NORTHWEST A&amp;F UNIVERSITY"}, {"id": 192121, "fullname": "Haixi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192121?format=json", "institution": "Northwest A&amp;F University"}], "abstract": "Mitigating noisy correspondence in cross-model matching poses a serious challenge due to the problem of error accumulation. Existing methods primarily attribute this accumulation to errors caused by noisy sample pairs. However, a novel source of error from clean sample pairs (also termed anchor pairs) is discovered in this paper. Such error accumulation is considered to arise from modality-inconsistent correlations. To address this issue, a novel method termed Geometric-Semantic Learning (GSL) is proposed. Firstly, GSL leverages the Fourier transform to emphasize semantic representations and reduce cross-modal inconsistencies caused by perturbations in non-critical fine-grained features, thereby alleviating the error accumulation problem. After that, a Geometry-Aware Label Correction (GALC) method is introduced to re-estimate soft correspondence labels by leveraging angular consistency between noisy sample pairs and anchor pairs across different modalities. Finally, a semantically constrained triplet loss is employed to regulate sample distances using semantic information, enabling robust separation of clean and noisy pairs during the training process. Extensive experiments on three benchmark datasets demonstrate that GSL consistently outperforms existing methods in retrieval accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39461", "url": null, "sourceid": 39001, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39407, "uid": "bc85a3f74ae51634b2ff56120742a8bf", "name": "HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph", "authors": [{"id": 192013, "fullname": "Jinkai Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192013?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 177296, "fullname": "jiaqing wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/177296?format=json", "institution": "Hangzhou dianzi University"}, {"id": 192014, "fullname": "Xinxiang Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192014?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 185862, "fullname": "Yaoqi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185862?format=json", "institution": "Lishui University"}, {"id": 158795, "fullname": "Xichun Sheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158795?format=json", "institution": "Macao Polytechnic University"}, {"id": 152147, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152147?format=json", "institution": "Guangming Laboratory"}, {"id": 99759, "fullname": "Liangqiong Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/99759?format=json", "institution": "The University of Hong Kong"}, {"id": 93330, "fullname": "Xinchen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/93330?format=json", "institution": "JD.com"}, {"id": 76244, "fullname": "Wu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76244?format=json", "institution": "University of Science and Technology of China"}], "abstract": "In recent years, the gait parsing sequence has become increasingly popular due to its higher information entropy than the binary silhouette and the keypoint-based skeleton. However, existing parsing-based gait recognition methods have not fully explored the complex, non-linear relationships between features at different positions, semantic, and temporal dynamics levels, i.e., higher-order correlations. To unleash the power of parsing between human body parts and temporal dynamics, this paper proposes a novel hypergraph-based gait recognition framework, named HyperGait. The HyperGait contains a global head and two elaborately-designed modules. In particular, the Spatial Hypergraph Convolutional Module (SHCM) and the Temporal Hypergraph Convolutional Module (THCM) are designed to explore the high-order spatial-level and temporal-level features, respectively.The SHCM extracts fine-grained relationships between human body parts through the hypergraph.The THCM performs the high-order temporal information between temporally related human body parts.Comprehensive experiments on two large-scale gait datasets, i.e., Gait3D and SUSTech1K, show the superior performance of our proposed HyperGait.In highly challenging real-world scenarios, with only parsing as input, our HyperGait achieves the Rank-1 accuracy of 80.5\\% on the Gait3D dataset.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39407", "url": null, "sourceid": 42119, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37105, "uid": "bbaca7efb81d47580e6d5d2744fdeae3", "name": "Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning", "authors": [{"id": 186665, "fullname": "Tingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186665?format=json", "institution": "Shanghai Jiaotong University; Shanghai Artificial Intelligence Laboratory"}, {"id": 176211, "fullname": "Zheng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/176211?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 143847, "fullname": "Jingxuan Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/143847?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}, {"id": 186666, "fullname": "Lijun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186666?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 133148, "fullname": "Cheng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/133148?format=json", "institution": "Zhejiang University &amp; Westlake University"}], "abstract": "Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data\u2014especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose \\textbf{DoGe} (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37105", "url": null, "sourceid": 30949, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39648, "uid": "58d2090ece6d4b4a64b9af4763e6661b", "name": "Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions", "authors": [{"id": 143847, "fullname": "Jingxuan Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/143847?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 192560, "fullname": "Caijun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/192560?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 153763, "fullname": "Qi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/153763?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 192561, "fullname": "Honghao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/192561?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 153765, "fullname": "Linzhuang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153765?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}, {"id": 186666, "fullname": "Lijun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186666?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 153766, "fullname": "Bihui Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153766?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 133148, "fullname": "Cheng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/133148?format=json", "institution": "Zhejiang University &amp; Westlake University"}], "abstract": "Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39648", "url": null, "sourceid": 45280, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40119, "uid": "fa7a4229d85b7747fc1f885a4c46828d", "name": "GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models", "authors": [{"id": 143847, "fullname": "Jingxuan Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/143847?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 192560, "fullname": "Caijun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/192560?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193572, "fullname": "Xi Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193572?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193573, "fullname": "Xinglong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193573?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 86615, "fullname": "Siyuan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86615?format=json", "institution": "Westlake University, Zhejiang University"}, {"id": 153765, "fullname": "Linzhuang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/153765?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 153766, "fullname": "Bihui Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153766?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}, {"id": 186666, "fullname": "Lijun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186666?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 133148, "fullname": "Cheng Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/133148?format=json", "institution": "Zhejiang University &amp; Westlake University"}], "abstract": "Unified Multimodal Models (UMMs) are redefining the landscape of artificial intelligence by coupling perception and generation across language, vision, and structured reasoning. Yet, despite their growing sophistication, a critical gap persists in evaluation: existing benchmarks largely measure discriminative understanding or unconstrained generation in isolation, overlooking the integrated generative reasoning required for genuine multimodal intelligence. To address this, we introduce GGBench, the benchmark explicitly designed to evaluate geometric generative reasoning\u2014the ability of a model to understand, reason about, and construct a solution within a unified framework. Each instance in GGBench contains precisely aligned natural-language instructions, executable GeoGebra code, and rendered diagrams, enabling deterministic and interpretable verification of a model\u2019s reasoning and constructive fidelity. The benchmark comprises 1,411 rigorously curated problems covering eight categories and multiple difficulty levels, resulting in over 7,000 aligned visualizations. We propose a comprehensive tri-modal evaluation protocol that jointly assesses textual planning quality, code executability, and geometric accuracy of generated diagrams through both automated and human-in-the-loop judging. Extensive experiments on both state-of-the-art UMMs and general Large Language Models (LLMs) reveal a large performance gap between end-to-end generation and reasoning-grounded construction. GGBench establishes a new standard for testing multimodal systems that must not only understand but also build, marking a crucial step toward grounded, verifiable generative intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40119", "url": null, "sourceid": 45478, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38047, "uid": "2fe19a9f65a57f9717e619f339217837", "name": "RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning", "authors": [{"id": 172246, "fullname": "Jiahe Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/172246?format=json", "institution": "Shanghai AI Lab"}, {"id": 126876, "fullname": "Chuang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126876?format=json", "institution": "Beihang University"}, {"id": 188914, "fullname": "Bowen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188914?format=json", "institution": "Peking University"}, {"id": 188915, "fullname": "Yinfan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188915?format=json", "institution": null}, {"id": 188916, "fullname": "Hao Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188916?format=json", "institution": "South China Normal University"}, {"id": 188917, "fullname": "Xingjian Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/188917?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 188918, "fullname": "Chengjin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188918?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 188919, "fullname": "Rui Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188919?format=json", "institution": "Beihang University"}, {"id": 188920, "fullname": "Junyuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188920?format=json", "institution": "Shanghai AI Lab"}, {"id": 172761, "fullname": "Jiaxing Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/172761?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 188921, "fullname": "Yubin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188921?format=json", "institution": null}, {"id": 186666, "fullname": "Lijun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186666?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 188922, "fullname": "Zhenhua Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188922?format=json", "institution": "South China Normal University"}, {"id": 188923, "fullname": "Jiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188923?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 126901, "fullname": "Qian Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126901?format=json", "institution": "Beihang University"}, {"id": 73990, "fullname": "Conghui He", "url": "http://cvpr.thecvf.com/api/miniconf/users/73990?format=json", "institution": "Shanghai AI Lab"}], "abstract": "Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the \\textbf{RxnCaption} framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision Language Models (LVLMs) handle naturally. We introduce a strategy termed ``\\emph{BBox and Index as Visual Prompt}'' (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the \\texttt{RxnCaption-15k} dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics.We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38047", "url": null, "sourceid": 39464, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36704, "uid": "53b79303779db833f34a053df5a6c111", "name": "FedAdamom: Adaptive Momentum for Improved Generalization in Federatedd Optimization", "authors": [{"id": 175663, "fullname": "Wenjie Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/175663?format=json", "institution": "Beihang University"}, {"id": 185682, "fullname": "Tianxiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185682?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 185683, "fullname": "Feng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185683?format=json", "institution": "Beihang University"}, {"id": 185684, "fullname": "Tiantong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185684?format=json", "institution": "Nanyang Technological University; Nanyang Technological University"}, {"id": 185685, "fullname": "Zhiming Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185685?format=json", "institution": "Beihang University"}, {"id": 185686, "fullname": "Shaoting Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185686?format=json", "institution": "Beihang University"}, {"id": 158712, "fullname": "Wei Yang Bryan Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/158712?format=json", "institution": "Nanyang Technological University"}], "abstract": "Federated learning (FL) has emerged as a widely adopted training paradigm for privacy-preserving machine learning. Despite the past success of SGD-based methods, they still suffer from severe data heterogeneity and the lack of adaptivity in practical applications. While several adaptive federated optimization methods (such as FedAdam) have been proposed and demonstrated to achieve faster convergence, they fail to show significant improvements in generalization performance under highly heterogeneous data distributions, and their optimization and generalization mechanisms remain insufficiently understood. To fill this gap, we introduce diffusion theory into the adaptive federated optimization framework and analyze the distinct effects of adaptive learning rate and global momentum from the perspectives of saddle-point escaping and flat-minima selection. Theoretical results show that although FedAdam outperforms FedAvg/FedAvgM in escaping saddle points, the latter escapes sharp minima more efficiently. The root cause lies in that adaptive learning rates, while enhancing saddle-point escape, weaken the preference for flat minima. Motivated by these insights, we propose FedAdamom, a new adaptive federated optimization algorithm that adapts the momentum hyperparameter rather than the learning rate. FedAdamom maintains strong saddle-point escaping capability while enhancing flat-minima selection. We further establish its convergence guarantees under non-convex objectives. Extensive experiments demonstrate that FedAdamom significantly outperforms existing adaptive federated optimization methods in terms of convergence speed, generalization performance, and preference for flat minima.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36704", "url": null, "sourceid": 35625, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40272?format=json"], "related_events_ids": [40272]}, {"id": 40272, "uid": "53b79303779db833f34a053df5a6c111", "name": "FedAdamom: Adaptive Momentum for Improved Generalization in Federatedd Optimization", "authors": [{"id": 175663, "fullname": "Wenjie Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/175663?format=json", "institution": "Beihang University"}, {"id": 185682, "fullname": "Tianxiang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185682?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 185683, "fullname": "Feng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185683?format=json", "institution": "Beihang University"}, {"id": 185684, "fullname": "Tiantong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185684?format=json", "institution": "Nanyang Technological University; Nanyang Technological University"}, {"id": 185685, "fullname": "Zhiming Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185685?format=json", "institution": "Beihang University"}, {"id": 185686, "fullname": "Shaoting Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185686?format=json", "institution": "Beihang University"}, {"id": 158712, "fullname": "Wei Yang Bryan Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/158712?format=json", "institution": "Nanyang Technological University"}], "abstract": "Federated learning (FL) has emerged as a widely adopted training paradigm for privacy-preserving machine learning. Despite the past success of SGD-based methods, they still suffer from severe data heterogeneity and the lack of adaptivity in practical applications. While several adaptive federated optimization methods (such as FedAdam) have been proposed and demonstrated to achieve faster convergence, they fail to show significant improvements in generalization performance under highly heterogeneous data distributions, and their optimization and generalization mechanisms remain insufficiently understood. To fill this gap, we introduce diffusion theory into the adaptive federated optimization framework and analyze the distinct effects of adaptive learning rate and global momentum from the perspectives of saddle-point escaping and flat-minima selection. Theoretical results show that although FedAdam outperforms FedAvg/FedAvgM in escaping saddle points, the latter escapes sharp minima more efficiently. The root cause lies in that adaptive learning rates, while enhancing saddle-point escape, weaken the preference for flat minima. Motivated by these insights, we propose FedAdamom, a new adaptive federated optimization algorithm that adapts the momentum hyperparameter rather than the learning rate. FedAdamom maintains strong saddle-point escaping capability while enhancing flat-minima selection. We further establish its convergence guarantees under non-convex objectives. Extensive experiments demonstrate that FedAdamom significantly outperforms existing adaptive federated optimization methods in terms of convergence speed, generalization performance, and preference for flat minima.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40272", "url": null, "sourceid": -35625, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36704?format=json"], "related_events_ids": [36704]}, {"id": 36906, "uid": "c193e32491586a336b2db8341da77a15", "name": "E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving", "authors": [{"id": 184167, "fullname": "Yihong Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184167?format=json", "institution": "ServiceNow Inc; McGill University, McGill University"}, {"id": 186171, "fullname": "Haicheng Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186171?format=json", "institution": "University of Macau"}, {"id": 186172, "fullname": "Tong Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186172?format=json", "institution": "Hong Kong Polytechnic University; Tongji University"}, {"id": 186173, "fullname": "Junlin He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186173?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 186174, "fullname": "Ao Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186174?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 186175, "fullname": "Kehua Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186175?format=json", "institution": "University of Washington"}, {"id": 186176, "fullname": "Wei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186176?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 186177, "fullname": "Zhenning Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186177?format=json", "institution": "University of Macao"}, {"id": 186178, "fullname": "Lijun Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/186178?format=json", "institution": "McGill University"}, {"id": 86400, "fullname": "Cheng-Zhong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86400?format=json", "institution": "University of Macau"}], "abstract": "End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they ignore the passenger\u2019s emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) AD, where an autonomous vehicle must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valence-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36906", "url": null, "sourceid": 37101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39851, "uid": "87db7aea88a3ae9304944ae954c6a420", "name": "IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness", "authors": [{"id": 192980, "fullname": "Ziye Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192980?format=json", "institution": "University of Houston"}, {"id": 192981, "fullname": "Guang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192981?format=json", "institution": "Virginia Commonwealth University"}, {"id": 192982, "fullname": "Yihang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192982?format=json", "institution": "University of Houston"}, {"id": 192983, "fullname": "Changqing Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192983?format=json", "institution": "University of Houston"}], "abstract": "We propose IrisFP, a novel adversarial-example-based model fingerprinting framework that enhances both uniqueness and robustness by leveraging multi-boundary characteristics, multi-sample behaviors, and fingerprint discriminative power assessment to generate composite-sample fingerprints. Three key innovations make IrisFP outstanding: 1) It positions fingerprints near the intersection of all decision boundaries\u2014unlike prior methods that target a single boundary\u2014thus increasing the prediction margin without placing fingerprints deep inside target class regions, enhancing both robustness and uniqueness; 2) It constructs composite-sample fingerprints, each comprising multiple samples close to the multi-boundary intersection, to exploit collective behavior patterns and further boost uniqueness; and 3) It assesses the discriminative power of generated fingerprints using statistical separability metrics developed based on two reference model sets respectively for pirated and independently-trained models, and assigns fingerprint-specific thresholds to retained fingerprints. Extensive experiments show that IrisFP consistently outperforms state-of-the-art methods, achieving reliable ownership verification by enhancing both robustness and uniqueness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39851", "url": null, "sourceid": 36338, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39957, "uid": "b4866aabd0aa02ee10cfc72af8eb195e", "name": "SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation", "authors": [{"id": 147245, "fullname": "Jiale Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/147245?format=json", "institution": null}, {"id": 193191, "fullname": "Shangfei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193191?format=json", "institution": "University of Science and Technology of China"}], "abstract": "3D Referring Expression Segmentation (3D-RES) aims to segment objects in point clouds according to language descriptions. Unlike common practices in 2D that utilize learnable query embeddings, recent 3D-RES methods typically generate queries directly from 3D points. However, this direct coupling of queries to raw point clouds introduces new challenges: an impractically large number of queries derived from massive point cloud data and a reliance on non-deterministic sampling algorithms. In this paper, we propose a Semantic-based Adaptive Query Network (SAQN), which introduces a novel query strategy for 3D-RES. Instead of generating queries from points, SAQN employs a learnable query vector for each semantic class. This approach drastically reduces the number of queries while maintaining the advantage of avoiding Hungarian matching through implicit class alignment. Additionally, to address potential cross-object ambiguity within semantic classes, we introduce supplementary queries that are adaptively fused with each class query to disambiguate and enrich representations. Comprehensive experiments show that SAQN achieves state-of-the-art performance while reducing the number of queries.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39957", "url": null, "sourceid": 30808, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39340, "uid": "e61aeeec4e3e400d0e9e052cdde2d170", "name": "UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes", "authors": [{"id": 191879, "fullname": "Kang DU", "url": "http://cvpr.thecvf.com/api/miniconf/users/191879?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 160046, "fullname": "\u96ea \u5ed6", "url": "http://cvpr.thecvf.com/api/miniconf/users/160046?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 191880, "fullname": "Junpeng Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/191880?format=json", "institution": ""}, {"id": 191881, "fullname": "Chaozheng Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191881?format=json", "institution": "Meituan"}, {"id": 155313, "fullname": "Yi Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/155313?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 145286, "fullname": "Yirui Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/145286?format=json", "institution": "Tencent"}, {"id": 127991, "fullname": "Duotun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127991?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 191882, "fullname": "ShengHuang ShengHuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191882?format=json", "institution": null}, {"id": 128022, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128022?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable.However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness.We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39340", "url": null, "sourceid": 36602, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39824, "uid": "862690c612c4c41a2ddceb9bf7a5c848", "name": "TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models", "authors": [{"id": 143712, "fullname": "Zhiwei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/143712?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 173716, "fullname": "Yitian Pang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173716?format=json", "institution": "Tsinghua University"}, {"id": 87834, "fullname": "Weining Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87834?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 129049, "fullname": "Zhenan Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/129049?format=json", "institution": "Institute of automation, Chinese Academy of Sciences"}, {"id": 192927, "fullname": "Qi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192927?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39824", "url": null, "sourceid": 41911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37902, "uid": "f1a90e1c055459c26e3280c607f8fe5e", "name": "FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation", "authors": [{"id": 184181, "fullname": "yuchen zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184181?format=json", "institution": "Xian Jiaotng University"}, {"id": 188536, "fullname": "Huikai Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188536?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188537, "fullname": "Lihuang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188537?format=json", "institution": "Southern University of Science and Technology"}, {"id": 188538, "fullname": "Zhipeng Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188538?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188539, "fullname": "Dexing Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188539?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Recently, synthetic palmprints have been increasingly used as substitutes for real data to train recognition models. To be effective, such synthetic data must reflect the diversity of real palmprints, including both style variation and geometric variation. However, existing palmprint generation methods mainly focus on style translation, while geometric variation is either ignored or approximated by simple handcrafted augmentations. In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. Specifically, FlowPalm estimates optical flows between real palmprint pairs to capture the statistical patterns of geometric deformations. Building on these priors, we design a progressive sampling process that gradually introduces the geometric deformations during diffusion while maintaining identity consistency. Extensive experiments on six benchmark datasets demonstrate that FlowPalm significantly outperforms state-of-the-art palmprint generation approaches in downstream recognition tasks. Notably, FlowPalm achieves a higher TAR at FAR=1e-4 than the best generative model does at FAR=1e-3.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37902", "url": null, "sourceid": 30776, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36329, "uid": "f4967b7a3d977a1f0f4997e1aa838516", "name": "Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport", "authors": [{"id": 180783, "fullname": "Zheng Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180783?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 184784, "fullname": "Nan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184784?format=json", "institution": "Beijing University of Technology"}, {"id": 100598, "fullname": "Yiming Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/100598?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 184785, "fullname": "Lifeng Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184785?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: server-side pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel's deviation from the global model, with the penalty's strength scaled by the client's pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36329", "url": null, "sourceid": 33300, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37771, "uid": "1823d84d3a1658907dac3f1c08d9b4f0", "name": "MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals", "authors": [{"id": 183310, "fullname": "Junyu Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183310?format=json", "institution": "Tsinghua University"}, {"id": 174393, "fullname": "Zhendong She", "url": "http://cvpr.thecvf.com/api/miniconf/users/174393?format=json", "institution": "Tianjin University"}, {"id": 188218, "fullname": "Chenghanyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188218?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 188219, "fullname": "Yuchuang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/188219?format=json", "institution": "computer science, Tsinghua University"}, {"id": 188220, "fullname": "Luqing Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188220?format=json", "institution": "Institute of Microelectronics of the Chinese Academy of Sciences"}, {"id": 188221, "fullname": "Dingwei Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188221?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 76730, "fullname": "Zonghao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/76730?format=json", "institution": "Tsinghua University"}, {"id": 188222, "fullname": "Bo Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188222?format=json", "institution": "Tsinghua University"}, {"id": 188223, "fullname": "Zehua Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/188223?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 188224, "fullname": "Wupeng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/188224?format=json", "institution": "Artificial Intelligence Institute, China Electronics Technology Group Corporation"}, {"id": 188225, "fullname": "Yaxin Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188225?format=json", "institution": "Beijing Information Science and Technology University"}, {"id": 188226, "fullname": "Peng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188226?format=json", "institution": "Tianjin University"}, {"id": 157538, "fullname": "Pei Pei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/157538?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 158611, "fullname": "Fengxiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158611?format=json", "institution": "National University of Defense Technology"}, {"id": 176670, "fullname": "Yangang Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/176670?format=json", "institution": "TsingHua University"}, {"id": 131381, "fullname": "Maosong Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/131381?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "The paradigm of Multimodal Large Language Models (MLLMs) offers a promising blueprint for advancing the electromagnetic (EM) domain. However, prevailing approaches often deviate from the native MLLM paradigm, instead using task-specific or pipelined architectures that lead to fundamental limitations in model performance and generalization. Fully realizing the MLLM potential in EM domain requires overcoming three main challenges: (1) \\textbf{Data.} The scarcity of high-quality datasets with paired EM signals and descriptive text annotations used for MLLMs pre-training; (2) \\textbf{Benchmark.} The absence of comprehensive benchmarks to systematically evaluate and compare the performance of models on EM signal-to-text tasks; (3) \\textbf{Model.} A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. First, to overcome data scarcity, we construct and release EM-100k, a large-scale dataset comprising over 100,000 EM signal-text pairs. Second, to enable rigorous and standardized evaluation, we propose EM-Bench, the most comprehensive benchmark featuring diverse downstream tasks spanning from perception to reasoning. Finally, to tackle the core modeling challenge, we present MERLIN, a novel training framework designed not only to align low-level signal representations with high-level semantic text, but also to explicitly enhance model robustness and performance in challenging low-SNR environments. Comprehensive experiments validate our method, showing that MERLIN is state-of-the-art in the EM-Bench and exhibits remarkable robustness in low-SNR settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37771", "url": null, "sourceid": 41924, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38519, "uid": "3e3e6f02b06d73c121422793ab3b91f1", "name": "CompBench: Benchmarking Complex Instruction-guided Image Editing", "authors": [{"id": 179958, "fullname": "Bohan Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/179958?format=json", "institution": "East China Normal University(Tax ID: 12100000425006133D)"}, {"id": 102419, "fullname": "Wenxuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102419?format=json", "institution": "East China Normal University"}, {"id": 175615, "fullname": "Yuntian Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175615?format=json", "institution": "East China Normal University"}, {"id": 190049, "fullname": "Junbo Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190049?format=json", "institution": "East China Normal University"}, {"id": 190050, "fullname": "Jincheng Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190050?format=json", "institution": "East China Normal University"}, {"id": 149629, "fullname": "Shaosheng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/149629?format=json", "institution": "Xiaohongshu"}, {"id": 177703, "fullname": "Fei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/177703?format=json", "institution": "Xiaohongshu"}, {"id": 190051, "fullname": "Zhaopeng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190051?format=json", "institution": "Zhejiang University"}, {"id": 190052, "fullname": "Zhouhong Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190052?format=json", "institution": "Xiaohongshu"}, {"id": 179641, "fullname": "Zhenfei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/179641?format=json", "institution": "University of Oxford; University of Sydney"}, {"id": 87059, "fullname": "Lei Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/87059?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 151075, "fullname": "Wanli Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151075?format=json", "institution": "Shanghai AI Lab"}, {"id": 70822, "fullname": "Lin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/70822?format=json", "institution": "University of Science and Technology of China"}, {"id": 190053, "fullname": "Fei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190053?format=json", "institution": null}, {"id": 153682, "fullname": "Zihan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153682?format=json", "institution": "East China Normal University"}, {"id": 135330, "fullname": "Yuan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/135330?format=json", "institution": "East China Normal University"}, {"id": 87891, "fullname": "Shaohui Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/87891?format=json", "institution": "East China Normal University"}], "abstract": "While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce CompBench, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models' precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38519", "url": null, "sourceid": 34078, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40088, "uid": "33abbac390f933b4d29d1ccae857ea98", "name": "Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures", "authors": [{"id": 180120, "fullname": "Yuechen Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180120?format=json", "institution": "Tsinghua University"}, {"id": 187287, "fullname": "Fang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187287?format=json", "institution": "University of Macau"}, {"id": 193485, "fullname": "Qimao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/193485?format=json", "institution": "Tsinghua University"}, {"id": 144260, "fullname": "Shaoqing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144260?format=json", "institution": "University of Macau"}, {"id": 193486, "fullname": "Jiaxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193486?format=json", "institution": "Tsinghua University"}, {"id": 151958, "fullname": "Ziying Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/151958?format=json", "institution": "Beijing Jiaotong University"}, {"id": 149023, "fullname": "Zhixin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149023?format=json", "institution": "University of Macau"}, {"id": 154370, "fullname": "Fuxi Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154370?format=json", "institution": "Tsinghua University"}], "abstract": "Vision-Language-Action (VLA) models for autonomous driving often hit a performance plateau during Reinforcement Learning (RL) optimization. This stagnation arises from exploration capabilities constrained by previous Supervised Fine-Tuning (SFT), leading to \"persistent failures\" in long-tail scenarios. In these critical situations, all explored actions yield a zero-value driving score. This information-sparse reward signals a failure, yet fails to identify its root cause\u2014whether it is due to incorrect planning, flawed reasoning, or poor trajectory execution. To address this limitation,  we propose **VLA** with **E**xplicit **L**earning from **F**ailures (**ELF-VLA**), a framework that augments RL with structured diagnostic feedback. Instead of relying on a vague scalar reward, our method produces detailed, interpretable reports that identify the specific failure mode. The VLA policy then leverages this explicit feedback to generate a **Feedback-Guided Refinement**. By injecting these corrected, high-reward samples back into the RL training batch, our approach provides a targeted gradient, which enables the policy to solve critical scenarios that unguided exploration cannot. Extensive experiments demonstrate that our method unlocks the latent capabilities of VLA models, achieving state-of-the-art (SOTA) performance on the public Navsim benchmark for overall PDMS, EPDMS score and high-level planning accuracy.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40088", "url": null, "sourceid": 40802, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39282, "uid": "193864498a92b6aede7589f5d4826e12", "name": "Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs", "authors": [{"id": 151021, "fullname": "Dmitry Demidov", "url": "http://cvpr.thecvf.com/api/miniconf/users/151021?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 127235, "fullname": "Muhammad Zaigham Zaheer", "url": "http://cvpr.thecvf.com/api/miniconf/users/127235?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 191762, "fullname": "Zongyan Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/191762?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 107318, "fullname": "Omkar Thawakar", "url": "http://cvpr.thecvf.com/api/miniconf/users/107318?format=json", "institution": "MBZUAI"}, {"id": 128310, "fullname": "Rao Anwer", "url": "http://cvpr.thecvf.com/api/miniconf/users/128310?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}], "abstract": "Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem remain limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system oper- ates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art per- formance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous ap- proaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vo- cabularies define an upper bound. Ablations further confirm that advanced prompting techniques and built-in rea- soning mechanisms significantly enhance naming quality. Additionally, we show that carefully engineered prompts enable open-source LMMs to match proprietary counter- parts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code and relevant prompting guidelines will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39282", "url": null, "sourceid": 41118, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38420, "uid": "e4a6bedbfb9b7230f8a46e0ab7e15ded", "name": "Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights", "authors": [{"id": 189836, "fullname": "Masafumi Mori", "url": "http://cvpr.thecvf.com/api/miniconf/users/189836?format=json", "institution": "DENSO Corporation"}, {"id": 189837, "fullname": "Shinya Gongyo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189837?format=json", "institution": "DENSO IT Laboratory"}, {"id": 189838, "fullname": "Mitsuru Ambai", "url": "http://cvpr.thecvf.com/api/miniconf/users/189838?format=json", "institution": "Denso IT Laboratory, Inc."}], "abstract": "Post-training quantization (PTQ) enables rapid deployment of deep pretrained models. In the low-bit regime, recent PTQ methods for vision models adopt asymmetric quantization (AsymQ), introducing zero-point offsets to mitigate quantization errors. However, these offsets impose substantial hardware overhead and fail to fully capture the non-symmetric structure of pretrained weight distributions, leaving many quantization levels unused.In this paper, we reveal a hidden symmetry in the pretrained weights: after removing a few sparse outliers, the distribution becomes nearly symmetric.Accordingly, we propose Dense and Additive Sparse Quantization (DASQ), which decomposes the weights into dense and sparse matrices.The dense component captures the symmetric structure around zero, while the sparse component models the removed outliers, and both can be processed in parallel and can be implemented with efficient zero-point-free computation.Experiments on image classification, object detection, and instance segmentation show that DASQ surpasses state-of-the-art PTQ methods with lower BOPs. On an FPGA, DASQ also demonstrates higher accuracy and lower power consumption than AsymQ at comparable throughput.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38420", "url": null, "sourceid": 40652, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36474, "uid": "0e9d48fe4fca69a47f5353d0a62333c2", "name": "Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation", "authors": [{"id": 128144, "fullname": "Yuqing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128144?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 185132, "fullname": "Guotian Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185132?format=json", "institution": "South China University of Technology; Peng Cheng Laboratory"}, {"id": 185133, "fullname": "Zhenqiao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185133?format=json", "institution": "Harbin Institute of Technology; Peng Cheng Laboratory"}, {"id": 89529, "fullname": "Zhenyu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/89529?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 128139, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128139?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}], "abstract": "Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios\u2014strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36474", "url": null, "sourceid": 40352, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40074, "uid": "42759774827b8680ddb8a6587761480f", "name": "Steering Where to Diffuse: Generative Modeling of Phenotypic Response Simulation with Steered Diffusion Bridge", "authors": [{"id": 181163, "fullname": "Rongchao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181163?format=json", "institution": "Peking University"}, {"id": 193444, "fullname": "Chengxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193444?format=json", "institution": "Beijing Institute of Technology"}, {"id": 193445, "fullname": "Yiwei Lou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193445?format=json", "institution": "Peking University"}, {"id": 193282, "fullname": "Yuling Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/193282?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 158816, "fullname": "Hanpin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158816?format=json", "institution": "Peking University"}, {"id": 149685, "fullname": "Yu Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149685?format=json", "institution": "Peking University"}], "abstract": "Phenotypic Response Simulation (PRS) has long been a fundamental task in quantitative biology and high-throughput screening, with the potential to accelerate therapeutic development and elucidate disease mechanisms beyond empirical clinical practice. However, the vast perturbation space poses challenges to the discriminative formulation, and existing generative approaches tend to concentrate on the same trajectory subspace, making their generated paths prone to drift. To fill these gaps, we propose a novel Steered Diffusion Bridge approach and named SimuSDB to define deterministic probabilistic trajectories between two distinct state domains for cell response generation. SimuSDB consists of two iterative processes: i) extending the diffusion bridge paradigm to maintain stochasticity and diversity in interpolation trajectories by introducing Brownian bridges and ii) generating cell morphologies that comply with phenotypic constraints, while allowing the latter to explicitly guide the generative process. For the challenging second stage, which involves incorporating diverse morphological constraints and phenotype rules, we formalize the rule-guided sample generation task as an optimal control problem within a stochastic dynamical system. This way, the generative model can achieve analytically tractable optimal control strategies and steered generation without collapsing toward the trajectory of the same data subspace. Comprehensive experiments demonstrate the superior performance of SimuSDB.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40074", "url": null, "sourceid": 45959, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38548, "uid": "e019c9de73f6441a1e1d8b26404fdb6f", "name": "Coordinate Denoising for Non\u2011Equilibrium Molecular Representation Learning", "authors": [{"id": 190110, "fullname": "Qianwei Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190110?format=json", "institution": "Nanjing University"}, {"id": 190111, "fullname": "Baile Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190111?format=json", "institution": "nanjing university"}, {"id": 190112, "fullname": "Jian Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190112?format=json", "institution": "Nanjing University"}, {"id": 190113, "fullname": "Furao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190113?format=json", "institution": "Nanjing University"}], "abstract": "Three-dimensional molecular representation learning has shown great promise in modeling chemical structures and their properties. However, most existing approaches implicitly assume molecules are at or near equilibrium states. This assumption breaks down for non-equilibrium structures\u2014ubiquitous in molecular dynamics (MD) trajectories\u2014where standard coordinate denoising techniques fail because the direct equivalence between denoising scores and atomic forces no longer holds. To bridge this gap, we propose Node Denoising on non-Equilibrium Molecules (NDeM), a novel auxiliary task grounded in a second-order finite difference approximation of the potential energy surface. By explicitly accounting for the non-zero inherent forces in non-equilibrium states, NDeM provides a theoretically sound denoising objective applicable to arbitrary molecular conformations. Crucially, our method is designed as a lightweight, architecture-agnostic plugin that requires no pre-training and can be seamlessly integrated into various supervised learning pipelines. Extensive experiments across diverse benchmarks, including MD17, QM9, and the large-scale OC20 dataset, demonstrate that NDeM consistently improves baseline models, yielding highly competitive performance and validating its robustness across both equilibrium and non-equilibrium regimes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38548", "url": null, "sourceid": 35093, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39493, "uid": "7314a789d9c965494dc1b1c371d120e9", "name": "TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation", "authors": [{"id": 186294, "fullname": "Yiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186294?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 76545, "fullname": "Sixian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76545?format=json", "institution": "None"}, {"id": 186295, "fullname": "Keming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186295?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 77199, "fullname": "Xinhang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/77199?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 192195, "fullname": "Songjie Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/192195?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 85108, "fullname": "Shuqiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85108?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Existing zero-shot Object Goal Navigation (ObjectNav) methods often exploit commonsense knowledge from large language or vision-language models to guide navigation. However, such knowledge arises from internet-scale text rather than embodied 3D experience, and episodic observations collected during navigation are typically discarded, preventing the accumulation of lifelong experience.To this end, we propose Trajectory RAG (TrajRAG), a retrieval-augmented generation framework that enhances large-model reasoning by retrieving geometric\u2013semantic experiences.  TrajRAG incrementally accumulates episodic observations from past navigation episodes. To structure these observations, we propose a topological-polar (topo-polar) trajectory representation that compactly encodes spatial layouts and semantic contexts, effectively removing redundancies in raw episodic observations. A hierarchical chunking structure further organizes similar topo-polar trajectories into unified summaries, enabling coarse-to-fine retrieval. During navigation, candidate frontiers generate multiple trajectory hypotheses that query TrajRAG for similar past trajectories, guiding large-model reasoning for waypoint selection. New experiences are continually consolidated into TrajRAG, enabling the accumulation of lifelong navigation experience.Experiments on MP3D, HM3D-v1, and HM3D-v2 show that TrajRAG effectively retrieves relevant geometric\u2013semantic experiences and improves zero-shot ObjectNav performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39493", "url": null, "sourceid": 34418, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36951, "uid": "fb8fb2ff998229aca1a25c0b94b94cc0", "name": "Multi-Scale Gaussian-Language Map for Embodied Navigation and Reasoning", "authors": [{"id": 76545, "fullname": "Sixian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76545?format=json", "institution": "None"}, {"id": 186294, "fullname": "Yiyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186294?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 77199, "fullname": "Xinhang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/77199?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 186295, "fullname": "Keming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186295?format=json", "institution": ", Chinese Academy of Sciences"}, {"id": 186296, "fullname": "Zijian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186296?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 85108, "fullname": "Shuqiang Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85108?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Understanding the geometric and semantic structure of environments is essential for embodied agents. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics,and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian parameters from dense point clouds without gradient-based optimization. Experiments on ObjectNav, InstNav, and SQA tasks show that GLMap effectively enhances target localization and contextual reasoning, while remaining compatible with large-model-based methods in a zero-shot manner.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36951", "url": null, "sourceid": 35001, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38333, "uid": "3e89ac165a1a75a582fa8305bae74fcd", "name": "See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis", "authors": [{"id": 180498, "fullname": "Jaehyun Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/180498?format=json", "institution": "KAIST"}, {"id": 189626, "fullname": "Minyoung Ahn", "url": "http://cvpr.thecvf.com/api/miniconf/users/189626?format=json", "institution": "KRAFTON; Seoul National University"}, {"id": 129975, "fullname": "Minkyu Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/129975?format=json", "institution": "KRAFTON, Inc."}, {"id": 189627, "fullname": "Jonghyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/189627?format=json", "institution": "KRAFTON AI"}, {"id": 131183, "fullname": "Jae-Gil Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/131183?format=json", "institution": "Korea Advanced Institute of Science and Technology"}, {"id": 189628, "fullname": "Dongmin Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/189628?format=json", "institution": "KRAFTON"}], "abstract": "Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose **ArtiAgent**, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using our pipeline, we collect 100K pairwise instances and demonstrate both efficacy and versatility across diverse applications. Code is available at https://anonymous.4open.science/r/ArtiAgent.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38333", "url": null, "sourceid": 32893, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36532, "uid": "0dd989c64c054921672af61a0c7a5e95", "name": "Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization", "authors": [{"id": 175587, "fullname": "Juxin Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/175587?format=json", "institution": "Inner Mongolia University"}, {"id": 143205, "fullname": "Haoyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/143205?format=json", "institution": "Inner Mongolia University"}, {"id": 185294, "fullname": "Mengyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185294?format=json", "institution": "Inner Mongolia University"}, {"id": 185295, "fullname": "Huaiwen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185295?format=json", "institution": "Inner Mongolia University"}], "abstract": "Machine Unlearning (MU) focuses on removing the influence of training samples from pre-trained models without retraining the model entirely. Existing MU methods have made several efforts to enable complete forgetting while preserving the model\u2019s performance on remaining data. However, they typically apply equal weights across different data, overlooking the ambiguous decision boundaries between similar samples or approximate classes. This leads to unnecessary consumption of shallowly memorized samples and significant performance degradation for approximate retention classes. Additionally, the inherent inconsistency between forgetting and retention objectives results in gradient conflict and domination problems during training, hindering model convergence and degrading overall performance. To address these, we introduce a novel adaptive gradient reweighting that assigns importance weights to individual forget samples or vulnerable retention classes, thereby enabling more efficient unlearning and preserving the performance of approximate classes. Subsequently, we propose a multi-stage objective optimization strategy, which comprises three optimization stages: Direction Rectification, Temporal Stabilization, and Adaptive Objective Combination. This strategy rectifies the direction of conflicting gradients and prevents one task (forgetting or retention) from dominating the model update. Comprehensive analyses and extensive experiments on multiple public datasets demonstrate that our method achieves considerable performance improvements in various tasks and scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36532", "url": null, "sourceid": 41002, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38039, "uid": "6d8a1d94651a63cd8cb385dc3e96a3a0", "name": "A More Word-like Image Tokenization for MLLMs", "authors": [{"id": 180716, "fullname": "Hyun Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/180716?format=json", "institution": "Seoul National University"}, {"id": 188895, "fullname": "Hyemin Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/188895?format=json", "institution": "Seoul National University"}, {"id": 188896, "fullname": "Yejin Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188896?format=json", "institution": "Seoul National University"}, {"id": 188897, "fullname": "Hyungwook Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188897?format=json", "institution": "Seoul National University"}, {"id": 188898, "fullname": "Hyunsoo Cho", "url": "http://cvpr.thecvf.com/api/miniconf/users/188898?format=json", "institution": "Ewha Women&#x27;s University"}, {"id": 188899, "fullname": "Soo Kyung Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/188899?format=json", "institution": "Ewha Women&#x27;s University"}, {"id": 153548, "fullname": "Joonseok Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/153548?format=json", "institution": "Seoul National University"}], "abstract": "Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an images into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses standard LLaVA 1.5 using only about 11% of the original visual tokens, substantially reducing memory and latency and making visual inputs more compatible with LLM-based reasoning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38039", "url": null, "sourceid": 40601, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37158, "uid": "2ef80c94a699fe337acb4f8236ace1cc", "name": "Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation", "authors": [{"id": 186800, "fullname": "Muquan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186800?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186801, "fullname": "Hang Gou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186801?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186802, "fullname": "Yingyi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186802?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186803, "fullname": "Rongzheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186803?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 186804, "fullname": "Ke Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186804?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 156250, "fullname": "Tao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/156250?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher\u2019s statistics. However, current residual-matching pipelines rely on static real patches, creating a fit-complexity gap and a pull-to-anchor effect that reduce intra-class diversity and hurt generalization. To address these issues,  we introduce RETA--a Retrieval and Topology Alignment framework for decoupled DD. First, Dynamic Retrieval Connection (DRC) selects a real patch from a prebuilt pool by minimizing a fit-complexity score in teacher feature space; the chosen patch is injected via a residual connection to tighten feature fit while controlling injected complexity. Second, Persistent Topology Alignment (PTA) regularizes synthesis with persistent homology: we build a mutual k-NN feature graph, compute persistence images of components and loops, and penalize topology discrepancies between real and synthetic sets, mitigating pull-to-anchor effect. Across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets, RETA consistently outperforms various baselines under comparable time and memory, especially reaching 64.3\\% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1\\% over the best prior.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37158", "url": null, "sourceid": 37947, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36309, "uid": "5f7a2dac2bedb3237ab8062d75b542f7", "name": "Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation", "authors": [{"id": 184742, "fullname": "Hongwei Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184742?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 184743, "fullname": "Jiahang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184743?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 184744, "fullname": "Xun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184744?format=json", "institution": "Zhejiang Gongshang University"}, {"id": 129754, "fullname": "Wenwu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129754?format=json", "institution": "Zhejiang Gongshang University"}], "abstract": "Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Source code will be released for research purposes.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36309", "url": null, "sourceid": 38318, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39101, "uid": "e67a27ba8bb6ba92dc274342c874d373", "name": "AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation", "authors": [{"id": 191357, "fullname": "Xiya Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191357?format=json", "institution": "Macau University of Science and Technology; Zhanjiang Ocean University"}, {"id": 191358, "fullname": "Qinglin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191358?format=json", "institution": "Macau University of Science and Technology"}, {"id": 191359, "fullname": "Li Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191359?format=json", "institution": null}], "abstract": "Prototype or region-attention modules have recently improved medical image segmentation but still suffer from two fundamental limitations: 1) they represent each semantic concept as a point or isotropic region, failing to capture the inherently anisotropic geometry of real feature distributions; and 2) many rely on non-differentiable clustering or one-way kernel weighting, which restricts their ability to form coherent region-level representations. We address these issues with the Anisotropic Differentiable Granular-Ball (AD-GBC) module, which generalizes prototypes into learnable geometric regions parameterized by a center and an anisotropic vector scale. AD-GBC aggregates local features into region-level semantics and redistributes the refined representation back to pixels in a fully differentiable manner, enabling geometry-aware refinement within modern UNet-style architectures.  Two geometric regularizers, a Wasserstein-based diversity loss and a radius\u2013dispersion consistency loss, prevent center collapse and encourage stable, well-formed region geometry.AD-GBC yields consistent improvements across four widely used medical segmentation benchmarks (BUSI, GlaS, CVC-ClinicDB, ISIC17) when integrated into two strong backbones (Rolling-UNet and U-KAN), demonstrating that the proposed geometric region formulation generalizes well across different imaging conditions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39101", "url": null, "sourceid": 33570, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38330, "uid": "b442e83a5556fa523514fdf47e8e4e1b", "name": "Temporal Interaction in Spiking Transformers with Multi-Delay Mixer", "authors": [{"id": 181694, "fullname": "Kexin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181694?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 189619, "fullname": "Hanwen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189619?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 189620, "fullname": "Zeyang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/189620?format=json", "institution": null}, {"id": 143723, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/143723?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 180305, "fullname": "Jieyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180305?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 155889, "fullname": "Shuai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155889?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 189621, "fullname": "Jibin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189621?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 85579, "fullname": "Malu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85579?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 127337, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127337?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Spiking Neural Networks (SNNs) have gained significant attention due to their event-driven computational paradigm, making them promising for neuromorphic computing. In recent years, the integration of SNNs and Transformer architectures has made remarkable progress in various tasks. However, existing spiking self-attention mechanisms predominantly focus on spatial information while neglecting explicit temporal modelling, leading to suboptimal performance. In this paper, we introduce the Temporal Interaction Coefficient (TIC) to analyze temporal dependency patterns in these spatial-only attention mechanisms, revealing their limited temporal interactions and restricted pattern diversity. To overcome this issue, we propose the \\textbf{M}ulti-\\textbf{D}elay \\textbf{Mixer} (\\textbf{MD-Mixer}), drawing inspiration from time delay mechanisms in the nervous system. Specifically, MD-Mixer introduces multiple temporal delays to perform effective time mixing and facilitate temporally enriched spatial attention. In addition, it can be integrated seamlessly into existing Spiking Transformers as a drop-in replacement while maintaining energy efficiency. Extensive evaluations on static and neuromorphic benchmarks demonstrate that MD-Mixer substantially improves the performance of Spiking Transformers, outperforming existing state-of-the-art (SOTA) methods. This work establishes MD-Mixer as an effective and general solution for temporal modelling in event-driven architectures.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38330", "url": null, "sourceid": 44625, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36727, "uid": "9554b1e815bd9412c45cf4cc459288bc", "name": "Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis", "authors": [{"id": 181234, "fullname": "Pei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181234?format=json", "institution": "Hunan University"}, {"id": 185733, "fullname": "xiangxiang Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185733?format=json", "institution": "Hunan University"}, {"id": 185734, "fullname": "Tengfei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185734?format=json", "institution": "Hunan University"}, {"id": 185735, "fullname": "Yucheng Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/185735?format=json", "institution": "National University of Singapore"}, {"id": 185736, "fullname": "Xuanbai Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/185736?format=json", "institution": "Hunan University"}, {"id": 185737, "fullname": "Yiping Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185737?format=json", "institution": "Hunan University"}], "abstract": "Whole-Slide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology.Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, **S**parse **T**ask V**e**ctor Mixu**p** with **H**ypernetworks ($\\text{STEPH}$). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that $\\text{STEPH}$ improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36727", "url": null, "sourceid": 30912, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38375, "uid": "dbe40ccd11b1fb869099e58e00076027", "name": "SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models", "authors": [{"id": 180965, "fullname": "Guanghui Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180965?format=json", "institution": "Hunan University"}, {"id": 189741, "fullname": "Huan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189741?format=json", "institution": "Hunan University"}, {"id": 189742, "fullname": "Zhixue Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189742?format=json", "institution": "University of Sheffield, University of Sheffield"}, {"id": 185734, "fullname": "Tengfei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185734?format=json", "institution": "Hunan University"}, {"id": 189743, "fullname": "Kehan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189743?format=json", "institution": "Hunan University"}, {"id": 189744, "fullname": "Steffen Eger", "url": "http://cvpr.thecvf.com/api/miniconf/users/189744?format=json", "institution": "University of Technology Nuremberg"}, {"id": 168336, "fullname": "Zhihua Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/168336?format=json", "institution": "Jinan University"}], "abstract": "Scientific images often require accurate numerical representations and correct object attributes, making them differ significantly from real-life images. However, existing faithfulness metrics for image generation or interpretation with large multimodal models (LMMs) focus mostly on real-life images, which makes them ill-suited for scientific image evaluations. For this, we propose **SCIEval**, a novel and unified faithfulness metric specifically designed for **SC**ientific **I**mage **Eval**uations. First, to fully capture faithfulness, we introduce three key aspects: (i) Relevance (R) which measures the overall text-image correspondence, (ii) Accuracy (A) which examines the details of scientific objects, and (iii) Explainability (E) which reveals unfaithful elements in the generated content. Consequently, we generate aspect-aware scientific  text-image data to train three sub evaluators (SCIEval-R/A/E). Specifically, to train SCIEval-R and SCIEval-A, we propose a new SciCLIP framework, where we improve the scientific image perception of CLIP text and visual encoders via intra- and cross-modal contrastive learning. Meanwhile, to train SCIEval-E, we finetune a strong LMM using supervised rationale samples. Moreover, we present SCIEval-Bench, a human-annotated evaluation benchmark across 8 scientific domains, consisting of 3,000 scientific text-to-image samples from 4 LMMs (for image generation) and 3,000 scientific image captioning samples from 4 LMMs (for image interpretation). Experiments on SCIEval-Bench demonstrate that our SCIEval is more reliable and better correlated with human ratings compared to 24 competitors, including GPT-4o.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38375", "url": null, "sourceid": 32977, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36658, "uid": "698936639e27b2bc038e0d7b4ea464b2", "name": "Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence", "authors": [{"id": 138558, "fullname": "Haozhe Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/138558?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 76304, "fullname": "Rui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/76304?format=json", "institution": "King Abdullah University of Science and Technology"}, {"id": 134937, "fullname": "\u6b63\u5b9d \u738b", "url": "http://cvpr.thecvf.com/api/miniconf/users/134937?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185578, "fullname": "Xinhao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185578?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 178280, "fullname": "Linjie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178280?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 144154, "fullname": "Tianyu Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/144154?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185579, "fullname": "Xuan Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185579?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 89263, "fullname": "Jiaqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89263?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "Unsupervised non-rigid point cloud correspondence aims to predict point-to-point correspondences without annotations. Existing methods leverage the spatial-relation-based feature propagation strategy that includes non-physical connections, which are sensitive to non-rigid deformation. To address this issue, we advocate to learn shape topology robust to non-rigid deformation, and propose the topology-aware feature propagation module integrated into a coarse-to-fine propagation and optimization pipeline. To extract point features robust to non-rigid deformation, we estimate keypoints as superpoints and encode superpoint features with topology weights, which learns reasonable topologies under non-rigid deformation. The vector quantization codebook is leveraged to enhance the original superpoint features with stored representative features across the dataset, improving feature robustness against shape variance. Robust point-wise correspondence is yielded after coarse-to-fine feature fusion and efficient test-time optimization. Extensive experiments on multiple benchmarks demonstrate the state-of-the-art performance of our method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36658", "url": null, "sourceid": 33724, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38827, "uid": "a60a9cff8f17d6bde744c438a7c6bb9f", "name": "Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation", "authors": [{"id": 180839, "fullname": "He Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180839?format=json", "institution": "Hokkaido University"}, {"id": 180840, "fullname": "Ren Togo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180840?format=json", "institution": "Hokkaido University"}, {"id": 190771, "fullname": "Takahiro Ogawa", "url": "http://cvpr.thecvf.com/api/miniconf/users/190771?format=json", "institution": "Hokkaido University"}, {"id": 190772, "fullname": "Kenji Hirata", "url": "http://cvpr.thecvf.com/api/miniconf/users/190772?format=json", "institution": "Hokkaido University Faculty of Medicine"}, {"id": 190773, "fullname": "Minghui Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190773?format=json", "institution": "Hokkaido University"}, {"id": 190774, "fullname": "Takaaki Yoshimura", "url": "http://cvpr.thecvf.com/api/miniconf/users/190774?format=json", "institution": "Hokkaido University"}, {"id": 190775, "fullname": "Hiroyuki Sugimori", "url": "http://cvpr.thecvf.com/api/miniconf/users/190775?format=json", "institution": "Hokkaido University"}, {"id": 190776, "fullname": "Noriko Nishioka", "url": "http://cvpr.thecvf.com/api/miniconf/users/190776?format=json", "institution": "Hokkaido University Faculty of Medicine"}, {"id": 190777, "fullname": "Yukie Shimizu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190777?format=json", "institution": "Hokkaido University"}, {"id": 190778, "fullname": "Kohsuke Kudo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190778?format=json", "institution": "Hokkaido University, Tokyo Institute of Technology"}, {"id": 190779, "fullname": "Miki Haseyama", "url": "http://cvpr.thecvf.com/api/miniconf/users/190779?format=json", "institution": "Hokkaido University"}], "abstract": "Automatic medical report generation from multimodal longitudinal imaging is crucial for clinical diagnosis but remains challenging due to privacy constraints and evolving disease dynamics. While federated learning (FL) enables decentralized model training without data sharing, its extension to longitudinal medical modeling remains underexplored. Existing FL approaches overlook temporal non-stationarity across visits and patient-specific heterogeneity, causing unstable optimization and degraded report quality.We introduce Federated Temporal Adaptation (FTA), a new FL setting for longitudinal medical report generation, and propose FedTAR, a framework combining parameter-efficient personalization and meta-learned temporal aggregation. FedTAR employs a metadata-conditioned LoRA module that generates patient-specific adapters from Gaussian-mixture embeddings and a residual temporal aggregation scheme that adaptively weights client updates via first-order MAML, ensuring stable and efficient optimization under temporal heterogeneity.Experiments on J-MID (1M exams) and MIMIC-CXR demonstrate consistent improvements in linguistic accuracy, temporal coherence, and cross-site generalization, establishing FedTAR as a robust, privacy-preserving paradigm for federated multimodal longitudinal modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38827", "url": null, "sourceid": 38693, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37537, "uid": "8de95d6a6ba0ca6c1eec90297345e0a6", "name": "Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal", "authors": [{"id": 187668, "fullname": "Xiaolong Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187668?format=json", "institution": "Zhejiang University"}, {"id": 71455, "fullname": "Qi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71455?format=json", "institution": "Zhejiang University"}, {"id": 76465, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76465?format=json", "institution": "Zhejiang University"}, {"id": 187669, "fullname": "Zongxi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187669?format=json", "institution": "Zhejiang University"}, {"id": 89634, "fullname": "Kailun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89634?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 187670, "fullname": "Peixuan Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187670?format=json", "institution": "Zhejiang University"}, {"id": 187671, "fullname": "Jiacheng Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187671?format=json", "institution": "Zhejiang University"}, {"id": 187672, "fullname": "Yao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187672?format=json", "institution": "Zhejiang University"}, {"id": 187673, "fullname": "Yaoguang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187673?format=json", "institution": "Zhejiang University"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 89635, "fullname": "Kaiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89635?format=json", "institution": "Zhejiang University"}], "abstract": "Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems\u2014including single-lens and metalens designs\u2014is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments.This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature.Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models.To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors.VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process.We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model.Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods.These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler.All code and datasets will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37537", "url": null, "sourceid": 39277, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39404, "uid": "7c7994618ab9ec08e3e913145fcbab5e", "name": "Universal Computational Aberration Correction: A Comprehensive Benchmark Analysis", "authors": [{"id": 187668, "fullname": "Xiaolong Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187668?format=json", "institution": "Zhejiang University"}, {"id": 71455, "fullname": "Qi Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71455?format=json", "institution": "Zhejiang University"}, {"id": 187672, "fullname": "Yao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187672?format=json", "institution": "Zhejiang University"}, {"id": 76465, "fullname": "Lei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/76465?format=json", "institution": "Zhejiang University"}, {"id": 192005, "fullname": "Zhonghua Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192005?format=json", "institution": "Zhejiang University"}, {"id": 89634, "fullname": "Kailun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89634?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 75489, "fullname": "Luc Van Gool", "url": "http://cvpr.thecvf.com/api/miniconf/users/75489?format=json", "institution": "INSAIT, Sofia Un. St. Kliment Ohridski"}, {"id": 89635, "fullname": "Kaiwei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89635?format=json", "institution": "Zhejiang University"}], "abstract": "Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses.Universal CAC paradigms trained on datasets encompassing diverse aberrations offer a promising solution to these challenges.However, efforts to develop universal CAC are still in their early stages due to the lack of a \\textit{comprehensive} benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear \\textit{which} specific factors influence existing CAC methods and \\textit{how} these factors affect their performance.In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed \\ourdataset, a large-scale benchmark constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation.Drawing on our comparative analysis, we identify three key factors -- \\textit{prior utilization}, \\textit{network architecture}, and \\textit{training strategy} -- that most significantly influence CAC performance, and further investigate their respective effects.We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations.Benchmarks, codes, and \\textit{Zemax} files will be available upon acceptance of the paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39404", "url": null, "sourceid": 31580, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36507, "uid": "660bc121513a9d5442e97c5cef85786e", "name": "MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification", "authors": [{"id": 144702, "fullname": "Sangwoon Kwak", "url": "http://cvpr.thecvf.com/api/miniconf/users/144702?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 182953, "fullname": "WEEYOUN KWON", "url": "http://cvpr.thecvf.com/api/miniconf/users/182953?format=json", "institution": "Chung-Ang University"}, {"id": 156307, "fullname": "Jun Young Jeong", "url": "http://cvpr.thecvf.com/api/miniconf/users/156307?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 185234, "fullname": "Geonho Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/185234?format=json", "institution": "Chung-Ang University"}, {"id": 126652, "fullname": "Won-Sik Cheong", "url": "http://cvpr.thecvf.com/api/miniconf/users/126652?format=json", "institution": "Electronics and Telecommunications Research Institute"}, {"id": 131019, "fullname": "Jihyong Oh", "url": "http://cvpr.thecvf.com/api/miniconf/users/131019?format=json", "institution": "Chung-Ang University"}], "abstract": "Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes.However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a na\u00efve extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes.Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal coherence. By learning bidirectional deformations between KfA and adaptively blending them through learnable opacity control, our approach mitigates temporal discontinuities and flickering artifacts.We further introduce a Feature-variance-guided Hierarchical Densification (FHD) scheme that effectively densifies KfA's while keeping rendering quality, based on an assigned level of feature-variance.To effectively evaluate our model's capability to handle real-world long-range 4D motion, we newly compose long-range 4D motion-contained dataset, called SelfCap$_{\\text{LR}}$. It has larger average dynamic motion magnitude, captured at spatially wider spaces, compared to previous dynamic video datasets.Overall, our MoRel achieves temporally coherent and flicker-free long-range 4D reconstruction while maintaining bounded memory usage, demonstrating both scalability and efficiency in dynamic Gaussian-based representations. The code and project page will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36507", "url": null, "sourceid": 31782, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37657, "uid": "45050a298f1b880fcadb9085073c6b3f", "name": "GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning", "authors": [{"id": 187963, "fullname": "Jiajin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187963?format=json", "institution": "New York University"}, {"id": 187964, "fullname": "Dongzhe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187964?format=json", "institution": "New York University"}, {"id": 187965, "fullname": "Chuanhao Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/187965?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 187966, "fullname": "Daochen Zha", "url": "http://cvpr.thecvf.com/api/miniconf/users/187966?format=json", "institution": "Airbnb"}, {"id": 152657, "fullname": "Qiaoyu Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/152657?format=json", "institution": "New York University Shanghai"}], "abstract": "Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured.To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks.Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision\u2013language models as a new foundation for multimodal graph learning.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37657", "url": null, "sourceid": 46653, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40232, "uid": "decf0953ee783d807db519f9cc4bb27f", "name": "Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning", "authors": [{"id": 183119, "fullname": "Qiwei Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183119?format=json", "institution": "Shenzhen University"}, {"id": 184956, "fullname": "Boyang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184956?format=json", "institution": "Shenzhen University"}, {"id": 193833, "fullname": "Minghao Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/193833?format=json", "institution": "Shenzhen University"}, {"id": 153443, "fullname": "Sitong Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153443?format=json", "institution": "Shenzhen University"}, {"id": 193834, "fullname": "Tao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193834?format=json", "institution": "Kunmimg University of Science and Technology"}, {"id": 193835, "fullname": "Yan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193835?format=json", "institution": null}, {"id": 193836, "fullname": "Yixuan Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/193836?format=json", "institution": "Central South University"}, {"id": 129884, "fullname": "Jiaming Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129884?format=json", "institution": "Shenzhen University"}, {"id": 126902, "fullname": "Renjing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126902?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state\u2013action\u2013state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a scalable self-supervised framework that learns dynamics-aware 3D representations directly from point clouds without action or label supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy for control, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for dynamics-aware 3D representation learning in robotics.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40232", "url": null, "sourceid": 37030, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38697, "uid": "ffae6555097116f27108b748d65c71d1", "name": "CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization", "authors": [{"id": 180812, "fullname": "Weilin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180812?format=json", "institution": "Xiamen University"}, {"id": 190485, "fullname": "Jiahao Rao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190485?format=json", "institution": "Xiamen University"}, {"id": 190486, "fullname": "Wenhao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190486?format=json", "institution": "Xiamen University"}, {"id": 155574, "fullname": "Xinyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155574?format=json", "institution": "Xiamen University"}, {"id": 127589, "fullname": "Xuan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127589?format=json", "institution": "Xiamen University"}, {"id": 88415, "fullname": "Liujuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88415?format=json", "institution": "Xiamen University"}], "abstract": "The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instances cross attention, to ensure semantic plausibility and \"reference-instance\" alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38697", "url": null, "sourceid": 40596, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38390, "uid": "5abe018d038a5f636f45328b8950cc50", "name": "Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset", "authors": [{"id": 180650, "fullname": "Geon Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180650?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 126122, "fullname": "Hangyul Yoon", "url": "http://cvpr.thecvf.com/api/miniconf/users/126122?format=json", "institution": "Korea Advanced Institute of Science and Technology (KAIST)"}, {"id": 189781, "fullname": "Hyunju Shin", "url": "http://cvpr.thecvf.com/api/miniconf/users/189781?format=json", "institution": null}, {"id": 189782, "fullname": "Hyunki Park", "url": "http://cvpr.thecvf.com/api/miniconf/users/189782?format=json", "institution": "Samsung Medical Center"}, {"id": 189783, "fullname": "Sang Seo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189783?format=json", "institution": "Samsung Medical Center"}, {"id": 85226, "fullname": "Eunho Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85226?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 189784, "fullname": "Edward Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189784?format=json", "institution": "Korea Advanced Institute of Science and Technology"}], "abstract": "The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest x-ray images and their corresponding reports.MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS.ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38390", "url": "https://github.com/checkoneee/ROSALIA", "sourceid": 45833, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38575, "uid": "c2e4a009b0acaea484f4384134824c69", "name": "HiFi-Brep: High-Fidelity B-Rep Latent Representation and Robust Generation", "authors": [{"id": 182352, "fullname": "Junhao Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/182352?format=json", "institution": "Zhejiang University"}, {"id": 190183, "fullname": "Chenqi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190183?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 190184, "fullname": "PuFan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190184?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 190185, "fullname": "Jiaying Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190185?format=json", "institution": null}, {"id": 190186, "fullname": "Yusheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190186?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 190187, "fullname": "Feiwei Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190187?format=json", "institution": "Hangzhou Dianzi University"}, {"id": 190188, "fullname": "Meie Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190188?format=json", "institution": "Guangzhou University"}, {"id": 85731, "fullname": "Kun Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85731?format=json", "institution": "Zhejiang University"}], "abstract": "Boundary representation (B-rep) generation is a fundamental task in Computer-Aided Design (CAD), enabling automated modeling of 3D geometries. However, the direct synthesis of valid and high-quality B-reps remains a major challenge.Existing deep generative methods suffer from brittle representation and generation paradigms, due to: (1) representation noise from padding variable-length sequences and feature contamination between distant primitives, and (2) fragile generation pipelines marked by cascaded decoding error propagation and a train-inference mismatch from deferred validity enforcement.To address this, we propose HiFi-Brep. Our core insight is that robust, high-validity generation requires: first, building upon a compact and high-fidelity latent representation; and second, reformulating validity constraints as differentiable inductive biases within a single-stage generation process, enabling mutual guidance between geometry and topology.We implement this through a topology-aware encoder that yields a high-fidelity latent representation by eliminating padding noise via query-based pooling and preventing feature contamination with topology-guided attention. Our single-stage decoder then jointly generates geometry and topology, embedding core manifold constraints into a differentiable learning objective to ensure topological validity and sidestep cascaded errors. The resulting latent space supports both unconditional synthesis and conditional generation from various inputs, such as class labels, point clouds, or images.Experiments demonstrate that HiFi-Brep outperforms state-of-the-art approaches in both model validity and geometric quality.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38575", "url": null, "sourceid": 32864, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36299, "uid": "b9e8886808531456cd2f4bb4e718e22b", "name": "Prototype-as-Prompt:  Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis", "authors": [{"id": 184728, "fullname": "Xianbing Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184728?format=json", "institution": "Jiangnan University"}, {"id": 178551, "fullname": "Lan Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/178551?format=json", "institution": "Jiangnan University"}, {"id": 184729, "fullname": "Heng-yang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184729?format=json", "institution": "Jiangnan University"}, {"id": 184730, "fullname": "Buzhou Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184730?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Multimodal Sentiment Analysis (MSA) aims to integrate textual, acoustic, and visual information to predict sentiment polarity. With the emergence of Large Language Models (LLMs), existing studies commonly employ learnable queries to compress audio\u2013visual representations and feed them as soft prompts into LLMs for MSA. However, due to the implicit learning mechanism of the learnable queries, these learnable queries lack explicit guidance regarding how each query encodes sentiment semantics. To address this issue, we propose a prototype-as-prompt framework that maps audio\u2013visual representations into a fixed set of multimodal sentiment prototypes. These prototypes are then used as soft prompts to guide the LLM in performing MSA. Concretely, we first compress both textual and non-textual features into multimodal prototypes using a resampling-based strategy. We further introduce a sentiment-aware prototype learning that explicitly binds multimodal prototypes with sentiment semantics. To ensure both cross-modal consistency and intra-modal diversity of multimodal sentiment prototypes, we design a cross-modal prototype alignment constraint and a distance-weighted prototype diversity constraint. Extensive experiments across three LLMs and four benchmark datasets show that PaP achieves superior performance with only 0.09\\%\u20130.26\\% of trainable parameters, highlighting its effectiveness and parameter efficiency.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36299", "url": null, "sourceid": 36394, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38141, "uid": "1bf50b1aa5574ec9c56a8f2af4039ed7", "name": "Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling", "authors": [{"id": 180508, "fullname": "Tang Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/180508?format=json", "institution": "Shang Hai Jiao Tong University"}, {"id": 154576, "fullname": "Huiyu Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/154576?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189147, "fullname": "Guoquan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189147?format=json", "institution": "Beijing University of Chemical Technology"}, {"id": 185456, "fullname": "Jianbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185456?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 189148, "fullname": "Jie Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189148?format=json", "institution": "Beijing University of Chemical Technology"}, {"id": 189149, "fullname": "Liang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189149?format=json", "institution": null}], "abstract": "Blind image quality assessment (BIQA) plays a crucial role in evaluating and optimizing visual experience. Most existing BIQA approaches fuse shallow and deep features extracted from backbone networks, while overlooking the unequal contributions to quality prediction. Moreover, while various vision encoder backbones are widely adopted in BIQA, the effective quality decoding architectures remain underexplored. To address these limitations, this paper investigates the contributions of shallow and deep features to BIQA, and proposes a effective quality feature decoding framework via GCN-enhanced \\underline{l}ayer\\underline{i}nteraction and MoE-based \\underline{f}eature d\\underline{e}coupling, termed \\textbf{(Life-IQA)}. Specifically, the GCN-enhanced layer interaction module utilizes the GCN-enhanced deepest-layer features as query and the penultimate-layer features as key, value, then performs cross-attention to achieve feature interaction. Moreover, a MoE-based feature decoupling module is proposed to decouple fused representations though different experts specialized for specific distortion types or quality dimensions. Extensive experiments demonstrate that Life-IQA shows more favorable balance between accuracy and cost than a vanilla Transformer decoder and achieves state-of-the-art performance on multiple BIQA benchmarks. The code will be released upon the publication.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38141", "url": null, "sourceid": 38742, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36459, "uid": "bf69bbe59f06be26b96d7efa1b0f0ffc", "name": "DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Models", "authors": [{"id": 181097, "fullname": "Zhou Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181097?format=json", "institution": "University of Science and Technology of China"}, {"id": 185104, "fullname": "Shida Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185104?format=json", "institution": "University of Science and Technology of China"}, {"id": 185105, "fullname": "YongXiang Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/185105?format=json", "institution": "University of Science and Technology of China"}, {"id": 127183, "fullname": "Haoyu Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/127183?format=json", "institution": "Tencent Youtu Lab"}, {"id": 128045, "fullname": "Linli Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128045?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Multimodal Large Language Models have achieved impressive performance on a variety of vision-language tasks, yet their fine-grained visual perception and precise spatial reasoning remain limited. In this work, we introduce DiG (Differential Grounding), a novel proxy task framework where MLLMs learn fine-grained perception by identifying and localizing all differences between similar image pairs without prior knowledge of their number. To support scalable training, we develop an automated 3D rendering-based data generation pipeline that produces high-quality paired images with fully controllable discrepancies. To address the sparsity of difference signals, we further employ curriculum learning that progressively increases complexity from single to multiple differences, enabling stable optimization. Extensive experiments demonstrate that DiG significantly improves model performance across a variety of visual perception benchmarks and that the learned fine-grained perception skills transfer effectively to standard downstream tasks, including RefCOCO, RefCOCO+, RefCOCOg, and general multimodal perception benchmarks. Our results highlight differential grounding as a scalable and robust approach for advancing fine-grained visual reasoning in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36459", "url": null, "sourceid": 37498, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37306, "uid": "2c563887fb511ed0c1d6aeacf603c2de", "name": "TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective", "authors": [{"id": 107615, "fullname": "Chaohu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/107615?format=json", "institution": "University of Science and Technology of China"}, {"id": 185104, "fullname": "Shida Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185104?format=json", "institution": "University of Science and Technology of China"}, {"id": 187129, "fullname": "Yubo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187129?format=json", "institution": "University of Science and Technology of China"}, {"id": 128045, "fullname": "Linli Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128045?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising progress in table reasoning from visual table inputs. Despite their ability to capture rich visual cues such as color and layout, MLLMs still underperform compared to text-only models.We argue that a major limitation lies in the pre-training process, which inadvertently weakens the model\u2019s intrinsic reasoning ability and consequently hinders the effectiveness of reinforcement fine-tuning on table reasoning tasks.In this paper, we introduce TableMix, a novel framework that tackles this challenge from a data-centric perspective. At the core of TableMix is a principled data mixing strategy. Specifically, TableMix constructs a hybrid dataset that combines: (1) multimodal table reasoning data to improve task-specific reasoning, (2) text-only mathematical reasoning data to revive the model\u2019s logical competence, and (3) simple multimodal perception data to preserve visual grounding.Recognizing the non-uniform difficulty of mixed data, we further propose a Difficulty-Aware Reward Shaping (DRS) mechanism, which enables the Group Relative Policy Optimization (GRPO) algorithm to adaptively reward concise reasoning for easy problems while encouraging more elaborate reasoning for complex ones, thereby reducing redundant computation and errors.Extensive experiments show that TableMix markedly enhances the reasoning ability of MLLMs, outperforming strong multimodal baselines and even rivaling state-of-the-art text-only models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37306", "url": null, "sourceid": 40048, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36943, "uid": "ba71bc595c120fca125dce3352c9ea5b", "name": "FlowMotion: Training-Free Flow Guidance for Video Motion Transfer", "authors": [{"id": 180301, "fullname": "Zhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180301?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 186275, "fullname": "Youcan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186275?format=json", "institution": "Zhejiang University"}, {"id": 84768, "fullname": "Jun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/84768?format=json", "institution": "Zhejiang University"}, {"id": 90895, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90895?format=json", "institution": "HKUST"}], "abstract": "Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36943", "url": null, "sourceid": 46625, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38632, "uid": "59c8e2db45e3dc2a4aa5f460c6a308f3", "name": "Dropping Anchor and Spherical Harmonics for Gaussian Splatting", "authors": [{"id": 190347, "fullname": "Shuangkang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190347?format=json", "institution": "Beihang University"}, {"id": 190348, "fullname": "I-Chao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190348?format=json", "institution": "The University of Tokyo"}, {"id": 86059, "fullname": "Xuanyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86059?format=json", "institution": "Megvii Technology Inc."}, {"id": 190349, "fullname": "Zesheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190349?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 190350, "fullname": "Yufeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190350?format=json", "institution": "Beihang University"}, {"id": 190351, "fullname": "Wenrui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190351?format=json", "institution": "Beihang University"}, {"id": 87502, "fullname": "Gang Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87502?format=json", "institution": "Tencent"}, {"id": 190352, "fullname": "Takeo Igarashi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190352?format=json", "institution": "The University of Tokyo, Tokyo Institute of Technology"}], "abstract": "Recent 3D Gaussian Splatting (3DGS) dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the dropout to color attributes by randomly dropping higher-degree SH coefficients to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing dropout methods with negligible computational overhead and can be readily integrated into various 3DGS variants to enhance their performances.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38632", "url": null, "sourceid": 44979, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36546, "uid": "6499e19d47d7cbd3302a26fdb40d0b41", "name": "Parameterized Prompt for Incremental Object Detection", "authors": [{"id": 180627, "fullname": "Zijia An", "url": "http://cvpr.thecvf.com/api/miniconf/users/180627?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 105398, "fullname": "boyu diao", "url": "http://cvpr.thecvf.com/api/miniconf/users/105398?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences."}, {"id": 185312, "fullname": "RuiQi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185312?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185313, "fullname": "Libo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185313?format=json", "institution": "Institute of Computing Technology"}, {"id": 130183, "fullname": "Chuanguang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130183?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 185314, "fullname": "Fei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185314?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 130221, "fullname": "Zhulin An", "url": "http://cvpr.thecvf.com/api/miniconf/users/130221?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 105588, "fullname": "Yongjun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/105588?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}], "abstract": "Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Our study reveals that existing prompts-pool-based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent confusion and catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36546", "url": null, "sourceid": 32769, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38124, "uid": "24a0f25cbb04c099f7ad15dfb0447ba1", "name": "LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation", "authors": [{"id": 181763, "fullname": "Yusheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181763?format=json", "institution": "Tongji University"}, {"id": 189099, "fullname": "Lizhi LOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/189099?format=json", "institution": "Tongji University"}, {"id": 189100, "fullname": "Yan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189100?format=json", "institution": "Tongji University"}, {"id": 189101, "fullname": "Zekai Miao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189101?format=json", "institution": "Tongji University"}, {"id": 189102, "fullname": "shaoming zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189102?format=json", "institution": null}, {"id": 189103, "fullname": "Jianmei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189103?format=json", "institution": "Tongji University"}], "abstract": "Metric depth estimation aims to recover depth maps with absolute scale, high resolution, and cross-scene consistency from visual observations. Existing approaches either rely on large-scale models or costly sensors to preserve metric accuracy and generalization, both ill-suited to resource-constrained deployment. In this paper, we propose **LiteSense**, a lightweight RGB-ToF fusion framework that leverages compact normalized histogram (CNH) signals together with RGB cues to achieve efficient and reliable metric depth estimation. Specifically, LiteSense leverages a U-Net-style encoder-decoder that forms an RGB-D input by concatenating RGB with upsampled ToF depth, providing explicit metric priors. To address resolution disparity and recover fine details, we introduce the **Patch-wise CNH Spatial Injection (PCSI)** module, which leverages zone-wise histogram measurements via cross-attention to guide high-level feature fusion. Extensively evaluated on NYUv2 and SUN RGB-D, LiteSense consistently outperforms monocular baselines and DELTAR with substantially lower computational cost, and demonstrates promising zero-shot generalization. We further introduce **THDR3K**, the first indoor RGB-ToF-CNH dataset, where LiteSense achieves real-world accuracy comparable to\u2014and in challenging cases surpassing\u2014Intel RealSense. All the relevant source codes and the collected dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38124", "url": null, "sourceid": 46577, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37869, "uid": "fdd877d9c3ca4eaf91d22d2b2ba54f52", "name": "Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks", "authors": [{"id": 181554, "fullname": "Linhua Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181554?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 188446, "fullname": "Bingrui Sima", "url": "http://cvpr.thecvf.com/api/miniconf/users/188446?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 129383, "fullname": "Kun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/129383?format=json", "institution": "Huazhong University of Sceince and Technology"}], "abstract": "Multimodal Large Reasoning Models (MLRMs) exhibit remarkable performance on complex tasks by incorporating explicit multi-step reasoning. However, this capability also introduces new security vulnerabilities. Existing jailbreak studies largely overlook Cognitive-level weaknesses embedded in the reasoning process itself. In this work, we uncover a critical cognitive bias in MLRMs: the anchoring effect, where safety judgments are disproportionately influenced by the first piece of information received\u2014the anchor. Building on this finding, we propose the Reasoning-chain Anchoring Attack (RA-Attack), a novel jailbreak framework that fully exploits this vulnerability. RA-Attack employs a cross-modal safe anchor, whose core component is a structured visual mind map. This structured format provides the model with a pre-established,  safety-biased reasoning chain that subtly induces it to rationalize and execute subsequent harmful intent. Extensive experiments across seven leading closed- and open-source MLRMs demonstrate the effectiveness of RA-Attack, achieving state-of-the-art jailbreak success rates\u201492% on Gemini-2.5-Pro and 82% on GPT-4o. Our findings reveal that cognitive biases can be systematically exploited to manipulate multimodal reasoning chains, establishing cognitive security as a critical and underexplored frontier in AI safety research. Warning: This paper contains unsafe examples.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37869", "url": null, "sourceid": 33369, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39143, "uid": "317d5338c2dd1182bd094370a1121ee4", "name": "Improving Adversarial Transferability with Local Perturbation Augmentation", "authors": [{"id": 180728, "fullname": "Jianxun Mi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180728?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 191442, "fullname": "Xuanhui Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191442?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 86107, "fullname": "Weisheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86107?format=json", "institution": "Chongqing University of Posts and Telecommunications"}], "abstract": "Adversarial examples expose fundamental vulnerabilities within deep neural networks, and their transferability highlights shared weaknesses across diverse models. Existing mainstream attack methods often rely on iterative processes with various strategies to improve transferability, but the limited knowledge of the target model restricts the success of these approaches. In this paper, we reveal that the iterative optimization process tends to over-specialize adversarial perturbations to the local gradient characteristics of the surrogate model, thereby hindering their transferability to other models. To address this limitation, we propose a novel attack method called Local Perturbation Augmentation Attack (LPAA). The key innovation of our approach lies in constructing multiple augmented local subspaces during each iteration, which steers perturbation updates towards a more generalizable direction, effectively reducing over-reliance on the surrogate model. Additionally, to improve the initial performance and overcome sensitivity to initial perturbation, we introduce a dedicated perturbation initialization strategy that ensures the optimization process starts from a direction with greater ability for transferability. Compared with existing random neighborhood sampling strategies, the LPAA serves as an effective approach that leverages the directional characteristics of perturbations to overcome their limitations. Extensive experiments on CNNs and ViTs demonstrate that LPAA consistently generates highly transferable adversarial examples, significantly surpassing the performance of state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39143", "url": null, "sourceid": 33607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39312, "uid": "d0cefa43d15f7e4b74ae21595155ec91", "name": "Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation", "authors": [{"id": 173563, "fullname": "Shihan Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/173563?format=json", "institution": "Vanderbilt University"}, {"id": 77238, "fullname": "Nilesh Kulkarni", "url": "http://cvpr.thecvf.com/api/miniconf/users/77238?format=json", "institution": "Netflix"}, {"id": 191821, "fullname": "David Hyde", "url": "http://cvpr.thecvf.com/api/miniconf/users/191821?format=json", "institution": "Vanderbilt University"}, {"id": 191822, "fullname": "Dmitriy Smirnov", "url": "http://cvpr.thecvf.com/api/miniconf/users/191822?format=json", "institution": "Reve"}], "abstract": "Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire.In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic \"real\" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39312", "url": null, "sourceid": 31331, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36702, "uid": "0ec2ae1d04af7898b86235766021c19f", "name": "Local Motion Matters: A Deconstruct\u2013Recompose Paradigm for Reinforcement Learning Pre-training from Videos", "authors": [{"id": 180601, "fullname": "Jinwen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180601?format=json", "institution": "Beijing Jiaotong University"}, {"id": 185678, "fullname": "Youfang Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185678?format=json", "institution": "Beijing Jiaotong University"}, {"id": 185679, "fullname": "Xiaobo Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185679?format=json", "institution": "Beijing Jiaotong University"}, {"id": 185680, "fullname": "Shuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185680?format=json", "institution": "Beijing Jiaotong University"}, {"id": 185681, "fullname": "Kai Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/185681?format=json", "institution": "Beijing Jiaotong University"}], "abstract": "Pre-training on large-scale videos to improve reinforcement learning efficiency is promising yet remains challenging. Existing methods typically treat the agent as an indivisible entity, modeling motion patterns globally. Such global modeling is tightly coupled with the morphology, hindering transfer across domains. In contrast, despite the vast disparity in global motions, the local components exhibit similar motion patterns across different agents. Building on this insight, we propose a novel Deconstruct\u2013Recompose Paradigm (DRP) for learning transferable local motion representations. Specifically, in the Deconstruct phase, we identify multiple local points and track their frame-wise motions, defining each as an Atomic Action. We introduce a Dual-Attention Encoder (DAE) to learn local motion representations from these Atomic Actions, capturing their spatiotemporal relationships. In the Recompose phase, we compose local motion representations with a learnable Motion Aggregation Token '[MAT]' via latent dynamics model learning. Additionally, an adapter bridges local motion and downstream action-specific dynamics to accelerate policy learning. Extensive experiments demonstrate that our method effectively transfers to diverse robotic control and manipulation tasks, significantly improving sample efficiency and performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36702", "url": null, "sourceid": 31198, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39859, "uid": "5bffb1cbed6ed79589cf475923419d26", "name": "PECCVAI : Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks", "authors": [{"id": 192998, "fullname": "Shreyas Dixit", "url": "http://cvpr.thecvf.com/api/miniconf/users/192998?format=json", "institution": "Vishwakarma Institute of Information Technology"}, {"id": 147293, "fullname": "Ashhar Aziz", "url": "http://cvpr.thecvf.com/api/miniconf/users/147293?format=json", "institution": "Indraprastha Institute of Information Technology, Delhi"}, {"id": 181421, "fullname": "Shashwat Bajpai", "url": "http://cvpr.thecvf.com/api/miniconf/users/181421?format=json", "institution": "BITS Pilani, Birla Institute of Technology and Science"}, {"id": 131356, "fullname": "Vasu Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/131356?format=json", "institution": "Meta AI/ CMU"}, {"id": 154985, "fullname": "Aman Chadha", "url": "http://cvpr.thecvf.com/api/miniconf/users/154985?format=json", "institution": "Amazon Web Services"}, {"id": 192999, "fullname": "Vinija Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/192999?format=json", "institution": "Facebook"}, {"id": 193000, "fullname": "Amitava Das", "url": "http://cvpr.thecvf.com/api/miniconf/users/193000?format=json", "institution": "University of of South Carolina"}], "abstract": "By 2026, up to 90% of online content may be synthetically generated, raising serious concerns about the spread of AI-generated disinformation. Policymakers and companies alike are responding California\u2019s Bill AB 321 mandates watermarking of AI-generated media, while firms like Meta and Google are deploying watermarking systems to curb the misuse of generative models. Yet, watermarking techniques remain fragile. In this work, we introduce and analyze a novel vulnerability: the visual paraphrase attack, a generative method capable of stripping both visible and invisible watermarks from AI-generated images. The attack operates in two steps: first, a caption is generated for an image. Then, the image and its caption are passed to a diffusion-based text-to-image system, producing a visually similar but watermark free image. Our empirical evaluation demonstrates that visual paraphrasing reliably removes watermarks while preserving the original image\u2019s semantic content, revealing a fundamental weakness in current watermarking systems. To address this, we introduce PECCAVI, the first watermarking method explicitly designed to withstand visual paraphrase attacks. PECCAVI embeds robust, distortion-free watermarks within semantically stable regions of the image, which we term Non-Melting Points (NMPs). The method uses multi-channel frequency domain watermarking and incorporates noisy burnishing to obfuscate watermark locations and resist reverse engineering. PECCAVI is model-agnostic and significantly more durable than existing approaches. We release the first visual paraphrase benchmark dataset and open-source all code and resources1, offering a foundation for future work on robust watermarking in the age of generative AI.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39859", "url": null, "sourceid": 37050, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39703, "uid": "7666534473231043db00bea461f55d33", "name": "NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing", "authors": [{"id": 192683, "fullname": "Tianlin Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192683?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 192684, "fullname": "Jiayi Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192684?format=json", "institution": "nanjing university"}, {"id": 192685, "fullname": "Chenpu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192685?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 130504, "fullname": "Zhengyao Lv", "url": "http://cvpr.thecvf.com/api/miniconf/users/130504?format=json", "institution": "University of Hong Kong"}, {"id": 89377, "fullname": "Binxin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89377?format=json", "institution": "University of Science and Technology of China"}, {"id": 131667, "fullname": "Hubery Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/131667?format=json", "institution": "Tencent"}, {"id": 86641, "fullname": "Chen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86641?format=json", "institution": "WeChat, Tencent"}, {"id": 185164, "fullname": "Jing LYU", "url": "http://cvpr.thecvf.com/api/miniconf/users/185164?format=json", "institution": "Wechat Team"}, {"id": 153687, "fullname": "Caifeng Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153687?format=json", "institution": "Nanjing University"}, {"id": 89111, "fullname": "Chenyang Si", "url": "http://cvpr.thecvf.com/api/miniconf/users/89111?format=json", "institution": "Nanyang Technological University  Singapore"}], "abstract": "Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \\& Dense Synthesis, a new framework for video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39703", "url": null, "sourceid": 32307, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39433, "uid": "d02505b02ba35ec882c23008023343f9", "name": "PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation", "authors": [{"id": 180408, "fullname": "Onkar Susladkar", "url": "http://cvpr.thecvf.com/api/miniconf/users/180408?format=json", "institution": "UIUC"}, {"id": 187955, "fullname": "Tushar Prakash", "url": "http://cvpr.thecvf.com/api/miniconf/users/187955?format=json", "institution": "Sony Research India"}, {"id": 152648, "fullname": "Adheesh Juvekar", "url": "http://cvpr.thecvf.com/api/miniconf/users/152648?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 152647, "fullname": "Kiet A. Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152647?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187954, "fullname": "Dong-Hwan Jang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187954?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 192068, "fullname": "Inderjit S Dhillon", "url": "http://cvpr.thecvf.com/api/miniconf/users/192068?format=json", "institution": "Google; University of Texas, Austin"}, {"id": 152651, "fullname": "Ismini Lourentzou", "url": "http://cvpr.thecvf.com/api/miniconf/users/152651?format=json", "institution": "University of Illinois Urbana - Champaign"}], "abstract": "Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatial-temporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video instance segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39433", "url": "https://plan-lab.github.io/projects/pyratok/", "sourceid": 36207, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37236, "uid": "2546165141cc6ca7f363a38c5f1c382b", "name": "Dynamic Visual SLAM using a General 3D Prior", "authors": [{"id": 130711, "fullname": "Xingguang Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/130711?format=json", "institution": "Rheinische Friedrich-Wilhelms Universit\u00e4t Bonn"}, {"id": 186985, "fullname": "Liren Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186985?format=json", "institution": "University of Bonn"}, {"id": 186986, "fullname": "Marija Popovic", "url": "http://cvpr.thecvf.com/api/miniconf/users/186986?format=json", "institution": "Delft University of Technology"}, {"id": 70118, "fullname": "Jens Behley", "url": "http://cvpr.thecvf.com/api/miniconf/users/70118?format=json", "institution": "University of Bonn"}, {"id": 70838, "fullname": "Cyrill Stachniss", "url": "http://cvpr.thecvf.com/api/miniconf/users/70838?format=json", "institution": "University of Bonn"}], "abstract": "Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly handle the inherent scale ambiguities of the batch-wise application of the feed-forward reconstruction model. Extensive experiments on multiple tasks show the superior performance of our proposed method compared to state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37236", "url": null, "sourceid": 46036, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39055, "uid": "f3498e568e0bb45515779d6bd47e20f4", "name": "FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing", "authors": [{"id": 191265, "fullname": "Yucheng Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191265?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 191266, "fullname": "Jiajun Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191266?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 181789, "fullname": "Kaiqian Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/181789?format=json", "institution": "Sun Yat-sen University"}, {"id": 191267, "fullname": "Baoquan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191267?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 131555, "fullname": "Haoran Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/131555?format=json", "institution": "Lingnan University"}, {"id": 86172, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86172?format=json", "institution": "Tencent AI Lab"}, {"id": 131569, "fullname": "Qing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/131569?format=json", "institution": "The Hong Kong Polytechnic University, Hong Kong Polytechnic University"}, {"id": 107502, "fullname": "Xudong Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107502?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint.  Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39055", "url": null, "sourceid": 34074, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38859, "uid": "0c73a58dcab181744b8a520a6f80f998", "name": "From 3D Pose to Prose: Biomechanics-Grounded Vision\u2013Language Coaching", "authors": [{"id": 182345, "fullname": "Yuyang Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/182345?format=json", "institution": "Drexel University"}, {"id": 190852, "fullname": "Yixuan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190852?format=json", "institution": "Drexel University"}, {"id": 86365, "fullname": "Shengjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86365?format=json", "institution": "Michigan State University"}, {"id": 76130, "fullname": "Yu Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76130?format=json", "institution": "Michigan State University"}, {"id": 154086, "fullname": "Feng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154086?format=json", "institution": "Drexel University"}], "abstract": "We present BioCoach, a biomechanics-grounded vision\u2013language framework for fitness coaching from streaming video. BioCoach fuses two signals, visual appearance and 3D skeletal kinematics, through a novel three-stage pipeline: an exercise-specific degree-of-freedom selector that focuses analysis on salient joints; a structured biomechanical context that pairs individualized morphometrics with cycle and constraint analysis; and a vision\u2013biomechanics conditioned feedback module that applies cross-attention to generate precise, actionable text. Using parameter-efficient training that freezes the vision and language backbones, BioCoach yields transparent, personalized reasoning rather than pattern matching. To enable learning and fair evaluation, we augment QEVD-fit-coach with biomechanics-oriented feedback to create QEVD-bio-fit-coach, and we introduce a biomechanics-aware LLM judge metric. BioCoach delivers clear gains on QEVD-bio-fit-coach across lexical and judgment metrics while maintaining temporal triggering; on the original QEVD-fit-coach, it improves text quality and correctness with near-parity timing, demonstrating that explicit kinematics and constraints are key to accurate, phase-aware coaching.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38859", "url": null, "sourceid": 37746, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36390, "uid": "a81dfd623ba30e99dfbd6022899a66f0", "name": "Chain of Event-Centric Causal Thought for Physically Plausible Video Generation", "authors": [{"id": 158836, "fullname": "Zixuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158836?format=json", "institution": "Sichuan University"}, {"id": 184937, "fullname": "Yixin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184937?format=json", "institution": "Sichuan University"}, {"id": 184938, "fullname": "Haolan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184938?format=json", "institution": "Sichuan University"}, {"id": 158837, "fullname": "Feng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158837?format=json", "institution": "The University of Adelaide"}, {"id": 184939, "fullname": "Yan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184939?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 88470, "fullname": "Wen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/88470?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 76644, "fullname": "Yinjie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76644?format=json", "institution": "Sichuan University"}], "abstract": "Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains.  \\textcolor[RGB]{237,0, 140}{Our code will be released soon}.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36390", "url": null, "sourceid": 43112, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38869, "uid": "659b7cf906b8fd348ff333c167d8386d", "name": "Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics", "authors": [{"id": 154155, "fullname": "Ao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154155?format=json", "institution": "Southwest Jiaotong University"}, {"id": 128954, "fullname": "XIN LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/128954?format=json", "institution": "G42"}, {"id": 128933, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128933?format=json", "institution": "AIQ"}, {"id": 154154, "fullname": "Yuezun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154154?format=json", "institution": "Ocean University of China"}, {"id": 190884, "fullname": "Zhaoquan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190884?format=json", "institution": "School of Computing and Artificial Intelligence, Southwest Jiaotong University"}, {"id": 190885, "fullname": "SHAN ZHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/190885?format=json", "institution": "Jiigan Technology"}, {"id": 190886, "fullname": "Bing Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/190886?format=json", "institution": "Jiigan Technology"}, {"id": 106393, "fullname": "Xiao WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/106393?format=json", "institution": "Southwest Jiaotong University"}], "abstract": "Modern optical flow estimation, though empowered by deep neural architectures, remains rooted in the discrete correspondence paradigm inherited from classical vision. Most networks infer frame-to-frame displacements or correlation volumes, capturing where pixels move but not how motion evolves continuously through time. Yet physical motion in the real world follows smooth dynamics governed by underlying velocity fields, as long established in fluid mechanics and transport theory. To bridge this gap, we introduce Optical Flow Matching (OFM), a continuous formulation that learns a time-dependent velocity field to transport pixel coordinates along motion distribution coherent trajectories. A key component of our OFM is Triangle Velocities Synergy (TVS), a lightweight geometric mechanism that provides a stable and physically meaningful velocity construction, ensuring that continuous transport remains well-defined. Combined with an Euler-based ODE solver, OFM yields flow fields that are temporally smooth, geometrically consistent, and process-interpretable. Experiments on Sintel, KITTI, and Spring demonstrate that OFM achieves state-of-the-art accuracy, enhanced temporal stability, and notably stronger cross-dataset generalization, advancing optical flow estimation from correspondence inference to continuous dynamical reasoning. All code and trained models will be released upon acceptance to facilitate further research.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38869", "url": null, "sourceid": 36016, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40348?format=json"], "related_events_ids": [40348]}, {"id": 40348, "uid": "659b7cf906b8fd348ff333c167d8386d", "name": "Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics", "authors": [{"id": 154155, "fullname": "Ao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154155?format=json", "institution": "Southwest Jiaotong University"}, {"id": 128954, "fullname": "XIN LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/128954?format=json", "institution": "G42"}, {"id": 128933, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128933?format=json", "institution": "AIQ"}, {"id": 154154, "fullname": "Yuezun Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/154154?format=json", "institution": "Ocean University of China"}, {"id": 190884, "fullname": "Zhaoquan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190884?format=json", "institution": "School of Computing and Artificial Intelligence, Southwest Jiaotong University"}, {"id": 190885, "fullname": "SHAN ZHAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/190885?format=json", "institution": "Jiigan Technology"}, {"id": 190886, "fullname": "Bing Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/190886?format=json", "institution": "Jiigan Technology"}, {"id": 106393, "fullname": "Xiao WU", "url": "http://cvpr.thecvf.com/api/miniconf/users/106393?format=json", "institution": "Southwest Jiaotong University"}], "abstract": "Modern optical flow estimation, though empowered by deep neural architectures, remains rooted in the discrete correspondence paradigm inherited from classical vision. Most networks infer frame-to-frame displacements or correlation volumes, capturing where pixels move but not how motion evolves continuously through time. Yet physical motion in the real world follows smooth dynamics governed by underlying velocity fields, as long established in fluid mechanics and transport theory. To bridge this gap, we introduce Optical Flow Matching (OFM), a continuous formulation that learns a time-dependent velocity field to transport pixel coordinates along motion distribution coherent trajectories. A key component of our OFM is Triangle Velocities Synergy (TVS), a lightweight geometric mechanism that provides a stable and physically meaningful velocity construction, ensuring that continuous transport remains well-defined. Combined with an Euler-based ODE solver, OFM yields flow fields that are temporally smooth, geometrically consistent, and process-interpretable. Experiments on Sintel, KITTI, and Spring demonstrate that OFM achieves state-of-the-art accuracy, enhanced temporal stability, and notably stronger cross-dataset generalization, advancing optical flow estimation from correspondence inference to continuous dynamical reasoning. All code and trained models will be released upon acceptance to facilitate further research.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40348", "url": null, "sourceid": -36016, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38869?format=json"], "related_events_ids": [38869]}, {"id": 36988, "uid": "214b755638073f8c465646edaad3c6ca", "name": "From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training", "authors": [{"id": 186384, "fullname": "Donglai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186384?format=json", "institution": "Independent Researcher"}, {"id": 186385, "fullname": "Hongzheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186385?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 186386, "fullname": "Yuzhi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186386?format=json", "institution": "City University of Hong Kong"}, {"id": 186387, "fullname": "Pingping Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186387?format=json", "institution": "City University of Hong Kong"}, {"id": 186388, "fullname": "Jinpeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186388?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186389, "fullname": "Wenao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186389?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186390, "fullname": "Zhijian Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186390?format=json", "institution": "city university of hong kong"}, {"id": 186391, "fullname": "Mengyang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186391?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 178160, "fullname": "Xiaolei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/178160?format=json", "institution": "HKUST"}, {"id": 180693, "fullname": "Senkang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180693?format=json", "institution": "City University of Hong Kong"}, {"id": 186392, "fullname": "Ziyi Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186392?format=json", "institution": null}, {"id": 186393, "fullname": "Jason Chun Lok Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186393?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 186394, "fullname": "Lai Man Po", "url": "http://cvpr.thecvf.com/api/miniconf/users/186394?format=json", "institution": null}], "abstract": "Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation\u2014enabling more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones-Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B\u2014spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36988", "url": null, "sourceid": 33974, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39263, "uid": "853d6d4dc67f67e50fa039df7ecf3e7b", "name": "UETrack: A Unified and Efficient Framework for Single Object Tracking", "authors": [{"id": 180503, "fullname": "Ben Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180503?format=json", "institution": "Dalian University of Technology"}, {"id": 188695, "fullname": "Jie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188695?format=json", "institution": "Dalian University of Technology"}, {"id": 159501, "fullname": "Xin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/159501?format=json", "institution": "Dalian University of Technology"}, {"id": 188694, "fullname": "Wanting Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188694?format=json", "institution": "Dalian University of Technology"}, {"id": 191733, "fullname": "Bin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191733?format=json", "institution": "Dalian University of Technology"}, {"id": 130941, "fullname": "Lu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130941?format=json", "institution": "Dalian University of Technology"}, {"id": 89488, "fullname": "Dong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89488?format=json", "institution": "Dalian University of Technology"}, {"id": 87510, "fullname": "Huchuan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87510?format=json", "institution": "Dalian University of Technology"}], "abstract": "With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, a unified and efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed\u2013accuracy trade-off compared to pervious methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code will be made available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39263", "url": null, "sourceid": 37556, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40173, "uid": "50ccab9fbc343b7cf1a1f149bf4facfd", "name": "Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models", "authors": [{"id": 180572, "fullname": "Yujia Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180572?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193709, "fullname": "Yuanxiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193709?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193710, "fullname": "Zhenyu Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193710?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193711, "fullname": "Tiankun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193711?format=json", "institution": null}, {"id": 193712, "fullname": "Chenxi Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193712?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 193713, "fullname": "Haopeng Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/193713?format=json", "institution": "Tencent Csig; Beijing University of Post and Telecommunications"}, {"id": 193714, "fullname": "Jinwen Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193714?format=json", "institution": null}, {"id": 193715, "fullname": "Xinyu Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/193715?format=json", "institution": "Tencent"}, {"id": 193716, "fullname": "Lisheng Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/193716?format=json", "institution": "tencent"}, {"id": 193717, "fullname": "Haijin Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193717?format=json", "institution": "Tencent"}, {"id": 193718, "fullname": "Jin Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/193718?format=json", "institution": "Tencent Search"}, {"id": 193719, "fullname": "Xinming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193719?format=json", "institution": "Zhongguancun Academy; Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 193720, "fullname": "RuiwenTao RuiwenTao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193720?format=json", "institution": ""}, {"id": 181690, "fullname": "Hongzhu Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/181690?format=json", "institution": "University of the Chinese Academy of Sciences"}], "abstract": "While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40173", "url": null, "sourceid": 39596, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37315, "uid": "cc95aa39a060194468cf78fa7dc1cb99", "name": "High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling", "authors": [{"id": 180571, "fullname": "Feng Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/180571?format=json", "institution": "Peking University"}, {"id": 157655, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157655?format=json", "institution": "ByteDance Inc."}, {"id": 130296, "fullname": "Li zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130296?format=json", "institution": "Bytedance Inc."}, {"id": 185377, "fullname": "Chuanmin Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/185377?format=json", "institution": "Peking University"}], "abstract": "Exploiting bi-directional context prediction has long been recognized as a key direction for improving compression efficiency in neural video coding. However, existing neural B-frame codecs still exhibit limited performance gains, particularly in high-resolution videos with large motion, where optical flow estimation becomes unreliable and balanced prediction fusion introduces distortions. To address these challenges, we present the first High-Resolution bi-directional neural video coding method, termed as HR-NVC, which non-uniformly integrates confidence-guided predictive cues from both temporal directions to achieve more reliable and efficient compression. Specifically, we propose Spatio-Temporal Anchored Motion Estimation, which introduces virtual anchor frames and low-resolution priors to significantly improve estimation robustness under large displacements. We further design a Hierarchical Motion Representation that converges multi-scale motion with temporal references, enabling compact and adaptive modeling of motion reliability across resolutions. Finally, a Bi-Contextual Asymmetric Harmonization module performs confidence-guided fusion of bidirectional references, effectively suppressing unreliable contexts and restoring structural consistency near occlusion and scene transition regions. Notably, our model is the first end-to-end-optimized video codec evaluated on 4K-resolution videos, establishing a new benchmark for higher-resolution NVC and achieving state-of-the-art performance among neural B-frame codecs.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37315", "url": null, "sourceid": 32381, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36570, "uid": "0141f065e310a9e40011628269e71ded", "name": "Discovering Adaptive Task Dependencies for Efficient Multi-Task Representation Compression", "authors": [{"id": 180273, "fullname": "Zhimeng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180273?format=json", "institution": "Peking University"}, {"id": 185375, "fullname": "Rongao Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185375?format=json", "institution": "Peking University"}, {"id": 184649, "fullname": "Junlong Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184649?format=json", "institution": "Xiamen University"}, {"id": 181193, "fullname": "Qi Mao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181193?format=json", "institution": "Communication University of China"}, {"id": 90140, "fullname": "Siwei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90140?format=json", "institution": "Peking University"}, {"id": 185376, "fullname": "Wen Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185376?format=json", "institution": "Peking University"}, {"id": 185377, "fullname": "Chuanmin Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/185377?format=json", "institution": "Peking University"}], "abstract": "Traditional image compression prioritizes pixel fidelity but often preserves details irrelevant to downstream vision tasks. Compressing task-specific representations instead better aligns with task semantics, yet redundant information persists across correlated tasks. Existing multi-task compression methods typically rely on  static dependency structures, leading to redundant bit allocation across correlated tasks and suboptimal rate-distortion performance. We present Adaptive Task Dependency Compression (ATDC), a framework that models per-image task relationships and encodes representations following an adaptive directed acyclic graph (DAG). ATDC infers pairwise task predictability via a learned correlation matrix, constructs a dynamic DAG to determine the optimal compression order, and encodes each task conditionally on its predecessors, achieving predictive redundancy removal and asymmetric information sharing across tasks. Experiments on the Taskonomy dataset demonstrate consistent gains in rate\u2013distortion efficiency and task accuracy over both human-oriented codecs and state-of-the-art multi-task compression methods.The learned DAGs reveal interpretable, content-dependent task hierarchies, establishing adaptive dependency modeling as a principled paradigm for multi-task representation compression.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36570", "url": null, "sourceid": 38486, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37353, "uid": "fa14e26aa100e41821a2050bf4edfb67", "name": "Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability", "authors": [{"id": 181969, "fullname": "Chengzhi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181969?format=json", "institution": "\u5317\u4eac\u7406\u5de5\u5927\u5b66"}, {"id": 187234, "fullname": "Heyan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187234?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187235, "fullname": "Ping Jian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187235?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187236, "fullname": "Zhen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187236?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187237, "fullname": "Yaning Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/187237?format=json", "institution": "Beijing Institute of Technology"}, {"id": 187238, "fullname": "Zhongbin Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187238?format=json", "institution": "Beijing Institute of Technology; Bytedance"}], "abstract": "Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called **T**emporally **C**onditioned **A**ttention **S**harpening (**TCAS**), which constructs an enhancement objective based on attention distinctions to enhance the model's temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method even achieves performance improvements in general video temporal grounding tasks, suggesting that temporal logic consistency is an important factor in temporal understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37353", "url": null, "sourceid": 44676, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40118, "uid": "833dc3b765e96b8141cad34f51dfef35", "name": "Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation", "authors": [{"id": 147376, "fullname": "junyuan ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/147376?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 186318, "fullname": "Xunzhi Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186318?format=json", "institution": "Nanjing University"}, {"id": 184978, "fullname": "Wenbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184978?format=json", "institution": "Nanjing University"}, {"id": 130772, "fullname": "Qi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86625, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86625?format=json", "institution": "Nanjing University"}], "abstract": "Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, causing overfitting prone under retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layerwise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then,  Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixelwise predictions, producing consistent masks. Together, these stages form a hierarchical select\u2013regularize\u2013calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state-of-the-art by more than 4.1 mIoU across multiple CD-FSS benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40118", "url": null, "sourceid": 42848, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39858, "uid": "b952172111c8ee9611f15f35632f4c6a", "name": "Understanding and Enforcing Weight Disentanglement in Task Arithmetic", "authors": [{"id": 192996, "fullname": "Shangge Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192996?format=json", "institution": "Nanjing University"}, {"id": 192997, "fullname": "Yuehan Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192997?format=json", "institution": "nanjing university"}, {"id": 85317, "fullname": "Lei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85317?format=json", "institution": "University of Wollonong"}, {"id": 130772, "fullname": "Qi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/130772?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 86630, "fullname": "Yinghuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86630?format=json", "institution": "Nanjing University"}, {"id": 184978, "fullname": "Wenbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184978?format=json", "institution": "Nanjing University"}, {"id": 86625, "fullname": "Yang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86625?format=json", "institution": "Nanjing University"}, {"id": 73994, "fullname": "Dacheng Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/73994?format=json", "institution": "Nanyang Technological University"}], "abstract": "Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement\" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model ($\\theta_0$) or the task vectors ($\\tau_t$) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Specialization (TFS), a model's ability to allocate distinct internal features to different tasks, as the fundamental principle. We first prove that TFS is a sufficient condition for weight disentanglement. More importantly, we find that TFS also gives rise to an observable geometric consequence: weight vector orthogonality. This positions TFS as the common cause for both the desired functional outcome (disentanglement) and a measurable geometric property (orthogonality). This relationship provides the key insight for our method: since the abstract TFS property is intractable to enforce directly, we can instead promote weight disentanglement by shaping its concrete geometric consequence, orthogonality. Therefore, we propose OrthoReg, a simple and effective regularization method that actively enforces an internal orthogonal structure on weight updates ($\\Delta W$) that constitute $\\tau_t$ during fine-tuning. And we theoretically prove that OrthoReg promotes disentanglement. Extensive experiments demonstrate that OrthoReg consistently and significantly enhances the performance of various task arithmetic methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39858", "url": null, "sourceid": 46711, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40387?format=json"], "related_events_ids": [40387]}, {"id": 37465, "uid": "070659cee3540cd84a4ca2eabd2a694c", "name": "Gaze Target Estimation with Concepts", "authors": [{"id": 135921, "fullname": "Xu Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/135921?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 182785, "fullname": "Houze Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182785?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187515, "fullname": "Vipin Gunda", "url": "http://cvpr.thecvf.com/api/miniconf/users/187515?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 187516, "fullname": "Zhongyi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187516?format=json", "institution": "Google"}, {"id": 187517, "fullname": "Tianyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187517?format=json", "institution": "University of California, Berkeley; Google DeepMind"}, {"id": 187518, "fullname": "Adarsh Kowdle", "url": "http://cvpr.thecvf.com/api/miniconf/users/187518?format=json", "institution": "Google"}, {"id": 187519, "fullname": "Inki Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/187519?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 95904, "fullname": "James Rehg", "url": "http://cvpr.thecvf.com/api/miniconf/users/95904?format=json", "institution": "University of Illinois at Urbana-Champaign"}], "abstract": "Estimating human gaze targets from images in-the-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose, in order to identify the subject of gaze analysis. As a result, detection errors can cascade and lead to failure. Moreover, these prior works lack the flexibility of specifying the gaze analysis task via natural language prompting, an approach which has been shown to have significant benefits in convenience and scalability for other image analysis tasks. To overcome these liimtations, we introduce the **Promptable Gaze Target Estimation (PGE)** task, a new end-to-end, concept-driven paradigm for gaze analysis. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., \"the boy in the red shirt\" or \"person in point [0.52, 0.48]\") to identify a specific subject for gaze analysis. This approach integrates subject localization with gaze estimation, and eliminates the rigid dependency on intermediate analysis stages. We develop a scalable data engine to generate **Gaze-Co** (Gaze Estimation with Concepts), a dataset and benchmark of 120K high-quality, prompt-annotated image pairs. We also propose **AnyGaze**, the first model designed for PGE. AnyGaze uses a transformer-based detector to fuse features from frozen encoders and simultaneously solves subject localization, in/out-of-frame presence, and gaze target heatmap estimation. AnyGaze achieves state-of-the-art performance on multiple PGE benchmarks, setting a strong baseline for this new problem even on a difficult out-of-domain, real-world clinical dataset. We will open-source the AnyGaze model and the Gaze-Co benchmark.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37465", "url": null, "sourceid": 39598, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38157, "uid": "94ae76abfd4a0a692a11c54041f5c0b1", "name": "SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras", "authors": [{"id": 155237, "fullname": "Weihong Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155237?format=json", "institution": "Zhejiang University"}, {"id": 143497, "fullname": "XiaoYu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/143497?format=json", "institution": "InSpatio Intelligent Technology"}, {"id": 189170, "fullname": "Zhuang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189170?format=json", "institution": "Vivo"}, {"id": 157562, "fullname": "Zhichao Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/157562?format=json", "institution": "Sensetime"}, {"id": 189171, "fullname": "Nan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189171?format=json", "institution": "inspatio"}, {"id": 189172, "fullname": "Haomin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189172?format=json", "institution": "InSpatio"}, {"id": 84995, "fullname": "Guofeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/84995?format=json", "institution": "Zhejiang University"}], "abstract": "High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38157", "url": null, "sourceid": 40118, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40196, "uid": "1c4196d0ff7fe4e94bdca98fb251bc25", "name": "Egocentric Visibility-Aware Human Pose Estimation", "authors": [{"id": 102271, "fullname": "Peng Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/102271?format=json", "institution": "Bytedance Inc."}, {"id": 193753, "fullname": "Yu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193753?format=json", "institution": "ByteDance Inc."}, {"id": 193754, "fullname": "Feng Yiqiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193754?format=json", "institution": null}, {"id": 128731, "fullname": "ZhenFan Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128731?format=json", "institution": "Bytedance"}, {"id": 102272, "fullname": "Yang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102272?format=json", "institution": "ByteDance"}], "abstract": "Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40196", "url": null, "sourceid": 46433, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37879, "uid": "de5d8722acb069c17055849b7aca5047", "name": "Neural Collapse in Test-Time Adaptation", "authors": [{"id": 155703, "fullname": "Xiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155703?format=json", "institution": "Sichuan University"}, {"id": 188479, "fullname": "Zhongjing Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/188479?format=json", "institution": "Peking University"}, {"id": 188480, "fullname": "Jiazhen Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188480?format=json", "institution": "Tsinghua University"}, {"id": 101375, "fullname": "Jiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/101375?format=json", "institution": "Peking University"}, {"id": 188481, "fullname": "Li Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188481?format=json", "institution": "Sichuan University"}, {"id": 157882, "fullname": "Jingyan Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157882?format=json", "institution": "Shenzhen Technology University"}, {"id": 119978, "fullname": "Zhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/119978?format=json", "institution": "SIGS, Tsinghua University"}], "abstract": "Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample\u2019s feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52\\% on ImageNet-C.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37879", "url": null, "sourceid": 40945, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39789, "uid": "4a59a8ea539613e156495fe26f8e64cd", "name": "Hierarchical Codec Diffusion for Video-to-Speech Generation", "authors": [{"id": 192859, "fullname": "Jiaxin Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/192859?format=json", "institution": "Fudan University"}, {"id": 153912, "fullname": "Gaoxiang Cong", "url": "http://cvpr.thecvf.com/api/miniconf/users/153912?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 105464, "fullname": "Chenhui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/105464?format=json", "institution": "Meituan; Fudan University"}, {"id": 192860, "fullname": "Xin-Cheng Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192860?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192861, "fullname": "Zhaoyang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192861?format=json", "institution": "Fudan University"}, {"id": 192862, "fullname": "Boyuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192862?format=json", "institution": "Fudan University"}, {"id": 76406, "fullname": "Hongming Shan", "url": "http://cvpr.thecvf.com/api/miniconf/users/76406?format=json", "institution": "Fudan University"}], "abstract": "Video-to-Speech (VTS) generation aims to synthesize speech solely from a silent video without auditory signals, and holds substantial promise for applications such as film dubbing and voice restoration for individuals with aphonia. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the distinctive hierarchical structure of Residual Vector Quantization (RVQ)-based codecs, we propose $\\textbf{HiCoDiT}$, a novel $\\textbf{Hi}$erarchical $\\textbf{Co}$dec $\\textbf{Di}$ffusion $\\textbf{T}$ransformer that exploits the inherent hierarchy of discrete speech tokens to achieve efficient alignment. Specifically, since lower-level tokens encode coarse speaker-aware content and higher-level tokens capture fine-grained prosody, \\methodname employs separate low-level and high-level blocks to generate tokens at corresponding codec layers. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content modeling, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale Adaptive Instance Layer Normalization (AdaLN) that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that \\methodname outperforms state-of-the-art baselines in fidelity, semantic consistency, and expressiveness, highlighting the effectiveness of integrating speech hierarchy for VTS generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39789", "url": null, "sourceid": 33329, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40104, "uid": "8dc6d6ed130ea73142c6de011fc26dbb", "name": "Improving Sparse Autoencoder with Dynamic Attention", "authors": [{"id": 88630, "fullname": "Dongsheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88630?format=json", "institution": "Xidian University"}, {"id": 174502, "fullname": "Jinsen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174502?format=json", "institution": "Shenzhen university"}, {"id": 193544, "fullname": "Dawei Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/193544?format=json", "institution": "Shenzhen University"}, {"id": 85724, "fullname": "Hui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85724?format=json", "institution": "Shenzhen University"}], "abstract": "Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for each neuron remains challenging in practice: excessive sparsity can lead to poor reconstruction, whereas insufficient sparsity may harm interpretability. While existing activation functions such as ReLU and TopK provide certain sparsity guarantees, they typically require additional sparsity regularization or cherry-picked hyperparameters. We show in this paper that dynamically sparse attention mechanisms using sparsemax can bridge this trade-off, due to their ability to determine the activation numbers in a data-dependent manner.  Specifically, we first explore a new class of SAEs based on the cross-attention architecture with the latent features as queries and the learnable dictionary as the key and value matrices. To encourage sparse pattern learning, we employ a sparsemax-based attention strategy that automatically infers a sparse set of elements according to the complexity of each neuron, resulting in a more flexible and general activation function. Through comprehensive evaluation and visualization, we show that our approach successfully achieves lower reconstruction loss while producing high-quality concepts, particularly in top-n classification tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40104", "url": null, "sourceid": 39016, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38608, "uid": "dca7e2d34a20f3a5b128c69253599a1b", "name": "RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding", "authors": [{"id": 89650, "fullname": "Xiyan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89650?format=json", "institution": "Baidu"}, {"id": 190281, "fullname": "Han Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190281?format=json", "institution": "Baidu"}, {"id": 190282, "fullname": "Yuhu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190282?format=json", "institution": "Baidu Inc."}, {"id": 190283, "fullname": "JUNJIE CAI", "url": "http://cvpr.thecvf.com/api/miniconf/users/190283?format=json", "institution": "Baidu"}, {"id": 190284, "fullname": "Zhe Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190284?format=json", "institution": "Baidu"}, {"id": 190285, "fullname": "Yangjianzhong Yangjianzhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190285?format=json", "institution": null}, {"id": 190286, "fullname": "Zhen Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190286?format=json", "institution": "Baidu"}], "abstract": "Understanding mid-level road semantics\u2014the structural and contextual cues that bridge low-level perception and high-level planning\u2014is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T)\u2014a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset will be made publicly available in the final version.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38608", "url": null, "sourceid": 32855, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39693, "uid": "87c9f6bccdb537339ee6fbaa0771aaaa", "name": "Medical Video Diagnosis via Counterfactual Reasoning", "authors": [{"id": 192658, "fullname": "Jianzhe Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192658?format=json", "institution": "Zhejiang University"}, {"id": 192659, "fullname": "Churan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192659?format=json", "institution": null}, {"id": 192660, "fullname": "Weiyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192660?format=json", "institution": "Chinese People&#x27;s Liberation Army General Hospital"}, {"id": 192661, "fullname": "Jianghua Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192661?format=json", "institution": "Chinese People&#x27;s Liberation Army General Hospital"}, {"id": 192662, "fullname": "Lian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192662?format=json", "institution": "Chinese People&#x27;s Liberation Army General Hospital"}, {"id": 86315, "fullname": "Wenguan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86315?format=json", "institution": "Zhejiang University"}, {"id": 88123, "fullname": "Yixin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88123?format=json", "institution": "Peking University"}, {"id": 74208, "fullname": "Yizhou Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/74208?format=json", "institution": "Peking University"}], "abstract": "Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an end-to-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39693", "url": null, "sourceid": 37423, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36644, "uid": "b1a3fd1d7a2cf89118a0a9b440757634", "name": "An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving", "authors": [{"id": 180667, "fullname": "Yi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180667?format=json", "institution": "Tongji University"}, {"id": 175026, "fullname": "Junwu E", "url": "http://cvpr.thecvf.com/api/miniconf/users/175026?format=json", "institution": "Tongji University"}, {"id": 185542, "fullname": "Zizhan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185542?format=json", "institution": "Tongji University"}, {"id": 185543, "fullname": "Yu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185543?format=json", "institution": "Tongji University"}, {"id": 127684, "fullname": "Hanli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127684?format=json", "institution": "Tongji University"}, {"id": 185544, "fullname": "Rui Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185544?format=json", "institution": "Tongji University"}], "abstract": "Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic annotations. Building upon ADMesh, we further construct CarlaOcc, a large-scale, physically consistent panoptic occupancy dataset generated using the CARLA simulator. This dataset contains 100K frames with fine-grained, instance-level occupancy ground truth at voxel resolutions as fine as 0.05 m. Furthermore, standardized evaluation metrics are introduced to quantify the quality of existing occupancy datasets. Finally, a systematic benchmark of representative models is established on the proposed dataset, which provides a unified platform for fair comparison and reproducible research in the field of 3D panoptic perception. Code and dataset will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36644", "url": null, "sourceid": 44423, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40231, "uid": "e1523f05b5135f8c5b464a15c9d3567d", "name": "The Midas Touch for Metric Depth", "authors": [{"id": 185543, "fullname": "Yu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/185543?format=json", "institution": "Tongji University"}, {"id": 185542, "fullname": "Zizhan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185542?format=json", "institution": "Tongji University"}, {"id": 193831, "fullname": "Zuyi Xiong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193831?format=json", "institution": "Tongji University"}, {"id": 193832, "fullname": "Haoran Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193832?format=json", "institution": "Tongji University"}, {"id": 180667, "fullname": "Yi Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180667?format=json", "institution": "Tongji University"}, {"id": 147356, "fullname": "Hongbo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/147356?format=json", "institution": "Tongji University"}, {"id": 127684, "fullname": "Hanli Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127684?format=json", "institution": "Tongji University"}, {"id": 185544, "fullname": "Rui Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185544?format=json", "institution": "Tongji University"}], "abstract": "Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present Midas Touch for Depth (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D modeling tasks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40231", "url": null, "sourceid": 46481, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38473, "uid": "69b15a532728b765d71499d899009fda", "name": "Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code", "authors": [{"id": 179605, "fullname": "Yicheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/179605?format=json", "institution": "Imperial College London"}, {"id": 189928, "fullname": "Tao Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/189928?format=json", "institution": "Fudan University"}, {"id": 189929, "fullname": "Zhonghua Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189929?format=json", "institution": "SenseTime"}, {"id": 152261, "fullname": "Jin Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/152261?format=json", "institution": "Monash University"}, {"id": 185612, "fullname": "Zongyuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185612?format=json", "institution": "Monash University"}, {"id": 149780, "fullname": "Wenjia Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/149780?format=json", "institution": "Imperial College London"}, {"id": 189930, "fullname": "Zhaolin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189930?format=json", "institution": "Monash University"}, {"id": 126993, "fullname": "Jianfei Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/126993?format=json", "institution": "Monash University"}], "abstract": "Magnetic resonance imaging (MRI) is a powerful and versatile imaging technique, offering a wide spectrum of information about the anatomy by employing different acquisition modalities. However, in the clinical workflow, it is impractical to collect all relevant modalities due to the scan time and cost constraints. Virtual full-stack scanning aims to impute missing MRI modalities from available but incomplete acquisitions, offering a cost-efficient solution to enhance data completeness and clinical usability. Existing imputation methods often depend on global conditioning or modality-specific designs, which limit their generalisability across patient cohorts and imaging protocols. To address these limitations, we propose CodeBrain, a unified framework that reformulates various ``any-to-any'' imputation tasks as a region-level full-stack code prediction problem. CodeBrain adopts a two-stage pipeline: (1) it learns the compact representation of a complete MRI modality set by encoding it into scalar-quantised codes at the region level, enabling high-fidelity image reconstruction after decoding these codes along with modality-agnostic common features; (2) it trains a projection encoder to predict the full-stack code map from incomplete modalities via a grading-based design for diverse imputation scenarios. Extensive experiments on two public brain MRI datasets, i.e., IXI and BraTS 2023, demonstrate that CodeBrain consistently outperforms state-of-the-art methods, establishing a new benchmark for unified brain MRI imputation and enabling virtual full-stack scanning.Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38473", "url": null, "sourceid": 33892, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37550, "uid": "807a284ada14944ff3095585658a03ae", "name": "Bringing Your Portrait to 3D Presence", "authors": [{"id": 154161, "fullname": "Jiawei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154161?format=json", "institution": "Nanjing University"}, {"id": 90376, "fullname": "Lei Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90376?format=json", "institution": "Microsoft Research Asia"}, {"id": 132590, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/132590?format=json", "institution": "Microsoft Research Asia"}, {"id": 187701, "fullname": "Zhenyu Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187701?format=json", "institution": "Research, Microsoft"}, {"id": 187702, "fullname": "Chong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187702?format=json", "institution": "Microsoft"}, {"id": 87718, "fullname": "Xiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87718?format=json", "institution": "Microsoft Research Asia"}, {"id": 85035, "fullname": "Xun Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85035?format=json", "institution": "Nanjing University"}, {"id": 153839, "fullname": "Hao Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153839?format=json", "institution": "Nanjing University"}, {"id": 87597, "fullname": "Yan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87597?format=json", "institution": "Microsoft Research Asia"}], "abstract": "We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation.We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts.We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency.A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results.Extensive experiments and analyses further validate the effectiveness of our approach. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37550", "url": null, "sourceid": 37244, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37553, "uid": "caff288f3dbd050e5c0e76097e333ff9", "name": "Mirai: Autoregressive Visual Generation Needs Foresight", "authors": [{"id": 187705, "fullname": "Yonghao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187705?format=json", "institution": "The university of Tokyo"}, {"id": 187706, "fullname": "Lang Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187706?format=json", "institution": "National Institute of Informatics; Sakana AI"}, {"id": 187707, "fullname": "Zerun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187707?format=json", "institution": null}, {"id": 129644, "fullname": "Runyi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129644?format=json", "institution": "Peking University"}, {"id": 92654, "fullname": "Toshihiko Yamasaki", "url": "http://cvpr.thecvf.com/api/miniconf/users/92654?format=json", "institution": "The University of Tokyo"}], "abstract": "Autoregressive (AR) visual generators model images as sequences of discrete tokens and are trained with next token likelihood. This strict causality supervision optimizes each step only by its immediate next token, which diminishes global coherence and slows convergence. We ask whether foresight, training signals that originate from later tokens, can help AR visual generation. We conduct a series of controlled diagnostics along the injection level, foresight layout, and foresight source axes, unveiling a key insight: aligning foresight to AR models' internal representation on the 2D image grids improves causality modeling.We formulate this insight with Mirai (meaning \"future\" in Japanese), a general framework that injects future information into AR training with no architecture change and no extra inference overhead: Mirai-E uses explicit foresight from multiple future positions of unidirectional representations, whereas Mirai-I leverages implicit foresight from matched bidirectional representations.Extensive experiments show that Mirai significantly accelerates convergence and improves generation quality. For instance, Mirai can speed up LlamaGen-B's convergence by up to 10$\\times$ and reduce the generation FID from 5.34 to 4.34 on the ImageNet class-condition image generation benchmark.Our study highlights that visual autoregressive models need foresight.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37553", "url": null, "sourceid": 37085, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40069, "uid": "8983dcdc28c8c330a0fcd03bdc2198df", "name": "DreamOmni2: Multimodal Instruction-based Generation and Editing", "authors": [{"id": 87856, "fullname": "Bin Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/87856?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 71414, "fullname": "Bohao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/71414?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 132567, "fullname": "Yuechen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132567?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 189284, "fullname": "Junjia Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189284?format=json", "institution": "Sun yat-sen University"}, {"id": 189286, "fullname": "JiyangLiu JiyangLiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189286?format=json", "institution": "ByteDance Inc."}, {"id": 77019, "fullname": "Jingyao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/77019?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 128162, "fullname": "Haoru Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/128162?format=json", "institution": "HKU"}, {"id": 128180, "fullname": "WU Sitong", "url": "http://cvpr.thecvf.com/api/miniconf/users/128180?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 193437, "fullname": "Chengyao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193437?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87905, "fullname": "Yitong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87905?format=json", "institution": "ByteDance Inc"}, {"id": 129111, "fullname": "Bei Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129111?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 154575, "fullname": "Jiaya Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/154575?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}], "abstract": "Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40069", "url": null, "sourceid": 38922, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37173, "uid": "5cd26024db3b7ce0cf1bf304b6b023ee", "name": "Enhancing the Security of Visual Speaker Authentication Based on Dynamic Lip-Print Analysis", "authors": [{"id": 186840, "fullname": "Yi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/186840?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 182194, "fullname": "Lei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182194?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 180827, "fullname": "Bofan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180827?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186841, "fullname": "Shi-Lin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186841?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "In recent years, face-based authentication methods are gradually replacing traditional methods across various applications, offering enhanced security and user convenience. However, these methods are threatened by the continuously evolving DeepFake techniques. In this paper, a novel Visual Speaker Authentication (VSA) approach based on dynamic lip-prints is proposed to improve system security against diverse attacks. The lip-prints are discriminative viseme segments that capture user's localized speaking habits. By leveraging these dynamic lip-prints, this approach can expand the prompt set without requiring additional user-recordings or model retraining, thereby strengthening resilience to replay attacks. Moreover, a Multi-Layer Dynamic-Enhanced Encoder is introduced to model fine-grained lip dynamics, addressing data scarcity challenges and ensuring robust feature extraction even in scenarios with short temporal spans and limited enrollment data. We have carried out extensive experiments on several datasets and the results have demonstrated the effectiveness of the proposed method in both security enhancement and prompt set scalability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37173", "url": null, "sourceid": 38530, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36488, "uid": "878d5255f246ac543eb4d7104a8abb78", "name": "Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning", "authors": [{"id": 185176, "fullname": "Yuhong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185176?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 102035, "fullname": "Beichen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/102035?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 131138, "fullname": "Yuhang Zang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131138?format=json", "institution": "Nanyang Technological University"}, {"id": 152680, "fullname": "Yuhang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152680?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 155815, "fullname": "Long Xing", "url": "http://cvpr.thecvf.com/api/miniconf/users/155815?format=json", "institution": "University of Science and Technology of China"}, {"id": 90594, "fullname": "Xiaoyi Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/90594?format=json", "institution": "Microsoft"}, {"id": 153062, "fullname": "Haodong Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153062?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 84911, "fullname": "Dahua Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/84911?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 77217, "fullname": "Jiaqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/77217?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36488", "url": null, "sourceid": 44561, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37348, "uid": "9a3fffd26a576a089927f9c34f60c7e6", "name": "PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery", "authors": [{"id": 187224, "fullname": "Elkhan Ismayilzada", "url": "http://cvpr.thecvf.com/api/miniconf/users/187224?format=json", "institution": "Michigan State University"}, {"id": 169435, "fullname": "Yufei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/169435?format=json", "institution": "ByteDance Inc."}, {"id": 164423, "fullname": "Zijun Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/164423?format=json", "institution": "Michigan State University"}], "abstract": "Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN\u2013Transformer backbone, we formulate Euler\u2013Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37348", "url": "https://elkhanzada.github.io/pad-hand", "sourceid": 37165, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38867, "uid": "0b0ace8b6bdf7426f3fe628939e917d1", "name": "Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Any Camera", "authors": [{"id": 190880, "fullname": "Mukai Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190880?format=json", "institution": "Carnegie Mellon University, Robotics Institute"}, {"id": 70318, "fullname": "Mosam Dabhi", "url": "http://cvpr.thecvf.com/api/miniconf/users/70318?format=json", "institution": "Carnegie Mellon University"}, {"id": 190881, "fullname": "Liuyue Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/190881?format=json", "institution": "Amazon"}, {"id": 73029, "fullname": "Sebastian Scherer", "url": "http://cvpr.thecvf.com/api/miniconf/users/73029?format=json", "institution": "Carnegie Mellon University"}, {"id": 76711, "fullname": "Laszlo Jeni", "url": "http://cvpr.thecvf.com/api/miniconf/users/76711?format=json", "institution": "Carnegie Mellon University"}], "abstract": "Modern perception increasingly relies on fisheye, panoramic, and other wide-FoV cameras, yet most pipelines still apply planar CNNs designed for pinhole imagery on 2D grids, where image-space neighborhoods misrepresent physical adjacency and models are sensitive to global rotations. Frequency-domain spherical CNNs partially address this mismatch but require costly spherical harmonic transforms that constrain resolution and efficiency. We introduce the Unified Spherical Frontend (USF), a lens-agnostic framework that lifts images from any calibrated camera to a unit-sphere representation via ray-direction correspondences and performs spherical resampling, convolution, and pooling directly in the spatial domain. USF is modular: projection, location sampling, value interpolation, and output resolution controls are decoupled. Its distance-only spherical kernels provide configurable rotation-equivariance by design (mirroring the translation-equivariance of planar CNNs), while avoiding harmonic transforms. We compare standard planar backbones with their spherical counterparts across classification, detection, and segmentation on synthetic (Spherical MNIST) and real-world datasets (PANDORA, Stanford 2D3DS), and stress-test robustness to extreme lens distortions, varying FoV, and arbitrary rotations. USF processes high-resolution spherical imagery efficiently and maintains \\emph{less than 1\\%} performance drop under random test-time rotations, even without rotational augmentation, while zero\u2011shot generalizing from one lens type to previously unseen wide-FoV lenses with minimal degradation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38867", "url": "https://tomnotch.com/USF", "sourceid": 43348, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38958, "uid": "27c8efa32c0738c9d83b37d1882d97ea", "name": "Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion", "authors": [{"id": 153229, "fullname": "Yueming Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153229?format=json", "institution": "Institute of Artificial Intelligence and Robotics, Xi\u2019an Jiaotong University"}, {"id": 71593, "fullname": "Ruoyu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/71593?format=json", "institution": "University of Science and Technology of China"}, {"id": 85511, "fullname": "Qi Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85511?format=json", "institution": "Microsoft Research Asia"}, {"id": 191058, "fullname": "Yuqi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191058?format=json", "institution": "ByteDance Inc."}, {"id": 191059, "fullname": "Wenfeng LIN", "url": "http://cvpr.thecvf.com/api/miniconf/users/191059?format=json", "institution": null}, {"id": 167563, "fullname": "MINGYU GUO", "url": "http://cvpr.thecvf.com/api/miniconf/users/167563?format=json", "institution": "ByteDance Inc."}, {"id": 86583, "fullname": "Chong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/86583?format=json", "institution": "Microsoft Research Asia"}, {"id": 87581, "fullname": "Nanning Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87581?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit the texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining the compact semantic latent, which is extracted from pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256$\\times$256 with guidance, SFD achieves FID 1.06  (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100$\\times$ faster convergence than original DiT without guidance. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38958", "url": null, "sourceid": 31009, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38672, "uid": "a2a52743471fc9d71744e35fa3625217", "name": "Gravitation-Driven Semantic Alignment for Text Video Retrieval", "authors": [{"id": 190435, "fullname": "Yi YANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/190435?format=json", "institution": "Tongji University"}, {"id": 87727, "fullname": "Zheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87727?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 130166, "fullname": "Xing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130166?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 89312, "fullname": "Jingkuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/89312?format=json", "institution": "University of Electronic Science and Technology of China,"}, {"id": 84847, "fullname": "Heng Tao Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/84847?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "The inherent semantic ambiguity of \u201cmany-to-many\u201d, where one video matches multiple texts and vice versa, aggravates the difficulty in text-video retrieval. The dominant deterministic embeddings only struggle to capture the mean semantics, while existing probabilistic methods fail to distinguish hard negatives for their imposing rigid uncertainty priors or ignoring the interaction between similarity and uncertainty. To this end, we propose a novel physics-inspired framework (GraviAlign) that decomposes the alignment of cross-modal semantic distributions into two orthogonal factors inspired by the Gravitational Force: (1) Semantic Attraction measuring gravitational alignment between distribution centers via uncertainty-derived \u201csemantic mass\u201d and \u201csemantic distance\u201d; (2) Geometric Overlap quantifying distribution intersection. Each factor has independent veto power to reject those matches with misalignment or poor overlap. Additionally, GraviAlign offers an efficient (O(D)), theoretically grounded alternative to intractable joint integrals. Extensive experiments on DiDeMo, MSR-VTT, and ActivityNet demonstrate our effectiveness and superiority, and solid ablation studies confirm the indispensability of two novel components.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38672", "url": null, "sourceid": 33652, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39724, "uid": "307bd34c343595bb92124627df76d18e", "name": "TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking", "authors": [{"id": 144125, "fullname": "Hanzhi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/144125?format=json", "institution": "Beijing Institute of Technology"}, {"id": 184635, "fullname": "dongdong weng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184635?format=json", "institution": null}, {"id": 192729, "fullname": "SUMO SUMO", "url": "http://cvpr.thecvf.com/api/miniconf/users/192729?format=json", "institution": null}, {"id": 192730, "fullname": "Yixiao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192730?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192731, "fullname": "Xiaonuo Dongye", "url": "http://cvpr.thecvf.com/api/miniconf/users/192731?format=json", "institution": "Beijing Institute of Technology"}, {"id": 192732, "fullname": "Chenyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192732?format=json", "institution": "Digital Xusheng (Beijing) Technology Co., Ltd"}], "abstract": "Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable precise 3D keypoint tracking.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39724", "url": null, "sourceid": 40156, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36622, "uid": "54fe5a851b42e219fba334ed340defac", "name": "PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation", "authors": [{"id": 151645, "fullname": "Yuheng Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151645?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 185492, "fullname": "Wen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185492?format=json", "institution": "Snapchat"}, {"id": 153062, "fullname": "Haodong Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153062?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 185493, "fullname": "Xingxing Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185493?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "We present PosterIQ, a design-driven benchmark for poster understanding and generation, annotated across composition structure, typographic hierarchy, and semantic intent. It includes 7,765 image\u2013annotation instances and 822 generation prompts spanning real, professional, and synthetic cases. To bridge visual design cognition and generative modeling, we define tasks for layout parsing, text\u2013image correspondence, typography/readability and font perception, design quality assessment, and controllable, composition-aware generation with metaphor. We evaluate state-of-the-art MLLMs and diffusion-based generators, finding persistent gaps in visual hierarchy, typographic semantics, saliency control, and intention communication; commercial models lead on high-level reasoning but act as insensitive automatic raters, while generators render text well yet struggle with composition-aware synthesis. Extensive analyses show PosterIQ is both a quantitative benchmark and a diagnostic tool for design reasoning, offering reproducible, task-specific metrics. We aim to catalyze models' creativity and integrate human-centred design principles into generative vision\u2013language systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36622", "url": null, "sourceid": 35284, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37152, "uid": "24b23168fdaa0d98e9119081cd4fdd3e", "name": "CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection", "authors": [{"id": 175496, "fullname": "Huidong Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/175496?format=json", "institution": "Xiamen University"}, {"id": 186788, "fullname": "Wentao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186788?format=json", "institution": "China Acedemy of Information and Communication Technology"}, {"id": 186789, "fullname": "Jie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186789?format=json", "institution": null}, {"id": 186790, "fullname": "Xinqi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186790?format=json", "institution": "Xiamen University"}, {"id": 186791, "fullname": "Ruolong Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186791?format=json", "institution": "China Academy of Informaton and Communications Technology"}, {"id": 186792, "fullname": "Yinglin Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186792?format=json", "institution": "Xiamen University"}, {"id": 186793, "fullname": "Yuxin Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186793?format=json", "institution": "Xiamen University"}, {"id": 91083, "fullname": "Ming Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/91083?format=json", "institution": "Xiamen University"}], "abstract": "With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce $CoCoVideo-26K$, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real\u2013fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos, establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose $CoCoDetect$, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework\u2019s robustness and generality. Dataset and code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37152", "url": null, "sourceid": 46176, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37443, "uid": "6f58f3b172bef95307def60e8be52b00", "name": "PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence", "authors": [{"id": 182133, "fullname": "Zheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182133?format=json", "institution": "National University of Defense Technology"}, {"id": 187458, "fullname": "Xueyi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187458?format=json", "institution": ""}, {"id": 186551, "fullname": "Yanming Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186551?format=json", "institution": "National University of Defense Technology"}, {"id": 187459, "fullname": "Yuxiang Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/187459?format=json", "institution": null}, {"id": 187460, "fullname": "Ding Zhaoyun", "url": "http://cvpr.thecvf.com/api/miniconf/users/187460?format=json", "institution": "nudt"}, {"id": 187461, "fullname": "Siqi Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187461?format=json", "institution": "Harbin Institute of Technology"}, {"id": 85561, "fullname": "Haizhou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85561?format=json", "institution": "The Chinese University of Hong Kong (Shenzhen); National University of Singapore"}, {"id": 187462, "fullname": "Mingrui Lao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187462?format=json", "institution": "National University of Defense Technology"}], "abstract": "Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic **alignment shifts** where only partial correspondences exist between image pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research.To our best knowledge, this work presents the first systematic investigation of the **Noisy Correspondence in Cross-View Geo-Localization (NC-CVGL)** problem, specifically addressing the practical scenario where a significant portion of training pairs exhibit spatial misalignment due to GPS inaccuracies. To this end, we propose **PAUL** (**P**artition and **A**ugmentation by **U**ncertainty **L**earning), a framework that achieves noise-robust learning through three coordinated mechanisms: **Co-partition** separates noisy from clean samples using data uncertainty and loss patterns; **Co-augmentation** enhances high-confidence regions via local assessment; and **Co-training** refines learning through mutual supervision between dual networks.Unlike conventional noise-handling methods that filter or relabel noisy samples, PAUL effectively utilizes noisy data through this triple collaborative mechanism. Comprehensive experiments validate the effectiveness of individual components in PAUL, which consistently achieves superior performance over other competitive noisy-correspondence-driven methods in various noise ratios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37443", "url": null, "sourceid": 44881, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36791, "uid": "a59e3dbde6c1420a4940b07ed8c96c47", "name": "Hierarchical Process Reward Models are Symbolic Vision Learners", "authors": [{"id": 181336, "fullname": "Shan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181336?format=json", "institution": "Australian Institute for Machine Learning (AIML)"}, {"id": 185881, "fullname": "Aotian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185881?format=json", "institution": "Ohio State University"}, {"id": 185882, "fullname": "Kai Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/185882?format=json", "institution": "University of Science and Technology of China"}, {"id": 89311, "fullname": "Jindong Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89311?format=json", "institution": "University of Oxford &amp; Google Research"}, {"id": 153067, "fullname": "Yuan Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/153067?format=json", "institution": "Ohio State University, Columbus"}, {"id": 88134, "fullname": "Anton van den Hengel", "url": "http://cvpr.thecvf.com/api/miniconf/users/88134?format=json", "institution": "University of Adelaide"}], "abstract": "Symbolic computer vision represents diagrams through explicit logical rules and structured representations, enabling interpretable understanding in machine vision. This requires fundamentally different learning paradigms from pixel-based visual models. Symbolic visual learners parse diagrams into geometric primitives\u2014points, lines, and shapes\u2014whereas pixel-based learners operate on textures and colors. We propose a novel self-supervised symbolic auto-encoder that encodes diagrams into structured primitives and their interrelationships within the latent space, and decodes them through our executable engine to reconstruct the input diagrams. Central to this architecture is SymHPR (Symbolic Hierarchical Process Reward Modeling), which applies hierarchical step-level parsing rewards to enforce point-on-line, line-on-shape, and shape-on-relation consistency. Since vanilla reinforcement learning exhibits poor exploration in the policy space during diagram reconstruction; we thus introduce stabilization mechanisms to balance exploration and exploitation. We fine-tune our symbolic encoder on downstream tasks, developing a neuro-symbolic system that integrates the reasoning capabilities of neural networks with the interpretability of symbolic models through reasoning-grounded visual rewards. Evaluations across reconstruction, perception, and reasoning tasks demonstrate the effectiveness of our approach: achieving a 98.2\\% reduction in MSE for geometric diagram reconstruction, surpassing GPT-4o by 0.6\\% with a 7B model on chart reconstruction, and improving by +13\\% on the MathGlance perception benchmark, and by +3\\% on MathVerse and GeoQA reasoning benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36791", "url": null, "sourceid": 37701, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36690, "uid": "1846568d5a791a321a2b27eb5734dded", "name": "ADSeeker: A Knowledge-Grounded Reasoning Framework for Industry Anomaly Detection and Reasoning", "authors": [{"id": 175402, "fullname": "Kai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/175402?format=json", "institution": "Shandong University"}, {"id": 185648, "fullname": "Zekai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185648?format=json", "institution": "Shandong University"}, {"id": 185649, "fullname": "Xihe Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185649?format=json", "institution": "Shandong University"}, {"id": 185650, "fullname": "Anpeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185650?format=json", "institution": "Shandong University"}, {"id": 185651, "fullname": "Jingmeng Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/185651?format=json", "institution": "Shandong University"}, {"id": 185652, "fullname": "Qinghui Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185652?format=json", "institution": "Shandong University; Laoshan Laboratory"}, {"id": 185653, "fullname": "Han Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185653?format=json", "institution": "Shandong University"}, {"id": 91041, "fullname": "Jianyuan Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/91041?format=json", "institution": "University of Sydney"}, {"id": 157621, "fullname": "jinglin zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157621?format=json", "institution": "Shandong University"}], "abstract": "Automatic vision inspection holds significant importance in industry inspection. While multimodal large language models (MLLMs) exhibit strong language understanding capabilities and hold promise for this task, their performance remains significantly inferior to that of human experts. In this context, we identify two key challenges: (i) insufficient integration of anomaly detection (AD) knowledge during pre-training and (ii) the lack of technically precise and context-aware language generation for anomaly reasoning. To address these issues, we propose ADSeeker, an anomaly task assistant designed to enhance inspection performance through knowledge-grounded reasoning. ADSeeker first leverages a curated visual document knowledge base, SEEK-MVTec\\&VisA (SEEK-M\\&V), which we construct to address the limitations of existing resources that rely solely on unstructured text. SEEK-M\\&V includes semantic-rich descriptions and image-document pairs, enabling more comprehensive anomaly understanding. To effectively retrieve and utilize this knowledge, we introduce the Query Image-Knowledge Retrieval-Augmented Generation (Q2K RAG) framework. To further enhance the performance in zero-shot anomaly detection (ZSAD), ADSeeker leverages the Hierarchical Sparse Prompt mechanism and type-level features to efficiently extract anomaly patterns. Furthermore, to tackle the challenge of limited industry anomaly detection (IAD) data, we introduce the largest-scale AD dataset, Multi-type Anomaly (MulA), encompassing 72 multi-scale defect types across 26 categories. Extensive experiments show that our plug-and-play framework, ADSeeker, achieves state-of-the-art zero-shot performance on several benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36690", "url": null, "sourceid": 31674, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38355, "uid": "f888782cbdc8c8cad62044b2150782dc", "name": "Adapting Lightweight Image-based Counting Models for Video Crowd Counting", "authors": [{"id": 189703, "fullname": "Weibo Shu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189703?format=json", "institution": "City University of Hong Kong"}, {"id": 88738, "fullname": "Antoni B. Chan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88738?format=json", "institution": "City University of Hong Kong"}], "abstract": "Video crowd counting aims to predict the people count in each frame of a video. It requires effectively leveraging spatio-temporal (ST) information in videos while satisfying real-time constraints. However, most existing methods use ST information from neighboring frames through auxiliary extraction and fusion modules---resulting in large computational cost and the need to buffer multiple frames during inference. Such designs limit their practicality in real-world applications with limited computational resources or stringent real-time requirements. To address these issues, we revisit video crowd counting from the perspective of lightweight image-based counting models that enable real-time deployment under limited resources. We analytically define ST information in a model-independent and statistically interpretable manner, and incorporate it into training via a statistical regularizer that effectively enhances model performance without adding modules or inference overhead. Most framework hyperparameters are further formulated as statistical inference problems, allowing automatic estimation from data and consequently efficient adaptation to new scenarios.Our framework unifies video crowd counting and image-based counting models under a compact, principled formulation that is lightweight, portable, and efficient. We also establish theoretical foundations for adapting image-based counting models to video crowd counting and achieve state-of-the-art accuracy and efficiency across six benchmarks, including challenging DRONECROWD and VSCROWD.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38355", "url": null, "sourceid": 45123, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36295, "uid": "c6734b5d94fdf74db73175ba5429ec11", "name": "RoSAMDepth: Robust Self-supervised Depth Estimation Leveraging Segment Anything Model", "authors": [{"id": 143641, "fullname": "Xuanang Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/143641?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 150494, "fullname": "Ning Zhiwei", "url": "http://cvpr.thecvf.com/api/miniconf/users/150494?format=json", "institution": "shanghai jiaotong university"}, {"id": 184710, "fullname": "Gengming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184710?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 160086, "fullname": "Jiaxi Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/160086?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184711, "fullname": "Runze Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184711?format=json", "institution": "Macquarie University; Shanghai Jiaotong University"}, {"id": 178830, "fullname": "Zhonglong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/178830?format=json", "institution": "Zhejiang Normal University"}, {"id": 184712, "fullname": "JIE YANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/184712?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 184713, "fullname": "Rong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184713?format=json", "institution": null}, {"id": 184714, "fullname": "Wei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184714?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Robust depth estimation aims to maintain high-quality depths across diverse conditions. However, most existing methods estimate depth without taking into account the object-level information. As a result, the predicted depth may easily deviate within objects and become blurred under adverse conditions. To overcome this weakness, we propose RoSAMDepth, a novel framework that can assist robust self-supervised depth estimation in leveraging rich and diverse object-level priors from the Segment Anything Model (SAM). We focus on incorporating object-level information across three key aspects: a segment-guided representation contrasting method that injects object-level awareness into the feature representation space; an adaptive regional outlier masking strategy combined with a regional Gaussian likelihood loss that enforces regional depth smoothness; and an object-level reliability estimation strategy that mitigates the influence of unreliable supervision. Extensive experiments across multiple datasets and diverse weather conditions demonstrate that our method produces sharper, more accurate depth predictions, consistently outperforming state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36295", "url": null, "sourceid": 33034, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38968, "uid": "4b52ddb7a0b6ac8553536d004c895b59", "name": "Portable Active Learning for Object Detection", "authors": [{"id": 183230, "fullname": "Rashi Sharma", "url": "http://cvpr.thecvf.com/api/miniconf/users/183230?format=json", "institution": "Panasonic R&amp;D Center Singapore"}, {"id": 191087, "fullname": "Justin Timothy C. Bersamin", "url": "http://cvpr.thecvf.com/api/miniconf/users/191087?format=json", "institution": "Nanyang Technological University"}, {"id": 191088, "fullname": "Karthikk Subramanian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191088?format=json", "institution": "Panasonic R&amp;D Center Singapore"}], "abstract": "Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning (AL) methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38968", "url": null, "sourceid": 46765, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39471, "uid": "040c46d8f724cba41e33a18b108df7bd", "name": "CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion", "authors": [{"id": 192138, "fullname": "James Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192138?format=json", "institution": "Independent Researcher; Hefei Thomas School"}, {"id": 192139, "fullname": "Yuxiao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192139?format=json", "institution": "University of Science and Technology of China"}, {"id": 188410, "fullname": "Youcheng Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188410?format=json", "institution": "University of Science and Technology of China"}, {"id": 85084, "fullname": "Ligang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85084?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Controllable, high-fidelity mesh editing remains a significant challenge in the domain of 3D content creation. Existing generative methods often struggle with complex geometries and fail to preserve fine-scale details. We propose CraftMesh, a novel framework for high-fidelity generative mesh manipulation based on Poisson Seamless Fusion. Our key insight is to decompose mesh editing into a pipeline that leverages the strengths of 2D image editing and 3D generative modeling: we first edit a 2D reference image, then generate a 3D mesh corresponding to the edited region, and fuse it seamlessly into the original mesh through a Joint Geometry and Appearance Fusion framework built on a hybrid SDF/Mesh representation to enable Poisson Geometry Blending and Poisson Texture Harmonization. Experimental results demonstrate that CraftMesh outperforms state-of-the-art methods, delivering improved structural consistency, richer local geometric and appearance details in challenging editing scenarios. The implementation will be released publicly upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39471", "url": null, "sourceid": 37449, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38673, "uid": "8e658a7376fbef31c3bb99e4dc2c7d6d", "name": "Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data", "authors": [{"id": 142863, "fullname": "Hongqing He", "url": "http://cvpr.thecvf.com/api/miniconf/users/142863?format=json", "institution": "Guangxi Normal University"}, {"id": 128706, "fullname": "Jie Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/128706?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 190436, "fullname": "Wenyuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190436?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 190437, "fullname": "Yonghua Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190437?format=json", "institution": "Singapore University of Technology and Design"}, {"id": 190438, "fullname": "Guoqiu Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190438?format=json", "institution": "Guangxi Normal University"}, {"id": 126913, "fullname": "Xiaofeng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126913?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Recently, contrastive learning (CL) plays an important role in exploring complementary information for multi-view clustering (MVC) and has attracted increasing attention. Nevertheless, real-world multi-view data suffer from data incompleteness or noise, resulting in rare-paired samples or mis-paired samples which significantly challenges the effectiveness of CL-based MVC. That is, rare-paired issue prevents MVC from extracting sufficient multi-view complementary information, and mis-paired issue causes contrastive learning to optimize the model in the wrong direction. To address these issues, we propose a unified CL-based MVC framework for enhancing clustering effectiveness on incomplete and noise multi-view data. First, to overcome the rare-paired issue, we design a global-graph guided contrastive learning, where all view samples construct a global-view affinity graph to form new sample pairs for fully exploring complementary information. Second, to mitigate the mis-paired issue, we propose a local-graph weighted contrastive learning, which leverages local neighbors to generate pair-wise weights to adaptively strength or weaken the pair-wise contrastive learning. Our method is imputation-free and can be integrated into a unified global-local graph-guided contrastive learning framework. Extensive experiments on both incomplete and noise settings of multi-view data demonstrate that our method achieves superior performance compared with state-of-the-art approaches.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38673", "url": null, "sourceid": 40999, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38470, "uid": "18208b5c8aefff325ade802e6985a0c1", "name": "MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation", "authors": [{"id": 129753, "fullname": "Wenfeng Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/129753?format=json", "institution": "Beijing Information Science and Technology University"}, {"id": 183963, "fullname": "Xuehan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183963?format=json", "institution": "Beijing Information Science and Technology University"}, {"id": 129752, "fullname": "Shuai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/129752?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 189920, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/189920?format=json", "institution": "Beijing Technology and Business University"}, {"id": 189921, "fullname": "Yuting Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/189921?format=json", "institution": "Beijing Information Science and Technology University"}, {"id": 189922, "fullname": "Zhenyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189922?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 129758, "fullname": "Xingliang Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/129758?format=json", "institution": "Beijing information science and technology university"}, {"id": 126925, "fullname": "Chenglizhao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/126925?format=json", "institution": "China University of Petroleum"}, {"id": 131397, "fullname": "Fei Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/131397?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 102436, "fullname": "Hongyu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/102436?format=json", "institution": "Beihang University"}, {"id": 103538, "fullname": "Aimin Hao", "url": "http://cvpr.thecvf.com/api/miniconf/users/103538?format=json", "institution": "Beihang University"}], "abstract": "Diffusion-based motion generation has advanced rapidly, but current methods still struggle with long-horizon consistency, style control, and multi-condition guidance. A major reason is the fused-conditioning design, where semantic, stylistic, and temporal signals share a single pathway, causing interference and limiting controllability.We propose MoCoDiff, a controlable autoregressive diffusion framework that introduces Injection Modulation Controllers (IMC). IMC is a lightweight, modality-specific linear modulation modules that inject text, style, and history signals through separate conditioning paths. IMC preserves the simplicity of a frozen backbone while avoiding the entanglement inherent to fused conditioning, enabling more stable and interpretable multi-condition control.To further enhance long-range synthesis, we develop a controllable autoregressive diffusion model equipped with Temporal IMC (TIMC), which applies history as a timestep-dependent corrective signal. This controllable formulation actively suppresses drift, enforces smooth transitions across motion segments, and significantly improves temporal coherence over extended sequences.Experiments show that MoCoDiff achieves state-of-the-art style fidelity, transition quality, and efficiency, while supporting flexible and interpretable multi-condition motion synthesis without retraining.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38470", "url": null, "sourceid": 30955, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37783, "uid": "f52b5a5a83b60359562d12342f8e5950", "name": "RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting", "authors": [{"id": 188256, "fullname": "Xuezhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188256?format=json", "institution": "HKUST"}, {"id": 187731, "fullname": "Li Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187731?format=json", "institution": "Scanline VFX"}, {"id": 149071, "fullname": "Yulin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149071?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 128022, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128022?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 90261, "fullname": "Pedro V. Sander", "url": "http://cvpr.thecvf.com/api/miniconf/users/90261?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow\u2013guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37783", "url": null, "sourceid": 41437, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40312?format=json"], "related_events_ids": [40312]}, {"id": 38847, "uid": "89ba5a9ed3d76c448fed83a18275b451", "name": "Accelerating Diffusion Model Training under Minimal Budgets: A  Condensation-Based Perspective", "authors": [{"id": 181493, "fullname": "Rui Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181493?format=json", "institution": "UESTC"}, {"id": 131008, "fullname": "Shitong Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/131008?format=json", "institution": "Southeast University"}, {"id": 190825, "fullname": "zikai zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190825?format=json", "institution": "Hong Kong University of Science and Technology(Guangzhou)"}, {"id": 190826, "fullname": "Pukun Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190826?format=json", "institution": "Guangdong University of Finance &amp; Economics"}, {"id": 190827, "fullname": "Hangyu Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/190827?format=json", "institution": "Alibaba Group"}, {"id": 152540, "fullname": "Tian Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/152540?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 187087, "fullname": "Lichen Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187087?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 185431, "fullname": "Shuo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185431?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 73868, "fullname": "Zeke Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/73868?format=json", "institution": "Baidu Research"}], "abstract": "Diffusion models have achieved remarkable performance on a wide range of generative tasks, yet training them from scratch is notoriously resource-intensive, typically requiring millions of training images and many GPU days. Motivated by a data-centric view of this bottleneck, we adopt a condensation-based perspective: given a large training set, the goal is to construct a much smaller condensed dataset that still supports training strong diffusion models under minimal data and compute budgets. To operationalize this perspective, we introduce Diffusion Dataset Condensation ($D^2C$), a two-phase framework comprising Select and Attach. In the Select phase, a diffusion difficulty score combined with interval sampling is used to identify a compact, informative training subset from the original data. Building on this subset, the Attach phase further strengthens the conditional signals by augmenting each selected image with rich semantic and visual representations. To our knowledge, $D^2C$ is the first framework that systematically investigates dataset condensation for diffusion models, whereas prior condensation methods have mainly targeted discriminative architectures. Extensive experiments across data budgets (0.8%\u20138% of ImageNet), model architectures, and image resolutions demonstrate that $D^2C$ dramatically accelerates diffusion model training while preserving high generative quality. On ImageNet $256^2$ with SiT-XL/2, $D^2C$ attains a FID of 4.3 in just 40k steps using only 0.8% of the training images, corresponding to about 233x and 100x faster training than vanilla SiT-XL/2 and SiT-XL/2 + REPA, respectively.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38847", "url": null, "sourceid": 36500, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39641, "uid": "7f4844039fc3e8da54ac036380e801f3", "name": "Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation", "authors": [{"id": 182355, "fullname": "Ke Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/182355?format=json", "institution": "Dalian Maritime University"}, {"id": 192549, "fullname": "Bolin Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/192549?format=json", "institution": "Dalian Martime University"}, {"id": 192550, "fullname": "Hongbo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192550?format=json", "institution": "Dalian Maritime University"}], "abstract": "Self-supervised monocular depth estimation has achieved remarkable progress in recent years, yet frequency aliasing and the lack of fine-grained cross-frame motion modeling still lead to blurred depth boundaries and suboptimal camera motion estimation.To address these challenges, we propose a progressive self-supervised framework that integrates a Frequency-Guided Depth Network (FGDepth) and a PoseQuery Network (PQNet). FGDepth incorporates a plug-and-play Frequency-Guided Sampling module that explicitly enhances high-frequency details and suppresses aliasing artifacts, producing depth maps with sharper boundaries. PQNet employs channel-aligned attention to model fine-grained cross-frame motion features, enabling more accurate and robust camera motion estimation. Furthermore, we design a progressive three-stage decoupled training strategy that effectively leverages the complementarity between depth and pose estimation, further improving overall performance.Extensive experiments on the KITTI benchmark demonstrate state-of-the-art performance, achieving a 4.1% reduction in Sq Rel over strong baselines, and our method also exhibits excellent cross-dataset generalization on Make3D. Ablation studies further validate the effectiveness of each proposed component.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39641", "url": null, "sourceid": 33713, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40312, "uid": "f52b5a5a83b60359562d12342f8e5950", "name": "RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting", "authors": [{"id": 188256, "fullname": "Xuezhen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188256?format=json", "institution": "HKUST"}, {"id": 187731, "fullname": "Li Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187731?format=json", "institution": "Scanline VFX"}, {"id": 149071, "fullname": "Yulin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/149071?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 128022, "fullname": "Zeyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128022?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 90261, "fullname": "Pedro V. Sander", "url": "http://cvpr.thecvf.com/api/miniconf/users/90261?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow\u2013guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large motions. Experiments on datasets featuring fast motion, non-rigid deformation, and severe occlusions demonstrate that RetimeGS achieves superior quality and coherence over state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40312", "url": null, "sourceid": -41437, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/37783?format=json"], "related_events_ids": [37783]}, {"id": 37382, "uid": "f5995c90359dcd1defe22a90d936fc78", "name": "VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction", "authors": [{"id": 187304, "fullname": "SiNan Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/187304?format=json", "institution": "Tsinghua University"}, {"id": 187305, "fullname": "JiaHao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187305?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 187306, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187306?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 185628, "fullname": "Shuhao Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/185628?format=json", "institution": "Kuaishou Technology"}, {"id": 90192, "fullname": "Zhengzhuo Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90192?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 187307, "fullname": "Yifu Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187307?format=json", "institution": "Tsinghua University"}, {"id": 132236, "fullname": "Yongxian Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/132236?format=json", "institution": "Tsinghua University"}, {"id": 156268, "fullname": "Kun Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156268?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 86179, "fullname": "Xinggang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86179?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 185630, "fullname": "Kai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185630?format=json", "institution": null}, {"id": 86975, "fullname": "Chun Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/86975?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss.  In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints.  This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction.  Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37382", "url": null, "sourceid": 35218, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36679, "uid": "e94378d015c20a82b0585c1011f97d7e", "name": "A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space", "authors": [{"id": 185627, "fullname": "Huijie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185627?format=json", "institution": "Beihang University"}, {"id": 185628, "fullname": "Shuhao Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/185628?format=json", "institution": "Kuaishou Technology"}, {"id": 185629, "fullname": "Haoxiang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185629?format=json", "institution": "South China Normal University"}, {"id": 183810, "fullname": "Shuai Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/183810?format=json", "institution": "Beihang University"}, {"id": 185630, "fullname": "Kai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185630?format=json", "institution": null}, {"id": 87865, "fullname": "Guoliang Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87865?format=json", "institution": "Beihang University"}], "abstract": "Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we consider the code-to-style image generation task, which aims to produce images with novel and consistent visual styles specified by only a numerical code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Extensive experiments validate that CoTyle effectively converts a numerical code into a style controller, demonstrating a style is worth one code. Compared to existing methods, the stylized images generated by our method are more diverse and consistent, unlocking a vast space of reproducible styles from minimal input.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36679", "url": null, "sourceid": 32702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40271?format=json"], "related_events_ids": [40271]}, {"id": 40271, "uid": "e94378d015c20a82b0585c1011f97d7e", "name": "A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space", "authors": [{"id": 185627, "fullname": "Huijie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185627?format=json", "institution": "Beihang University"}, {"id": 185628, "fullname": "Shuhao Cui", "url": "http://cvpr.thecvf.com/api/miniconf/users/185628?format=json", "institution": "Kuaishou Technology"}, {"id": 185629, "fullname": "Haoxiang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185629?format=json", "institution": "South China Normal University"}, {"id": 183810, "fullname": "Shuai Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/183810?format=json", "institution": "Beihang University"}, {"id": 185630, "fullname": "Kai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185630?format=json", "institution": null}, {"id": 87865, "fullname": "Guoliang Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87865?format=json", "institution": "Beihang University"}], "abstract": "Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we consider the code-to-style image generation task, which aims to produce images with novel and consistent visual styles specified by only a numerical code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Extensive experiments validate that CoTyle effectively converts a numerical code into a style controller, demonstrating a style is worth one code. Compared to existing methods, the stylized images generated by our method are more diverse and consistent, unlocking a vast space of reproducible styles from minimal input.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40271", "url": null, "sourceid": -32702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36679?format=json"], "related_events_ids": [36679]}, {"id": 39780, "uid": "85bcd4523877c541065a40d7e1563269", "name": "Generalizable Video Quality Assessment via Weak-to-Strong Learning", "authors": [{"id": 185457, "fullname": "Linhan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185457?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184429, "fullname": "Wei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184429?format=json", "institution": "East China Normal University"}, {"id": 185780, "fullname": "Xiangyang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185780?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 156244, "fullname": "Kaiwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156244?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 127773, "fullname": "Jun Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/127773?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 192834, "fullname": "Yicong Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192834?format=json", "institution": null}, {"id": 156245, "fullname": "Dandan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156245?format=json", "institution": "East China Normal University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 89522, "fullname": "Xiongkuo Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/89522?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with human-labeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. In this work, we explore \\textbf{weak-to-strong (W2S) learning} as a new paradigm for advancing VQA without reliance on human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a \\textbf{distinct weak-to-strong effect in VQA}. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) \\textbf{integrating homogeneous and heterogeneous supervision signals} from diverse VQA teachers---including off-the-shelf VQA models and synthetic distortion simulators---via a learn-to-rank formulation, and (2) \\textbf{iterative W2S training}, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in video quality assessment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39780", "url": null, "sourceid": 30759, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36748, "uid": "db8b1247beaa42158ebfe5c489ca36db", "name": "A\u00b3: Towards Advertising Aesthetic Assessment", "authors": [{"id": 152304, "fullname": "Kaiyuan Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/152304?format=json", "institution": "East China Normal University"}, {"id": 185778, "fullname": "Yixuan Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185778?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185779, "fullname": "Lu Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/185779?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 147291, "fullname": "Yushuo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/147291?format=json", "institution": "Shang Hai Jiaotong University"}, {"id": 152884, "fullname": "Zijian Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152884?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 185456, "fullname": "Jianbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185456?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185780, "fullname": "Xiangyang Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185780?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 152151, "fullname": "Yuan Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/152151?format=json", "institution": "Shanghai AI Lab"}, {"id": 89537, "fullname": "Zicheng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89537?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present **A\u00b3 (Advertising Aesthetic Assessment)**, a comprehensive framework encompassing four components: a paradigm (**A\u00b3-Law**), a dataset (**A\u00b3-Dataset**), a multimodal large language model (**A\u00b3-Align**), and a benchmark (**A\u00b3-Bench**). Central to A\u00b3 is a theory-driven paradigm, A\u00b3-Law, comprising three hierarchical stages: (1) Perceptual Attention, evaluating perceptual image signals for their ability to attract attention; (2) Formal Interest, assessing formal composition of image color and spatial layout in evoking interest; and (3) Desire Impact, measuring desire evocation from images and their persuasive impact. Building on A\u00b3-Law, we construct A\u00b3-Dataset with 120K instruction-response pairs from 30K advertising images, each richly annotated with multi-dimensional labels and Chain-of-Thought (CoT) rationales. We further develop A\u00b3-Align, trained under A\u00b3-Law with CoT-guided learning on A\u00b3-Dataset. Extensive experiments on A\u00b3-Bench demonstrate that A\u00b3-Align achieves superior alignment with A\u00b3-Law compared to existing models, and this alignment generalizes well to quality advertisement selection and prescriptive advertisement critique, indicating its potential for broader deployment.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36748", "url": null, "sourceid": 33230, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38497, "uid": "4a643e03578577526149a599ac6f332b", "name": "Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data", "authors": [{"id": 189985, "fullname": "Yujuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189985?format=json", "institution": "Beijing Institute of Technology"}, {"id": 183716, "fullname": "Qing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183716?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189986, "fullname": "Ziyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189986?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189987, "fullname": "Xiuxing Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189987?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189988, "fullname": "Zhuo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189988?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189989, "fullname": "Mengrui Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189989?format=json", "institution": "Beijing Institute of Technology"}, {"id": 189990, "fullname": "Xia Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189990?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "Random frame-level data missing is a critical challenge in multimodal sentiment analysis. Existing methods are largely limited to passive completion via single-pass feedforward connections and static cross-modal fusion, which struggle to generate high-quality completed features. However, the brain is not a passive recipient of external information but a dynamic system for active perceptual inference. Its core lies in the dynamic nested recurrents formed by intra-cortical recurrent completion mechanisms and corticothalamic circuits, which iteratively perform perceptual inference. Inspired by this, we propose the Dynamic Nested Recurrent Network (DNRNet). It is the first to introduce recurrent inference into the data completion task, achieving a paradigm shift from passive completion to active perceptual inference. Its local recurrent loop simulates intra-cortical recurrent pattern completion to perform perceptual inference and generate local correction features. The global recurrent loop simulates the modulatory function of the thalamus, calculating modality confidence to dynamically weight and integrate cross-modal information, generating global correction features. The local and global correction features are fused to obtain the completion signal, which is then combined with the input features of the current iteration to serve as the input for the next iteration. Experiments on the MOSI, MOSEI, and SIMS datasets demonstrate that DNRNet achieves an average accuracy improvement of 1.5%\u20132.0% over baseline models across all missing rates, validating the superiority of the brain-inspired approach in complex missing data scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38497", "url": null, "sourceid": 42461, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39186, "uid": "d5a76ec267ddb7fb3538a6489a342a66", "name": "Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression", "authors": [{"id": 180515, "fullname": "Haotian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180515?format=json", "institution": "Peking University"}, {"id": 191534, "fullname": "Feiyue Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/191534?format=json", "institution": "Jilin University"}, {"id": 191535, "fullname": "Yixin Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191535?format=json", "institution": "Jilin University"}, {"id": 191536, "fullname": "Jian Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/191536?format=json", "institution": "Jilin University"}, {"id": 191147, "fullname": "Haocheng Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191147?format=json", "institution": "Peking University"}, {"id": 130709, "fullname": "Tongda Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130709?format=json", "institution": "Tsinghua University"}, {"id": 191537, "fullname": "Zhenning Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/191537?format=json", "institution": "Tsinghua University"}, {"id": 130713, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130713?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 90140, "fullname": "Siwei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90140?format=json", "institution": "Peking University"}, {"id": 150005, "fullname": "Jiaqi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150005?format=json", "institution": "Peking University"}], "abstract": "Multi-view image compression (MIC) aims to achieve high compression efficiency by exploiting inter-image correlations, playing a crucial role in 3D applications. As a subfield of MIC, distributed multi-view image compression (DMIC) offers performance comparable to MIC while eliminating the need for inter-view information at the encoder side.However, existing methods in DMIC typically treat all images equally, overlooking the varying degrees of correlation between different views during decoding, which leads to suboptimal coding performance. To address this limitation, we propose a novel $\\textbf{OmniParallax Attention Mechanism}$ (OPAM), which is a general mechanism for explicitly modeling correlations and aligned features between arbitrary pairs of information sources.Building upon OPAM, we propose a Parallax Multi Information Fusion Module (PMIFM) to adaptively integrate information from different sources. PMIFM is incorporated into both the joint decoder and the entropy module to construct our end-to-end DMIC framework, $\\textbf{ParaHydra}$.Extensive experiments demonstrate that $\\textbf{ParaHydra}$ is $\\textbf{the first DMIC method}$ to significantly surpass state-of-the-art MIC codecs, while maintaining low computational overhead. Performance gains become more pronounced as the number of input views increases. Compared with LDMIC, $\\textbf{ParaHydra}$ achieves bitrate savings of $\\textbf{19.72\\\\%}$ on WildTrack(3) and up to $\\textbf{24.18\\\\%}$ on WildTrack(6), while significantly improving coding efficiency (as much as $\\textbf{65}\\times$ in decoding and $\\textbf{34}\\times$ in encoding).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39186", "url": null, "sourceid": 32792, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38897, "uid": "0bb3498deb34e18fa94d2c9d6a443c07", "name": "Beyond the Static World: Continual Category Discovery under Visual Drift", "authors": [{"id": 130131, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130131?format=json", "institution": "Monash University"}, {"id": 189484, "fullname": "Yiwen Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189484?format=json", "institution": "Monash University"}, {"id": 158746, "fullname": "Sijin Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/158746?format=json", "institution": "Monash University"}, {"id": 185612, "fullname": "Zongyuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185612?format=json", "institution": "Monash University"}], "abstract": "Generalized Category Discovery (GCD) aims to identify both known and novel classes from unlabeled data with the aid of labeled examples. While promising, most existing GCD methods rely on simultaneous access to labeled and unlabeled datasets\u2014an assumption often impractical in real-world deployments. Continual Category Discovery (CCD) relaxes this requirement by adapting a pre-trained model to streaming unlabeled data, yet it typically assumes domain-consistent data distributions. This places a strong limitation on its applicability. In this work, we study Open Continual Category Discovery (OCCD), where the model must robustly discover previously unseen concepts from real-world data streams that may originate from heterogeneous and shifting domains. To address this, we propose an adaptive framework built on three key ideas. First, we propose a weight-aware separation module, which leverages partial unbalanced optimal transport for instance probability modeling and employs binary response spectrum quantization to generate cues for distinguishing known and unknown categories, enabling automatic sample separation. Second, for known categories, we introduce a cross-domain semantic alignment module that incorporates adversarial learning to perform adaptive prototype matching, thereby enhancing robustness against domain shifts. Finally, for unknown categories, we design a category topology consistency constraint that preserves semantic relationships between known and novel classes during distribution shifts. Experiments show our approach excels at discovering new categories while maintaining strong performance on known ones in evolving domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38897", "url": null, "sourceid": 36790, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37050, "uid": "74bec1b62ad81455d42aee584dffe766", "name": "Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion", "authors": [{"id": 186567, "fullname": "Keyang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186567?format=json", "institution": "Peking University"}, {"id": 155931, "fullname": "Sifan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155931?format=json", "institution": "Southeast University"}, {"id": 186568, "fullname": "Hongbin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186568?format=json", "institution": "ByteDance Inc."}, {"id": 186569, "fullname": "Gang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186569?format=json", "institution": "Guangming Laboratory"}, {"id": 156338, "fullname": "Zhifei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156338?format=json", "institution": "Peking University"}, {"id": 184477, "fullname": "Yikai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184477?format=json", "institution": "Tsinghua University"}, {"id": 88573, "fullname": "Zhen Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88573?format=json", "institution": "Peking University"}, {"id": 128621, "fullname": "Jieyi Long", "url": "http://cvpr.thecvf.com/api/miniconf/users/128621?format=json", "institution": "Theta Labs, Inc."}, {"id": 152147, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152147?format=json", "institution": "Guangming Laboratory"}], "abstract": "Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical \u201cCity\u2013District\u2013Grid\u201d structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a produce\u2013refine\u2013evaluate isometric image synthesis loop, followed by image-to-3D generation. To simulate continuous city evolution, Yo'City further introduces a user-interactive, relationship-guided expansion mechanism, which performs scene graph\u2013based distance- and semantics-aware layout optimization, ensuring spatially coherent city growth. To comprehensively evaluate our method, we construct a diverse benchmark dataset and design six multi-dimensional metrics that assess generation quality from the perspectives of semantics, geometry, texture, and layout. Extensive experiments demonstrate that Yo'City consistently outperforms existing state-of-the-art methods across all evaluation aspects.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37050", "url": null, "sourceid": 37522, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39054, "uid": "b385131592e7af5c1d249376c2f7d0bc", "name": "MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification", "authors": [{"id": 191261, "fullname": "Jiahao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191261?format=json", "institution": "Chongqing University"}, {"id": 131525, "fullname": "Sheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131525?format=json", "institution": "Chongqing University"}, {"id": 191262, "fullname": "Xin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191262?format=json", "institution": "Chongqing University"}, {"id": 155208, "fullname": "Zhixiong Nan", "url": "http://cvpr.thecvf.com/api/miniconf/users/155208?format=json", "institution": "Chongqing University"}, {"id": 191263, "fullname": "Jiajun Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/191263?format=json", "institution": "Chongqing University"}, {"id": 191264, "fullname": "Nankun Mu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191264?format=json", "institution": "Chongqing University"}], "abstract": "In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39054", "url": null, "sourceid": 42900, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39371, "uid": "8fae1017385f209b727c45cda5e956aa", "name": "Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET", "authors": [{"id": 191941, "fullname": "Huabin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191941?format=json", "institution": "Anhui University"}, {"id": 183730, "fullname": "Xinyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/183730?format=json", "institution": "Anhui University"}, {"id": 191942, "fullname": "Yuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191942?format=json", "institution": "Anhui University"}, {"id": 191943, "fullname": "Fei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191943?format=json", "institution": "Monash University, Malaysia Campus"}], "abstract": "Positron Emission Tomography (PET) can be used for the early diagnosis of various brain disorders. However, the annotation of PET scans requires the involvement of specialized nuclear medicine experts, making accurately annotated PET data extremely scarce. MRI-based cross-modal domain adaptation methods can improve the brain disorder classification accuracy with sparsely labeled PET data. However, existing methods fail to balance the core requirements of domain discrepancy elimination and modality-specific discriminative information retention in cross-modal tasks. Forced alignment often undermines the core pathological discriminative features of both modalities, making it difficult to meet the collaborative optimization demands of cross-modal brain disorder classification. To address this, we propose a Dual-Stream feature Disentanglement and Alignment (DSDA) framework designed for collaborative optimization of cross-modal domain adaptation and brain disorder classification. This framework first dynamically evaluates and explicitly decouples the critical brain regions relevant to the classification task from the non-critical regions that preserve brain structural integrity. It then applies differential processing to the two types of brain regions: topology-weighted feature alignment for non-critical regions and high-confidence feature fusion for critical regions. This differential processing ensures that the model effectively aligns features while preserving key discriminative information. Extensive experimental results on various datasets (e.g., ADNI, AIBL, and PPMI) demonstrate the effectiveness of DSDA which helps achieve the state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39371", "url": null, "sourceid": 45999, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36178, "uid": "57768606c3164d03a9f2fd033a9895ee", "name": "Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization", "authors": [{"id": 184342, "fullname": "Weijian Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/184342?format=json", "institution": "Dalian University of Technology"}, {"id": 171974, "fullname": "Songqian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/171974?format=json", "institution": "Dalian University of Technology; Dalian University of Technology"}, {"id": 181678, "fullname": "Yuqi Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/181678?format=json", "institution": "Dalian University of Technology"}, {"id": 184343, "fullname": "Jian Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184343?format=json", "institution": "Dalian University of Technology"}, {"id": 184344, "fullname": "Yongdong Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184344?format=json", "institution": "North Minzu University"}, {"id": 153990, "fullname": "Qiang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153990?format=json", "institution": "Dalian University of Technology"}], "abstract": "As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals.Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36178", "url": null, "sourceid": 44649, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37883, "uid": "99fb2c00a7924d2152227c0d60112947", "name": "Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception", "authors": [{"id": 188489, "fullname": "Yihang Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188489?format=json", "institution": "City University of Hong Kong"}, {"id": 180693, "fullname": "Senkang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180693?format=json", "institution": "City University of Hong Kong"}, {"id": 151710, "fullname": "Haonan An", "url": "http://cvpr.thecvf.com/api/miniconf/users/151710?format=json", "institution": "City University of Hong Kong"}, {"id": 151730, "fullname": "Zhengru Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/151730?format=json", "institution": "City University of Hong Kong"}, {"id": 188490, "fullname": "Hangcheng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188490?format=json", "institution": "City University of Hong Kong"}, {"id": 158876, "fullname": "Yuguang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158876?format=json", "institution": "City University of Hong Kong"}], "abstract": "Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation. Our approach combines MVIG representation with temporal graph learning to generate evolving fabrication risk maps and employs entropy-aware vulnerability search to optimize attack location, timing and persistence, enabling adaptive attacks with generalizability across various defensive configurations. Extensive evaluations on OPV2V and Adv-OPV2V datasets demonstrate that MVIG attack reduces defense success rates by up to 62\\% against state-of-the-art defenses while achieving 47\\% lower detection for persistent attacks at 29.9 FPS, exposing critical security gaps in CP systems.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37883", "url": null, "sourceid": 43068, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39103, "uid": "8290207904a74d28e7c0afdd6a974745", "name": "RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces", "authors": [{"id": 151710, "fullname": "Haonan An", "url": "http://cvpr.thecvf.com/api/miniconf/users/151710?format=json", "institution": "City University of Hong Kong"}, {"id": 191361, "fullname": "Xiaohui Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/191361?format=json", "institution": "South China University of Technology"}, {"id": 87750, "fullname": "Guang Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/87750?format=json", "institution": "Institute for Infocomm Research, A*STAR"}, {"id": 188489, "fullname": "Yihang Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188489?format=json", "institution": "City University of Hong Kong"}, {"id": 188490, "fullname": "Hangcheng Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188490?format=json", "institution": "City University of Hong Kong"}, {"id": 185832, "fullname": "Xiangyu Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185832?format=json", "institution": "South China University of Technology"}, {"id": 158876, "fullname": "Yuguang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158876?format=json", "institution": "City University of Hong Kong"}], "abstract": "The proliferation of AI-generated content (AIGC) has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property (IP). In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness.To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution (ID) and out-of-distribution (OOD) data. Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39103", "url": null, "sourceid": 44092, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36268, "uid": "5392c2f4a7ba34fdf47a1c5208f09640", "name": "When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence", "authors": [{"id": 184627, "fullname": "Mingrui Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184627?format=json", "institution": "Xidian University"}, {"id": 184628, "fullname": "Fengzhi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184628?format=json", "institution": "Xidian University"}, {"id": 184629, "fullname": "Xin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/184629?format=json", "institution": "Xidian University"}, {"id": 184630, "fullname": "Jun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184630?format=json", "institution": "Xidian University"}, {"id": 86416, "fullname": "Nannan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86416?format=json", "institution": "Xidian University"}, {"id": 88813, "fullname": "Xinbo Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88813?format=json", "institution": "Xidian University"}], "abstract": "Establishing accurate correspondence between sparse line representations and rich textured imagery remains a formidable challenge. While diffusion features excel in semantic correspondence, they struggle to bridge the fundamental gap between abstract sketches and texture-rich photographs. We identify two critical disparities: spatial domain misalignment from structural abstraction differences, and frequency domain inconsistencies from texture density variations. Based on this analysis, we propose SFA-DIFT, a novel approach that learns spatial-frequency aligned diffusion features for robust cross-modal correspondence. Unlike previous methods focusing solely on spatial alignment, our key innovation performs dual-domain alignment by learning unified clean diffusion features while strategically aggregating low-frequency components in the frequency domain. This comprehensive spatial-frequency alignment enables equitable understanding between sparse abstractions and rich textures. To validate our approach, we extend the existing sketch-photo correspondence dataset (PSC6K) by generating multi-style textured imagery, creating MS-PSC6K, a comprehensive correspondence benchmark. Extensive experiments demonstrate that SFA-DIFT achieves state-of-the-art performance, delivering substantial improvements with an average of 0.87\\% on PCK@1, 2.20\\% on PCK@5, and 0.95\\% on PCK@10 over previous best methods, validating the effectiveness and robustness of our dual-domain alignment approach.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36268", "url": null, "sourceid": 40248, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40168, "uid": "f01b7be7d43d51e505a307a223676662", "name": "ESAM++: Efficient Online 3D Perception on the Edge", "authors": [{"id": 193694, "fullname": "Qin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193694?format=json", "institution": "Stanford University"}, {"id": 132059, "fullname": "Lavisha Aggarwal", "url": "http://cvpr.thecvf.com/api/miniconf/users/132059?format=json", "institution": "Amazon"}, {"id": 193695, "fullname": "Saptarashmi Bandyopadhyay", "url": "http://cvpr.thecvf.com/api/miniconf/users/193695?format=json", "institution": "University of Maryland, College Park"}, {"id": 193696, "fullname": "Vikas Bahirwani", "url": "http://cvpr.thecvf.com/api/miniconf/users/193696?format=json", "institution": "Meta"}, {"id": 152548, "fullname": "Marc Niethammer", "url": "http://cvpr.thecvf.com/api/miniconf/users/152548?format=json", "institution": "University of California, San Diego"}, {"id": 75810, "fullname": "Ehsan Adeli", "url": "http://cvpr.thecvf.com/api/miniconf/users/75810?format=json", "institution": "Stanford University"}, {"id": 132025, "fullname": "Andrea Colaco", "url": "http://cvpr.thecvf.com/api/miniconf/users/132025?format=json", "institution": "Google"}], "abstract": "Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and gen- eralized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3\u00d7 faster inference with a 2\u00d7 smaller model size compared to ESAM, enabling practical deployment in real-world edge scenarios. Code and models will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40168", "url": null, "sourceid": 44618, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38794, "uid": "7bfa49a75dfad532ea8fe8b32b12c516", "name": "Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction", "authors": [{"id": 103147, "fullname": "Jiazhen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103147?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 190684, "fullname": "Mingkuan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190684?format=json", "institution": "Tsinghua University"}, {"id": 90895, "fullname": "Long Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90895?format=json", "institution": "HKUST"}], "abstract": "Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with **all-mask prediction**, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present *STAMP*: **S**imultaneous **T**extual **A**ll-**M**ask **P**rediction, an MLLM that embodies this paradigm. After generating a textual response, *STAMP* predicts an entire segmentation mask in a single forward pass by treating it as a parallel \"fill-in-the-blank\" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that *STAMP* significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38794", "url": null, "sourceid": 38282, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36225, "uid": "7b18b584738d4f52b3d4dfce78c1827e", "name": "Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation", "authors": [{"id": 184491, "fullname": "Guanting Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/184491?format=json", "institution": "Nanjing University of Information Science and Technology"}, {"id": 177092, "fullname": "Shenglong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/177092?format=json", "institution": "Hohai University"}, {"id": 85939, "fullname": "Kaihua Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85939?format=json", "institution": "NUIST"}, {"id": 184492, "fullname": "Guangcan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184492?format=json", "institution": "Southeast University"}, {"id": 184493, "fullname": "Min Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/184493?format=json", "institution": null}], "abstract": "This paper presents a generalizable CoSOD framework via mixed content-style modulation, termed CoMCS, to enhance the robustness of the model to unseen domains. The CoMCS, consisting of a mixed content modulator (MCM), a mixed style modulator (MSM), and a collaborative semantic contrast module (SCM), effectively extracts scene structure priors as well as augments the source domain styles to bridge the domain gap between the source and the unseen domains. Specifically, the CoMCS first utilizes the CLIP model to extract conceptual knowledge associated with the semantic classes in the whole scene, resulting in multi-class semantic embeddings that are domain-invariant. Subsequently, the MCM models the semantic relationships between the prototypes of co-salient objects and the multi-class semantic embeddings through the cross-attention mechanism, effectively capturing domain-invariant scene structure priors that aid in reducing scene distribution shift in unseen domains. Meanwhile, to alleviate domain perturbations encountered during testing, the MSM addresses the uncertainty associated with domain shifts by synthesizing feature statistics, such as mean and standard deviation, during training to simulate new stylistic characteristics, thus achieving data augmentation within the source domain. Finally, to reduce the ambiguity of the co-salient object representations within test data from unseen domains, the SCM employs a uniform loss function to ensure that the learned prototypes are uniformly distributed within the hyperspherical space, further enhancing the domain generalization capabilities of the framework. Besides, to further verify the generalization ability of the CoMCS to unseen domains, we construct an unseen-domain benchmark dataset (UND) that selects a variety of image groups with unseen classes from CoCA, CoSOD3k, CoSal2015. Extensive evaluations on the four benchmark datasets demonstrate favorable performance of our CoMCS to a variety of state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36225", "url": null, "sourceid": 46513, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37417, "uid": "7a0c99ef914f596a9d745df32a9c84dd", "name": "UNICBench: UNIfied Counting Benchmark for MLLM", "authors": [{"id": 187396, "fullname": "Chenggang Rong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187396?format=json", "institution": "Northwest Polytechnical University"}, {"id": 126145, "fullname": "Tao Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/126145?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 186351, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186351?format=json", "institution": "China Telecom"}, {"id": 187397, "fullname": "Yaowu Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187397?format=json", "institution": "Sun Yat-sen University"}, {"id": 187398, "fullname": "Jia Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187398?format=json", "institution": "Harbin Institute of Technology"}, {"id": 126142, "fullname": "Song Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126142?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 153280, "fullname": "Yuan Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153280?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 155750, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155750?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi\u2011level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three\u2011level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality\u2011specific matching rules, we evaluate 45 state\u2011of\u2011the\u2011art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long\u2011tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37417", "url": null, "sourceid": 42660, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37042, "uid": "1037133936e623fd56c5c23a101d65cf", "name": "Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking", "authors": [{"id": 186541, "fullname": "Jie Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186541?format=json", "institution": "University of Science and Technology of China"}, {"id": 186542, "fullname": "Yinchao Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/186542?format=json", "institution": "University of Science and Technology of China"}, {"id": 186543, "fullname": "Yuyang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186543?format=json", "institution": "University of Science and Technology of China"}, {"id": 186544, "fullname": "Dengqing Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186544?format=json", "institution": "University of Science and Technology of China"}, {"id": 146642, "fullname": "Jianpeng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/146642?format=json", "institution": "University of Science and Technology of China"}, {"id": 153636, "fullname": "Xu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/153636?format=json", "institution": "Sangfor Technologies Inc."}, {"id": 186545, "fullname": "Qiao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186545?format=json", "institution": "University of Science and Technology of China"}, {"id": 88062, "fullname": "Wenfei Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88062?format=json", "institution": "University of Science and Technology of China"}, {"id": 85977, "fullname": "Tianzhu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85977?format=json", "institution": "University of Science and Technology of China"}], "abstract": "3D single object tracking (SOT) in point clouds is a fundamental component of autonomous perception but remains challenging due to sparse observations, irregular geometry, and frequent occlusion. Most prior methods adopt a category-specific paradigm, requiring individual models for different object types. This design hinders scalability and generalization, as object categories in the real world exhibit vast variations in scale and structure. In this work, we present UniKPT, a category-unified and structure-aware framework that performs robust 3D tracking across diverse object classes without relying on category priors. UniKPT introduces three key innovations: (1) an adaptive structural keypoint extractor that identifies scale-consistent and semantically meaningful points; (2) a progressive correspondence aligner that enforces hierarchical geometric consistency across frames; and (3) a confidence-aware localization module that adaptively refines tracking by suppressing uncertain correspondences and exploiting inter-keypoint structural relations. Experiments on the nuScenes and KITTI benchmarks demonstrate that a single UniKPT model not only generalizes across categories but also outperforms state-of-the-art category-specific trackers, achieving gains of +4.37% in Success and +5.16% in Precision on nuScenes.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37042", "url": null, "sourceid": 43874, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39031, "uid": "89d7b51cbf6fdf18a3b0586c9e538b6d", "name": "Partial Weakly-Supervised Oriented Object Detection", "authors": [{"id": 154724, "fullname": "Mingxin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154724?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 154723, "fullname": "Peiyuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154723?format=json", "institution": "Wuhan University"}, {"id": 191204, "fullname": "Yuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191204?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191205, "fullname": "Wei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191205?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191206, "fullname": "Yue Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/191206?format=json", "institution": "Nanyang Technological University"}, {"id": 186950, "fullname": "Ning Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186950?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184641, "fullname": "Ziyang Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/184641?format=json", "institution": "Shanghai Artificial Intelligence Laboratory; SUN YAT-SEN UNIVERSITY"}, {"id": 102617, "fullname": "Junwei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/102617?format=json", "institution": "Wuhan University"}, {"id": 191207, "fullname": "Zhirui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191207?format=json", "institution": "Aerospace Information Research Institute, Chinese Academy of Sciences"}, {"id": 191208, "fullname": "Yi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191208?format=json", "institution": "The Ohio State University"}, {"id": 128276, "fullname": "Xue Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128276?format=json", "institution": "Shanghai AI Laboratory"}], "abstract": "The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose: (1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model's sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses traditional semi-supervised algorithms. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39031", "url": null, "sourceid": 41057, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37533, "uid": "82b05ac4ebef496529bb27f3c6786782", "name": "From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning", "authors": [{"id": 157965, "fullname": "Yang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157965?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 156164, "fullname": "Qianqian Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156164?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 187664, "fullname": "Peisong Wen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187664?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 157967, "fullname": "Siran Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/157967?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 185535, "fullname": "Xilin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185535?format=json", "institution": "Beijing Institute of Technology"}, {"id": 85019, "fullname": "Qingming Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85019?format=json", "institution": "University of Chinese Academy of Sciences"}], "abstract": "Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks. This process typically introduces complex temporal processing modules with fine-tuning on video data. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters conversely hinders their intra-video temporal consistency, which is required to produce stable representations for the same object in a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent performance improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available in the Supplemental Material.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37533", "url": null, "sourceid": 35919, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38647, "uid": "490e2819b183440ff0baf248f186d608", "name": "PGA: Prior-free Generative Attack for Practical No-box Scenario", "authors": [{"id": 172443, "fullname": "hongyu peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/172443?format=json", "institution": "NWPU"}, {"id": 190381, "fullname": "Xiang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190381?format=json", "institution": "Northwestern Polytechnical University"}, {"id": 190382, "fullname": "Gong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/190382?format=json", "institution": "Northwestern Polytechnical University"}], "abstract": "The unrealistic reliance on abundant prior information in traditional transferable attacks has spurred the Practical No-box Scenario (PNS), where attackers can access only limited unlabeled images. However, existing methods rely on iterative optimization to produce adversarial examples with inherently limited inference speed and transferability. Conversely, faster generative attacks fundamentally conflict with the PNS due to their critical dependence on abundant prior information that is explicitly absent in this scenario. To bridge this gap, we propose Prior-free Generative Attack (PGA), the first generative attack tailored for the PNS. Specifically, we introduce the Curriculum-Guided Micro-Robust Optimization that progressively incorporates more challenging discriminative tasks to mitigate the degenerate solutions common in self-supervised learning with limited data, yielding robust and transferable surrogates for downstream attacks. Furthermore, the Region-Aware Consistent Perturbation Learning guides the generator to produce fine-grained and spatially coherent perturbations, mitigating the common pitfall of generative attacks falling into local optima under insufficient supervision. Extensive experiments demonstrate that our PGA achieves remarkable transferability across various settings with high inference speed. This work provides a more practical benchmark for future research on transferable attacks, revealing the great potential of generative attacks under the PNS.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38647", "url": null, "sourceid": 46313, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36886, "uid": "1389f372d9685b20d2e3477c47ed568f", "name": "dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models", "authors": [{"id": 181428, "fullname": "Yi Xin", "url": "http://cvpr.thecvf.com/api/miniconf/users/181428?format=json", "institution": "Nanjing University"}, {"id": 186106, "fullname": "Siqi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186106?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 186107, "fullname": "Tianxiang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186107?format=json", "institution": "IEEE"}, {"id": 186108, "fullname": "Qi Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186108?format=json", "institution": "The University of Sydney"}, {"id": 151862, "fullname": "Haoxing Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/151862?format=json", "institution": "Nanjing University"}, {"id": 186109, "fullname": "Kaiwen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186109?format=json", "institution": "Shanghai Jiaotong University; Shanghai Artificial Intelligence Laboratory"}, {"id": 186110, "fullname": "Zhiwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186110?format=json", "institution": "Pennsylvania State University"}, {"id": 154242, "fullname": "Yangfan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/154242?format=json", "institution": "University of Minnesota - Twin Cities"}, {"id": 181163, "fullname": "Rongchao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181163?format=json", "institution": "Peking University"}, {"id": 133860, "fullname": "Jinbin Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/133860?format=json", "institution": "NUS"}, {"id": 144726, "fullname": "Shuo Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144726?format=json", "institution": "University of Science and Technology of China, Shanghai AI laboratary"}, {"id": 76614, "fullname": "Bin Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76614?format=json", "institution": "Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences"}, {"id": 86671, "fullname": "Junjun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/86671?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 77291, "fullname": "Yihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77291?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 186111, "fullname": "Yuewen Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186111?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 71043, "fullname": "Xiaohong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71043?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text\u2013image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36886", "url": null, "sourceid": 44231, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37562, "uid": "298fbf7365715ccb63b71724722f7a21", "name": "ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding", "authors": [{"id": 144726, "fullname": "Shuo Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/144726?format=json", "institution": "University of Science and Technology of China, Shanghai AI laboratary"}, {"id": 187732, "fullname": "Nan Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187732?format=json", "institution": "china academy of art"}, {"id": 187733, "fullname": "Jiayang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187733?format=json", "institution": "Peking University"}, {"id": 170633, "fullname": "Xiaohui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/170633?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 187734, "fullname": "Lihao Shao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187734?format=json", "institution": "China Academy of Art"}, {"id": 186109, "fullname": "Kaiwen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186109?format=json", "institution": "Shanghai Jiaotong University; Shanghai Artificial Intelligence Laboratory"}, {"id": 156114, "fullname": "Yu Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/156114?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 103898, "fullname": "Yuandong Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/103898?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 130080, "fullname": "Jiarui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130080?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 187735, "fullname": "Jiaquan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187735?format=json", "institution": null}, {"id": 187736, "fullname": "Bo.Qu Bo.Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187736?format=json", "institution": null}, {"id": 87610, "fullname": "Wenhai Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87610?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 86632, "fullname": "Yu Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86632?format=json", "institution": "Shanghai Aritifcal Intelligence Laboratory"}, {"id": 187737, "fullname": "Dajuin Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187737?format=json", "institution": "China Academy of Art; China Academy of Art"}, {"id": 77291, "fullname": "Yihao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77291?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}], "abstract": "The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:  (1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated IAA dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37562", "url": null, "sourceid": 40057, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39172, "uid": "a0a9b321a2b35675be63a0603e77f0f2", "name": "SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow via Spatial Vector Sampling and Multi-scale Refinement", "authors": [{"id": 71377, "fullname": "Han Ling", "url": "http://cvpr.thecvf.com/api/miniconf/users/71377?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 88464, "fullname": "Quansen Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/88464?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 191501, "fullname": "Yinghua Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191501?format=json", "institution": "Agency for Science, Technology and Research (A*STAR)"}, {"id": 191502, "fullname": "Ivor Tsang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191502?format=json", "institution": "A*STAR"}, {"id": 191503, "fullname": "Yinghui Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/191503?format=json", "institution": "Southeast University"}], "abstract": "Although depth-assisted scene flow estimation has advanced rapidly, mainstream dense frameworks (e.g., RAFT-3D) still rely primarily on 2D feature correlations to optimize 3D motion fields, which hinders their ability to exploit 3D structural priors effectively and consequently limits robustness in complex scenes. We present SEA-Flow3D, a simple, efficient, and accurate framework for dense scene flow estimation.At its core lies a Spatial Vector Sampling (SVS) module that jointly samples 3D coordinates and correlation volumes within the local neighborhood of matched points, producing a direction-aware correlation representation with explicit spatial vectors and providing strong geometric guidance for subsequent optimization. Following the simplicity-and-efficiency principle, SEA-Flow3D adopts a RAFT-style multi-scale recurrent refinement architecture, integrating an RNN-based optimizer with context-guided upsampling to achieve higher accuracy with fewer iterations. Extensive experiments on KITTI and Sintel demonstrate that SEA-Flow3D achieves state-of-the-art performance while maintaining remarkable efficiency and a lightweight design.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39172", "url": null, "sourceid": 43453, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39522, "uid": "13c5761fbc4d10bc361221c281f84190", "name": "CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision\u2013Language Models", "authors": [{"id": 150071, "fullname": "Abhiroop Chatterjee", "url": "http://cvpr.thecvf.com/api/miniconf/users/150071?format=json", "institution": "Jadavpur University"}, {"id": 167682, "fullname": "Susmita Ghosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/167682?format=json", "institution": "Jadavpur University"}, {"id": 192257, "fullname": "Ashish Ghosh", "url": "http://cvpr.thecvf.com/api/miniconf/users/192257?format=json", "institution": ""}, {"id": 184358, "fullname": "Emmett Ientilucci", "url": "http://cvpr.thecvf.com/api/miniconf/users/184358?format=json", "institution": "Rochester Institute of Technology"}], "abstract": "Recent advances in vision\u2013language models (VLMs) have revealed both the promise and the rigidity of large-scale pretraining. Despite their impressive zero-shot generalization, existing adaptation paradigms\u2014whether prompt tuning, adapter injection, or fine-tuning\u2014remain class-specific, modality-biased, and structure-agnostic. However, these design choices limit reasoning-level transfer across tasks. To this end, we rethink adaptation as a shared conceptual structure rather than a per-class specialization. We propose $\\textbf{CASPA}$ (Concept-Anchored Semantic Prompt Adapter), a dual-anchor semantic adapter that jointly learns shared text and image anchors as a bidirectional conceptual interface between modalities. Each class learns a soft association distribution over these anchors, producing compositional representations that enable parameter sharing and semantic reuse. To further align visual and textual reasoning spaces, CASPA employs $\\textbf{Semantic Cross-Consistency Regularization (S-XCR)}$, enforcing geometric and semantic agreement between text- and image-conditioned anchor mixtures. To the best of our knowledge, this is the first work to jointly model graph-structured semantic adaptation and cross-modal regularization for unified, reasoning-level vision\u2013language alignment. CASPA is evaluated across four adaptation regimes\u2014base-to-novel generalization, few-shot learning under data scarcity, cross-data transfer, and backbone-agnostic few-shot evaluation. Evaluated on eleven diverse visual recognition datasets, it matches or outperforms several state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39522", "url": null, "sourceid": 41078, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38726, "uid": "437d2c4de406d5e46cb45e9802811f31", "name": "Property-Informed Diffusion-Based Text-to-Microstructure Generation", "authors": [{"id": 190531, "fullname": "Bingxuan Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/190531?format=json", "institution": "Southeast Community College Area"}, {"id": 152627, "fullname": "Hongsong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152627?format=json", "institution": "Southeast University"}, {"id": 131829, "fullname": "Jie Gui", "url": "http://cvpr.thecvf.com/api/miniconf/users/131829?format=json", "institution": "Southeast University"}], "abstract": "Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. The proposed framework has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38726", "url": null, "sourceid": 39911, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38789, "uid": "013c0727c2f3b90ec8545f5062f75360", "name": "Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation", "authors": [{"id": 186809, "fullname": "Xianglin Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186809?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University; University of Liverpool; iHorry"}, {"id": 157612, "fullname": "Jian Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157612?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 153442, "fullname": "Xiaolei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153442?format=json", "institution": "University of Liverpool"}, {"id": 190679, "fullname": "Zhen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190679?format=json", "institution": null}, {"id": 89348, "fullname": "Jimin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89348?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Contrastive Language-Image Pre-training (CLIP) offers a new paradigm for Weakly Supervised Semantic Segmentation (WSSS) by generating Class Activation Maps (CAMs) from text-image alignment. Existing methods primarily rely on hand-crafted templates or general attribute descriptions generated by a large language model to construct text prototypes for querying visual features. However, these strategies faces two major limitations: the inherent modality gap in CLIP prevents text prototypes achieving tight alignment with visual features; and their static text prototypes cannot adaptively respond to target instances that exhibit diverse visual attributes. To address these challenges, our key insight is to directly construct instance-specific visual description prototype as query, thereby bypassing the suboptimal static text description optimization. To this end, we propose the Visual Description Assembly (VDA) framework. It employs a probabilistic model to map complex CLIP visual features into a structured latent space. This latent space allows us to explicitly disentangle and aggregate varied visual attributes, and then dynamically assemble them into instance-specific visual prototypes. Furthermore, to enhance the robustness of this prototype, we adaptively incorporate the semantically stable text prototype into it as the final query for generating superior CAMs. Experimental results show our method outperforms existing baselines, achieving state-of-the-art performance on WSSS benchmarks. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38789", "url": null, "sourceid": 36295, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38037, "uid": "5dddaf9d765767a1a9fbce4362325e89", "name": "Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference", "authors": [{"id": 188888, "fullname": "Zhiceng Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188888?format=json", "institution": "Yunnan University"}, {"id": 188889, "fullname": "Changmiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188889?format=json", "institution": "Shenzhen Research Institute of Big Data"}, {"id": 188890, "fullname": "Jun Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188890?format=json", "institution": "Zhongnan University of Economics and Law; Nanyang Technological University"}, {"id": 188891, "fullname": "Wenwen Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/188891?format=json", "institution": "Yunnan University"}], "abstract": "While spatial transcriptomics (ST) has advanced understanding of gene expression within tissue context, its high experimental cost limits large-scale application. Predicting ST from pathology images offers a promising, cost-effective alternative, yet existing methods often struggle to capture the complex spatial relationships across slides. To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. It integrates local spatial context within the target slide and cross-slide similarities computed from image embeddings extracted by a pathology foundation model. These embeddings enable inter-slice knowledge transfer across slides. Additionally, SpaHGC incorporates Masked Graph Contrastive Learning to enhance feature representation and transfer spatial gene expression knowledge from reference to target slides, enabling it to model complex spatial dependencies and significantly improve prediction accuracy. We conducted comprehensive benchmarking on seven matched histology-ST datasets from different platforms, tissues, and cancer subtypes. The results demonstrate that SpaHGC significantly outperforms the existing nine state-of-the-art methods across all evaluation metrics. Moreover, the model\u2019s predicted ST profiles closely align with the ground truth data and accurately correspond to tumor regions. Additionally, the predictions are significantly enriched in multiple cancer-related pathways, thereby highlighting its strong biological relevance and application potential. Code availability and reproducibility details are in the Supplementary Materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38037", "url": null, "sourceid": 33830, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37162, "uid": "994252186323cee9c2a1f1b607ec4a91", "name": "Frequency-Aware Affinity for Weakly Supervised Semantic Segmentation", "authors": [{"id": 107428, "fullname": "Ziqian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/107428?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 186809, "fullname": "Xianglin Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186809?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University; University of Liverpool; iHorry"}, {"id": 107328, "fullname": "Xinqiao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107328?format=json", "institution": "Xi\u2019an Jiaotong-Liverpool University"}, {"id": 153442, "fullname": "Xiaolei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153442?format=json", "institution": "University of Liverpool"}, {"id": 149721, "fullname": "Quan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/149721?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}, {"id": 89348, "fullname": "Jimin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/89348?format=json", "institution": "Xi&#x27;an Jiaotong-Liverpool University"}], "abstract": "Weakly Supervised Semantic Segmentation (WSSS) typically utilizes Class Activation Maps (CAMs) to provide the pixel-wise localization. However, CAMs tend to activate only the most discriminative regions, leading to suboptimal WSSS performance. Although existing CAM refinement methods leverage pair-wise relations in affinity to expand the activation regions, these affinities derived from Vision Transformer (ViTs) exhibit a smoothing property, neglecting crucial high-frequency relations and failing to accurately refine object boundaries. In this work, we propose the Dual Frequency-Aware framework (DFA) to address this limitation. Specifically, the Low-Frequency-Aware Alignment (LFAA) generates low-frequency-aware affinity that captures salient semantic relations to enhance object interior semantic consistency on CAMs, while the High-Frequency-Aware Rectification (HFAR) module produces high-frequency-aware affinity that models precise relations to preserve object boundary structure on CAMs. By effectively integrating these two complementary affinities, we design a novel Frequency-Guided (FG) CAM Generation based on Optimal Transport theory, which significantly omits the complex refinement process. Extensive experiments demonstrate that our DFA framework achieves state-of-the-art performance on both PASCAL VOC and MS COCO benchmarks. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37162", "url": null, "sourceid": 34620, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39683, "uid": "3e43df7f86a57ed8ce8ce3b26a8a08de", "name": "CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions", "authors": [{"id": 146701, "fullname": "Gong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/146701?format=json", "institution": "tianjin university"}, {"id": 148789, "fullname": "Chaokun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/148789?format=json", "institution": "College of Intelligence and Computing, Tianjin University"}, {"id": 186339, "fullname": "Pengcheng Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186339?format=json", "institution": "Tianjin University"}], "abstract": "Cooperative perception lets agents share information to expand coverage and improve scene understanding. However, in real-world scenarios, diverse and unpredictable corruptions undermine its robustness and generalization. To address these challenges, we introduce CoopDiff, a diffusion-based cooperative perception framework that mitigates corruptions via a denoising mechanism. CoopDiff adopts a teacher-student paradigm: the Quality-Aware Teacher performs voxel-level early fusion with Quality of Interest weighting and semantic guidance, then produces clean supervision features via a diffusion denoiser. The Dual-Branch Diffusion Student first separates ego and cooperative streams in encoding to reconstruct the teacher's clean targets. And then, an Ego-Guided Cross-Attention mechanism facilitates balanced decoding under degradation by adaptively integrating ego and cooperative features. We evaluate CoopDiff on two constructed multi-degradation benchmarks, OPV2Vn and DAIR-V2Xn, each incorporating six corruption types, including environmental and sensor-level distortions. Benefiting from the inherent denoising properties of diffusion, CoopDiff consistently outperforms prior methods across all degradation types and lowers the relative corruption error. Furthermore, it offers a tunable balance between precision and inference efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39683", "url": null, "sourceid": 45281, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38797, "uid": "7125a10499ff1d99aa7cfbbc91cf2278", "name": "Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding", "authors": [{"id": 158743, "fullname": "Zhongxing Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/158743?format=json", "institution": "Weill Cornell Medicine, Cornell University"}, {"id": 189486, "fullname": "Zhonghua Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189486?format=json", "institution": "Monash University"}, {"id": 190690, "fullname": "Zhe Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190690?format=json", "institution": "Monash University"}, {"id": 190691, "fullname": "Dachuan Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/190691?format=json", "institution": "Georgia Tech"}, {"id": 132272, "fullname": "feilong tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132272?format=json", "institution": "Monash University"}, {"id": 153129, "fullname": "Ming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153129?format=json", "institution": "Monash University"}, {"id": 190692, "fullname": "Shiyan Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/190692?format=json", "institution": "Monash University"}, {"id": 190693, "fullname": "Xiaocheng Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190693?format=json", "institution": "Northeastern University"}, {"id": 130131, "fullname": "Wei Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/130131?format=json", "institution": "Monash University"}, {"id": 86979, "fullname": "Dwarikanath Mahapatra", "url": "http://cvpr.thecvf.com/api/miniconf/users/86979?format=json", "institution": "Inception Institute of AI, UAE"}, {"id": 158750, "fullname": "Yifan Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158750?format=json", "institution": "Weill Cornell Medicine, Cornell University"}, {"id": 136624, "fullname": "Mingquan Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/136624?format=json", "institution": "University of Minnesota - Twin Cities"}, {"id": 185612, "fullname": "Zongyuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185612?format=json", "institution": "Monash University"}], "abstract": "Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present  Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38797", "url": null, "sourceid": 45898, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36969, "uid": "38b12ca2da2197746dd5ae6549648310", "name": "CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception", "authors": [{"id": 146701, "fullname": "Gong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/146701?format=json", "institution": "tianjin university"}, {"id": 148789, "fullname": "Chaokun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/148789?format=json", "institution": "College of Intelligence and Computing, Tianjin University"}, {"id": 186338, "fullname": "Tao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186338?format=json", "institution": "Tianjin University"}, {"id": 186339, "fullname": "Pengcheng Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186339?format=json", "institution": "Tianjin University"}, {"id": 186340, "fullname": "Feng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186340?format=json", "institution": "Tianjin University"}, {"id": 186341, "fullname": "Xin Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/186341?format=json", "institution": "Tianjin University"}], "abstract": "Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36969", "url": null, "sourceid": 31652, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38745, "uid": "f43764367fa4b73ba947fae71b0223a4", "name": "EmoDiffTalk\uff1aEmotion-aware Diffusion for Editable 3D Gaussian Talking Head", "authors": [{"id": 174262, "fullname": "Chang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174262?format=json", "institution": "Beijing Normal University"}, {"id": 190568, "fullname": "Tianjiao Jing", "url": "http://cvpr.thecvf.com/api/miniconf/users/190568?format=json", "institution": "Beijing Normal University"}, {"id": 190569, "fullname": "Chengcheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/190569?format=json", "institution": "Beijing Normal University"}, {"id": 190570, "fullname": "Xuanqi Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/190570?format=json", "institution": "Beijing Normal University"}, {"id": 190571, "fullname": "Zhengxuan Lian", "url": "http://cvpr.thecvf.com/api/miniconf/users/190571?format=json", "institution": "Beijing Normal University"}, {"id": 76490, "fullname": "Qin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/76490?format=json", "institution": "Renmin University of China"}, {"id": 190572, "fullname": "Hongliang Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190572?format=json", "institution": "Xiaomi Corporation"}, {"id": 190573, "fullname": "Shi-Sheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190573?format=json", "institution": "Beijing Normal University"}], "abstract": "Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38745", "url": null, "sourceid": 32768, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37411, "uid": "06f38e7909709a72b521a4a9d1c05841", "name": "GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking", "authors": [{"id": 180606, "fullname": "Yufei Zhan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180606?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"id": 187372, "fullname": "Ziheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187372?format=json", "institution": null}, {"id": 187373, "fullname": "Yousong Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187373?format=json", "institution": "China University of Mining Technology - Beijing"}, {"id": 187374, "fullname": "Rongkun Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/187374?format=json", "institution": "The Chinese University of Hong Kong; Tsinghua University; Peking University; ByteDance Inc.; Xi&#x27;an Jiaotong University"}, {"id": 187375, "fullname": "Guanghao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/187375?format=json", "institution": "East China Normal University"}, {"id": 187376, "fullname": "Ruipu Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187376?format=json", "institution": null}, {"id": 180713, "fullname": "Zhenghao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/180713?format=json", "institution": "Beihang University"}, {"id": 90881, "fullname": "Can Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90881?format=json", "institution": "Peking University"}, {"id": 187377, "fullname": "Yifan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/187377?format=json", "institution": "Renmin University of China"}, {"id": 187378, "fullname": "Zhentao he", "url": "http://cvpr.thecvf.com/api/miniconf/users/187378?format=json", "institution": null}, {"id": 187379, "fullname": "Zheming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187379?format=json", "institution": "ByteDance Inc."}, {"id": 85445, "fullname": "Ming Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85445?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 187380, "fullname": "Minghui Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187380?format=json", "institution": "ByteDance"}, {"id": 85436, "fullname": "Jinqiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85436?format=json", "institution": "Institute of Automation, Chinese Academy of Sciences"}], "abstract": "Despite recent advances in multimodal reasoning, Multimodal Large Language Models (MLLMs) still struggle on complex tasks where initial visual perceptions can be misleading. This performance gap stems from a critical reasoning flaw we term Visual Inertia: while MLLMs excel at iterative reflection in textual contexts, they tend to uncritically commit to their initial visual interpretations and rarely revise them. To overcome this limitation, we introduce GThinker, an MLLM equipped with a novel adaptive visual rethinking capability. GThinker leverages Cue-Rethinking, a flexible reasoning pattern that not only grounds reasoning in visual cues but also strategically triggers a re-examination of these cues to resolve inconsistencies. To instill this capability, we introduce a novel two-stage training framework. It begins with a pattern-guided cold start, enhanced by a judge-guided selective mechanism to learn from failure cases, followed by incentive reinforcement learning. We further curate the  GThinker-11k dataset to power the training with an iterative multimodal annotation pipeline. Extensive experiments demonstrate that GThinker significantly mitigates visual inertia during reasoning, achieving a leading 81.5\\% on the M3CoT benchmark, which is rich in such challenges, surpassing the powerful O4-mini model. Furthermore, GThinker shows consistent improvements across a range of multimodal reasoning benchmarks with an average gain of 2.1\\%, showcasing the broad benefits of equipping MLLMs with the ability to rethink both what they see and how they think.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37411", "url": null, "sourceid": 40941, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38866, "uid": "c1b952b6948f085d619846108cec1b8b", "name": "Synthetic Knowledge-Guided Learning via Target-Region Gradients", "authors": [{"id": 190877, "fullname": "Koshiro Nagano", "url": "http://cvpr.thecvf.com/api/miniconf/users/190877?format=json", "institution": "Keio University"}, {"id": 190878, "fullname": "Ryo Fujii", "url": "http://cvpr.thecvf.com/api/miniconf/users/190878?format=json", "institution": "CMU, Carnegie Mellon University; Keio University"}, {"id": 86491, "fullname": "Ryo Hachiuma", "url": "http://cvpr.thecvf.com/api/miniconf/users/86491?format=json", "institution": "Konica Minolta, Inc."}, {"id": 190879, "fullname": "Fumiaki Sato", "url": "http://cvpr.thecvf.com/api/miniconf/users/190879?format=json", "institution": "CyberAgent, Inc."}, {"id": 86495, "fullname": "Taiki Sekii", "url": "http://cvpr.thecvf.com/api/miniconf/users/86495?format=json", "institution": "Konica Minolta, Inc."}, {"id": 93489, "fullname": "HIDEO SAITO", "url": "http://cvpr.thecvf.com/api/miniconf/users/93489?format=json", "institution": "Keio University"}], "abstract": "Training with synthetic data has become a standard strategy for improving robustness to distribution shifts. However, most existing approaches exploit synthetic samples only indirectly---for example, by enriching backgrounds, contexts, or negative examples---while providing no explicit signal about where the true target content resides.As a result, models can continue to rely on spurious correlations, which ultimately limit their robustness. In this work, we convert a basic but under-utilized provenance of synthetic data into explicit supervision: during synthesis, we know which pixels or elements originate from which source instances. We formalize this provenance as synthetic knowledge and propose a Synthetic Knowledge-Guided (SKG) training framework that uses it to shape gradients toward target regions and away from irrelevant ones via a Gradient Guide Loss. Our framework is generic and can be seamlessly integrated into diverse synthesis pipelines, including mixing-based synthesis and generative editing-based synthesis, without additional human annotations. Experiments on image classification, weakly supervised object localization, and weakly supervised spatio-temporal action localization show consistent gains over strong baseline methods. These results demonstrate that making provenance in synthetic data is an effective and widely applicable mechanism for mitigating shortcut learning and enhancing robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38866", "url": null, "sourceid": 38298, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36289, "uid": "24699bced4aeb7cf8d33b0319c4a5c98", "name": "Survive the 1001$^{st}$ Night: Interactive Physical Reasoning", "authors": [{"id": 179934, "fullname": "Mingyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179934?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 180385, "fullname": "lifeng zhuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/180385?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 184697, "fullname": "Tianxi Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184697?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184698, "fullname": "Guocan Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/184698?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184699, "fullname": "Xian Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/184699?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184700, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184700?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184701, "fullname": "Renjie Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184701?format=json", "institution": "Fudan University"}, {"id": 184702, "fullname": "Zizhu He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184702?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184703, "fullname": "Ziyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184703?format=json", "institution": null}, {"id": 184704, "fullname": "Jiting Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184704?format=json", "institution": "Carnegie Mellon University"}, {"id": 127551, "fullname": "Yonglu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127551?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience.We study this in a Game-to-Unseen (G2U) setting, curating 1{,}000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning.Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality.We therefore propose \\textbf{IPR} (\\textbf{Interactive Physical Reasoning}), using world-model rollouts to score and reinforce a VLM\u2019s policy, and introduce \\textbf{PhysCode}, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning.Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games.These results support physics-centric interaction as a path to steadily improving physical reasoning. \\textbf{Our code will be publicly available.}", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36289", "url": null, "sourceid": 30925, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36700, "uid": "a70935fb4ff7a4e569a4f573cb7eca56", "name": "An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning", "authors": [{"id": 185673, "fullname": "Quyen Tran", "url": "http://cvpr.thecvf.com/api/miniconf/users/185673?format=json", "institution": "Qualcomm Inc, QualComm"}, {"id": 185674, "fullname": "Ngoc-Hai Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185674?format=json", "institution": "QualComm; Tufts University"}, {"id": 85387, "fullname": "Minh Quan Dao", "url": "http://cvpr.thecvf.com/api/miniconf/users/85387?format=json", "institution": "Rutgers University"}, {"id": 157991, "fullname": "Hoang Phan", "url": "http://cvpr.thecvf.com/api/miniconf/users/157991?format=json", "institution": "New York University"}, {"id": 185675, "fullname": "Linh Ngo Van", "url": "http://cvpr.thecvf.com/api/miniconf/users/185675?format=json", "institution": "Hanoi University of Science and Technology"}, {"id": 185676, "fullname": "Khoat Than", "url": "http://cvpr.thecvf.com/api/miniconf/users/185676?format=json", "institution": "Hanoi University of Science and Technology"}, {"id": 128648, "fullname": "Dinh Phung", "url": "http://cvpr.thecvf.com/api/miniconf/users/128648?format=json", "institution": "Monash University"}, {"id": 75668, "fullname": "Dimitris N. Metaxas", "url": "http://cvpr.thecvf.com/api/miniconf/users/75668?format=json", "institution": "Rutgers University"}, {"id": 128675, "fullname": "Trung Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/128675?format=json", "institution": "Monash University"}], "abstract": "In online incremental learning, data continuously arrives with substantial shifts in distribution, creating a significant challenge since previous samples cannot be revisited. Prior research has typically relied on either a single adaptive centroid or fixed multiple centroids to represent each class in the latent space. However, such methods struggle when class data streams are inherently multimodal and require continual centroid updates. To overcome this, we introduce an online Mixture Model learning framework grounded in Optimal Transport theory (MMOT), where centroids evolve incrementally with new data. This approach offers two main advantages: (i) it provides a more precise characterization of complex data streams, and (ii) it enables improved class similarity estimation for unseen samples during inference through MMOT-derived centroids. Furthermore, to strengthen representation learning and mitigate catastrophic forgetting, we design a Dynamic Preservation strategy that regulates the latent space and maintains class separability over time. Experimental evaluations on benchmark datasets confirm the superior effectiveness of our proposed method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36700", "url": null, "sourceid": 42512, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39216, "uid": "279a8a4af46de7caf29071434c2aa9d9", "name": "UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register", "authors": [{"id": 127884, "fullname": "Congpei Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/127884?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 191613, "fullname": "Zhaoyu Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191613?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 86951, "fullname": "Wei Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/86951?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 89574, "fullname": "Zhuotao Tian", "url": "http://cvpr.thecvf.com/api/miniconf/users/89574?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 86955, "fullname": "Yanhao Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86955?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 85748, "fullname": "Tong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85748?format=json", "institution": "EPFL / University of Chinese Academy of Sciences"}], "abstract": "Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\\% mIoU on ADE20K (+9.4\\%), surpassing specialized vision models like DINOv2 (49.1\\%), while zero-shot segmentation accuracy improves by up to 22\\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39216", "url": "https://congpeiqiu.github.io/UniRefiner/", "sourceid": 44042, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39502, "uid": "d6dee97af59164b2d3a186e03d8ebdf0", "name": "Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation", "authors": [{"id": 182172, "fullname": "Tairan He", "url": "http://cvpr.thecvf.com/api/miniconf/users/182172?format=json", "institution": "NVIDIA, CMU"}, {"id": 165949, "fullname": "Zi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/165949?format=json", "institution": "NVIDIA"}, {"id": 182161, "fullname": "Haoru Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/182161?format=json", "institution": "UC Berkeley"}, {"id": 181826, "fullname": "Qingwei Ben", "url": "http://cvpr.thecvf.com/api/miniconf/users/181826?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 96336, "fullname": "Zhengyi Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/96336?format=json", "institution": "Carnegie Mellon University"}, {"id": 192208, "fullname": "Wenli Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192208?format=json", "institution": null}, {"id": 88866, "fullname": "Ye Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/88866?format=json", "institution": "NVIDIA Research"}, {"id": 192209, "fullname": "Xingye Da", "url": "http://cvpr.thecvf.com/api/miniconf/users/192209?format=json", "institution": null}, {"id": 192210, "fullname": "Fernando Casta\u00f1eda", "url": "http://cvpr.thecvf.com/api/miniconf/users/192210?format=json", "institution": "NVIDIA"}, {"id": 149610, "fullname": "Shankar Sastry", "url": "http://cvpr.thecvf.com/api/miniconf/users/149610?format=json", "institution": "University of California Berkeley"}, {"id": 192215, "fullname": "Changliu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192215?format=json", "institution": "Carnegie Mellon University"}, {"id": 141900, "fullname": "Guanya Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/141900?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 169493, "fullname": "Linxi Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/169493?format=json", "institution": "NVIDIA"}, {"id": 75460, "fullname": "Yuke Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75460?format=json", "institution": "University of Texas - Austin"}], "abstract": "A core barrier to the real-world productivity of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization\u2014over lighting, materials, camera parameters, image quality, and sensor delays\u2014with real-to-sim alignment of the dexterous hand and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39502", "url": null, "sourceid": 38320, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38261, "uid": "2e904afa80a8ca949f187f64ff2d15b2", "name": "Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection", "authors": [{"id": 189444, "fullname": "Yawen Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189444?format=json", "institution": "Hefei University of Technology"}, {"id": 89880, "fullname": "Feng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89880?format=json", "institution": "Hefei University of Technology"}, {"id": 189445, "fullname": "Shuqi Kong", "url": "http://cvpr.thecvf.com/api/miniconf/users/189445?format=json", "institution": "Hefei University of Technology"}, {"id": 186060, "fullname": "Yunfeng Diao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186060?format=json", "institution": "Hefei University of Technology"}, {"id": 189446, "fullname": "Xinjian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189446?format=json", "institution": "Hefei University of Technology"}, {"id": 189447, "fullname": "Zenglin Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/189447?format=json", "institution": "Hefei University of Technology"}, {"id": 85089, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85089?format=json", "institution": "Hefei University of Technology"}], "abstract": "Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35\\% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38261", "url": null, "sourceid": 30672, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39938, "uid": "ab3e363159c8a7f02c774f0d6bc7c922", "name": "SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System", "authors": [{"id": 193156, "fullname": "Zhiyu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193156?format=json", "institution": "Jinan University"}, {"id": 144052, "fullname": "Weilong Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144052?format=json", "institution": "National University of Singapore"}, {"id": 193157, "fullname": "YUFEI SHI", "url": "http://cvpr.thecvf.com/api/miniconf/users/193157?format=json", "institution": "Nanyang Technological University"}, {"id": 152161, "fullname": "Xin Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152161?format=json", "institution": "NVIDIA"}, {"id": 156250, "fullname": "Tao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/156250?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 86005, "fullname": "Huiping Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86005?format=json", "institution": "South China University of Technology"}, {"id": 152147, "fullname": "Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152147?format=json", "institution": "Guangming Laboratory"}, {"id": 163978, "fullname": "Hehe Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/163978?format=json", "institution": "Zhejiang University"}], "abstract": "Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating\u2014a domain that demands external professional knowledge integration and rigorous step-wise reasoning\u2014existing approaches often struggle. To bridge this gap, we propose SciEducator, an iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan\u2013Do\u2013Study\u2013Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39938", "url": null, "sourceid": 37086, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39732, "uid": "0cc0573dc3c00da89c04e5a8259ef832", "name": "$\\oslash$ Source Models Leak What They Shouldn\u2019t $\\nrightarrow$ : Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization", "authors": [{"id": 192739, "fullname": "Arnav Devalapally", "url": "http://cvpr.thecvf.com/api/miniconf/users/192739?format=json", "institution": "University of Michigan"}, {"id": 181659, "fullname": "Poornima Jain", "url": "http://cvpr.thecvf.com/api/miniconf/users/181659?format=json", "institution": "Indian Institute of Technology, Hyderabad"}, {"id": 192740, "fullname": "Kartik Srinivas", "url": "http://cvpr.thecvf.com/api/miniconf/users/192740?format=json", "institution": null}, {"id": 153478, "fullname": "Vineeth Balasubramanian", "url": "http://cvpr.thecvf.com/api/miniconf/users/153478?format=json", "institution": "Microsoft Research and IIT-Hyderabad"}], "abstract": "The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called *SCADA-UL*: **U**n**l**earning **S**ource-exclusive **C**l**A**sses in **D**omain **A**daptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization.We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown.Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39732", "url": null, "sourceid": 39505, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39768, "uid": "df5aec07cb149d32c29c536d694bce24", "name": "GeoDexGrasp: Geometry-aware Generation for Data-efficient and Physics-plausible Dexterous Grasping", "authors": [{"id": 143462, "fullname": "Bing Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/143462?format=json", "institution": "Xi &#x27;an Jiaotong University"}, {"id": 180538, "fullname": "Weiyuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180538?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192816, "fullname": "changlong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192816?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192817, "fullname": "Chenxi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192817?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192818, "fullname": "Zhibin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192818?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 192819, "fullname": "Zhi Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192819?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Achieving dexterous grasping remains a key challenge in robotics. Recent generative approaches enable diverse grasps through large-scale data-driven training, yet they often neglect geometric priors of objects, which leads to low data efficiency and poor physical plausibility. We propose GeoDexGrasp, a geometry-aware generation framework for dexterous grasping built upon object-centric geometric representations. We introduce a SIM(3)-equivariant network equipped with a self-supervised disentanglement strategy to extract interpretable and transferable geometric features, including shape, size, pose, and interaction direction.The overall generation process is then decomposed into two stages: first, root rotation generation conditioned on pose and interaction direction; second, hand grasp generation guided by shape and size. By leveraging geometric representations, GeoDexGrasp achieves SOTA physical plausibility (reducing 40\\% penetration depth) across five datasets, and exhibits improved data efficiency. Additionally, GeoDexGrasp is also lightweight (using less than 20\\% of the parameters of the previous SOTA method) and attains a comparable grasp success rate.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39768", "url": null, "sourceid": 36398, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37835, "uid": "2b01c70df55e88336d156e20f583a90e", "name": "Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training", "authors": [{"id": 145360, "fullname": "Gengluo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/145360?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 157987, "fullname": "Pengyuan Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157987?format=json", "institution": "Tencent"}, {"id": 157988, "fullname": "Chengquan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157988?format=json", "institution": "Tencent"}, {"id": 188367, "fullname": "Huawen Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188367?format=json", "institution": "Institute of Information Engineering"}, {"id": 188368, "fullname": "Liang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188368?format=json", "institution": "Tencent"}, {"id": 188369, "fullname": "Xingyu Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188369?format=json", "institution": "Tencent"}, {"id": 154558, "fullname": "Gangyan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154558?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 126300, "fullname": "Han Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/126300?format=json", "institution": "Microsft Research Asia"}, {"id": 152556, "fullname": "Can Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/152556?format=json", "institution": "Institute of Information Engineering, Chinese Academy of Sciences"}, {"id": 152555, "fullname": "Yu ZHOU", "url": "http://cvpr.thecvf.com/api/miniconf/users/152555?format=json", "institution": "Nankai University"}], "abstract": "Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions\u2014primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data\u2013training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37835", "url": null, "sourceid": 45822, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38100, "uid": "b7acd6cfcbb355845b1c5164ceb8a846", "name": "TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models", "authors": [{"id": 154331, "fullname": "Qianlong Xiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154331?format=json", "institution": "Harbin Institute of Technology"}, {"id": 154332, "fullname": "Miao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154332?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 154768, "fullname": "Haoyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/154768?format=json", "institution": "Harbin Institute of Technology"}, {"id": 189058, "fullname": "Kun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189058?format=json", "institution": "Shandong University"}, {"id": 128772, "fullname": "Junhui Hou", "url": "http://cvpr.thecvf.com/api/miniconf/users/128772?format=json", "institution": "City University of Hong Kong"}, {"id": 84777, "fullname": "Liqiang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/84777?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}], "abstract": "Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content.This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods.However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist.To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found.However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model.To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses.Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance.Our experiments demonstrate that TINA successfully regenerates erased concepts from models treated with state-of-the-art unlearning.The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.Code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38100", "url": null, "sourceid": 42591, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39581, "uid": "9a7c22ed48340ab6cd2a273912d51767", "name": "IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation", "authors": [{"id": 130102, "fullname": "Yankai Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130102?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 192404, "fullname": "Qiaoru Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192404?format=json", "institution": "Meituan; Zhejiang University; Northeastern University"}, {"id": 176164, "fullname": "BinLu Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176164?format=json", "institution": "Zhejiang University"}, {"id": 192405, "fullname": "Haoran Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192405?format=json", "institution": "Fudan University"}, {"id": 192406, "fullname": "Chao Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/192406?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 70502, "fullname": "Junting Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/70502?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 183663, "fullname": "Yuxiang Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/183663?format=json", "institution": "Zhejiang University"}, {"id": 192407, "fullname": "Xuhong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192407?format=json", "institution": "Zhejiang University"}, {"id": 190842, "fullname": "Jianwei Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/190842?format=json", "institution": "Zhejiang University"}], "abstract": "Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose IBISAgent\u2014a novel agentic MLLM that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model\u2019s robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39581", "url": null, "sourceid": 32805, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37327, "uid": "7aeb49ed1f0520808e3d0be990604367", "name": "ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Senisng", "authors": [{"id": 187168, "fullname": "Zhenghui Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187168?format=json", "institution": "Wuhan University"}, {"id": 187169, "fullname": "Chen Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187169?format=json", "institution": "Wuhan University"}, {"id": 71122, "fullname": "Xiangyong Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/71122?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 158612, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/158612?format=json", "institution": "Wuhan University"}, {"id": 187170, "fullname": "Hongruixuan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/187170?format=json", "institution": "The University of Tokyo"}, {"id": 145621, "fullname": "Datao Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/145621?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 187171, "fullname": "Liangpei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187171?format=json", "institution": "Wuhan University"}, {"id": 131621, "fullname": "Zhuo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131621?format=json", "institution": "Stanford University"}], "abstract": "Spatiotemporal image generation is a highly meaningful task, which can generate future scenes conditioned on given observations. However, existing change generation methods can only handle event-driven changes (e.g., new buildings) and fail to model cross-temporal variations (e.g., seasonal shifts). In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. Given pre-event images and multimodal event controls, ChangeBridge generates post-event scenes that are both spatially and temporally coherent. The core idea is a drift-asynchronous diffusion bridge. Specifically, it consists of three main modules: a) Composed bridge initialization, which replaces noise initialization. It starts the diffusion from a composed pre-event state, modeling a diffusion bridge process. b) Asynchronous Drift Diffusion, which uses a pixel-wise drift map, assigning different drift magnitudes to event and temporal evolution. This enables differentiated generation during the pre-to-post transition. c) Drift-Aware Denoising, which embeds the drift map into the denoising network, guiding drift-aware reconstruction.Experiments show that ChangeBridge can generate better cross-spatiotemporal aligned scenarios compared to state-of-the-art methods. Additionally, ChangeBridge shows great potential for land-use planning and as a data generation engine for a series of change detection tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37327", "url": null, "sourceid": 37707, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39930, "uid": "1fb0a167e68e4e697611819f9f970c8e", "name": "Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation", "authors": [{"id": 176708, "fullname": "Junbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/176708?format=json", "institution": "School of Computer Science, Wuhan University"}, {"id": 193134, "fullname": "hangsu hangsu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193134?format=json", "institution": "Wuhan University"}, {"id": 193135, "fullname": "Zhaofan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193135?format=json", "institution": "Wuhan University"}, {"id": 193136, "fullname": "Hang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/193136?format=json", "institution": "Wuhan University"}, {"id": 193137, "fullname": "Chao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193137?format=json", "institution": "Wuhan University"}], "abstract": "Audio-visual segmentation (AVS) aims to accurately segment sounding objects in video frames by leveraging audio-visual correspondence cues. However, it remains challenging due to the intrinsic semantic incompleteness within a single modality and the semantic gap between audio and visual representations. Traditional feature-fusion-based decoding approaches struggle to suppress fusion noise effectively, while recent methods that incorporate data-dependent priors often increase the complexity of modeling audio-visual correlations, leading to poor cross-domain generalization. To address these issues, we propose a novel adaptive contrastive and prototype learning framework, BYOAVP, for AVS. Specifically, we design a Self-Supervised Audio Enhancement (SSAE) module that introduces contrastive learning to adaptively align audio representations with gradient-blocked visual embeddings, thus narrowing the semantic gap between modalities. Furthermore, a Dynamic Prototype Constraint (DPC) module is developed to refine pixel-wise category perception via momentum-based prototype updating, while enhancing the localization of sounding regions through cross-modal interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance across two AVS benchmarks and six sub-tasks, exhibiting strong robustness and generalization ability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39930", "url": null, "sourceid": 42420, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37151, "uid": "6b7f18f1f939115a1bd761ec64bd2c55", "name": "When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models", "authors": [{"id": 180686, "fullname": "Hui Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180686?format=json", "institution": "Nanyang Technological University"}, {"id": 76212, "fullname": "Yi Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/76212?format=json", "institution": "Nanyang Technological University, Singapore"}, {"id": 186784, "fullname": "Yiming Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186784?format=json", "institution": "Nanyang Technological University"}, {"id": 186785, "fullname": "Chenyu Yi", "url": "http://cvpr.thecvf.com/api/miniconf/users/186785?format=json", "institution": "Nanyang Technological University"}, {"id": 186786, "fullname": "Qixin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186786?format=json", "institution": "Nanyang Technological University"}, {"id": 186787, "fullname": "Bingquan Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186787?format=json", "institution": "National University of Singapore; DSO National Labs"}, {"id": 87266, "fullname": "Alex C. Kot", "url": "http://cvpr.thecvf.com/api/miniconf/users/87266?format=json", "institution": "Nanyang Technological University"}, {"id": 87664, "fullname": "Xudong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87664?format=json", "institution": "Nanyang Technological University"}], "abstract": "Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of **universal, transferable adversarial patches** against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce **UPA-RFAS** (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37151", "url": null, "sourceid": 32325, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37294, "uid": "8ae8e407864cd3b5e1298bdf1e6787f5", "name": "ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain", "authors": [{"id": 183027, "fullname": "Yingzhao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183027?format=json", "institution": "Yangtze River Delta HlT Robot Technology Research Institute"}, {"id": 187101, "fullname": "Yanjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187101?format=json", "institution": null}, {"id": 187102, "fullname": "lijun zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187102?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Active 3D Gaussian reconstruction achieves superior completeness and rendering quality by intelligently selecting viewpoints. However, existing methods suffer from two critical limitations: information gain metrics that prioritize geometric coverage while ignoring rendering quality, and overfitting to sparse view configurations that degrades novel view synthesis. We introduce ActivePolicy, a novel framework addressing both challenges through principled NBV selection and regularization. We propose \\textbf{GL-Graph}, a graph-theoretic strategy that unifies geometric consistency, rendering quality, and observation redundancy into a single stability criterion. To counteract overfitting, we introduce \\textbf{4D-Reg}, which identifies floaters through manifold discrepancies among three depth types (R-Depth, $\\alpha$-Depth, C-Depth) and suppresses them via adaptive dropout. Extensive experiments demonstrate state-of-the-art reconstruction completeness and rendering fidelity on standard benchmarks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37294", "url": null, "sourceid": 45496, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38756, "uid": "1509c5822b3f60050ebd79dbc1343b06", "name": "GraspALL: Adaptive Structural Compensation from Luminance Variation for Robotic Garment Grasping in Any Low-Light Conditions", "authors": [{"id": 190591, "fullname": "Haifeng Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190591?format=json", "institution": "Jilin University"}, {"id": 175853, "fullname": "Wenshuo Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/175853?format=json", "institution": "Jilin University"}, {"id": 190592, "fullname": "Zhouyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190592?format=json", "institution": "Jilin University"}, {"id": 71666, "fullname": "Runyang Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/71666?format=json", "institution": "Jilin University"}, {"id": 87043, "fullname": "Fan Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87043?format=json", "institution": "Institute of Computing Technology, CAS"}, {"id": 131773, "fullname": "Tong-yee Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/131773?format=json", "institution": "National Cheng Kung University"}, {"id": 190593, "fullname": "zipei fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190593?format=json", "institution": "Jilin University"}, {"id": 129288, "fullname": "Ruihai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129288?format=json", "institution": "Peking University"}, {"id": 190594, "fullname": "Yuran Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190594?format=json", "institution": "Peking University"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}, {"id": 190595, "fullname": "Hechang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190595?format=json", "institution": "Jilin University"}, {"id": 73483, "fullname": "Hyung Jin Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73483?format=json", "institution": "University of Birmingham"}, {"id": 76807, "fullname": "Yixing Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76807?format=json", "institution": "Jilin University"}], "abstract": "Achieving accurate garment grasping under dynamically changing illumination is crucial for all-day operation of service robots. However, the reduced illumination in low-light scenes severely degrades garment structural features, leading to a significant drop in grasping robustness. Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model\u2019s adaptability to illumination changes. To address this problem, we propose GraspALL, an illumination-structure interactive compensation model. The innovation of GraspALL lies in encoding continuous illumination changes into quantitative references to guide adaptive feature compensation between RGB and non-RGB modalities, thereby generating illumination-consistent grasping representations. Experiments on the self-built multimodal garment grasping (MIGG) dataset demonstrate that GraspALL improves grasping accuracy by 32-44% over baseline methods under diverse illumination conditions.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38756", "url": null, "sourceid": 38248, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38003, "uid": "b8b687939c2ff82c6cc395c7de783262", "name": "S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance", "authors": [{"id": 188804, "fullname": "Beining Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188804?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 101309, "fullname": "Siting Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/101309?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 188805, "fullname": "Zhao Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188805?format=json", "institution": "Nanyang Technological University"}, {"id": 167442, "fullname": "Junxian Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/167442?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 129735, "fullname": "Hesheng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/129735?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S$^2$-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38003", "url": null, "sourceid": 41069, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37831, "uid": "4fbdc7de3e31de33862586c8db456f53", "name": "Intrinsic Concept Extraction Based on Compositional Interpretability", "authors": [{"id": 174289, "fullname": "Hanyu Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/174289?format=json", "institution": "Guangdong University of Technology; VIPSHOP"}, {"id": 188353, "fullname": "Hong Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/188353?format=json", "institution": "VIPSHOP"}, {"id": 188354, "fullname": "Guoheng Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188354?format=json", "institution": "Guangdong University of Technology"}, {"id": 188355, "fullname": "Jianbin Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188355?format=json", "institution": "VIPSHOP"}, {"id": 188356, "fullname": "Xuhang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188356?format=json", "institution": "Huizhou University"}, {"id": 87613, "fullname": "Chi-Man Pun", "url": "http://cvpr.thecvf.com/api/miniconf/users/87613?format=json", "institution": "University of Macau"}, {"id": 150524, "fullname": "Shanhu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150524?format=json", "institution": "VIPSHOP"}, {"id": 188357, "fullname": "Pan Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188357?format=json", "institution": "VIPSHOP"}], "abstract": "Unsupervised Concept Extraction aims to extract concepts from a single image, yet existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37831", "url": null, "sourceid": 36360, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37545, "uid": "c909eca59c9bf353a358f03ca9658916", "name": "SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models", "authors": [{"id": 187688, "fullname": "Zhanxuan Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187688?format=json", "institution": "Yunnan Normal University"}, {"id": 165925, "fullname": "Xu Qiyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/165925?format=json", "institution": "Yunnan Normal University"}, {"id": 187689, "fullname": "Yu Duan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187689?format=json", "institution": "Xidian University"}, {"id": 187690, "fullname": "Yonghang Tai", "url": "http://cvpr.thecvf.com/api/miniconf/users/187690?format=json", "institution": null}, {"id": 127563, "fullname": "Huafeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127563?format=json", "institution": "Kunmimg University of Science and Technology"}], "abstract": "Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \\textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \\textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \\textbf{SOTA} (\\textit{Self-adaptive Optimal TrAnsport}), a \\textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \\textbf{SOTA} requires no hyperparameter tuning and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \\textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. All codes are provided in the supplementary materials and will be released upon the acceptance of this paper.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37545", "url": null, "sourceid": 39641, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38358, "uid": "f71a344159282c7b94cb0e465f0fa610", "name": "ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models", "authors": [{"id": 189710, "fullname": "Hai Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189710?format=json", "institution": "Sichuan University"}, {"id": 189711, "fullname": "Zhen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189711?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 76644, "fullname": "Yinjie Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/76644?format=json", "institution": "Sichuan University"}, {"id": 189712, "fullname": "Songchen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/189712?format=json", "institution": "Sichuan University"}, {"id": 105589, "fullname": "Bing Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/105589?format=json", "institution": "University of Electronic Science and Technology of China,"}, {"id": 93490, "fullname": "Shuaicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/93490?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code will be released to facilitate future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38358", "url": null, "sourceid": 35978, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39344, "uid": "51e99940fd54d7566cb8e00b9e029bb9", "name": "Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation", "authors": [{"id": 191889, "fullname": "ZeBin Ji", "url": "http://cvpr.thecvf.com/api/miniconf/users/191889?format=json", "institution": "Chongqing University of Post and Telecommunications"}, {"id": 191890, "fullname": "Yang Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191890?format=json", "institution": "Beihang University"}, {"id": 77315, "fullname": "Xiuli Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/77315?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 90522, "fullname": "Bo Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90522?format=json", "institution": "Chongqing University of Posts and Telecommunications"}, {"id": 86069, "fullname": "Bin Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86069?format=json", "institution": "Chongqing University of Posts and Tel."}], "abstract": "It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural language descriptions, thereby advancing the understanding of the neural network's decision-making mechanism. However, these approaches assume that each neuron has well-defined functions and provides discriminative features for neural network decision-making. In fact, some neurons may be redundant or may offer misleading concepts. Thus, the descriptions for such neurons may cause misinterpretations of the factors driving the neural network\u2019s decisions. To address the issue, we introduce a verification of neuron functions, which checks whether the generated concept highly activates the corresponding neuron. Furthermore, we propose a Select\u2013Hypothesize\u2013Verify framework for interpreting neuron functionality. This framework consists of: 1) selecting activation samples that best capture a neuron\u2019s well-defined functional behavior through activation-distribution analysis; 2) forming hypotheses about concepts for the selected neurons; and 3) verifying whether the generated concepts accurately reflect the functionality of the neuron. Extensive experiments show that our method produces more accurate neuron concepts. Our generated concepts activate the corresponding neurons with a probability approximately 1.5 times that of the current state-of-the-art method.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39344", "url": null, "sourceid": 38630, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38664, "uid": "6cb5686162b6b655404343ae77bd9348", "name": "UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs", "authors": [{"id": 145953, "fullname": "Liang Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/145953?format=json", "institution": "University of Science and Technology of China"}, {"id": 90344, "fullname": "Min Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90344?format=json", "institution": "Institute of Artificial Intelligence, Hefei Comprehensive National Science Center"}, {"id": 190420, "fullname": "Xingyu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190420?format=json", "institution": "University of Science and Technology of China"}, {"id": 190421, "fullname": "Aowen Qiu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190421?format=json", "institution": "University of Science and Technology of China"}, {"id": 85751, "fullname": "Wengang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85751?format=json", "institution": "University of Science and Technology of China"}, {"id": 89074, "fullname": "Houqiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/89074?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Active search and tracking of arbitrary targets by Unmanned Aerial Vehicles (UAVs) in cluttered environments remains a highly challenging problem. Existing methods either construct complex modular pipelines, leading to substantial computational costs, or adopt end-to-end controllers that often fail to generalize across different targets and scenes. Moreover, search and tracking are typically treated separately despite their strong interdependence.In this paper, we present UAST, a simple yet effective mapping-free framework that unifies active search and persistent tracking using only RGB-D observations. The proposed system couples a dual-branch perception module with a Rule-Based Point Search Policy that adaptively switches between tracking and search-based recovery. A lightweight control network generates dynamically feasible trajectories directly from fused perception and UAV states. Furthermore, we introduce a training strategy with an elaborated tracking-aware visibility loss and a tailored data construction.Extensive experiments in both simulated and real-world environments show that our approach achieves higher success rates, more stable long-term tracking, and faster target search compared with existing methods, while maintaining high efficiency. The code will be released upon publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38664", "url": null, "sourceid": 36905, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36544, "uid": "f9eca5038949eae460da07906408a092", "name": "ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving", "authors": [{"id": 151700, "fullname": "Qihang Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151700?format=json", "institution": "Tsinghua University"}, {"id": 153276, "fullname": "Xuesong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/153276?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 185311, "fullname": "Chenye Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185311?format=json", "institution": "Tsinghua University"}, {"id": 86239, "fullname": "Shaoshuai Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/86239?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 86526, "fullname": "Hongsheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/86526?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision\u2013language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision\u2013language\u2013action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36544", "url": null, "sourceid": 34019, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36572, "uid": "f593b9ead8801922f74f0a5329e31486", "name": "Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching", "authors": [{"id": 180084, "fullname": "Hao Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/180084?format=json", "institution": "Zhejiang University"}, {"id": 129165, "fullname": "Muzhi Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129165?format=json", "institution": "Zhejiang University"}, {"id": 183228, "fullname": "Shenyan Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183228?format=json", "institution": "Zhejiang University"}, {"id": 185379, "fullname": "Anzhou Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185379?format=json", "institution": "Zhejiang University"}, {"id": 185380, "fullname": "Cong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185380?format=json", "institution": "Zhejiang University"}, {"id": 185381, "fullname": "Hua Geng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185381?format=json", "institution": "Zhejiang University"}, {"id": 185382, "fullname": "Duochao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/185382?format=json", "institution": "Zhejiang University"}, {"id": 185383, "fullname": "Wentao Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/185383?format=json", "institution": "Zhejiang University"}, {"id": 158625, "fullname": "Tao Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/158625?format=json", "institution": "Westlake University"}, {"id": 185384, "fullname": "Hao Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/185384?format=json", "institution": "Zhejiang University"}, {"id": 86450, "fullname": "Chunhua Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/86450?format=json", "institution": "Zhejiang University"}], "abstract": "Employing multimodal large language models (MLLMs) in 3D physical environments demands complex spatial reasoning capabilities that integrate geometric understanding, viewpoint synthesis, fine-grained perception, and robust depth estimation. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We address this gap through the lens of wide-baseline matching (WBM)---determining whether two views with large viewpoint changes, appearance shifts, and occlusions depict the same scene element. We introduce ReasonMatch-Bench, a comprehensive benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios. Our evaluation reveals substantial gaps between human performance and state-of-the-art MLLMs, particularly for smaller models, highlighting critical deficiencies in spatial reasoning. To bridge this gap, we propose a scalable data generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora (RGB-D videos and SfM reconstructions), providing diverse, verifiable supervision. Leveraging verifiable matching accuracy as rewards, we introduce Dynamic Correspondence Reinforcement Learning (DCRL), combining Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to enable progressive acquisition of sophisticated spatial reasoning without explicit supervision. Extensive experiments demonstrate that our approach significantly enhances MLLMs' spatial reasoning capabilities, narrowing the gap with human performance on complex 3D understanding tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36572", "url": null, "sourceid": 40374, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40251, "uid": "af63a657e0ceef55ddb2751bc5c0294d", "name": "Towards Streaming Referring Video Segmentation via Large Language Model", "authors": [{"id": 156488, "fullname": "Wenkang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156488?format=json", "institution": "Southeast University"}, {"id": 193880, "fullname": "Kaicheng Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193880?format=json", "institution": "DeepGlint"}, {"id": 185046, "fullname": "Xiang An", "url": "http://cvpr.thecvf.com/api/miniconf/users/185046?format=json", "institution": "DeepGlint"}, {"id": 193881, "fullname": "Qiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193881?format=json", "institution": "Alibaba Group"}, {"id": 185574, "fullname": "Ziyong Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185574?format=json", "institution": "DeepGlint"}, {"id": 191957, "fullname": "Wankou Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191957?format=json", "institution": "Southeast University"}, {"id": 74045, "fullname": "Jiankang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74045?format=json", "institution": "Imperial College London"}], "abstract": "Current referring video segmentation methods typically operate in an offline manner, where sparse frames are first selected for image-level referring segmentation, and the resulting masks are then propagated across the video. Although video sampling captures global context, its isolated processing steps not only complicate optimization but also restrict applicability to real-world streaming scenarios. In this paper, we propose a simple but efficient MLLM-based framework StreamingRVOS, which can extend image-level segmentation to video-level via a streaming pipeline without introducing extra parameters. Specifically, we employ a Semantic Embedding Recycling (SER) method to propagate temporal context across frames, enabling the model to perceive semantic representation in the video. Then, we propose an Online Mask Consistency Perception (OMCP) strategy to adaptively invoke the MLLM to re-perceive the current scene and regenerate the semantic embedding. We conduct extensive experiments on multiple downstream datasets to prove the effectiveness of StreamingRVOS. Compared to previous methods, our method achieves excellent performance in referring video segmentation (1B variant improves upon Sa2VA by 19.2 on the MeViS dataset), while operating at an average speed of 7 FPS under streaming inference on 1 $\\times$ A800 GPU.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40251", "url": null, "sourceid": 34395, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36655, "uid": "e17fdb12d20b8b9cc8a1aae0255da728", "name": "Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining", "authors": [{"id": 182590, "fullname": "Zhumei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182590?format=json", "institution": "Beijing Institute of Technology"}, {"id": 185572, "fullname": "zechenhu zechenhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185572?format=json", "institution": null}, {"id": 185573, "fullname": "Ruoxi Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/185573?format=json", "institution": "Zhejiang University"}, {"id": 184432, "fullname": "Huaijin Pi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184432?format=json", "institution": "The University of Hong Kong"}, {"id": 185574, "fullname": "Ziyong Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185574?format=json", "institution": "DeepGlint"}, {"id": 185575, "fullname": "Liang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185575?format=json", "institution": null}, {"id": 87173, "fullname": "Mingtao Pei", "url": "http://cvpr.thecvf.com/api/miniconf/users/87173?format=json", "institution": "Beijing Institute of Technology"}, {"id": 75767, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75767?format=json", "institution": "Beijing Institute of General Artificial Intelligence"}], "abstract": "Human motion recovery for real-world interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models' reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data, followed by multi-view fine-tuning on public 3D data, thus achieving a combination of strong priors and geometric constraints. Furthermore, to recover absolute poses, we introduce a novel human motion representation that decouples the learning of local pose and global movements, while encoding ground geometric priors to accelerate convergence, thereby yielding more precise positioning in the physical world. Experiments on in-the-wild benchmarks show that our method outperforms state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting strong generalization capability. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36655", "url": null, "sourceid": 38886, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39162, "uid": "ae91b05cf9c865b54cf57adf35831853", "name": "ORV: 4D Occupancy-centric Robot Video Generation", "authors": [{"id": 162862, "fullname": "Xiuyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162862?format=json", "institution": "Tsinghua University"}, {"id": 152936, "fullname": "Bohan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152936?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 185393, "fullname": "Shaocong Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185393?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 186923, "fullname": "Nan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186923?format=json", "institution": "Beijing Academy of Artificial Intelligence"}, {"id": 90336, "fullname": "Chongjie Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/90336?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 90325, "fullname": "Zhaoxi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90325?format=json", "institution": "MMLab@NTU, Nanyang Technological University"}, {"id": 132748, "fullname": "Minghan Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/132748?format=json", "institution": "Tsinghua University"}, {"id": 191486, "fullname": "Yikang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/191486?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 75853, "fullname": "Zheng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75853?format=json", "institution": "Tsinghua University"}, {"id": 132631, "fullname": "Xin Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/132631?format=json", "institution": "Eastern Institute of Technology, Ningbo"}, {"id": 75520, "fullname": "Hang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/75520?format=json", "institution": "Tsinghua University"}, {"id": 88978, "fullname": "Hao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/88978?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for embodied scenarios; we therefore curate ORV-Data, a large-scale, high-quality 4D semantic occupancy dataset of robot manipulation. Across BridgeV2, DROID, and RT-1, ORV improves video generation quality and controllability, achieving 18.8\\% lower FVD than state of the art, +3.5\\% success rate on visual planning, and +6.4\\% success rate on policy learning. Beyond singleview generation, ORV natively supports multiview consistent synthesis and enables simulation-to-real transfer despite significant domain gaps. Code, models, and data will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39162", "url": null, "sourceid": 32133, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37625, "uid": "43e9cf78aafcad8469415b24f9e13fea", "name": "All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark", "authors": [{"id": 182322, "fullname": "Junjiang Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182322?format=json", "institution": "Xinjiang University"}, {"id": 187897, "fullname": "Liejun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187897?format=json", "institution": "Xinjiang University"}, {"id": 187898, "fullname": "Zhiqing Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/187898?format=json", "institution": "Xinjiang University"}], "abstract": "With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an \"all-in-one\" trifunctional forensic solution: the regression head underlies an \"intrinsic-extrinsic\" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments demonstrate that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37625", "url": null, "sourceid": 33070, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39206, "uid": "fb1dc1367f429a50b497eb473bb0d23e", "name": "RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment", "authors": [{"id": 191581, "fullname": "Liyao Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191581?format=json", "institution": "University of Alberta"}, {"id": 191582, "fullname": "Ruichen Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191582?format=json", "institution": "University of Alberta"}, {"id": 191583, "fullname": "Chao Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/191583?format=json", "institution": "Huawei Technologies Canada"}, {"id": 88141, "fullname": "Di Niu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88141?format=json", "institution": "University of Alberta"}], "abstract": "Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt\u2013image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision\u2013language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions\u2014including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39206", "url": "https://github.com/LiyaoJiang1998/RAISE", "sourceid": 34121, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39704, "uid": "6508deb7d1b89059767d5af370618046", "name": "ApET: Approximation-Error Guided Token Compression for Efficient VLMs", "authors": [{"id": 192686, "fullname": "Qiankun Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/192686?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 192687, "fullname": "Ziyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192687?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 192688, "fullname": "Haofei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192688?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 185422, "fullname": "Zhen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/185422?format=json", "institution": "Pengcheng Laboratory"}, {"id": 85075, "fullname": "Jie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/85075?format=json", "institution": "Peking University"}, {"id": 192689, "fullname": "Hairong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/192689?format=json", "institution": null}], "abstract": "Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an \\textbf{Ap}proximation-\\textbf{E}rror \\textbf{T}oken compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2\\% of the original performance on image-understanding tasks and even attains 100.4\\% on video-understanding tasks, while compressing the token budgets by 88.9\\% and 87.5\\%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39704", "url": "https://github.com/Maqkccx/ApET", "sourceid": 32671, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36199, "uid": "e335123b12e0f5cec402ab04e2ea9870", "name": "LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks", "authors": [{"id": 184423, "fullname": "Hengjian Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184423?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 156244, "fullname": "Kaiwei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/156244?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184424, "fullname": "Shibo Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184424?format=json", "institution": "Jilin University"}, {"id": 184425, "fullname": "Mingjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184425?format=json", "institution": "Shanghai University of Electric Power"}, {"id": 184426, "fullname": "Caoqihang Caoqihang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184426?format=json", "institution": "Shanghai University of Electric Power"}, {"id": 184427, "fullname": "Xianfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184427?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 184428, "fullname": "Yucheng Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184428?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 89522, "fullname": "Xiongkuo Min", "url": "http://cvpr.thecvf.com/api/miniconf/users/89522?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 184429, "fullname": "Wei Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/184429?format=json", "institution": "East China Normal University"}, {"id": 156245, "fullname": "Dandan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156245?format=json", "institution": "East China Normal University"}, {"id": 86659, "fullname": "Guangtao Zhai", "url": "http://cvpr.thecvf.com/api/miniconf/users/86659?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "The rapid progress of Multimodal Large Language Models (MLLMs) marks a significant step toward artificial general intelligence, offering great potential for augmenting human capabilities. However, their ability to provide effective assistance in dynamic, real-world environments remains largely underexplored. Existing video benchmarks predominantly assess passive understanding through retrospective analysis or isolated perception tasks, failing to capture the interactive and adaptive nature of real-time user assistance. To bridge this gap, we introduce LifeEval, a multimodal benchmark designed to evaluate real-time, task-oriented human\u2013AI collaboration in daily life from an egocentric perspective. LifeEval emphasizes three key aspects: task-oriented holistic evaluation, egocentric real-time perception from continuous first-person streams, and human\u2013assistant collaborative interaction through natural dialogues. Constructed via a rigorous annotation pipeline, the benchmark comprises 4,075 high-quality question\u2013answer pairs across 6 core capability dimensions. Extensive evaluations of 26 state-of-the-art MLLMs on LifeEval reveal substantial challenges in achieving timely, effective and adaptive interaction, highlighting essential directions for advancing human-centered interactive intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36199", "url": null, "sourceid": 30971, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39503, "uid": "4a8d4ae1b5bdd7ae8b35ed010c81520e", "name": "Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos", "authors": [{"id": 192216, "fullname": "Hongrui Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192216?format=json", "institution": "ByteDance"}, {"id": 192217, "fullname": "Junjie Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192217?format=json", "institution": "ByteDance Inc."}, {"id": 192218, "fullname": "Zhihong Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192218?format=json", "institution": "ByteDance"}, {"id": 192219, "fullname": "Shengnan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192219?format=json", "institution": null}, {"id": 192220, "fullname": "Wenjiawei Wenjiawei", "url": "http://cvpr.thecvf.com/api/miniconf/users/192220?format=json", "institution": null}, {"id": 154492, "fullname": "Wanquan Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154492?format=json", "institution": "ByteDance"}, {"id": 154495, "fullname": "Songtao Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/154495?format=json", "institution": "ByteDance"}, {"id": 127533, "fullname": "Qian HE", "url": "http://cvpr.thecvf.com/api/miniconf/users/127533?format=json", "institution": "Institute of Remote Sensing Application, Chinese Academic of Sciences"}], "abstract": "Video Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a single-view video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a train-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically avoids the train-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data, we establish a synthetic data pipeline integrated with our training strategy to enhance precision. Qualitative and quantitative results demonstrate a positive correlation between performance and training data volume, confirming our scalability.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39503", "url": null, "sourceid": 40409, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36347, "uid": "5e7f2e8ff45b2e7c879e010041cc0d29", "name": "PureCC: Pure Learning for Text-to-Image Concept Customization", "authors": [{"id": 156952, "fullname": "Zhichao Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/156952?format=json", "institution": "Tsinghua University"}, {"id": 184830, "fullname": "Xiaole Xian", "url": "http://cvpr.thecvf.com/api/miniconf/users/184830?format=json", "institution": "Shenzhen University"}, {"id": 184831, "fullname": "Qingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184831?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 184832, "fullname": "Wenyu Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184832?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 180129, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180129?format=json", "institution": "Kling AI"}, {"id": 127865, "fullname": "Weicheng Xie", "url": "http://cvpr.thecvf.com/api/miniconf/users/127865?format=json", "institution": "Shenzhen University"}, {"id": 184833, "fullname": "Siyang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/184833?format=json", "institution": "University of Exeter; University of Cambridge"}, {"id": 184834, "fullname": "Pingfa Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/184834?format=json", "institution": "Tsinghua University"}, {"id": 184835, "fullname": "Long ZENG", "url": "http://cvpr.thecvf.com/api/miniconf/users/184835?format=json", "institution": "Shenzhen International Graduate School, Tsinghua University"}, {"id": 127120, "fullname": "Liang Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/127120?format=json", "institution": "Shanghai AI Lab"}], "abstract": "Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization.However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts.To address this issue, we propose PureCC. PureCC novelly introduces a decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representation as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concept. Furthermore, PureCC introduces an novel adaptive guidance scale $\\lambda^\\star$ to dynamically adjust the guidance strength of the target concept, balancing between customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36347", "url": null, "sourceid": 32603, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37053, "uid": "2db351ad40df79806c20fc4038ae1f38", "name": "Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure", "authors": [{"id": 181182, "fullname": "Xin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181182?format=json", "institution": "Shanxi University"}, {"id": 186571, "fullname": "Liang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/186571?format=json", "institution": "Shanxi university"}, {"id": 186572, "fullname": "Guanchao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186572?format=json", "institution": "Shanxi University"}, {"id": 186573, "fullname": "Xian Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186573?format=json", "institution": "University of Manchester"}], "abstract": "Exemplar-Free Class Incremental Learning (EFCIL) aims to enable models to learn new classes sequentially without retaining samples from previous tasks. While recent approaches leverage pre-trained models with parameter-efficient tuning to mitigate forgetting, they often overlook a crucial cause of forgetting: the collapse of the class-discriminative structure. This structure comprises two interdependent components: intra-class structure, which characterizes the shape of individual classes, and inter-class structure, which characterizes the global geometric relationships among class prototypes. We reveal that catastrophic forgetting stems from the simultaneous deterioration of both intra-class and inter-class structures. To address this, we propose a unified framework that preserves the class-discriminative structure. It preserves the intra-class structure by reshaping class means and covariances to preserve each class\u2019s shape during migration, and maintains inter-class structure by stabilizing angular relationships between samples and old prototypes. Extensive experiments demonstrate that our framework outperforms existing leading methods on multiple EFCIL benchmarks, validating that preserving the class-discriminative structure is crucial for mitigating catastrophic forgetting.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37053", "url": null, "sourceid": 35121, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 65730, "file": "/media/PosterPDFs/CVPR%202026/37053.png", "modified": "2026-04-23T05:10:31.916733-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": true, "sortkey": 0, "is_live_content": false, "detailed_kind": "", "generated_from": null, "resourcetype": "EventmediaImageFile"}, {"id": 65731, "file": "/media/PosterPDFs/CVPR%202026/37053-thumb.png", "modified": "2026-04-23T05:10:32.094704-07:00", "display_section": 1, "type": "Poster", "name": "Poster", "visible": false, "sortkey": 0, "is_live_content": false, "detailed_kind": "thumb", "generated_from": null, "resourcetype": "EventmediaImageFile"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37980, "uid": "a02248212328cdea5940d1c050f6c6e1", "name": "Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining", "authors": [{"id": 188737, "fullname": "Weijun Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188737?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 128144, "fullname": "Yuqing Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128144?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 148028, "fullname": "Weikang Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/148028?format=json", "institution": "Harbin Institute of Technology\uff0cShenzhen"}, {"id": 128139, "fullname": "Xin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128139?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 88416, "fullname": "Ming Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88416?format=json", "institution": "Harbin Institute of Technology"}, {"id": 154060, "fullname": "Xiaopeng Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154060?format=json", "institution": "Harbin Institute of Technology"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}, {"id": 84797, "fullname": "Wangmeng Zuo", "url": "http://cvpr.thecvf.com/api/miniconf/users/84797?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video\u2013text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37980", "url": null, "sourceid": 46659, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39505, "uid": "69a4241986721658eb6237d56c1b8abc", "name": "DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis", "authors": [{"id": 192222, "fullname": "Xinglong Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192222?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 154155, "fullname": "Ao Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/154155?format=json", "institution": "Southwest Jiaotong University"}, {"id": 130988, "fullname": "Zhengning Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/130988?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 192223, "fullname": "Yueqi Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192223?format=json", "institution": null}, {"id": 88923, "fullname": "Chaoyu Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/88923?format=json", "institution": "Dalian University of Technology"}, {"id": 88411, "fullname": "Lei Lei", "url": "http://cvpr.thecvf.com/api/miniconf/users/88411?format=json", "institution": "Xiaomi"}, {"id": 105589, "fullname": "Bing Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/105589?format=json", "institution": "University of Electronic Science and Technology of China,"}, {"id": 93490, "fullname": "Shuaicheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/93490?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve.Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment.Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39505", "url": null, "sourceid": 41440, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36301, "uid": "8f1fbc45b8d10f3bc00d82ad450c8a6c", "name": "RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning", "authors": [{"id": 145057, "fullname": "Tongrui Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/145057?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 184733, "fullname": "Qingbin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184733?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 184734, "fullname": "Shengyu Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184734?format=json", "institution": "Institute of Computing Technology, Chinese Academy of Sciences"}, {"id": 184735, "fullname": "Wei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184735?format=json", "institution": "Chinese Academy of Sciences"}, {"id": 129330, "fullname": "Xueqi Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/129330?format=json", "institution": ", Chinese Academy of Sciences"}], "abstract": "Compared to untargeted attacks, targeted transfer-based attack is still suffering from much lower Attack Success Rates (ASRs),  although significant improvements have been achieved by kinds of methods, such as diversifying input, stabilizing the gradient, and re-training surrogate models. In this paper, we find that  adversarial examples generated by existing methods rely heavily on a small subset of surrogate model parameters, which in turn limits their transferability to unseen target models. Inspired by this, we propose the Random Parameter Pruning Attack (RaPA), which introduces parameter-level randomization during the attack process. At each optimization step, RaPA randomly prunes model parameters to generate diverse yet semantically consistent surrogate variants.We show this parameter-level randomization is equivalent to adding an importance-equalization regularizer, thereby alleviating the over-reliance issue. Extensive experiments across both CNN and Transformer architectures demonstrate that RaPA substantially enhances transferability. In the challenging case of transferring from CNN-based to Transformer-based models, RaPA achieves up to 11.7\\% higher average ASRs than state-of-the-art baselines(with 33.3\\% ASRs), while being training-free, cross-architecture efficient, and easily integrated into existing attack frameworks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36301", "url": null, "sourceid": 34626, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38764, "uid": "6ec6bb3422418bd6a33bbfe1df28450f", "name": "Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement", "authors": [{"id": 183231, "fullname": "Chunlei Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/183231?format=json", "institution": "University of Technology Sydney"}, {"id": 190616, "fullname": "Jiahao Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/190616?format=json", "institution": "University of Technology Sydney"}, {"id": 190617, "fullname": "Yun Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190617?format=json", "institution": null}, {"id": 190618, "fullname": "Bo Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190618?format=json", "institution": "Anhui University"}, {"id": 90486, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90486?format=json", "institution": "University of Technology Sydney"}], "abstract": "Multimodal image registration is a fundamental task for multimodal imagery and a prerequisite for downstream cross-modal analysis. Despite recent progress with shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, so modality-private cues can still leak into the shared space. Second, most multi-scale frameworks support only one transformation type, which limits their applicability in real-world scenarios where global misalignment and local deformation coexist.To address these issues, we view hybrid multimodal registration as jointly constructing a stable shared feature space and a unified hybrid transformation within that space. Building on this perspective, we introduce HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) produces multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues across scales and projects the shared component into a stable subspace suited for correspondence. On top of this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative, coarse-to-fine estimation of both global rigid parameters and multi-scale fine-grained deformation fields, which are fused into a single coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on both rigid and non-rigid registration tasks. Code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38764", "url": null, "sourceid": 34607, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38439, "uid": "0b97e22ca7c95fdf86d97f10abd7a6c3", "name": "Active Intelligence in Video Avatars via Closed-loop World Modeling", "authors": [{"id": 184463, "fullname": "Xuanhua He", "url": "http://cvpr.thecvf.com/api/miniconf/users/184463?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 88773, "fullname": "Tianyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/88773?format=json", "institution": "IDEA"}, {"id": 180361, "fullname": "Ke Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/180361?format=json", "institution": "University of Science and Technology of China"}, {"id": 71446, "fullname": "Rui-Qi Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/71446?format=json", "institution": "Nankai University"}, {"id": 189865, "fullname": "Meng Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/189865?format=json", "institution": "Meituan"}, {"id": 85474, "fullname": "Yong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85474?format=json", "institution": "Tencent AI Lab"}, {"id": 189866, "fullname": "Zhuoliang Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189866?format=json", "institution": null}, {"id": 84905, "fullname": "Xiaoming Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/84905?format=json", "institution": "Meituan"}, {"id": 87711, "fullname": "Qifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87711?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency\u2014they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38439", "url": null, "sourceid": 43670, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38338, "uid": "0e9a9605327d6cb01c3e322a7e421294", "name": "Disentanglement-wise Image Dehazing through Cross-Domain Manifold Consensus", "authors": [{"id": 174841, "fullname": "Tianyi Lyu", "url": "http://cvpr.thecvf.com/api/miniconf/users/174841?format=json", "institution": "Nanjing Univ of Post and Telecommunications"}, {"id": 189636, "fullname": "Mingye Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/189636?format=json", "institution": "School of Artificial Intelligence, Nanjing University of Posts and Telecommunications"}, {"id": 90572, "fullname": "Kai-Kuang Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90572?format=json", "institution": "Nanyang Technological University"}], "abstract": "Current dehazing methods face two intertwined challenges: (1): the misidentification of haze-related features due to domain-specific interference in both single-domain and empirically integrated multi-domain approaches, and (2): severe chromatic distortion caused by haze-induced inherent color entanglement. To overcome these limitations, we propose a unified framework centered on a $\\textbf{Cross-domain Invariant Manifold}$ (CIM), which constructs a consistent latent representation space by aligning multi-domain features through shared scattering semantics. The manifold is optimized via $\\textbf{consensus density-driven contrastive learning}$, effectively enhancing cross-domain consistency while eliminating domain-specific biases. Building upon this structured foundation, we further introduce a disentanglement-wise architecture, i.e.the $\\textbf{Physics-Guided HSV Decomposition Network}$, that explicitly separates entangled color components to ensure robust color fidelity. Comprehensive experiments demonstrate that our CIM-D framework achieves state-of-the-art performance, effectively eliminating haze-induced color shifts and restoring natural scene appearance. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38338", "url": null, "sourceid": 34065, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39884, "uid": "bc5248bec66e472f11f85710936fe03c", "name": "VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection", "authors": [{"id": 128726, "fullname": "Yang Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/128726?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 193049, "fullname": "Feize Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193049?format=json", "institution": "Sun Yat-Sen University"}, {"id": 130142, "fullname": "Dave Zhenyu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/130142?format=json", "institution": "Technische Universit\u00e4t M\u00fcnchen"}, {"id": 72743, "fullname": "Yingji Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/72743?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 87060, "fullname": "Lanqing Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/87060?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 88296, "fullname": "Dan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88296?format=json", "institution": "CSE, HKUST"}], "abstract": "Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain\u2014i.e., precisely calibrated multi-view camera poses\u2014to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure. (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to \u2018see\u2019 what they need, then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses SG-Free baselines by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT\u2019s internally learned semantic and geometric priors can be effectively leveraged by our AG and QD. Code and pretrained models will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39884", "url": null, "sourceid": 34215, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39881, "uid": "0b32d6a87c09c32b3cd90dfd5ef5699f", "name": "COT-FM: Cluster-wise Optimal Transport Flow Matching", "authors": [{"id": 181180, "fullname": "Chiensheng Chiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/181180?format=json", "institution": "National Taiwan University"}, {"id": 181179, "fullname": "Kuan-Hsun Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181179?format=json", "institution": "National Taiwan University"}, {"id": 193044, "fullname": "Jia-Wei Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193044?format=json", "institution": "National Taiwan University"}, {"id": 193045, "fullname": "Cheng-Fu Chou", "url": "http://cvpr.thecvf.com/api/miniconf/users/193045?format=json", "institution": "Department of computer science and informational engineering, National Taiwan University"}, {"id": 180829, "fullname": "Tsung-Wei Ke", "url": "http://cvpr.thecvf.com/api/miniconf/users/180829?format=json", "institution": "Department of computer science and informational engineering, National Taiwan University"}], "abstract": "We introduce COT-FM, a general framework that reshapes the probability path in Flow Matching (FM) to achieve faster and more reliable generation. FM models often produce curved trajectories due to random or batch-wise couplings, which increase discretization error and reduce sample quality. COT-FM fixes this by clustering target samples and assigning each cluster a dedicated source distribution obtained by reversing pretrained FM models. This divide-and-conquer strategy yields more accurate local transport and significantly straighter vector fields, all without changing the model architecture. As a plug-and-play approach, COT-FM consistently accelerates sampling and improves generation quality across 2D datasets, image benchmarks, and robotic manipulation tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39881", "url": null, "sourceid": 45643, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39681, "uid": "8fad6d7e58542a344408f83a0a73e11b", "name": "Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising", "authors": [{"id": 183635, "fullname": "Xiaoqian Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/183635?format=json", "institution": "University of Science and Technology of China"}, {"id": 192636, "fullname": "Dong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192636?format=json", "institution": "University of Science and Technology of China"}, {"id": 192637, "fullname": "Husen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192637?format=json", "institution": "University of Science and Technology of China"}, {"id": 91259, "fullname": "Zheng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91259?format=json", "institution": "China University of Geosciences (Wuhan)"}, {"id": 152483, "fullname": "Renjie Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/152483?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Point cloud denoising is a critical preprocessing step for enhancing the reliability and accuracy of 3D perception systems. Most existing progressive denoising methods rely on fixed iterative pipelines that process all regions uniformly, resulting in redundant computation and over-smoothing of geometric details when handling point clouds with non-uniform noise distributions. To overcome these limitations, we introduce Dynamic Skip Net (DSNet), a novel progressive denoising framework that adaptively determines the optimal denoising path for each local patch based on its noise characteristics. DSNet incorporates a noise discriminator that quantifies local noise intensity by analyzing normal similarity, and a reverse monotonic decision function that maps this measure to an appropriate denoising module. Furthermore, we propose a Path-Selective Iteration mechanism that dynamically re-evaluates the restoration state and re-plans the denoising route at each stage, enabling cross-stage skipping to minimize unnecessary computation. Extensive experiments on multiple benchmarks demonstrate that DSNet achieves state-of-the-art performance in noise suppression, geometric fidelity, and computational efficiency. Our code and models will be made publicly available at github.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39681", "url": null, "sourceid": 34495, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39165, "uid": "762d77fa312b52c109f63a9fa0b1edbe", "name": "Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention", "authors": [{"id": 191490, "fullname": "Junhao Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/191490?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 179926, "fullname": "XUE JIALONG", "url": "http://cvpr.thecvf.com/api/miniconf/users/179926?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 191491, "fullname": "Anqi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191491?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 191492, "fullname": "Jincheng Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/191492?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 130678, "fullname": "Guo Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/130678?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2\\% of visual tokens preserves 90.1\\% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6\\%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39165", "url": null, "sourceid": 35511, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36781, "uid": "2240136c6caf3b4075577449deda3728", "name": "Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features", "authors": [{"id": 180640, "fullname": "Jingyi Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180640?format=json", "institution": "The School of Electronic and Information Engineering, Beihang University"}, {"id": 126868, "fullname": "Meisong Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126868?format=json", "institution": "Alibaba Group"}, {"id": 157235, "fullname": "Ying Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157235?format=json", "institution": "Taobao, Alibaba Group"}, {"id": 185860, "fullname": "Minglang Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185860?format=json", "institution": "Beihang University"}, {"id": 126860, "fullname": "Xin Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/126860?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 88543, "fullname": "Mai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88543?format=json", "institution": "Beihang University, Tsinghua University"}], "abstract": "Diffusion model (DM) based Video Super-Resolution (VSR) approaches achieve impressive perceptual quality. However, they suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity, primarily caused by inaccurate alignment and insufficient compensation between video frames. In this paper, within the DM-based VSR pipeline, we revisit the role of alignment and compensation between adjacent video frames and reveal two crucial observations: (a) the feature domain is better suited than the pixel domain for information compensation due to its stronger spatial and temporal correlations, and (b) warping at an upscaled resolution better preserves high-frequency information, but this benefit is not necessarily monotonic. Therefore, we propose a novel Densely Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR), with an Optical Guided Warping Module (OGWM) to maintain high-frequency details in the aligned features and a Feature-wise Temporal Condition Module (FTCM) to deliver dense guidance in the feature domain. Extensive experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR, including perceptual quality (35.82\\% DISTS reduction), fidelity (0.20 dB PSNR gain), and temporal consistency (30.37\\% tLPIPS reduction).", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36781", "url": null, "sourceid": 32926, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38326, "uid": "eef43b55de33c2d530ea4a9669b81062", "name": "ViT$^3$: Unlocking Test-Time Training in Vision", "authors": [{"id": 131365, "fullname": "Dongchen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/131365?format=json", "institution": "Tsinghua University"}, {"id": 189614, "fullname": "Yining Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189614?format=json", "institution": "Tsinghua University"}, {"id": 189615, "fullname": "Tianyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189615?format=json", "institution": "Tsinghua University"}, {"id": 189616, "fullname": "Zixuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189616?format=json", "institution": "Tsinghua University"}, {"id": 189617, "fullname": "Ziming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189617?format=json", "institution": null}, {"id": 152920, "fullname": "Jun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152920?format=json", "institution": "Alibaba Group"}, {"id": 185404, "fullname": "YuCheng YuCheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185404?format=json", "institution": null}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 93664, "fullname": "Gao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93664?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key\u2013value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38326", "url": null, "sourceid": 39957, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40332?format=json"], "related_events_ids": [40332]}, {"id": 40332, "uid": "eef43b55de33c2d530ea4a9669b81062", "name": "ViT$^3$: Unlocking Test-Time Training in Vision", "authors": [{"id": 131365, "fullname": "Dongchen Han", "url": "http://cvpr.thecvf.com/api/miniconf/users/131365?format=json", "institution": "Tsinghua University"}, {"id": 189614, "fullname": "Yining Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189614?format=json", "institution": "Tsinghua University"}, {"id": 189615, "fullname": "Tianyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189615?format=json", "institution": "Tsinghua University"}, {"id": 189616, "fullname": "Zixuan Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/189616?format=json", "institution": "Tsinghua University"}, {"id": 189617, "fullname": "Ziming Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189617?format=json", "institution": null}, {"id": 152920, "fullname": "Jun Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152920?format=json", "institution": "Alibaba Group"}, {"id": 185404, "fullname": "YuCheng YuCheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/185404?format=json", "institution": null}, {"id": 152921, "fullname": "Bo Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152921?format=json", "institution": "Alibaba Group"}, {"id": 93664, "fullname": "Gao Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93664?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key\u2013value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40332", "url": null, "sourceid": -39957, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/38326?format=json"], "related_events_ids": [38326]}, {"id": 36173, "uid": "0849659c26c77c8aac41bed0133fa88c", "name": "VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection", "authors": [{"id": 184330, "fullname": "Shuohao Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/184330?format=json", "institution": "National University of Defense Technology"}, {"id": 184331, "fullname": "Qiang Fang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184331?format=json", "institution": "National University of Defense Technology"}, {"id": 149249, "fullname": "Xin Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149249?format=json", "institution": "National University of Defense Technology"}], "abstract": "Closed-set object detection in remote sensing imagery has made significant progress, but achieving high detection accuracy remains challenging. Vision-Language Models (VLMs), which possess rich prior knowledge, offer a promising solution to this challenge. However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we propose VLM4RSDet, a novel collaborative training framework that leverages vision-language model to enhance the performance of conventional closed-set remote sensing object detectors. Notably, during inference, VLM4RSDet only retains the standard object detection architecture, thus avoiding any additional deployment overhead. Furthermore, we introduce a Global\u2013Local Cross-Attention (GLCA) module and a Learnable Hierarchical Prediction Strategy (LHPS) to further improve collaborative training performance. Extensive experiments on five benchmark datasets demonstrate the effectiveness and robustness of our approach. In particular, our method outperforms the state-of-the-art by 7.5\\% in mAP$_{0.5:0.95}$ on the VisDrone2019 dataset. Our code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36173", "url": null, "sourceid": 40737, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39046, "uid": "f7c5213a8ce1cfc32b697f9e70e1b3b7", "name": "Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation", "authors": [{"id": 184304, "fullname": "Simone Mosco", "url": "http://cvpr.thecvf.com/api/miniconf/users/184304?format=json", "institution": "University of Padova"}, {"id": 191248, "fullname": "Daniel Fusaro", "url": "http://cvpr.thecvf.com/api/miniconf/users/191248?format=json", "institution": "University of Padua; University of Padua"}, {"id": 191249, "fullname": "Alberto Pretto", "url": "http://cvpr.thecvf.com/api/miniconf/users/191249?format=json", "institution": "University of Padua"}], "abstract": "Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at [REMOVED DUE TO ANONYMOUS SUBMISSION].", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39046", "url": "https://simom0.github.io/lido-page/", "sourceid": 33815, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36168, "uid": "ff7c6b322d982194ef32e74820a3fff4", "name": "BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates", "authors": [{"id": 184318, "fullname": "Phuong-Anh Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/184318?format=json", "institution": "Vietnam National University Hanoi"}, {"id": 184319, "fullname": "Tien Anh Pham", "url": "http://cvpr.thecvf.com/api/miniconf/users/184319?format=json", "institution": "Vietnam National University Hanoi"}, {"id": 184320, "fullname": "Duc-Trong Le", "url": "http://cvpr.thecvf.com/api/miniconf/users/184320?format=json", "institution": "Vietnam National University, Hanoi"}, {"id": 182184, "fullname": "Van Nguyen", "url": "http://cvpr.thecvf.com/api/miniconf/users/182184?format=json", "institution": "VNU University of Engineering and Technology"}], "abstract": "Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing modalities (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework consists of two complementary modules. The Feature Calibration Module (FCM) operates at the representation level, recalibrating unimodal features through global contextual information to build a shared representation basis across heterogeneous missing patterns. The Gradient Rebalancing Module (GRM) works at the optimization level, equalizing learning dynamics across modalities by modulating gradient magnitudes and directions from distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36168", "url": null, "sourceid": 33604, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38246, "uid": "109d211aa9d2aeb2bd9db5c6561ed68d", "name": "GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking", "authors": [{"id": 158756, "fullname": "Yihao Zhen", "url": "http://cvpr.thecvf.com/api/miniconf/users/158756?format=json", "institution": "Shenyang Institute of Automation Chinese Academy of Sciences"}, {"id": 189416, "fullname": "Mingyue Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189416?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 189121, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189121?format=json", "institution": "Shenyang University"}, {"id": 129300, "fullname": "Baojie Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129300?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 181696, "fullname": "Jiahua Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181696?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 158757, "fullname": "Tinghui Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/158757?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 105106, "fullname": "Huijie Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/105106?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences"}], "abstract": "A frequently cited advantage of Multi-Camera Multi-Target (MCMT) Tracking is that the introduction of multiple views provides rich discriminative visual representations for each target. Existing MCMT models typically adopt a two-stage framework, involving single-camera tracking followed by inter-camera tracking. However, in this paradigm, the use of multiple views is confined to recovering missed matches in the first stage, providing a limited contribution to overall tracking. To address this issue, we propose a novel global MCMT tracking framework termed GMT, which effectively leverages the advantage of multi-view by performing global-level trajectory-target matching. Specifically, instead of assigning trajectories independently for each view, we propose a Cross-View Feature Consistency Enhancement(CFCE) module to reduce the feature discrepancies across different views, and encode the same historical targets across different views as global trajectories. The Global Trajectory Associate (GTA) module is then introduced to associate new targets to global trajectories, allowing the model to jointly exploit both intra-view and inter-view cues during tracking. Compared with the two-stage framework, the GMT achieves significant improvements on existing datasets, with gains of up to 13.1\\% in CVMA in and 19.2\\% in CVIDF1. Moreover, we present VisionTrack, a high-quality, large-scale MCMT dataset encompassing diverse scenes with varying illumination and target distributions, providing significantly greater diversity than existing datasets. Our code and dataset will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38246", "url": null, "sourceid": 44702, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38129, "uid": "d33fc27e8d0ecbfc11e16d20e723991d", "name": "STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection", "authors": [{"id": 105106, "fullname": "Huijie Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/105106?format=json", "institution": "Shenyang Institute of Automation, Chinese Academy of Sciences"}, {"id": 172237, "fullname": "Pengrui huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/172237?format=json", "institution": "SYUCT/SIA"}, {"id": 189121, "fullname": "Qiang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189121?format=json", "institution": "Shenyang University"}, {"id": 129300, "fullname": "Baojie Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/129300?format=json", "institution": "Nanjing University of Posts and Telecommunications"}, {"id": 181696, "fullname": "Jiahua Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/181696?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 99759, "fullname": "Liangqiong Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/99759?format=json", "institution": "The University of Hong Kong"}], "abstract": "Surrounding-view 3D object detection is a fundamental task in autonomous driving, which aims to locate 3D objects from multiple camera views. Existing methods predominantly followed a 2D-to-3D pipeline, leveraging 2D detectors to enhance 3D detection performance. However, these methods ignored the inherent disparities in both temporal and feature dimensional representations between 2D and 3D detection, resulting in the positional deviations in 3D space. Furthermore, the absence of temporal information in 2D detection leads to object omission in occluded scenarios. To address these limitations, we propose STUR3D, a unified framework that builds spatio-temporal alignment between 2D and 3D perception. First, we project historical 3D detection features onto the 2D image plane, guiding the 2D detector to distill the requisite representations for 3D detection, thereby harmonizing feature representations across different dimensional spaces. Second, we integrate temporal information into 2D detection to establish temporal coherence to unify spatio-temporal reasoning across both paradigms, which yields more robust and accurate 3D detection in dynamic scenes. Additionally, we integrate depth cues into feature encoding to guide the lifting of 2D detections into 3D queries, suppressing their inherent biases. Extensive experiments on the nuScenes benchmark demonstrate the effectiveness of our framework, and STUR3D achieves state-of-the-art results of 57.9\\% mAP and 64.6\\% NDS on the nuScenes \\test set.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38129", "url": null, "sourceid": 38515, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38649, "uid": "1654b4d90dd364d2796bdbd590189e7d", "name": "Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training", "authors": [{"id": 183662, "fullname": "Xiangyang Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/183662?format=json", "institution": "Tsinghua University"}, {"id": 184831, "fullname": "Qingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184831?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 190384, "fullname": "Yuming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190384?format=json", "institution": "Peking University"}, {"id": 179013, "fullname": "Guanbo Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/179013?format=json", "institution": "Tsinghua University"}, {"id": 89148, "fullname": "Yongjie Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89148?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 184832, "fullname": "Wenyu Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/184832?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 180129, "fullname": "Meng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180129?format=json", "institution": "Kling AI"}, {"id": 134947, "fullname": "Pengfei Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/134947?format=json", "institution": "Kuaishou Technology"}, {"id": 190385, "fullname": "Shao-Lun Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190385?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Recent advances in video generation models have achieved impressive results. However, these models heavily rely on the use of high-quality data that combines both high visual quality and high motion quality. In this paper, we identify a key challenge in video data curation: the Motion-Vision Quality Dilemma. We discovered that visual quality and motion quality inherently exhibit a negative correlation, making it hard to obtain golden data that excels in both aspects. To address this challenge, we first examine the hierarchical learning dynamics of video diffusion models and conduct gradient-based analysis on quality-degraded samples. We discover that quality-imbalanced data can produce gradients similar to golden data at appropriate timesteps. Based on this, we introduce the novel concept of Timestep selection in Training Process. We propose Timestep-aware Quality Decoupling (TQD), which modifies the data sampling distribution to better match the model\u2019s learning process. For certain types of data, the sampling distribution is skewed toward higher timesteps for high motion quality data, while high visual quality data is more likely to be sampled during lower timesteps. Through extensive experiments, we demonstrate that TQD enables training exclusively on separated imbalanced data to achieve performance surpassing conventional training with better data, challenging the necessity of perfect data in video generation. Moreover, our method also boosts model performance when trained on high-quality data, showcasing its effectiveness across different data scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38649", "url": null, "sourceid": 44830, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40171, "uid": "aa8dca2cf6b330e41c6da43a5beea921", "name": "VideoSSR: Video Self-Supervised Reinforcement Learning", "authors": [{"id": 193702, "fullname": "Zefeng He", "url": "http://cvpr.thecvf.com/api/miniconf/users/193702?format=json", "institution": "nanjing university"}, {"id": 156078, "fullname": "Xiaoye Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156078?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 193703, "fullname": "Yafu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/193703?format=json", "institution": null}, {"id": 193704, "fullname": "Siyuan Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193704?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 186043, "fullname": "Daizong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186043?format=json", "institution": "Wuhan University"}, {"id": 74051, "fullname": "Yu Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/74051?format=json", "institution": "The Chinese University of Hong Kong"}], "abstract": "Reinforcement Learning with Verifiable Reward (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive.This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data?To investigate this problem, we first introduce three self-supervised pretext tasks for video understanding: Anomaly Grounding, Object Counting, and Temporal Jigsaw. To validate the difficulty of these tasks, we construct the Video Intrinsic Understanding Benchmark (VIUBench), revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that our VideoSSR consistently enhances model performance, yielding an average improvement of over 5\\%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40171", "url": null, "sourceid": 43795, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36360, "uid": "e4edfb4697cc2be24e31c23ff181d185", "name": "AutoRegressive Generation with B-rep Holistic Token Sequence Representation", "authors": [{"id": 184876, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184876?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 158725, "fullname": "Yunpeng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158725?format=json", "institution": "National University of Singapore"}, {"id": 184877, "fullname": "Yongkang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184877?format=json", "institution": "Northwestern Polytechnical University, Northwest Polytechnical University Xi&#x27;an"}, {"id": 158723, "fullname": "Hao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158723?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 87165, "fullname": "Hongping Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/87165?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 158726, "fullname": "Yilei Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158726?format=json", "institution": "the Northwestern Polytechnical University"}], "abstract": "Previous representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep's geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, followed by geometry block sequencing. Finally, we assemble the holistic sequence representation for the entire B-rep. We also construct a transformer-based autoregressive model that learns the distribution over holistic token sequences via next-token prediction, using a multi-layer decoder-only architecture with causal masking. Experiments demonstrate that BrepARG achieves state-of-the-art (SOTA) performance. BrepARG validates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36360", "url": null, "sourceid": 45286, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39117, "uid": "fa68a633df6169bb2bf730da98faff59", "name": "BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep", "authors": [{"id": 158723, "fullname": "Hao Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/158723?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 191389, "fullname": "Liyuan Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/191389?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 184877, "fullname": "Yongkang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/184877?format=json", "institution": "Northwestern Polytechnical University, Northwest Polytechnical University Xi&#x27;an"}, {"id": 144686, "fullname": "RuohanWang ruohan", "url": "http://cvpr.thecvf.com/api/miniconf/users/144686?format=json", "institution": "Northwestern Polytechinical University"}, {"id": 184876, "fullname": "Jiahao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/184876?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 158725, "fullname": "Yunpeng Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/158725?format=json", "institution": "National University of Singapore"}, {"id": 158726, "fullname": "Yilei Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158726?format=json", "institution": "the Northwestern Polytechnical University"}], "abstract": "Due to the heterogeneity of faces and edges in B-rep, conventional graph-based representations is incapable of establishing a unified formulation for faces and edges, thereby constraining the capabilities of B-rep generative models. We propose a B-rep Variational Graph Auto Encoding (BrepVGAE), the first variational graph autoencoder framework capable of holistically encoding and decoding boundary representations of B-rep models.Firstly, we novelly represent both geometry faces and edges as nodes in a graph representation. We then design a sparse graph autoencoder to aggregate the complete B-rep structure into a compact global latent vector. We then construct a decoder that employs set-based generation, which uses bilinear layers to reconstruct adjacency relationships, i.e., topology, with a single latent vector. Afterwards, the same decoder generates node features for all faces and edges through learnable query vectors and cross-attention mechanisms. Finally, a two-stage training strategy ensures effective coupling of geometry and topology throughout. Comprehensive experiments demonstrate that BrepVGAE significantly outperforms existing methods in reconstruction accuracy, topological validity, and generative diversity. This validates the feasibility and efficacy of decoding complete CAD geometric-topological distributions from a unified latent representation, while also offering novel insights for CAD part retrieval and feature recognition domains.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39117", "url": null, "sourceid": 31265, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39025, "uid": "2585eb8dd46328ec14bce03201541e9d", "name": "Distribution-Aligned Multimodal Fusion for Robust Object Detection", "authors": [{"id": 183183, "fullname": "XIAOHUI HAO", "url": "http://cvpr.thecvf.com/api/miniconf/users/183183?format=json", "institution": "Beihang University"}, {"id": 191195, "fullname": "Yanglin Pu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191195?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 191196, "fullname": "Yongjun Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191196?format=json", "institution": null}, {"id": 153448, "fullname": "Rui She", "url": "http://cvpr.thecvf.com/api/miniconf/users/153448?format=json", "institution": "Beihang University"}], "abstract": "Cross-degradation generalization remains a critical challenge for RGB-infrared multimodal object detection, especially when training data covers limited degradation types. This paper presents a distribution alignment framework with a key insight: aligning fused features to the pretrained distribution where the frozen detector performs optimally, rather than adapting to training-specific degradations. By freezing the pretrained detector and training only a lightweight fusion module (15\\% of total parameters), our approach leverages complementary infrared information to reduce distribution shift while maintaining computational efficiency. The method achieves state-of-the-art results on three benchmarks (LLVIP, FLIR, DroneVehicle) with 4\u00d7 faster training. Critically, we demonstrate that aligning to the pretrained distribution substantially outperforms aligning to training degradations when generalizing to unseen scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39025", "url": null, "sourceid": 40389, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40174, "uid": "1be7371e7797ca4a5f6f651aa9c6eb43", "name": "Universal 3D Shape Matching via Coarse-to-Fine Language Guidance", "authors": [{"id": 181099, "fullname": "Qinfeng Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/181099?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 105964, "fullname": "Guofeng Mei", "url": "http://cvpr.thecvf.com/api/miniconf/users/105964?format=json", "institution": "Fondazione Bruno Kessler"}, {"id": 85913, "fullname": "Bo Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/85913?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 193721, "fullname": "Zhang Liying", "url": "http://cvpr.thecvf.com/api/miniconf/users/193721?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 90486, "fullname": "Jian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90486?format=json", "institution": "University of Technology Sydney"}, {"id": 193722, "fullname": "YICK Kit-lun", "url": "http://cvpr.thecvf.com/api/miniconf/users/193722?format=json", "institution": "Hong Kong Polytechnic University"}], "abstract": "Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift \"coarse\" semantic cues into \"fine\" correspondence, which is achieved through two stages. In the \"coarse\" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vision language models (VLMs) to extract text embeddings, enabling the construction of matched semantic parts. In the  \"fine\" stage, we leverage these coarse correspondences to guide the learning of dense correspondences through a dedicated rank-based contrastive scheme. Thanks to class-agnostic segmentation, language guiding, and rank-based contrastive learning, our method is versatile for universal object categories and requires no predefined part proposals, enabling universal matching for inter-class and non-isometric shapes. Extensive experiments demonstrate UniMatch consistently outperforms competing methods in various challenging scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40174", "url": null, "sourceid": 35753, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37258, "uid": "00dc5000951c268cc8655294daf67b1b", "name": "Hybrid Robust Collaborative Perception with LiDAR-4D Radar Fusion under Adverse Weather Conditions", "authors": [{"id": 180335, "fullname": "Yuquan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180335?format=json", "institution": "University of Science and Technology of China"}, {"id": 180234, "fullname": "hui zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/180234?format=json", "institution": "university of science and technology of China"}, {"id": 187020, "fullname": "Wenyu Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187020?format=json", "institution": "University of Science and Technology of China"}, {"id": 187021, "fullname": "Ziyin Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187021?format=json", "institution": "University of Science and Technology of China"}, {"id": 187022, "fullname": "Chuanming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187022?format=json", "institution": "University of Science and Technology of China"}, {"id": 187023, "fullname": "Xiaohua Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187023?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Current collaborative perception systems have significantly improved 3D object detection performance.However, widely used LiDAR and camera systems often suffer performance degradation under adverse weather conditions.The weather-robust 4D radar provides a promising solution to address this challenge.Nevertheless, the effective fusion of sparse 4D radar measurements with degraded LiDAR data remains a significant challenge due to cross-modal corruption and information loss. In this work, we propose a novel hybrid robust collaborative perception framework (HRCP), designed to improve the collaborative perception performance under adverse weather conditions through LiDAR-4D radar fusion.Specifically, we introduce a hybrid collaboration strategy that considers their distinct physical properties and differently processes them during information transmission.Additionally, we propose a bidirectional cross-modal gating (BCMG) module that enables LiDAR and 4D radar to mutually validate feature reliability, ensuring consistent cross-modal representation, and an adaptive feature enhancement (AFE) module that enables comprehensive refinement of degraded and suppressed regions to mitigate information loss.Extensive experimental results demonstrate that our method outperforms previous state-of-the-art approaches under adverse weather conditions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37258", "url": null, "sourceid": 44473, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39305, "uid": "f63d6b3e439d2ef66cf77266122a88bf", "name": "Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking", "authors": [{"id": 180348, "fullname": "Chunjiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/180348?format=json", "institution": "Sichuan University"}, {"id": 191806, "fullname": "Jianbo Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/191806?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 191807, "fullname": "Li Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191807?format=json", "institution": "Sichuan University"}, {"id": 191808, "fullname": "Yanru Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191808?format=json", "institution": "Sichuan University"}, {"id": 90046, "fullname": "Liangyin Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/90046?format=json", "institution": "Sichuan University"}], "abstract": "Multi-object tracking (MOT) in computer vision involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to cost confusion arising from partial occlusion. To address this issue, we present the Occlusion-Aware SORT (OA-SORT) framework, which introduces three innovations: the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). First, OAM assesses the occlusion status (\\ie, occlusion severity) of objects and introduces a Gaussian Map (GM) to reduce background influence. Two plug-and-play, training-free components\u2014OAO and BAM\u2014are further proposed. Specifically, OAO leverages the OAM-derived bias from the Kalman Filter's position estimations to compensate positional cost, thereby mitigating confusion. Next, BAM utilizes the OAM-derived bias from the latest trajectory observations to optimize the Kalman Filter\u2019s motion parameters, suppressing estimation fluctuations. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1\\% and 64.2\\% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08\\% and 3.05\\% on DanceTrack, demonstrating the reusability of the occlusion-aware framework and its components. Ablation studies further validate the effectiveness of the three components, highlighting the key role of the Gaussian Map.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39305", "url": null, "sourceid": 45245, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38583, "uid": "d768001e841bc0bd089acc6f816d0699", "name": "Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast", "authors": [{"id": 190205, "fullname": "Fan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190205?format=json", "institution": "Nanjing University of Finance and Economics"}, {"id": 183666, "fullname": "Yuanzhi Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/183666?format=json", "institution": "Nanjing University of Finance and Economics"}, {"id": 107521, "fullname": "Haimei Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/107521?format=json", "institution": "The University of Sydney"}, {"id": 190206, "fullname": "Yudong Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/190206?format=json", "institution": "Nanjing University of Finance and Economics"}, {"id": 190207, "fullname": "Haikun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190207?format=json", "institution": "Nanjing University of Finance and Economics"}], "abstract": "In unsupervised cross-modal hashing, real-world data often exhibit partial alignment and semantic mismatch: dominant modalities tend to overrule fusion, fine-grained complementary cues are overlooked, and mini-batch \u201cnegative samples\u201d are contaminated by semantically related items, yielding frequent false negatives. Treating all pairs equally in contrastive learning thus makes training noise-prone and ill-suited to partially aligned data. To mitigate these pains, we present Unsupervised Weighted Masked Contrastive Hashing (UWMCH), whose core is: (i) random masked fusion deliberately suppresses part of modality evidence during feature interaction, forcing the model to learn complementary semantics under diverse \u201cpartial interactions,\u201d avoiding reliance on a single modality and explicitly exposing hard cases; (ii) pairwise weighting no longer treats masked and unmasked pairs as equivalent but adaptively assigns a weight to each cross-modal pair by combining instance-level semantic consistency with a K-means induced cluster-consensus prior, injecting the weight into the contrastive objective to suppress suspected false negatives and amplify more informative masked positives. To stabilize the global structure, we further introduce two constraints: Cluster-Centroid Agreement (CCA) forms global semantic anchors at the prototype level in synergy with UWMCH; Semantic Structure Regularization (SSR) builds higher-order semantic structure and aligns it with cross-modal similarity, maintaining intra-modal compactness and inter-modal separability under masking. Extensive benchmark experiments show that UWMCH achieves better retrieval accuracy and convergence stability across multiple datasets. The code will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38583", "url": null, "sourceid": 40472, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38070, "uid": "3f6ef69d17c6561909c684b8189fe349", "name": "Learning from Noisy Supervision: A Denoising\u2013Debiasing Framework for Weakly Supervised Video Anomaly Detection", "authors": [{"id": 182335, "fullname": "Yaxin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/182335?format=json", "institution": "Nankai University"}, {"id": 188979, "fullname": "Yang Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188979?format=json", "institution": null}, {"id": 188980, "fullname": "Wenya Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/188980?format=json", "institution": "Nankai University"}, {"id": 188981, "fullname": "Sihan Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188981?format=json", "institution": "Nankai University"}, {"id": 188982, "fullname": "Xiangrui Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/188982?format=json", "institution": "Nankai University"}, {"id": 188983, "fullname": "Xi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188983?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 188984, "fullname": "Ying Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188984?format=json", "institution": "Nankai University"}, {"id": 188985, "fullname": "Xiaojie Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188985?format=json", "institution": "Nankai University"}], "abstract": "Weakly supervised video anomaly detection (WS-VAD) aims to localize frame-level anomalies using only video-level labels. This task is typically formulated within a multiple instance learning (MIL) paradigm, where each video is treated as a bag of snippets, achieving robust performance without requiring additional information.However, existing methods often struggle with noisy supervision signals. Normal snippets within abnormal bags are frequently misclassified as anomalies due to inaccurate anomaly scores. These misclassified instances act as noisy samples, introducing false supervision that hinders the learning of true anomaly patterns.In this work, we introduce $D^{2}MIL$, a Denoising\u2013Debiasing framework within the Multiple Instance Learning paradigm designed to suppress noise and improve anomaly discrimination. Our approach integrates two key components:(1) Denoising Module: We introduce a dynamic drop rate to adaptively filter out suspected noisy samples during training, based on the observation that noisy samples incur higher training losses. (2) Debiasing Module: We leverage a vision-language model to re-evaluate the discarded samples. This recovers potentially valuable abnormal instances that were mistakenly removed, as they are similar to noisy samples but difficult for the model to recognize. $D^{2}MIL$ is a general purpose denoising strategy that can be integrated into any MIL-based method. Our extensive experiments on the three benchmark datasets (ShanghaiTech, UCF-Crime, and MSAD) demonstrate that $D^{2}MIL$ is compatible with diverse MIL frameworks and consistently enhances their performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38070", "url": null, "sourceid": 31549, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39837, "uid": "ded6980d594427e86a9f172d44a24b0d", "name": "Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification", "authors": [{"id": 183259, "fullname": "Yiwei Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/183259?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}, {"id": 192949, "fullname": "Zhengliang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192949?format=json", "institution": "China University of Petroleum-Beijing at Karama"}, {"id": 184873, "fullname": "Shaozu Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/184873?format=json", "institution": "Meituan"}, {"id": 192950, "fullname": "Chengyin Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192950?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}, {"id": 192951, "fullname": "Zhiyang Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/192951?format=json", "institution": "China University of Petroleum-Beijing at Karamay"}, {"id": 192952, "fullname": "Jiujiang Guo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192952?format=json", "institution": "Tianjin University"}, {"id": 192953, "fullname": "Meng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/192953?format=json", "institution": "Oracle AI"}, {"id": 192954, "fullname": "Peiying Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192954?format=json", "institution": "Meituan"}, {"id": 192955, "fullname": "Longbiao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192955?format=json", "institution": "Tianjin University"}], "abstract": "Hateful meme classification aims to identify memes containing hateful content and has become increasingly important in the era of social media dominance. Large multimodal models (LMMs) have significantly enhanced the understanding of multimodal content, advancing this field. However, cognitive biases in LMMs can impede effective collaboration among models. To address this issue, we introduce \\textbf{GECO}, a \\textbf{G}ame-theoretic multi-ag\\textbf{E}nt \\textbf{C}ollaboration framew\\textbf{O}rk that organizes multiple LMMs into interacting agents and employs game-theoretic principles to guide them toward an optimal cooperative equilibrium. GECO integrates a mixed bonus scheme, incorporating both individual accuracy and cross-model agreement to promote convergence toward a consistent cooperative solution. In addition, we implement efficient policy learning and introduce a penalty coefficient to optimize the framework effectively and ensure training stability.  Extensive experiments on five publicly available benchmarks demonstrate that our framework achieves new state-of-the-art performance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39837", "url": null, "sourceid": 44468, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39790, "uid": "d6b940b9220d02a4fe5072a08dd2f490", "name": "Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers", "authors": [{"id": 71048, "fullname": "hualiang wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71048?format=json", "institution": "HKUST"}, {"id": 157828, "fullname": "Siming Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/157828?format=json", "institution": "alibabagroup"}, {"id": 151427, "fullname": "Weinan Jia", "url": "http://cvpr.thecvf.com/api/miniconf/users/151427?format=json", "institution": "University of Science and Technology of China"}, {"id": 85619, "fullname": "Yuning Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85619?format=json", "institution": "University of Science and Technology of China"}, {"id": 189583, "fullname": "Mu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189583?format=json", "institution": "ByteDance Inc."}, {"id": 192863, "fullname": "Jidong Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192863?format=json", "institution": "Bytedance"}, {"id": 127166, "fullname": "Xiaomeng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/127166?format=json", "institution": "The Hong Kong University of Science and Technology"}], "abstract": "One-dimensional (1D) visual tokenizers offer notable semantic compactness by discarding local spatial priors, and have become increasingly popular for image reconstruction and generation tasks. However, such global and sequential representations struggle to preserve fine-grained visual content; simply increasing network size or token count offers only superficial mitigation. To address this, we introduce \\textit{\\textbf{VLTok}}, a novel 1D hybrid tokenizer that unifies \\textit{\\textbf{V}}isual and \\textit{\\textbf{L}}anguage representations in a shared \\textit{\\textbf{Tok}}en space through a \\textit{\\textbf{self-prompted}} training paradigm. During training, VLTok simultaneously generates 1D visual and textual tokens from images, aligning the textual tokens with embeddings from a pre-trained language model. This cross-modal alignment infuses implicit linguistic cues into the tokenizer, enhancing fine-grained image encoding. At inference, the self-prompted paradigm eliminates the need for external text, maintaining the simplicity of the image-only framework while benefiting from multi-modal guidance. Extensive experiments on the ImageNet benchmark demonstrate that VLTok achieves state-of-the-art performance in both image reconstruction and image generation. For example, under the same model parameter budget, our method yields relative reduction of 11.1% in rFID and 18.7% in gFID compared to GigaTok.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39790", "url": null, "sourceid": 45295, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37205, "uid": "6baf0c68d744f7699a2fca1addf6a105", "name": "Next-Scale Autoregressive Models for Text-to-Motion Generation", "authors": [{"id": 179977, "fullname": "Zhiwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/179977?format=json", "institution": "University of Pennsylvania"}, {"id": 186917, "fullname": "Shibo Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/186917?format=json", "institution": "University of Pennsylvania, University of Pennsylvania"}, {"id": 90047, "fullname": "Lingjie Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/90047?format=json", "institution": "Saarland Informatics Campus, Max-Planck Institute"}, {"id": 186918, "fullname": "Mingmin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186918?format=json", "institution": "University of Pennsylvania"}], "abstract": "Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text\u2013motion data, we further incorporate cross-scale hierarchical refinement for improving coarse-scale predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves state-of-the-art text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37205", "url": null, "sourceid": 41605, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36971, "uid": "c3a5d96180f4811f93e797bb2bca08ee", "name": "Exploring the Underwater World Segmentation without  Extra Training", "authors": [{"id": 186348, "fullname": "Bingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186348?format=json", "institution": "University of Science and Technology of China"}, {"id": 186349, "fullname": "Tao Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186349?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186350, "fullname": "Da Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186350?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186351, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186351?format=json", "institution": "China Telecom"}, {"id": 155750, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155750?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185717, "fullname": "Xuelong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185717?format=json", "institution": "China Telecom; Northwestern Polytechnical University"}], "abstract": "Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduce **AquaOV255**, the first large-scale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse marine organisms and man-made objects for open-vocabulary evaluation. Furthermore, we establish the first underwater open-vocabulary segmentation benchmark, **UOVSBench**, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive cross-domain evaluation. Alongside, we present **Earth2Ocean**, a training-free open-vocabulary segmentation framework that transfers terrestrial vision\u2013language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (**GMG**) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment(**CSA**) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves over 6+ mIoU improvement on average while maintaining efficient inference.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36971", "url": null, "sourceid": 36134, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38200, "uid": "9ffeb5b764e884cc28fef0104e64b5bc", "name": "Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging", "authors": [{"id": 189291, "fullname": "Zhilin Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189291?format=json", "institution": "Harbin Institute of Technology"}, {"id": 162893, "fullname": "Yabin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/162893?format=json", "institution": "Harbin Institute of Technology"}, {"id": 158940, "fullname": "Zhiheng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/158940?format=json", "institution": "Shenzhen University of Advanced Technology"}, {"id": 126240, "fullname": "Yaguang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/126240?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}, {"id": 154060, "fullname": "Xiaopeng Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/154060?format=json", "institution": "Harbin Institute of Technology"}], "abstract": "Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle real-world dynamic distribution shifts after deployment. However, its efficacy is limited by the scarcity and unreliability of supervision signals, leading to error accumulation and catastrophic forgetting. While existing methods predominantly follow a backward-alignment paradigm, constructing weak supervisory surrogates derived from prior knowledge, they struggle with unreliable supervision and evolving distribution shifts. To overcome this, we propose a novel forward-facilitation paradigm through a dynamic style bridging framework. Specifically, we first construct a compact, offline-generated knowledge base of semantically pure class exemplars to provide reliable content. Subsequently, to mitigate generative bias and handle evolving distribution shifts, we propose a multi-level style bridging mechanism. It dynamically transfers current target domain styles to synthetic proxies at the input, statistics, and representation levels. This process yields on-the-fly proxies that are both semantically reliable and stylistically faithful to the target data, which are then used to construct on-demand supervisory signals, effectively enabling stable and discriminative adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate consistent and substantial improvements over recent state-of-the-art methods.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38200", "url": null, "sourceid": 31226, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36993, "uid": "7f95080e8eda3a6ca81ca314500535f8", "name": "MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning", "authors": [{"id": 186404, "fullname": "Chenglong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186404?format=json", "institution": "Northeastern University"}, {"id": 186405, "fullname": "Yifu Huo", "url": "http://cvpr.thecvf.com/api/miniconf/users/186405?format=json", "institution": "Northeastern University"}, {"id": 186406, "fullname": "Yang Gan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186406?format=json", "institution": "Northeastern University"}, {"id": 155377, "fullname": "Qiaozhi He", "url": "http://cvpr.thecvf.com/api/miniconf/users/155377?format=json", "institution": "None"}, {"id": 186407, "fullname": "Qi Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186407?format=json", "institution": null}, {"id": 186408, "fullname": "Bei Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186408?format=json", "institution": "Meituan"}, {"id": 186409, "fullname": "Yan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186409?format=json", "institution": null}, {"id": 186410, "fullname": "Junfu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186410?format=json", "institution": null}, {"id": 186411, "fullname": "Tian Hua Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/186411?format=json", "institution": null}, {"id": 186412, "fullname": "JingBo Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186412?format=json", "institution": "Northeastern University"}, {"id": 186413, "fullname": "Tong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186413?format=json", "institution": "Northeastern University"}], "abstract": "Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning with verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically depends on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale the training of MRMs. To overcome this limitation, we propose a Multi-Stage Reinforcement Learning (MSRL) approach, which can achieve scalable reinforcement learning for MRMs with limited multimodal data. MSRL redefines the conventional RLVR-based training paradigm by first learning a generalizable reward reasoning capability from large-scale textual preference data and then progressively transferring this capability to multimodal tasks through caption-based and fully multimodal reinforcement learning stages. Furthermore, we introduce a cross-modal knowledge distillation approach to improve preference generalization within MSRL. Extensive experiments demonstrate that MSRL effectively scales the RLVR-based training of generative MRMs and substantially improves their performance across both visual understanding and visual generation tasks (e.g., 68.5\\%$\\rightarrow$74.8\\% on VLReward Bench, 69.2\\%$\\rightarrow$75.4\\% on GenAI-Bench), without requiring additional multimodal preference annotations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36993", "url": null, "sourceid": 32867, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39501, "uid": "ce21433ca7c640e8847df344c1d50ab2", "name": "PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning", "authors": [{"id": 176018, "fullname": "Xinyong Cai", "url": "http://cvpr.thecvf.com/api/miniconf/users/176018?format=json", "institution": "SiChuan University"}, {"id": 192211, "fullname": "Changbin Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192211?format=json", "institution": "Sichuan University"}, {"id": 192212, "fullname": "Yong Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192212?format=json", "institution": "University of Hong Kong"}, {"id": 192213, "fullname": "Hongyu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192213?format=json", "institution": "Sichuan University"}, {"id": 192214, "fullname": "Yuankai Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192214?format=json", "institution": "Sichuan University"}], "abstract": "Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center\u2013surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions ($1\\times k$ followed by $k\\times1$), reducing per-channel computational cost from $\\mathcal{O}(k^2)$ to $\\mathcal{O}(2k)$. PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39501", "url": null, "sourceid": 36734, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37363, "uid": "60e592a5e2c587d823db52392ff1591f", "name": "Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models", "authors": [{"id": 179898, "fullname": "Keuntae Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/179898?format=json", "institution": "Hanyang University"}, {"id": 187260, "fullname": "Mingyu Kang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187260?format=json", "institution": "Hanyang University"}, {"id": 180074, "fullname": "Yong Suk Choi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180074?format=json", "institution": "Hanyang University"}], "abstract": "Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues.First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs.To address these limitations, we propose Position & Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by the classifier-free guidance, amplifies visual grounding signals to enhance the model\u2019s alignment with visual evidence.Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3\u00d7 speedup compared to reasoning with four times more diffusion steps. Our code will be released after publication.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37363", "url": null, "sourceid": 46280, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38029, "uid": "3dbf996cda195acfa83b7f12ca698ae0", "name": "TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering", "authors": [{"id": 181258, "fullname": "Hanshen Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/181258?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 85267, "fullname": "Yuliang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/85267?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 153724, "fullname": "Xuecheng Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153724?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 188879, "fullname": "An-Lan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188879?format=json", "institution": ""}, {"id": 188880, "fullname": "Hao Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188880?format=json", "institution": null}, {"id": 91003, "fullname": "Dingkang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91003?format=json", "institution": "Fudan University"}, {"id": 188881, "fullname": "ChaoFeng ChaoFeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/188881?format=json", "institution": "ByteDance Inc."}, {"id": 126836, "fullname": "Can Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126836?format=json", "institution": "Bytedance"}, {"id": 126844, "fullname": "Jingqun Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126844?format=json", "institution": "Bytedance"}, {"id": 85817, "fullname": "Xiang Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/85817?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Visual Text Rendering (VTR) remains a critical challenge in text\u2011to\u2011image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment.% We identify a key bottleneck across both VTR evaluation and Reinforcement Learning (RL) processes: current evaluators and reward models lack the ability for fine-grained structural perception. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL\u2011based optimization.   As a result, even state\u2011of\u2011the\u2011art generators (e.g., SeedDream4.0, Qwen\u2011Image) still struggle to render structurally faithful text.To address this, we propose TextPecker,a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any text-to-image generator. To enable this capability, we construct a recognition dataset with character\u2011level structural\u2011anomaly annotations and develop a stroke\u2011editing synthesis engine to expand structural\u2011error coverage. Experiments show that TextPecker consistently improves diverse text\u2011to\u2011image models; even on the well\u2011optimized Qwen\u2011Image, it significantly yields average gains of 4\\% in structural fidelity and 8.7\\% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR.Our work fills a gap in VTR optimization, providing a foundational step towards  reliable and structural faithful visual text generation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38029", "url": null, "sourceid": 37229, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39254, "uid": "f4d6f152ec84baf228eb36bf424a6aa1", "name": "UniVBench: Towards Unified Evaluation for Video Foundation Models", "authors": [{"id": 182289, "fullname": "Jianhui Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/182289?format=json", "institution": "Zhejiang University"}, {"id": 191713, "fullname": "Xiaotian Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191713?format=json", "institution": "Zhejiang University"}, {"id": 191714, "fullname": "Yichen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/191714?format=json", "institution": "Zhejiang University"}, {"id": 191715, "fullname": "Yuan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191715?format=json", "institution": "Zhejiang University"}, {"id": 191716, "fullname": "Yan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191716?format=json", "institution": "ByteDance Inc."}, {"id": 191717, "fullname": "Ziyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/191717?format=json", "institution": null}, {"id": 191718, "fullname": "Zhihang Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191718?format=json", "institution": "Tianjin University"}, {"id": 191719, "fullname": "Wei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191719?format=json", "institution": "Bytedance"}, {"id": 191720, "fullname": "Zuozhu Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191720?format=json", "institution": "Zhejiang University"}], "abstract": "Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified  capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a  benchmark purpose-built for evaluating  video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has  encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are entirely human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39254", "url": null, "sourceid": 42620, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39713, "uid": "b91e57514fa3018b0a8238eaba5337b7", "name": "Refracting Reality: Generating Images with Realistic Transparent Objects", "authors": [{"id": 145783, "fullname": "Yue Yin", "url": "http://cvpr.thecvf.com/api/miniconf/users/145783?format=json", "institution": "The Australian National University"}, {"id": 192708, "fullname": "Enze Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192708?format=json", "institution": "Australian National University"}, {"id": 107093, "fullname": "Dylan Campbell", "url": "http://cvpr.thecvf.com/api/miniconf/users/107093?format=json", "institution": "Australian National University"}], "abstract": "Generative image models can produce convincingly real images, with plausible shapes, textures, layouts and lighting. However, one domain in which they perform notably poorly is in the synthesis of transparent objects, which exhibit refraction, reflection, absorption and scattering. Refraction is a particular challenge, because refracted pixel rays often intersect with surfaces observed in other parts of the image, providing a constraint on the color. It is clear from inspection that generative models have not distilled the laws of optics sufficiently well to accurately render refractive objects. In this work, we consider the problem of generating images with accurate refraction, given a text prompt. We synchronize the pixels within the object's boundary with those outside by warping and merging the pixels using Snell's Law of Refraction, at each step of the generation trajectory. For those surfaces that are not directly observed in the image, but are visible via refraction or reflection, we recover their appearance by synchronizing the image with a second generated image---a panorama centered at the object---using the same warping and merging procedure. We demonstrate that our approach generates much more optically-plausible images that respect the physical constraints.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39713", "url": null, "sourceid": 38221, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39591, "uid": "9e4745a6b2bdebae79fa8fbe2c83b7ca", "name": "MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis", "authors": [{"id": 153982, "fullname": "Yuting Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153982?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 153984, "fullname": "Kaishen Yuan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153984?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 70254, "fullname": "Hao Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/70254?format=json", "institution": "HKUST (GZ)"}, {"id": 192423, "fullname": "Yutao Yue", "url": "http://cvpr.thecvf.com/api/miniconf/users/192423?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou); Institute of Deep Perception Technology, JITRI"}, {"id": 157913, "fullname": "Jintai Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157913?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 153986, "fullname": "Kaishun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153986?format=json", "institution": "HKUST(GZ)"}], "abstract": "Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code will be available on GitHub.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39591", "url": null, "sourceid": 33569, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39589, "uid": "4c721b2017d5fb477659ba48277446de", "name": "Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction", "authors": [{"id": 181084, "fullname": "Hao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/181084?format=json", "institution": "Great Bay University"}, {"id": 188215, "fullname": "Lu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188215?format=json", "institution": "Insta360; Wuhan University"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 192419, "fullname": "Jie Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192419?format=json", "institution": "Great Bay University"}, {"id": 192420, "fullname": "Yi Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192420?format=json", "institution": "Donghua University, Shanghai"}, {"id": 127662, "fullname": "Xu Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127662?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}, {"id": 192421, "fullname": "Mingyu Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/192421?format=json", "institution": "Donghua University"}, {"id": 192422, "fullname": "Fei Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/192422?format=json", "institution": "Great Bay University"}], "abstract": "Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixed-length observations. However, real-world driving often yields variable-length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a $\\textbf{P}$rogressive $\\textbf{R}$etrospective $\\textbf{F}$ramework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features.  Moreover, we propose a Rolling-Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug-and-play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code will be released.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39589", "url": null, "sourceid": 42074, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37768, "uid": "8000692a2ab9fd3dc006a3cdc9869978", "name": "STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation", "authors": [{"id": 158124, "fullname": "Hao Ren", "url": "http://cvpr.thecvf.com/api/miniconf/users/158124?format=json", "institution": "Sun Yat-Sen University"}, {"id": 158126, "fullname": "Zetong Bi", "url": "http://cvpr.thecvf.com/api/miniconf/users/158126?format=json", "institution": "Sun Yat-Sen University"}, {"id": 158125, "fullname": "Yiming Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/158125?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 188214, "fullname": "Zhaoliang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188214?format=json", "institution": "Insta360"}, {"id": 188215, "fullname": "Lu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188215?format=json", "institution": "Insta360; Wuhan University"}, {"id": 127643, "fullname": "Hui Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127643?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. The code will be released to the public.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37768", "url": null, "sourceid": 33629, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38258, "uid": "b9f7c99a62433ab681f7e97cdc4bd107", "name": "DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training", "authors": [{"id": 172265, "fullname": "Haoran Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/172265?format=json", "institution": "Tsinghua University"}, {"id": 189052, "fullname": "Dizhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189052?format=json", "institution": "insta360"}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}, {"id": 188215, "fullname": "Lu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188215?format=json", "institution": "Insta360; Wuhan University"}], "abstract": "In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. We attribute the main challenges in preserving geometric fidelity and photorealism to the scarcity of large-scale, high-quality real-world panoramic data, in contrast to prior methods that emphasize model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code will be released publicly.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38258", "url": null, "sourceid": 34886, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38096, "uid": "2551f901c615b9e8c58d169fc1a560db", "name": "AirSim360: A Panoramic Simulation Platform within Drone View", "authors": [{"id": 189047, "fullname": "Xian Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/189047?format=json", "institution": "Insta360"}, {"id": 189048, "fullname": "Yuling Pan", "url": "http://cvpr.thecvf.com/api/miniconf/users/189048?format=json", "institution": "Shenzhen University"}, {"id": 189049, "fullname": "Yuhang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189049?format=json", "institution": "Insta360"}, {"id": 189050, "fullname": "Xiang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/189050?format=json", "institution": "Northwestern University"}, {"id": 189051, "fullname": "Weijun Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189051?format=json", "institution": "Insta360 Research"}, {"id": 189052, "fullname": "Dizhe Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189052?format=json", "institution": "insta360"}, {"id": 188214, "fullname": "Zhaoliang Wan", "url": "http://cvpr.thecvf.com/api/miniconf/users/188214?format=json", "institution": "Insta360"}, {"id": 182231, "fullname": "Xin Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/182231?format=json", "institution": "University of California, San Diego"}, {"id": 174773, "fullname": "Xiangkai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/174773?format=json", "institution": "Institute of Automation\uff0cChinese Academy of Sciences"}, {"id": 189053, "fullname": "Juntao Liang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189053?format=json", "institution": null}, {"id": 139138, "fullname": "Xiangtai Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/139138?format=json", "institution": "ByteDance Inc."}, {"id": 152590, "fullname": "jerett Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/152590?format=json", "institution": "insta360"}, {"id": 84747, "fullname": "Bo Du", "url": "http://cvpr.thecvf.com/api/miniconf/users/84747?format=json", "institution": "Wuhan University"}, {"id": 75715, "fullname": "Ming-Hsuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/75715?format=json", "institution": "University of California at Merced"}, {"id": 188215, "fullname": "Lu Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/188215?format=json", "institution": "Insta360; Wuhan University"}], "abstract": "The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and instance-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38096", "url": null, "sourceid": 38646, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37596, "uid": "1e542de86c0aa1cdcbc7177e0829e95d", "name": "OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data", "authors": [{"id": 146943, "fullname": "Yan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/146943?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 154172, "fullname": "Zhengxue Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/154172?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 187788, "fullname": "Junxuan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187788?format=json", "institution": "Ant Group"}, {"id": 154440, "fullname": "Dajiang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/154440?format=json", "institution": "Ant Group"}, {"id": 187789, "fullname": "Qunshan Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187789?format=json", "institution": "Alibaba Group"}, {"id": 187790, "fullname": "Qi Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187790?format=json", "institution": "Ant Group"}, {"id": 85857, "fullname": "Li Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/85857?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Lossless compression is essential for efficient data storage and transmission. Although learning-based lossless compressors achieve strong results, most of them are designed for a single modality, leading to redundant compressor deployments in multi-modal settings. Designing a unified multi-modal compressor is critical yet challenging, as different data types vary largely in format, dimension, and statistics. Multi-modal large language models offer a promising resolution but remain too complex for practical use. Thus, we propose \\textbf{OmniZip}, \\textbf{a unified and lightweight lossless compressor for multi-modal data (like image, text, speech, tactile, database, and gene sequence)}. Built on a lightweight backbone, OmniZip incorporates three key components to enable efficient multi-modal lossless compression: a modality-unified tokenizer that reversibly transforms diverse data into tokens, a modality-routing context learning mechanism that enables flexible multi-modal context modeling, and a modality-routing feedforward design that further enhances the model's nonlinear representation flexibility. A reparameterization training strategy is used to enhance model capacity. It outperforms or matches other state-of-the-art compressors on multiple modalities, achieving 42\\%, 57\\%, 62\\% and 42\\%, 53\\% higher compression efficiency than gzip on CLIC-M, TouchandGo, enwik9, LibriSpeech, and WikiSQL datasets, respectively. It also supports near real-time inference on resource-constrained edge devices, reaching up to 1MB/s on MacBook CPUs and iPhone NPUs. Our code will be released upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37596", "url": null, "sourceid": 37523, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36502, "uid": "8c6ff9e1b53e7e1f532048b6b1435d7b", "name": "IncreFA: Breaking the Static Wall of Generative Model Attribution", "authors": [{"id": 185221, "fullname": "Haotian Qin", "url": "http://cvpr.thecvf.com/api/miniconf/users/185221?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 131116, "fullname": "Dongliang Chang", "url": "http://cvpr.thecvf.com/api/miniconf/users/131116?format=json", "institution": "Tsinghua University"}, {"id": 185222, "fullname": "Yueying Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185222?format=json", "institution": "Beijing University of Posts and Telecommunications"}, {"id": 185223, "fullname": "Yuexuan Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185223?format=json", "institution": "Beijing University of Post and Telecommunications"}, {"id": 157160, "fullname": "Lei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/157160?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 90236, "fullname": "Zhanyu Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/90236?format=json", "institution": "Beijing University of Post and Telecommunication"}], "abstract": "As AI generative models evolve at unprecedented speed, image attribution has become a moving target. New diffusion, adversarial and autoregressive generators appear almost monthly, making existing watermark, classifier and inversion methods obsolete upon release. The core problem lies not in model recognition, but in the inability to adapt attribution itself.We introduce IncreFA, a framework that redefines attribution as a structured incremental learning problem, allowing the system to learn continuously as new generative models emerge. IncreFA departs from conventional incremental learning by exploiting the hierarchical relationships among generative architectures and coupling them with continual adaptation. It integrates two mutually reinforcing mechanisms: (1) Hierarchical Constraints, which encode architectural hierarchies through learnable orthogonal priors to disentangle family-level invariants from model-specific idiosyncrasies; and (2) a Latent Memory Bank, which replays compact latent exemplars and mixes them to generate pseudo-unseen samples, stabilising representation drift and enhancing open-set awareness.On the newly constructed Incremental Attribution Benchmark (IABench) covering 28 generative models released between 2022 and 2025, IncreFA achieves state-of-the-art attribution accuracy and 98.9% unseen detection under a temporally ordered open-set protocol.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36502", "url": null, "sourceid": 41222, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38934, "uid": "c5b432382d5978b94676426a32725dff", "name": "GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting", "authors": [{"id": 165006, "fullname": "Yasmine Omri", "url": "http://cvpr.thecvf.com/api/miniconf/users/165006?format=json", "institution": "Stanford University"}, {"id": 190994, "fullname": "Connor Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190994?format=json", "institution": "Stanford University"}, {"id": 190995, "fullname": "Tsachy Weissman", "url": "http://cvpr.thecvf.com/api/miniconf/users/190995?format=json", "institution": "Stanford University"}, {"id": 190996, "fullname": "Thierry Tambe", "url": "http://cvpr.thecvf.com/api/miniconf/users/190996?format=json", "institution": "Stanford University"}], "abstract": "Modern vision\u2013language pipelines are driven by RGB vision encoders trained on massive image\u2013text corpora. While these pipelines have enabled impressive zero-shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over $90\\times$ faster fitting and $\\sim97$% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only $\\sim9.7 - 13.8$% of the total parameters.On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs $3$\u2013$23.5\\times$ relative to pixels. Our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission-efficient for edge\u2013cloud learning.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38934", "url": null, "sourceid": 33235, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36901, "uid": "6e4f47d6a3e00616a565c69757bf59c5", "name": "More than the Sum: Panorama-Language Models for Adverse Omni-Scenes", "authors": [{"id": 179973, "fullname": "Weijia Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179973?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie; Shenzhen University"}, {"id": 106502, "fullname": "Ruiping Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106502?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 186155, "fullname": "Jiale Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/186155?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}, {"id": 113798, "fullname": "Yufan Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/113798?format=json", "institution": "Karlsruhe Institute of Technology (KIT)"}, {"id": 92182, "fullname": "Junwei Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/92182?format=json", "institution": "Karlsruhe Institute of Technology"}, {"id": 186156, "fullname": "Zichao Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186156?format=json", "institution": "University College London, University of London; Karlsruher Institut f\u00fcr Technologie"}, {"id": 186157, "fullname": "Jiaming Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186157?format=json", "institution": "Hunan University"}, {"id": 186158, "fullname": "Qiufu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186158?format=json", "institution": "Shenzhen University"}, {"id": 76746, "fullname": "Linlin Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/76746?format=json", "institution": "Shenzhen University"}, {"id": 75808, "fullname": "Rainer Stiefelhagen", "url": "http://cvpr.thecvf.com/api/miniconf/users/75808?format=json", "institution": "Karlsruhe Institute of Technology"}], "abstract": "Most existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we propose that panoramic vision-language understanding is more than the sum of its pinhole counterparts. We introduce Panorama-Language Modeling (PLM), a unified 360\u00b0 visual-language reasoning. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that integrates diverse and adverse omni-scenes, enabling comprehensive reasoning under occlusion, accidents, and challenging conditions. To establish a foundation for PLM, we develop a plug-and-play panoramic adaptation module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under adverse omni-scenes, revealing that a full panorama yields understanding greater than the sum of its parts. All datasets and code will be publicly released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36901", "url": null, "sourceid": 43546, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38201, "uid": "eabe8e1fe6add3ddf6d65b6df954b376", "name": "IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator\u2013Critic Framework", "authors": [{"id": 189292, "fullname": "Feiyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189292?format=json", "institution": "Fudan University"}, {"id": 189293, "fullname": "Jiayuan Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189293?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186351, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186351?format=json", "institution": "China Telecom"}, {"id": 186350, "fullname": "Da Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186350?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186348, "fullname": "Bingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186348?format=json", "institution": "University of Science and Technology of China"}, {"id": 189294, "fullname": "Peng Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189294?format=json", "institution": "Fudan University"}, {"id": 155750, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155750?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative \"generate-review-refine\" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38201", "url": null, "sourceid": 44288, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37637, "uid": "15ba4a985b497db630021097082b4ce2", "name": "Refa\u00e7ade: Editing Object with Given Reference Texture", "authors": [{"id": 173374, "fullname": "Youze Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/173374?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 179967, "fullname": "Penghui Ruan", "url": "http://cvpr.thecvf.com/api/miniconf/users/179967?format=json", "institution": "The Hong Kong Polytechnic University"}, {"id": 187924, "fullname": "Bojia Zi", "url": "http://cvpr.thecvf.com/api/miniconf/users/187924?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 85981, "fullname": "Xianbiao Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/85981?format=json", "institution": "International Digital Economy Academy"}, {"id": 76162, "fullname": "Jianan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76162?format=json", "institution": "Astribot Inc"}, {"id": 184713, "fullname": "Rong Xiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/184713?format=json", "institution": null}], "abstract": "Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, *Object Retexture*, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability due to two reasons: conditioning on the raw reference image introduces unwanted structural information, and this method fails to disentangle visual texture and structure information of the source. To address this problem, we proposed a method, namely **Refa\u00e7ade**, that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving geometry and motion of source videos. Second, we disrupt the reference\u2019s global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than global layout of object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37637", "url": null, "sourceid": 35106, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39444, "uid": "065940d7612634b4e81f593f6ce19418", "name": "EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models", "authors": [{"id": 192090, "fullname": "Mingchen Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/192090?format=json", "institution": "Harbin Institute of Technology"}, {"id": 131456, "fullname": "Xiang Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/131456?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 174567, "fullname": "wei jie", "url": "http://cvpr.thecvf.com/api/miniconf/users/174567?format=json", "institution": "Harbin Institute of Technology"}, {"id": 128613, "fullname": "Dongmei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128613?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 84777, "fullname": "Liqiang Nie", "url": "http://cvpr.thecvf.com/api/miniconf/users/84777?format=json", "institution": "Harbin Institute of Technology (Shenzhen)"}, {"id": 90282, "fullname": "Weili Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/90282?format=json", "institution": "Monash University"}], "abstract": "Recent advances in unimanual manipulation policies have achieved remarkable success across diverse robotic tasks through abundant training data and well-established model architectures. However, extending these capabilities to bimanual manipulation remains challenging due to the lack of bimanual demonstration data and the complexity of coordinating dual-arm actions. Existing approaches either rely on extensive bimanual datasets or fail to effectively leverage pre-trained unimanual policies. To address this limitation, we propose EnergyAction, a novel framework that compositionally transfers unimanual manipulation policies to bimanual tasks through the Energy-Based Models (EBMs). Specifically, our method incorporates three key innovations. First, we model individual unimanual policies as EBMs and leverage their compositional properties to compose left and right arm actions, enabling the fusion of unimanual policies into a bimanual policy. Second, we introduce an energy-based temporal-spatial coordination mechanism through energy constraints, ensuring the generated bimanual actions are both temporal coherence and spatial feasibility. Third, we propose two different energy-aware denoising strategies that dynamically adapt denoising steps based on action quality assessment. These strategies ensure the generation of high-quality actions while maintaining superior computational efficiency compared to fixed-step denoising approaches. Experimental results demonstrate that EnergyAction effectively transfers unimanual knowledge to bimanual tasks, achieving superior performance on both simulated and real-world tasks with minimal bimanual data.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39444", "url": null, "sourceid": 40780, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40135, "uid": "622947a06b2eedeb1bf9282d4c61b7f1", "name": "MARIS: Marine Open-Vocabulary Instance Segmentation", "authors": [{"id": 186348, "fullname": "Bingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186348?format=json", "institution": "University of Science and Technology of China"}, {"id": 189292, "fullname": "Feiyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189292?format=json", "institution": "Fudan University"}, {"id": 186350, "fullname": "Da Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186350?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186351, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186351?format=json", "institution": "China Telecom"}, {"id": 155750, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155750?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 185717, "fullname": "Xuelong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185717?format=json", "institution": "China Telecom; Northwestern Polytechnical University"}], "abstract": "Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories.  To support evaluation, we introduce **MARIS** (_Marine Open-Vocabulary Instance Segmentation_), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) Instance segmentation (UOVIS), featuring a limited set of seen categories and diverse unseen categories.  Although OV instance segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions.  To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (**GPEM**) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (**SAIM**) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories.  Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research. The code of this paper can be found in the supplementary materials.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40135", "url": null, "sourceid": 37994, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38998, "uid": "5d746dc8f27fa915c65b235c63aedf89", "name": "Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting", "authors": [{"id": 186350, "fullname": "Da Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186350?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}, {"id": 186348, "fullname": "Bingyu Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/186348?format=json", "institution": "University of Science and Technology of China"}, {"id": 189292, "fullname": "Feiyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189292?format=json", "institution": "Fudan University"}, {"id": 186351, "fullname": "Zhiyuan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186351?format=json", "institution": "China Telecom"}, {"id": 155750, "fullname": "Junyu Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/155750?format=json", "institution": "Northwest Polytechnical University Xi&#x27;an"}], "abstract": "Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation. To address these challenges, we present **QICA**, a novel framework that synergizes quantity perception with robust spatial cast aggregation. Specifically, we introduce a Synergistic Prompting Strategy (**SPS**) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (**CAD**) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains. Code is provided in the appendix.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38998", "url": null, "sourceid": 36064, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38172, "uid": "4b4d2998331ccb9a1abe9eae8d46e242", "name": "Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation", "authors": [{"id": 189204, "fullname": "Sidan Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189204?format=json", "institution": "Beijing Institute of Technology"}, {"id": 86287, "fullname": "Hongteng Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/86287?format=json", "institution": "Renmin University of China"}, {"id": 126459, "fullname": "Dixin Luo", "url": "http://cvpr.thecvf.com/api/miniconf/users/126459?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a \"selection-then-ranking\" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance.When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38172", "url": null, "sourceid": 44893, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40258, "uid": "b55d7ce2adb9449fc4dae6115cbbe30f", "name": "SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation", "authors": [{"id": 180871, "fullname": "Yujie Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180871?format=json", "institution": "Sichuan University"}, {"id": 152266, "fullname": "Jingwen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152266?format=json", "institution": "Xinjiang University"}, {"id": 184941, "fullname": "Sibo Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/184941?format=json", "institution": "Fuzhou University"}, {"id": 152267, "fullname": "Yanzhou Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/152267?format=json", "institution": "Alibaba Group"}, {"id": 103350, "fullname": "He Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/103350?format=json", "institution": "Guizhou University"}, {"id": 173464, "fullname": "Yisong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173464?format=json", "institution": "Sichuan University"}, {"id": 152268, "fullname": "Min Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152268?format=json", "institution": "Sichuan University"}, {"id": 152260, "fullname": "Junlong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152260?format=json", "institution": "Sichuan University"}], "abstract": "Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM\u2019s original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40258", "url": null, "sourceid": -41258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36392?format=json"], "related_events_ids": [36392]}, {"id": 36392, "uid": "b55d7ce2adb9449fc4dae6115cbbe30f", "name": "SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation", "authors": [{"id": 180871, "fullname": "Yujie Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180871?format=json", "institution": "Sichuan University"}, {"id": 152266, "fullname": "Jingwen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/152266?format=json", "institution": "Xinjiang University"}, {"id": 184941, "fullname": "Sibo Ju", "url": "http://cvpr.thecvf.com/api/miniconf/users/184941?format=json", "institution": "Fuzhou University"}, {"id": 152267, "fullname": "Yanzhou Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/152267?format=json", "institution": "Alibaba Group"}, {"id": 103350, "fullname": "He Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/103350?format=json", "institution": "Guizhou University"}, {"id": 173464, "fullname": "Yisong Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/173464?format=json", "institution": "Sichuan University"}, {"id": 152268, "fullname": "Min Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/152268?format=json", "institution": "Sichuan University"}, {"id": 152260, "fullname": "Junlong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/152260?format=json", "institution": "Sichuan University"}], "abstract": "Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM\u2019s original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36392", "url": null, "sourceid": 41258, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40258?format=json"], "related_events_ids": [40258]}, {"id": 39544, "uid": "a792113bb3e7dbb10067989ea36f38cf", "name": "Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection", "authors": [{"id": 192309, "fullname": "Dacheng Qi", "url": "http://cvpr.thecvf.com/api/miniconf/users/192309?format=json", "institution": "Beijing University of Aeronautics and Astronautics; TranscEngram"}, {"id": 192310, "fullname": "Chenyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192310?format=json", "institution": "University of Hong Kong"}, {"id": 192311, "fullname": "Jingwei Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192311?format=json", "institution": "ShanghaiTech University"}, {"id": 192312, "fullname": "Tianzhe Chu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192312?format=json", "institution": "University of Hong Kong"}, {"id": 186525, "fullname": "Zibo Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186525?format=json", "institution": "Tencent"}, {"id": 91015, "fullname": "Wen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91015?format=json", "institution": "Tencent PCG"}, {"id": 190351, "fullname": "Wenrui Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/190351?format=json", "institution": "Beihang University"}, {"id": 106959, "fullname": "Yi Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/106959?format=json", "institution": "UC Berkeley"}, {"id": 153801, "fullname": "Shenghua Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/153801?format=json", "institution": "University of Hong Kong"}], "abstract": "Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a Pointer that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation. To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately 575K CAD models. Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to an extremely low level, achieving a significant improvement over prior command sequence methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39544", "url": null, "sourceid": 32507, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39650, "uid": "80a694e39b3619bc4ca3d38b851ef8d6", "name": "DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging", "authors": [{"id": 155510, "fullname": "Yan Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/155510?format=json", "institution": "Pengcheng Laboratory"}, {"id": 192564, "fullname": "Guiping Cao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192564?format=json", "institution": "Pengcheng Laboratory"}, {"id": 126240, "fullname": "Yaguang Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/126240?format=json", "institution": "Peng Cheng Laboratory"}, {"id": 76805, "fullname": "Ming Tao", "url": "http://cvpr.thecvf.com/api/miniconf/users/76805?format=json", "institution": "Pengcheng Laboratory"}, {"id": 192565, "fullname": "Haoran Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192565?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192566, "fullname": "Jun-Hui Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/192566?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 87840, "fullname": "Yaowei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87840?format=json", "institution": "Pengcheng Laboratory"}, {"id": 128613, "fullname": "Dongmei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/128613?format=json", "institution": "Peng Cheng Laboratory"}], "abstract": "Model merging offers a promising paradigm for consolidating multiple expert models into a single multitask architecture. However, its effectiveness is often hindered by task interference, where conflicting parameter updates from different tasks degrade performance. While dynamic, Mixture-of-Experts based methods have improved adaptability, they are fundamentally limited by constructing their expert pools from task vectors in isolation, failing to resolve underlying structural conflicts across tasks. In this paper, we introduce DuetMerging, a novel framework that synergistically mitigates task interference from both dynamic and static perspectives. Dynamically, we apply Tucker decomposition to a unified tensor of task vectors, creating a harmonized expert pool derived from a shared core tensor that structurally enhances synergies and suppresses conflicts. Statically, we introduce a neuron-based sparsification technique that leverages task-specific neuron activation patterns to create a precise mask. This allows us to selectively preserve critical information from the decomposition's residual while suppressing functionally irrelevant or conflicting parameters. Comprehensive experiments demonstrate that DuetMerging outperforms existing methods, establishing a new state-of-the-art in both task performance and parameter efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39650", "url": null, "sourceid": 33628, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38923, "uid": "705fc1d67ee7b0ad2a52828285ec34c8", "name": "Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference", "authors": [{"id": 175015, "fullname": "Yushi Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/175015?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 190972, "fullname": "Feng Hong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190972?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 168310, "fullname": "Huangjie Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/168310?format=json", "institution": "Apple"}, {"id": 190973, "fullname": "Xu Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190973?format=json", "institution": "Alibaba Group"}, {"id": 190974, "fullname": "Zhiyong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190974?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 126281, "fullname": "Yanfeng Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126281?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 86992, "fullname": "Jiangchao Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/86992?format=json", "institution": "Shanghai Jiaotong University"}], "abstract": "Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality and speed tradeoff in parallel decoding. This stems from the \"combinatorial contradiction\" phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation.ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a 2-8$\\times$ inference speedup without any quality degradation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38923", "url": null, "sourceid": 37521, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38960, "uid": "dfd31d6afb2c290d77300ccfea5fe766", "name": "From Remember to Transfer: Interpretable Open-World Reasoning in MLLMs", "authors": [{"id": 183401, "fullname": "Chenghao Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/183401?format=json", "institution": "UESTC"}, {"id": 191060, "fullname": "Jun Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/191060?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191061, "fullname": "Songbo Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191061?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191062, "fullname": "HuaDong Jian", "url": "http://cvpr.thecvf.com/api/miniconf/users/191062?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191063, "fullname": "Hao Ni", "url": "http://cvpr.thecvf.com/api/miniconf/users/191063?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 191064, "fullname": "LIK-HANG LEE", "url": "http://cvpr.thecvf.com/api/miniconf/users/191064?format=json", "institution": "Hong Kong Polytechnic University"}, {"id": 165635, "fullname": "SUNG BAE BAE", "url": "http://cvpr.thecvf.com/api/miniconf/users/165635?format=json", "institution": "Kyung Hee University"}, {"id": 191065, "fullname": "Guoqing Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191065?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 127337, "fullname": "Yang Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127337?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 191066, "fullname": "Chaoning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/191066?format=json", "institution": "University of Electronic Science and Technology of China"}], "abstract": "Multimodal agents, such as JARVIS-1, are rapidly advancing in open-world environments. Their core workflow typically follows a perception\u2013reasoning\u2013action\u2013memory cycle. Existing studies primarily emphasize improving memory representations and storage formats, treating memory mainly as an information repository. However, distilling transferable knowledge from stored experiences remains an important yet underexplored challenge.In real-world settings, structures and patterns tend to recur. If an agent can capture and reuse these latent patterns, it can infer new actionable knowledge from prior experience, enabling more efficient and flexible task execution. To explore this capability, we propose Echo. Echo decomposes knowledge into five explicit dimensions of transferability: structure, attribute, process, function, and interaction. Based on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to effectively retrieve past experiences and generalize them to new tasks.Experiments show that, under a from-scratch learning setting, Echo achieves a 1.3\u00d7\u20131.7\u00d7 speed-up in object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval. These results demonstrate that robust knowledge transfer, driven by effective utilization of contextual examples, is a highly promising direction for advancing open-world multimodal agents.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38960", "url": null, "sourceid": 37840, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38714, "uid": "751973642cd1d3b56fa7a37e71ffde50", "name": "DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching", "authors": [{"id": 179949, "fullname": "Chang Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/179949?format=json", "institution": "Shanghai Jiao Tong University &amp; Tencent Hunyuan"}, {"id": 128929, "fullname": "Changlin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/128929?format=json", "institution": "SeeKoo"}, {"id": 190511, "fullname": "Songtao Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190511?format=json", "institution": "Tencent AI Lab"}, {"id": 190512, "fullname": "Zhao Zhong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190512?format=json", "institution": "Tencent"}, {"id": 190513, "fullname": "Kailin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190513?format=json", "institution": "Wuhan University"}, {"id": 87643, "fullname": "Linfeng Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87643?format=json", "institution": ", Tsinghua University"}], "abstract": "While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden.Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance,but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38714", "url": null, "sourceid": 32101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36789, "uid": "bd8b4dc6f31f0c04377e23f09e764426", "name": "Modeling Cross-vision Synergy for Unified Large Vision Model", "authors": [{"id": 131572, "fullname": "Shengqiong Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/131572?format=json", "institution": "National University of Singapore"}, {"id": 185877, "fullname": "Lanhu Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185877?format=json", "institution": "National University of Singapore"}, {"id": 185878, "fullname": "Mingyang Bao", "url": "http://cvpr.thecvf.com/api/miniconf/users/185878?format=json", "institution": "National University of Singapore"}, {"id": 185879, "fullname": "Wenhao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185879?format=json", "institution": "National University of Singapore"}, {"id": 91500, "fullname": "Hanwang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/91500?format=json", "institution": "Nanyang Technological University"}, {"id": 106922, "fullname": "Shuicheng Yan", "url": "http://cvpr.thecvf.com/api/miniconf/users/106922?format=json", "institution": "National University of Singapore, Department of Electrical and Computer Engineering"}, {"id": 178983, "fullname": "Hao Fei", "url": "http://cvpr.thecvf.com/api/miniconf/users/178983?format=json", "institution": "National University of Singapore"}, {"id": 88927, "fullname": "Tat-seng Chua", "url": "http://cvpr.thecvf.com/api/miniconf/users/88927?format=json", "institution": "National University of Singapore"}], "abstract": "Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on ten benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10\\% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic large vision models.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36789", "url": null, "sourceid": 39343, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37674, "uid": "68c7db132a6299fb17d7ff6bc384a52f", "name": "Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation", "authors": [{"id": 180848, "fullname": "Guangchen Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/180848?format=json", "institution": "Nanjing University"}, {"id": 187987, "fullname": "Yirui Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/187987?format=json", "institution": "hohai University"}, {"id": 187988, "fullname": "Zhu Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/187988?format=json", "institution": null}, {"id": 187989, "fullname": "Tao Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187989?format=json", "institution": "vivo Mobile Communication Co., Ltd."}, {"id": 126669, "fullname": "Hao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/126669?format=json", "institution": "vivo Mobile Communication \uff08Hangzhou\uff09Co., Ltd"}, {"id": 87210, "fullname": "Bo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/87210?format=json", "institution": "vivo Mobile Communication Co.,Ltd."}, {"id": 88504, "fullname": "Tong Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/88504?format=json", "institution": "Nanjing University"}], "abstract": "Few-shot Semantic Segmentation (FSS) aims to segment objects of novel categories given only a handful of labeled examples. However, existing methods often rely on complex  category-specific modeling, resulting in high computational cost and limited generalization under low-data regimes. To address these challenges, we propose a Bayesian Probabilistic Network (BPNet) that reformulates FSS as a composition of three interpretable components: a prior, a likelihood, and a class-consistency term. Specifically, an efficient Segment Anything Model (SAM) is employed to generate fragmented prior regions for the query image, while both the likelihood and the consistency terms are estimated by a lightweight Class-Agnostic Localization Model (CALM). CALM simultaneously predicts the class consistency between support-query pairs through a binary classification head and estimates the likelihood by localizing the target region in the support image. By evaluating SAM-generated regions in parallel, CALM can efficiently identify the core region, thereby transforming the segmentation problem into a simple binary classification task. Furthermore, to mitigate the semantic incompleteness of SAM proposals, we introduce an attention-based Semantic Completion Module (SCM), which leverages local and global context cues to integrate fragmented regions into semantically complete masks. Extensive experiments demonstrate that BPNet achieves state-of-the-art performance while maintaining high efficiency.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37674", "url": null, "sourceid": 31190, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36810, "uid": "116c87e094144d0002cff857e56a322f", "name": "FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario", "authors": [{"id": 180156, "fullname": "Hang Dai", "url": "http://cvpr.thecvf.com/api/miniconf/users/180156?format=json", "institution": "Center on Frontiers of Computing Studies, Peking University"}, {"id": 185934, "fullname": "Hongwei Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/185934?format=json", "institution": "Peking University"}, {"id": 185935, "fullname": "Han Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/185935?format=json", "institution": "Zhejiang University"}, {"id": 185936, "fullname": "Duojin Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185936?format=json", "institution": "Peking University"}, {"id": 155831, "fullname": "Jiyao Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/155831?format=json", "institution": "Peking University"}, {"id": 76571, "fullname": "Hao Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/76571?format=json", "institution": "Peking University"}], "abstract": "The increasing need for augmented reality and robotics is urging for articulated object reconstruction with high scalability. However, the existing settings of reconstructing from discrete articulation states or casual monocular video need non-trivial axes alignment or suffer from insufficient coverage, limiting the applications. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simpler setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, free-moving part segmentation discovers rigid parts from relative motion in unconstrained capture. The joint estimation module proposes a noise-resistant approach to recover joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry and joint angles of the articulated object. We perform experiments on two benchmarks and real-world free-moving articulated objects. Experiments show that FreeArtGS consistently outperforms prior methods in free-moving articulated object reconstruction and remains competitive in the similar previous setting, underscoring the potential of FreeArtGS to serve as an engine for realistic articulated asset building. Code and data will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36810", "url": null, "sourceid": 39281, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39279, "uid": "aa827be8f6b291a77a8bf45f2bdbac78", "name": "Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision", "authors": [{"id": 126260, "fullname": "Hyunsoo Cha", "url": "http://cvpr.thecvf.com/api/miniconf/users/126260?format=json", "institution": "Seoul National University"}, {"id": 191755, "fullname": "Wonjung Woo", "url": "http://cvpr.thecvf.com/api/miniconf/users/191755?format=json", "institution": "Seoul National University"}, {"id": 132210, "fullname": "Byungjun Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/132210?format=json", "institution": "Seoul National University"}, {"id": 98572, "fullname": "Hanbyul Joo", "url": "http://cvpr.thecvf.com/api/miniconf/users/98572?format=json", "institution": "Seoul National University"}], "abstract": "We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front\u2013back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision.  Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment\u2013posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39279", "url": null, "sourceid": 32502, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36851, "uid": "aa2ecd855649906462f4ed8ccc6cd84b", "name": "Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset", "authors": [{"id": 70941, "fullname": "Qingyan Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/70941?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 93859, "fullname": "Qiuyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93859?format=json", "institution": "Ant Group"}, {"id": 89797, "fullname": "Hao Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89797?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 156838, "fullname": "Yue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156838?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 186021, "fullname": "Hanlin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186021?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 127962, "fullname": "Wen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127962?format=json", "institution": "Zhejiang University"}, {"id": 91740, "fullname": "Ka Leong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/91740?format=json", "institution": "HKUST"}, {"id": 69405, "fullname": "Shuailei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69405?format=json", "institution": "Northeastern University"}, {"id": 94115, "fullname": "Yanhong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/94115?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 151683, "fullname": "Zichen Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/151683?format=json", "institution": "HKUST"}, {"id": 91631, "fullname": "Yinghao Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91631?format=json", "institution": "Stanford University"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 87711, "fullname": "Qifeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/87711?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence.  Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy.  The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing. We will release our dataset and models for reproducibility.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36851", "url": null, "sourceid": 39599, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38771, "uid": "d9f53a87e571214d24718690a39f67e9", "name": "HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives", "authors": [{"id": 151629, "fullname": "Yihao Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/151629?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 89797, "fullname": "Hao Ouyang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89797?format=json", "institution": "Department of Computer Science and Engineering, Hong Kong University of Science and Technology"}, {"id": 156838, "fullname": "Yue Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/156838?format=json", "institution": "Hong Kong University of Science and Technology"}, {"id": 93859, "fullname": "Qiuyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/93859?format=json", "institution": "Ant Group"}, {"id": 127962, "fullname": "Wen Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127962?format=json", "institution": "Zhejiang University"}, {"id": 91740, "fullname": "Ka Leong Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/91740?format=json", "institution": "HKUST"}, {"id": 186021, "fullname": "Hanlin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/186021?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 69405, "fullname": "Shuailei Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/69405?format=json", "institution": "Northeastern University"}, {"id": 71750, "fullname": "Yixuan LI", "url": "http://cvpr.thecvf.com/api/miniconf/users/71750?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 127834, "fullname": "Chen Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/127834?format=json", "institution": "Nanyang Technological University"}, {"id": 94115, "fullname": "Yanhong Zeng", "url": "http://cvpr.thecvf.com/api/miniconf/users/94115?format=json", "institution": "Shanghai AI Laboratory"}, {"id": 186273, "fullname": "Xing Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186273?format=json", "institution": "Ant Group"}, {"id": 88128, "fullname": "Yujun Shen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88128?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 149628, "fullname": "Huamin Qu", "url": "http://cvpr.thecvf.com/api/miniconf/users/149628?format=json", "institution": "Hong Kong University of Science and Technology"}], "abstract": "State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this \"narrative gap\" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38771", "url": null, "sourceid": 32523, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40106, "uid": "def0f1c91a4533cc9c0d5262ba754644", "name": "Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion", "authors": [{"id": 193546, "fullname": "Yu Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/193546?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 193547, "fullname": "Longjun Gao", "url": "http://cvpr.thecvf.com/api/miniconf/users/193547?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 193548, "fullname": "Yuanqi Su", "url": "http://cvpr.thecvf.com/api/miniconf/users/193548?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 193549, "fullname": "HaoAng Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193549?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 193550, "fullname": "Xiaoning Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/193550?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions\u2014where over 93% of voxels are empty and foreground classes are rare\u2014poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40106", "url": null, "sourceid": 37263, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38803, "uid": "f19346a09e58f197a731f3c062aeef4b", "name": "$\\alpha$Matte4K & $\\mu$Matting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting", "authors": [{"id": 190715, "fullname": "Xinyi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/190715?format=json", "institution": "East China Normal University"}, {"id": 190716, "fullname": "Hang Dong", "url": "http://cvpr.thecvf.com/api/miniconf/users/190716?format=json", "institution": "Kuaishou Technology"}, {"id": 89300, "fullname": "Baowei Jiang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89300?format=json", "institution": "QiYuan Lab"}, {"id": 184608, "fullname": "Shenkun Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184608?format=json", "institution": null}, {"id": 190717, "fullname": "Youqi Guan", "url": "http://cvpr.thecvf.com/api/miniconf/users/190717?format=json", "institution": "Beijing Normal University"}, {"id": 130537, "fullname": "Kanle Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/130537?format=json", "institution": "Kuaishou Technology"}, {"id": 156268, "fullname": "Kun Gai", "url": "http://cvpr.thecvf.com/api/miniconf/users/156268?format=json", "institution": "Kuaishou- \u5feb\u624b\u79d1\u6280"}, {"id": 152615, "fullname": "Haichuan Song", "url": "http://cvpr.thecvf.com/api/miniconf/users/152615?format=json", "institution": "East China Normal University"}], "abstract": "High-resolution human video matting aims to predict accurate alpha mattes for semi-transparent regions while ensuring temporal consistency across frames.Despite notable progress, existing research remains limited by the insufficient quality of datasets, including (1) inaccurate alpha fractional values resulting from imperfect annotation, and (2) visual inconsistencies arising from arbitrary foreground-background compositions that lack natural coherence.In this paper, we introduce $\\alpha$Matte4K, a large-scale 4K-resolution human video matting dataset, which achieves accurate annotations and physical consistency through physically based rendering (PBR).From model perspective, constrained by computational costs, current methods often up-sample alpha outputs to meet target resolutions that unavoidably diminishes precision.To overcome this critical limitation, we introduce $\\mu$Matting, a innovative resolution-agnostic two-stage matting framework for video matting: (1) coarse matte localization using a portrait-aware masked autoencoder; (2) refinement of critical regions via sparse 3D convolution, augmented by a temporal modulator that injects global spatio-temporal cues for enhanced consistency and contextual awareness. Extensive experiments show that $\\alpha$Matte4K boosts baseline performance, while $\\mu$Matting surpasses state-of-the-art methods in accuracy and spatio-temporal consistency, driving applications in real-world scenarios.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38803", "url": null, "sourceid": 41500, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36670, "uid": "e88ea88793f3891e2587415cb90c27d6", "name": "LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs", "authors": [{"id": 185611, "fullname": "Behzad Bozorgtabar", "url": "http://cvpr.thecvf.com/api/miniconf/users/185611?format=json", "institution": "Aarhus University ; Swiss Federal Institute of Technology Lausanne; University of Southern Denmark - SDU"}, {"id": 86979, "fullname": "Dwarikanath Mahapatra", "url": "http://cvpr.thecvf.com/api/miniconf/users/86979?format=json", "institution": "Inception Institute of AI, UAE"}, {"id": 72355, "fullname": "Sudipta Roy", "url": "http://cvpr.thecvf.com/api/miniconf/users/72355?format=json", "institution": "Jio Institute"}, {"id": 73853, "fullname": "Muzammal Naseer", "url": "http://cvpr.thecvf.com/api/miniconf/users/73853?format=json", "institution": "MBZUAI"}, {"id": 158752, "fullname": "Imran Razzak", "url": "http://cvpr.thecvf.com/api/miniconf/users/158752?format=json", "institution": "MBZUAI, Abu Dhabi"}, {"id": 185612, "fullname": "Zongyuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185612?format=json", "institution": "Monash University"}], "abstract": "Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced\u2014high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose $\\texttt{\\textbf{LATA}}$ (Laplacian-Assisted Transductive Adaptation), a $\\textit{training- and label-free}$ refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image\u2013image $k$NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a $\\textit{failure-aware}$ conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. $\\texttt{\\textbf{LATA}}$ is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across $\\textbf{three}$ medical VLMs and $\\textbf{nine}$ downstream tasks, $\\texttt{\\textbf{LATA}}$ consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that $\\texttt{\\textbf{LATA}}$ sharpens zero-shot predictions without compromising exchangeability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36670", "url": null, "sourceid": 43224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/40270?format=json"], "related_events_ids": [40270]}, {"id": 40270, "uid": "e88ea88793f3891e2587415cb90c27d6", "name": "LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs", "authors": [{"id": 185611, "fullname": "Behzad Bozorgtabar", "url": "http://cvpr.thecvf.com/api/miniconf/users/185611?format=json", "institution": "Aarhus University ; Swiss Federal Institute of Technology Lausanne; University of Southern Denmark - SDU"}, {"id": 86979, "fullname": "Dwarikanath Mahapatra", "url": "http://cvpr.thecvf.com/api/miniconf/users/86979?format=json", "institution": "Inception Institute of AI, UAE"}, {"id": 72355, "fullname": "Sudipta Roy", "url": "http://cvpr.thecvf.com/api/miniconf/users/72355?format=json", "institution": "Jio Institute"}, {"id": 73853, "fullname": "Muzammal Naseer", "url": "http://cvpr.thecvf.com/api/miniconf/users/73853?format=json", "institution": "MBZUAI"}, {"id": 158752, "fullname": "Imran Razzak", "url": "http://cvpr.thecvf.com/api/miniconf/users/158752?format=json", "institution": "MBZUAI, Abu Dhabi"}, {"id": 185612, "fullname": "Zongyuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185612?format=json", "institution": "Monash University"}], "abstract": "Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced\u2014high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose $\\texttt{\\textbf{LATA}}$ (Laplacian-Assisted Transductive Adaptation), a $\\textit{training- and label-free}$ refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image\u2013image $k$NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a $\\textit{failure-aware}$ conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. $\\texttt{\\textbf{LATA}}$ is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across $\\textbf{three}$ medical VLMs and $\\textbf{nine}$ downstream tasks, $\\texttt{\\textbf{LATA}}$ consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that $\\texttt{\\textbf{LATA}}$ sharpens zero-shot predictions without compromising exchangeability.", "topic": null, "keywords": [], "decision": "Accept (Oral)", "session": "", "eventtype": "Oral", "event_type": "Oral", "room_name": null, "virtualsite_url": "/virtual/2026/oral/40270", "url": null, "sourceid": -43224, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": ["http://cvpr.thecvf.com/api/miniconf/events/36670?format=json"], "related_events_ids": [36670]}, {"id": 38223, "uid": "c52376a1820e868235b1851b87492a39", "name": "Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data", "authors": [{"id": 188962, "fullname": "Xinlin Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188962?format=json", "institution": "East China Normal University; Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 132272, "fullname": "feilong tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/132272?format=json", "institution": "Monash University"}, {"id": 188964, "fullname": "Haolin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188964?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 188963, "fullname": "Xiwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188963?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 153129, "fullname": "Ming Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/153129?format=json", "institution": "Monash University"}, {"id": 148627, "fullname": "Huifa Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/148627?format=json", "institution": "East China Normal University"}, {"id": 158745, "fullname": "Haochen Xue", "url": "http://cvpr.thecvf.com/api/miniconf/users/158745?format=json", "institution": "University of Liverpool"}, {"id": 86671, "fullname": "Junjun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/86671?format=json", "institution": "Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences"}, {"id": 185612, "fullname": "Zongyuan Ge", "url": "http://cvpr.thecvf.com/api/miniconf/users/185612?format=json", "institution": "Monash University"}, {"id": 130667, "fullname": "Yichen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130667?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 188971, "fullname": "Ying Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/188971?format=json", "institution": "East China Normal University"}, {"id": 158752, "fullname": "Imran Razzak", "url": "http://cvpr.thecvf.com/api/miniconf/users/158752?format=json", "institution": "MBZUAI, Abu Dhabi"}], "abstract": "Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by \\textit{knowledge} and \\textit{reasoning} complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, \\textit{\\textbf{D}ifficulty-\\textbf{I}nfluence \\textbf{Q}uadrant} \\textbf{(DIQ)}, which prioritizes samples in the \u201chigh-difficulty\u2013high-influence\u201d quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in \\textit{differential diagnosis}, \\textit{safety check}, and \\textit{evidence citation}, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only \\textbf{1\\%} of selected data to match full-dataset performance, while using \\textbf{10\\%} consistently outperforms baseline methods, highlighting the superiority of principled data selection over brute-force scaling.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38223", "url": null, "sourceid": 32297, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38066, "uid": "017326cd529b9126139a294e682c0495", "name": "CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection", "authors": [{"id": 188962, "fullname": "Xinlin Zhuang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188962?format=json", "institution": "East China Normal University; Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 130667, "fullname": "Yichen Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/130667?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 188963, "fullname": "Xiwei Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188963?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 188964, "fullname": "Haolin Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188964?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 188965, "fullname": "Yifan Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188965?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 188966, "fullname": "Ziyun Zou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188966?format=json", "institution": null}, {"id": 188967, "fullname": "Yulong Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/188967?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 148627, "fullname": "Huifa Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/148627?format=json", "institution": "East China Normal University"}, {"id": 188968, "fullname": "Dongliang Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188968?format=json", "institution": "East China Normal University"}, {"id": 188969, "fullname": "Qinglei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/188969?format=json", "institution": null}, {"id": 188970, "fullname": "Weiyang Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188970?format=json", "institution": "Department of Computer Science and Engineering, The Chinese University of Hong Kong"}, {"id": 188971, "fullname": "Ying Qian", "url": "http://cvpr.thecvf.com/api/miniconf/users/188971?format=json", "institution": "East China Normal University"}, {"id": 142668, "fullname": "Jiangming Shi", "url": "http://cvpr.thecvf.com/api/miniconf/users/142668?format=json", "institution": "Xiamen University"}, {"id": 158752, "fullname": "Imran Razzak", "url": "http://cvpr.thecvf.com/api/miniconf/users/158752?format=json", "institution": "MBZUAI, Abu Dhabi"}], "abstract": "Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by scaling up domain-specific datasets.  We revisit this task from a data-centric perspective: {Can effective data selection substitute for large-scale datasets in continual pre-training (CPT)?} We introduce \\textbf{CHIPS} (\\textbf{C}urvature-aware \\textbf{H}ybrid \\textbf{I}nfluence in \\textbf{P}rojection \\textbf{S}ubspace), which assigns each image\u2013text pair a utility that integrates three complementary factors aligned with three goals: \\textit{faithfulness} via a curvature-aware, Newton-style alignment computed in CLIP\u2019s end-point subspace; \\textit{scalability} via an InfoNCE-aware curvature estimator with Johnson\u2013Lindenstrauss (JL) sketching; and \\textit{retention} via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower\u2011bound guarantee on the proxy\u2019s correlation with full\u2011parameter alignment and by characterizing the bias\u2013variance trade\u2011offs introduced by curvature mixing and JL sketching.We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 \\textbf{medical benchmarks}, matches full-dataset CPT with 30\\% of the data, and outperforms half-dataset CPT using only 10\\%; 2) on 31 \\textbf{general-domain benchmarks}, CHIPS yields the smallest performance drop under 10--30\\% data-retention budgets. Code, data, and model checkpoints will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38066", "url": null, "sourceid": 42688, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36607, "uid": "2367a2216a3ec74c8c6dd02123836612", "name": "DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration", "authors": [{"id": 185452, "fullname": "Jiayi Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/185452?format=json", "institution": "University of Science and Technology of China"}, {"id": 152046, "fullname": "Yuxin Yao", "url": "http://cvpr.thecvf.com/api/miniconf/users/152046?format=json", "institution": "City University of Hong Kong"}, {"id": 185453, "fullname": "Qiuhang Lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/185453?format=json", "institution": "Chinese Academy of Sciences; University of Chinese Academy of Sciences"}, {"id": 127405, "fullname": "Juyong Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/127405?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism that incorporates a computationally lightweight single-point RANSAC algorithm followed by a refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method's effectiveness, as shown by achieving up to a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36607", "url": null, "sourceid": 32454, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37057, "uid": "be1f3b9abdc31feeeda082b2501c65f1", "name": "Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation", "authors": [{"id": 186579, "fullname": "Yuanfan Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/186579?format=json", "institution": "Hunan University"}, {"id": 70993, "fullname": "Kunyu Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/70993?format=json", "institution": "KIT"}, {"id": 87543, "fullname": "Xu Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87543?format=json", "institution": "HKUST"}, {"id": 89634, "fullname": "Kailun Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/89634?format=json", "institution": "Karlsruher Institut f\u00fcr Technologie"}], "abstract": "Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360\u00b0 scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting and propose the Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360\u00b0 panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin in the Euler rotation space, thereby enhancing viewpoint-invariant semantic representation for panoramic geometry and improving generalization across novel viewpoints. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The source code will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37057", "url": null, "sourceid": 45866, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37416, "uid": "7ab86630748a3c51b0a65481820bca38", "name": "StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars", "authors": [{"id": 181425, "fullname": "Zhiyao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181425?format=json", "institution": "Tsinghua University"}, {"id": 101234, "fullname": "Ziqiao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/101234?format=json", "institution": "Renmin University of China"}, {"id": 187394, "fullname": "Yifeng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187394?format=json", "institution": "Tencent"}, {"id": 155777, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155777?format=json", "institution": "Tencent"}, {"id": 175795, "fullname": "zhengguang zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/175795?format=json", "institution": "tencent"}, {"id": 127553, "fullname": "Zixiang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127553?format=json", "institution": "xiaobing.ai"}, {"id": 71092, "fullname": "Guozhen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71092?format=json", "institution": "Nanjing University"}, {"id": 187395, "fullname": "Youliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187395?format=json", "institution": "Tsinghua University"}, {"id": 155471, "fullname": "Yuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155471?format=json", "institution": "Microsoft Research"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 91120, "fullname": "Yong-Jin Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/91120?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37416", "url": null, "sourceid": 45629, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38922, "uid": "44d199164a5fd86d5e5cb29a5f8181f8", "name": "ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars", "authors": [{"id": 101234, "fullname": "Ziqiao Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/101234?format=json", "institution": "Renmin University of China"}, {"id": 155777, "fullname": "Yi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/155777?format=json", "institution": "Tencent"}, {"id": 187394, "fullname": "Yifeng Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/187394?format=json", "institution": "Tencent"}, {"id": 71092, "fullname": "Guozhen Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/71092?format=json", "institution": "Nanjing University"}, {"id": 181425, "fullname": "Zhiyao Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/181425?format=json", "institution": "Tsinghua University"}, {"id": 127553, "fullname": "Zixiang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/127553?format=json", "institution": "xiaobing.ai"}, {"id": 187395, "fullname": "Youliang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187395?format=json", "institution": "Tsinghua University"}, {"id": 175795, "fullname": "zhengguang zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/175795?format=json", "institution": "tencent"}, {"id": 153722, "fullname": "Zhaoxin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153722?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}, {"id": 159540, "fullname": "Hongyan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/159540?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 155471, "fullname": "Yuan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/155471?format=json", "institution": "Microsoft Research"}, {"id": 106459, "fullname": "qinglin lu", "url": "http://cvpr.thecvf.com/api/miniconf/users/106459?format=json", "institution": "Tencent"}, {"id": 105706, "fullname": "Jun He", "url": "http://cvpr.thecvf.com/api/miniconf/users/105706?format=json", "institution": "Renmin University of China"}], "abstract": "Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process\u2014early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model\u2019s text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality. The code will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38922", "url": null, "sourceid": 39008, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37710, "uid": "9717b5c8bd4b8dc15925b7d42a7a9c0d", "name": "Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving", "authors": [{"id": 180525, "fullname": "Hua Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/180525?format=json", "institution": "City University of Hong Kong"}, {"id": 85278, "fullname": "Zikang Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/85278?format=json", "institution": "City University of Hong Kong"}, {"id": 188063, "fullname": "Qian Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/188063?format=json", "institution": "City University of Hong Kong"}, {"id": 157985, "fullname": "Zihao WEN", "url": "http://cvpr.thecvf.com/api/miniconf/users/157985?format=json", "institution": "City University of Hong Kong"}, {"id": 188064, "fullname": "Junjie Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/188064?format=json", "institution": "Jiangsu University"}, {"id": 188065, "fullname": "Xinhong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/188065?format=json", "institution": "City University of Hong Kong"}, {"id": 188066, "fullname": "Zhengmin JIANG", "url": "http://cvpr.thecvf.com/api/miniconf/users/188066?format=json", "institution": "City University of Hong Kong"}, {"id": 85269, "fullname": "Yung-Hui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/85269?format=json", "institution": "Hon Hai Research Institute"}, {"id": 72103, "fullname": "Jianping Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/72103?format=json", "institution": "City University of Hong Kong"}], "abstract": "Reliable long-horizon trajectory prediction requires both high positional accuracy and physically plausible temporal motion consistency. However, existing methods suffer from two fundamental limitations. First, they overlook the inherent difference in prediction logic: near-future trajectories are primarily governed by historical dynamics, whereas distant-future behaviors are driven by high-level semantic context. Yet, most methods employ a unified decoding pathway that blurs the temporal distinction.Second, although the near future is relatively easier to predict, existing methods lack mechanisms for coherent trajectory propagation across time horizons, often resulting in kinematically implausible predictions with inconsistent heading evolution and degraded long-horizon performance. To address these challenges, we propose NDPNet, a dual-stage architecture that decouples near- and distant-horizon modeling into specialized pathways, with a dedicated transition module ensuring smooth temporal bridging. Furthermore, we introduce a novel motion-aware coherence loss that explicitly embeds kinematic priors to enforce trajectory consistency. Extensive experiments show that NDPNet achieves SOTA performance on Argoverse 2 and WOMD. Notably, on WOMD, it ranks 1$^{\\text{st}}$ in both minFDE${}_6$ and minADE${}_6$ across all standard horizons (3s, 5s, 8s) without ensemble learning or NMS post-processing, and is the first to achieve sub-1.75 minFDE${}_6$ for 8s prediction, surpassing prior methods by a large margin. The code will be released subsequently.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37710", "url": null, "sourceid": 33506, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36980, "uid": "a719fa25b2a1d8f332ce512e145d4e83", "name": "Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models", "authors": [{"id": 181881, "fullname": "Hulingxiao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/181881?format=json", "institution": "Peking University"}, {"id": 186371, "fullname": "Zhi Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/186371?format=json", "institution": "Peking University Shenzhen Graduate School"}, {"id": 87023, "fullname": "Yuxin Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/87023?format=json", "institution": "Peking University"}], "abstract": "A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that \\ours~consistently enhances LMMs\u2019 hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36980", "url": null, "sourceid": 46316, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36337, "uid": "5124ec804c4633ad5e127f3f9543bc10", "name": "Multi-Paradigm Collaborative Adversarial Attack Against Multimodal Large Language Models", "authors": [{"id": 181295, "fullname": "Yuanbo Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/181295?format=json", "institution": "Jiangnan University"}, {"id": 75854, "fullname": "Tianyang Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/75854?format=json", "institution": "Jiangnan University"}, {"id": 184809, "fullname": "Cong Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/184809?format=json", "institution": "Jiangnan University"}, {"id": 184810, "fullname": "Tao Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/184810?format=json", "institution": "Nanjing University of Science and Technology"}, {"id": 129533, "fullname": "Xiaojun Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/129533?format=json", "institution": "Jiangnan University"}, {"id": 154654, "fullname": "Josef Kittler", "url": "http://cvpr.thecvf.com/api/miniconf/users/154654?format=json", "institution": "University of Surrey"}], "abstract": "The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications.However, transferable adversarial threats are thus exposed in the public community.In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces.This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations.To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy.By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias.Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.The code will be released here.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36337", "url": null, "sourceid": 35520, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40117, "uid": "c845225d5c3e731efe343997f03eee08", "name": "Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge", "authors": [{"id": 193569, "fullname": "Wonseon Lim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193569?format=json", "institution": "Chung-Ang University"}, {"id": 193570, "fullname": "Jaesung Lee", "url": "http://cvpr.thecvf.com/api/miniconf/users/193570?format=json", "institution": "Chung-Ang University"}, {"id": 193571, "fullname": "Dae-Won Kim", "url": "http://cvpr.thecvf.com/api/miniconf/users/193571?format=json", "institution": "Chung-Ang University"}], "abstract": "Continual learning (CL) on edge devices requires not only high accuracy but also training-time efficiency to support on-device model adaptation under limited memory and compute resources. While prompt-based continual learning (PCL) achieves strong performance with few learnable parameters, existing studies primarily optimize accuracy or inference efficiency, overlooking the cost of on-device training. In this paper, we propose CPS-Prompt, a critical patch-aware sparse prompting framework that enhances training efficiency with minimal accuracy loss by combining Critical Patch Sampling (CPS) for task-aware token selection and Decoupled Prompt\u2013Classifier Training (DPCT) for representation alignment. Extensive experiments across three public datasets demonstrate that CPS-Prompt reduces peak memory usage and training time by 36\\% and 35\\%, respectively, while maintaining accuracy within 2\\% of the state-of-the-art method, C-Prompt, and matching the balanced CODA-Prompt baseline.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40117", "url": null, "sourceid": 37783, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39883, "uid": "5a32bed7d2d1228c472c5f751ac20c0e", "name": "Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning", "authors": [{"id": 182105, "fullname": "Song Lai", "url": "http://cvpr.thecvf.com/api/miniconf/users/182105?format=json", "institution": "City University of Hong Kong"}, {"id": 186148, "fullname": "Zhe Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/186148?format=json", "institution": "University of Science and Technology of China"}, {"id": 193047, "fullname": "Fei Zhu", "url": "http://cvpr.thecvf.com/api/miniconf/users/193047?format=json", "institution": "Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science &amp; Innovation, Chinese Academy of Sciences"}, {"id": 193048, "fullname": "Ji Cheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/193048?format=json", "institution": "City University of Hong Kong"}, {"id": 188885, "fullname": "Xi Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/188885?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 73898, "fullname": "Qingfu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/73898?format=json", "institution": "City University of Hong Kong"}, {"id": 89628, "fullname": "Gaofeng Meng", "url": "http://cvpr.thecvf.com/api/miniconf/users/89628?format=json", "institution": "Institute of automation, Chinese academy of science, Chinese Academy of Sciences"}], "abstract": "Rehearsal-based methods are the cornerstone of modern online class-incremental learning (OCIL), yet they face a fundamental challenge: the gradient of the current task often conflicts with that of the rehearsal data from the memory buffer, leading to catastrophic forgetting. Recent works have implicitly addressed this by using hypergradients, but the underlying mechanism has remained poorly understood. In this paper, we first provide a formal analysis revealing that hypergradients mitigate forgetting by aligning task-specific gradients towards a common meta-objective, thereby reducing their conflict. However, we argue that this conflict-reducing alignment is inherently myopic\u2014it only considers the immediate gradient directions, failing to account for the loss landscape geometry just one step ahead. To overcome this limitation, we introduce a novel framework: Lookahead Optimization for Rehearsal (LOR). Instead of committing to a single update, LOR first explores a set of potential future model states by taking lookahead steps along different directions that balance plasticity and stability. To ensure the final update is robust, we formulate the optimization as a min-max problem, seeking parameters that perform well even under the worst-case lookahead scenario. This objective is made tractable by a smooth Log-Sum-Exp approximation, enabling efficient end-to-end training. Theoretical analysis from both optimization and statistical perspectives corroborates the robustness of our approach. Extensive experiments on Seq-CIFAR10, Seq-CIFAR100, and Seq-TinyImageNet demonstrate that LOR significantly outperforms state-of-the-art methods, establishing a new and more robust paradigm for rehearsal-based OCIL.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39883", "url": null, "sourceid": 43590, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37403, "uid": "66c3f86ac905496e4ab21a2bc5fb33f1", "name": "BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds", "authors": [{"id": 103612, "fullname": "Tongyan Hua", "url": "http://cvpr.thecvf.com/api/miniconf/users/103612?format=json", "institution": "HKUST(GZ)"}, {"id": 187333, "fullname": "Haoran Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187333?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 77389, "fullname": "Yuan Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/77389?format=json", "institution": "The University of Hong Kong"}, {"id": 182570, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182570?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 88283, "fullname": "Ying-Cong Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88283?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 178398, "fullname": "Wufan Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/178398?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}], "abstract": "We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion.To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation.Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes.We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds.Our method delivers substantial qualitative and quantitative improvements over prior building abstraction methods. Furthermore, the effectiveness of our approach is evidenced by the strong performance of its recovered point clouds on building point cloud completion benchmarks, which exhibit improved surface accuracy and distribution uniformity.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37403", "url": null, "sourceid": 35130, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39599, "uid": "3fd35dfcc8e0ce7cca31532a92d6701c", "name": "AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation", "authors": [{"id": 180011, "fullname": "Haoyue Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/180011?format=json", "institution": "University of Science and Technology of China"}, {"id": 192448, "fullname": "Shengnan Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192448?format=json", "institution": null}, {"id": 192449, "fullname": "Yulin Qiao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192449?format=json", "institution": "University of Macau"}, {"id": 178067, "fullname": "juncheng zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/178067?format=json", "institution": "Chinese University of Hong Kong"}, {"id": 192450, "fullname": "Youhui Bai", "url": "http://cvpr.thecvf.com/api/miniconf/users/192450?format=json", "institution": "University of Science and Technology of China"}, {"id": 192451, "fullname": "Ping Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/192451?format=json", "institution": "University of Science and Technology of China"}, {"id": 192452, "fullname": "Zewen Jin", "url": "http://cvpr.thecvf.com/api/miniconf/users/192452?format=json", "institution": "University of Science and Technology of China"}, {"id": 192453, "fullname": "Cheng Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/192453?format=json", "institution": null}], "abstract": "Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity, or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training\u2011free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX\u20112B, HunyuanVideo, and Wan\u20112.1 via one A40 GPU demonstrate up to 1.67x-4.31x speedup with negligible quality degradation.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39599", "url": null, "sourceid": 46101, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37393, "uid": "98ff5000b5b118e27d8a4d8ea171d6dd", "name": "CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation via Competitive Strategy", "authors": [{"id": 144805, "fullname": "wang duanchu", "url": "http://cvpr.thecvf.com/api/miniconf/users/144805?format=json", "institution": "Xi&#x27;an University of Electronic Science and Technology"}, {"id": 187332, "fullname": "Junjie Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187332?format=json", "institution": "University of Chinese Academy of Sciences"}, {"id": 187333, "fullname": "Haoran Gong", "url": "http://cvpr.thecvf.com/api/miniconf/users/187333?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 142874, "fullname": "Jing Liu", "url": "http://cvpr.thecvf.com/api/miniconf/users/142874?format=json", "institution": "Xi&#x27;an Jiaotong University"}, {"id": 182570, "fullname": "Di Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182570?format=json", "institution": "Xi&#x27;an Jiaotong University"}], "abstract": "Transformer-based approaches have recently become the dominant paradigm for 3D instance segmentation. These methods typically employ a multi-layer decoder that iteratively refines a set of learnable queries into instance mask predictions. However, we observe that multiple queries often target the same instance simultaneously, leading to fragmented masks for a single object.  We define this phenomenon as \\emph{inter-query competition}, which slows convergence and limits segmentation accuracy. To address this problem, we present \\textbf{CompetitorFormer}, a novel framework designed for Transformer-based methods. Our method mitigates inter-query competition by explicitly modeling the competitive relationships among queries. Specifically, we introduce a \\emph{Query Competition Layer} before each decoder stage to construct a dynamic competitive landscape, allowing each query to perceive its relative importance. In addition, the proposed \\emph{Relative Relationship Encoding} and \\emph{Rank Cross-Attention} modules enhance both self-attention and cross-attention by prioritizing dominant queries. Extensive experiments show that our approach converges faster and achieves superior performance on the ScanNetV2, ScanNet++V2, ScanNet200, and S3DIS datasets.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37393", "url": null, "sourceid": 42915, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38799, "uid": "1f034ade6c58fc442a66e4b2b71abbf8", "name": "Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning", "authors": [{"id": 180014, "fullname": "Mingjie Ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/180014?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 144149, "fullname": "yichao ma", "url": "http://cvpr.thecvf.com/api/miniconf/users/144149?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190696, "fullname": "Zhong Yang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190696?format=json", "institution": "Huazhong University of Science and Technology"}, {"id": 190697, "fullname": "Guohui Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190697?format=json", "institution": "Huazhong University of Science and Technology"}], "abstract": "Multimodal large language models (MLLMs) have achieved remarkable success on diverse visual-language tasks. However, fixed-resolution models face challenges in perceiving fine-grained visual details, particularly due to *distracted attention* and *blurry vision*. To address these issues, we propose **SLoFo**, a training-free and self-guided inference framework that mimics the human \"**S**can-**Lo**cate-**Fo**cus\" process. SLoFo first adopts a dual-branch mechanism to identify critical image regions: the Semantic branch constructs a gradient-based semantic relevance map, and the Structure branch estimates visual token uniqueness offering complementary and robust evidence. By combining both branches, SLoFo perceives and explicitly crop critical regions. During inference, with additional cropped sub-image, SLoFo applies a progressive visual token pruning strategy to improve attention focus on key areas while reducing computational overhead. Experiments on detail-sensitive and general-purpose benchmarks show that SLoFo consistently improves accuracy (+4.79% on TextVQA, +2.62% on GQA) and robustness (+4.60% on POPE-MSCOCO adversarial) without training or external modules.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38799", "url": null, "sourceid": 31686, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 36962, "uid": "8c10f4923dd9a1bb3ea9d8b41c6f40e3", "name": "Hint2Gen: Bridging Understanding and Generation via Code-structured Hints", "authors": [{"id": 182144, "fullname": "Yuanpeng Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182144?format=json", "institution": "Linkin"}, {"id": 186320, "fullname": "Yunpeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186320?format=json", "institution": "ByteDance Inc."}, {"id": 88507, "fullname": "Xi Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/88507?format=json", "institution": "the University of Hong Kong, University of Hong Kong"}, {"id": 106809, "fullname": "Liang Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/106809?format=json", "institution": null}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}], "abstract": "Recent unified models have made remarkable strides in generating high-quality images, yet they consistently fail on reasoning-intensive tasks, i.e., solving mazes, assembling tangrams. Intriguingly, we find that vision-language models (VLMs) and large language models (LLMs) can accurately solve these tasks, but cannot generate the corresponding images because they lack a structured visual output interface. This reveals that the core bottleneck is not reasoning capacity, but the lack of a structured interface to translate high-level reasoning into precise visual output. To bridge this gap, we propose using code-structured visual hints (i.e., SVG/HTML) overlays that explicitly encode reasoning steps directly on the image plane. Accordingly, we develop an automatic data construction pipeline that can generate high-quality code-structured hints for existing datasets and train a unified model called Hint2Gen based on FLUX.1 Kontext to condition its generation on such hints. Furthermore, to comprehensively evaluate the effectiveness of our approach, we introduce Reason2Gen, a benchmark comprising 4,000 samples spanning 20 categories across 7 core dimensions, including path connectivity, spatial assembly, etc. Extensive experiments demonstrate that even simply providing such hints as extra inputs\u2014without any retraining\u2014boosts their performance. And our model significantly outperforms all leading open-source/closed-source methods on reasoning-aware generation and editing across all the dimensions.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/36962", "url": null, "sourceid": 37822, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39823, "uid": "c3c80d6099f861d4199ca0372b9f96c3", "name": "Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation", "authors": [{"id": 182144, "fullname": "Yuanpeng Tu", "url": "http://cvpr.thecvf.com/api/miniconf/users/182144?format=json", "institution": "Linkin"}, {"id": 186320, "fullname": "Yunpeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186320?format=json", "institution": "ByteDance Inc."}, {"id": 192925, "fullname": "Xinyu Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/192925?format=json", "institution": null}, {"id": 192926, "fullname": "Chao Liao", "url": "http://cvpr.thecvf.com/api/miniconf/users/192926?format=json", "institution": "ByteDance Inc."}, {"id": 87814, "fullname": "Hengshuang Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/87814?format=json", "institution": "The University of Hong Kong"}], "abstract": "MeanFlow is a powerful few-step generative framework that can be trained from scratch, but its performance degrades significantly when the one-step loss uses a large portion of training data. This stems from a temporal scale imbalance: gradients from different stages of generation contribute unevenly, leading to unstable optimization\u2014evident in blurry samples and high FID scores. The core issue is a conflict between two opposing forces: terms that amplify variance over long time spans and strong constraints needed near the start of generation, which a fixed sampling strategy cannot reconcile. To resolve this, we propose Temporal Equilibrium MeanFlow (TEMF), which balances these competing demands through two simple yet effective components: (1) a temporal equilibrium weighting function that equalizes gradient influence across all time scales, and (2) a dynamic boundary scheduler that gradually shifts training focus\u2014from stabilizing early steps to refining the full trajectory as training progresses. Without changing the model architecture, TEMF retains true one-step generation with classifier-free guidance, achieving a state-of-the-art FID of 2.62 on ImageNet 256\u00d7256\u2014achieving the best results among diffusion- and flow-based one-step methods.", "topic": null, "keywords": [], "decision": "Accept (Highlight)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39823", "url": null, "sourceid": 46777, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 37452, "uid": "c4ba6775e8258a19bc5b83859a26d6f3", "name": "Beyond Mimicry: Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations", "authors": [{"id": 157192, "fullname": "Wei-Jin Huang", "url": "http://cvpr.thecvf.com/api/miniconf/users/157192?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 187481, "fullname": "Yue-Yi Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/187481?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 129725, "fullname": "Yi-Lin Wei", "url": "http://cvpr.thecvf.com/api/miniconf/users/129725?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 157193, "fullname": "Zhi-Wei Xia", "url": "http://cvpr.thecvf.com/api/miniconf/users/157193?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 187482, "fullname": "Juantao Tan", "url": "http://cvpr.thecvf.com/api/miniconf/users/187482?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 150559, "fullname": "Yuan-Ming Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/150559?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 187483, "fullname": "Zhilin Zhao", "url": "http://cvpr.thecvf.com/api/miniconf/users/187483?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}, {"id": 86314, "fullname": "Wei-Shi Zheng", "url": "http://cvpr.thecvf.com/api/miniconf/users/86314?format=json", "institution": "SUN YAT-SEN UNIVERSITY"}], "abstract": "Enabling humanoid robots to physically interact with humans is a critical frontier, but progress is hindered by the scarcity of high-quality Human-Humanoid Interaction (HHoI) data. While leveraging abundant Human-Human Interaction (HHI) data presents a scalable alternative, we first demonstrate that standard retargeting fails by breaking the essential contacts. We address this with PAIR (Physics-Aware Interaction Retargeting), a contact-centric, two-stage pipeline that preserves contact semantics across morphology differences to generate physically consistent HHoI data.This high-quality data, however, exposes a second failure: conventional imitation learning policies merely mimic trajectories and lack interactive understanding. We therefore introduce D-STAR (Decoupled Spatio-Temporal Action Reasoner), a hierarchical policy that disentangles when to act from where to act. In D-STAR, Phase Attention (when) and a Multi-Scale Spatial module (where) are fused by the diffusion head to produce synchronized whole-body behaviors beyond mimicry.By decoupling these reasoning streams, our model learns robust temporal phases without being distracted by spatial noise, leading to responsive, synchronized collaboration. We validate our framework through extensive and rigorous simulations, demonstrating significant performance gains over baseline approaches and a complete, effective pipeline for learning complex whole-body interactions from HHI data.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/37452", "url": null, "sourceid": 46028, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38161, "uid": "fb628a9f2005fb3e8152ca8a66ca4515", "name": "CUBic: Coordinated Unified Bimanual Perception and Control Framework", "authors": [{"id": 189184, "fullname": "Xingyu Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/189184?format=json", "institution": "Institute of automation, Chinese academy of science"}, {"id": 187632, "fullname": "Pengxiang Ding", "url": "http://cvpr.thecvf.com/api/miniconf/users/187632?format=json", "institution": "Zhejiang University; Westlake University"}, {"id": 189185, "fullname": "Jingkai Xu", "url": "http://cvpr.thecvf.com/api/miniconf/users/189185?format=json", "institution": "Peking University"}, {"id": 90158, "fullname": "Donglin Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/90158?format=json", "institution": "Westlake University"}, {"id": 153722, "fullname": "Zhaoxin Fan", "url": "http://cvpr.thecvf.com/api/miniconf/users/153722?format=json", "institution": "Beijing University of Aeronautics and Astronautics"}], "abstract": "Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side\u2014either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination\u2014thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control} that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38161", "url": null, "sourceid": 39491, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39734, "uid": "cb836769232f3d822643080cba2b2425", "name": "S$^{2}$FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain", "authors": [{"id": 86068, "fullname": "Baoquan Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/86068?format=json", "institution": ", Harbin Institute of Technology (shenzhen)"}, {"id": 154661, "fullname": "Zhehao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/154661?format=json", "institution": "Harbin Institute of Technology"}, {"id": 76919, "fullname": "Lisai Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/76919?format=json", "institution": "Computer Science"}, {"id": 153110, "fullname": "Kenghong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/153110?format=json", "institution": "Harbin Institute of Technology"}, {"id": 154659, "fullname": "Tianran Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/154659?format=json", "institution": "Harbin Institute of Technology"}, {"id": 192743, "fullname": "Yuxi Sun", "url": "http://cvpr.thecvf.com/api/miniconf/users/192743?format=json", "institution": "Shenzhen University"}, {"id": 127047, "fullname": "Yunming Ye", "url": "http://cvpr.thecvf.com/api/miniconf/users/127047?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 105501, "fullname": "Yao He", "url": "http://cvpr.thecvf.com/api/miniconf/users/105501?format=json", "institution": "Dalian University of Technology"}], "abstract": "Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change $\\Delta W$ is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change $\\Delta W$ with uniform spectrum.To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called $\\text{S}^2\\text{FT}$. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons.Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our $\\text{S}^2\\text{FT}$ achieves superior performance by only using 0.08% training parameters.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39734", "url": null, "sourceid": 44154, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 40216, "uid": "11ccf58f042c0bfac410befd3513abcf", "name": "Fourier Angle Alignment for Oriented Object Detection in Remote Sensing", "authors": [{"id": 176075, "fullname": "Changyu Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/176075?format=json", "institution": "Beijing Institute of Technology"}, {"id": 129586, "fullname": "Linwei Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/129586?format=json", "institution": "Beijing Institute of Technology"}, {"id": 186056, "fullname": "Lin Gu", "url": "http://cvpr.thecvf.com/api/miniconf/users/186056?format=json", "institution": "Tohoku University"}, {"id": 73966, "fullname": "Ying Fu", "url": "http://cvpr.thecvf.com/api/miniconf/users/73966?format=json", "institution": "Beijing Institute of Technology"}], "abstract": "In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance,  we introduce **Fourier Angle Alignment**, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : **FAAFusion** and **FAA Head**. FAAFusion works at the detector neck, aligning the main direction of higher-level features to the lower-level features and then fusing them. FAA Head serves as a new detection head, which pre-aligns RoI features to a canonical angle and adds them to the original features before classification and regression. Experiments on DOTA-v1.0, DOTA-v1.5 and HRSC2016 show that our method can greatly improve previous work. Particularly, our method achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training and testing, validating the efficacy of our approach in remote sensing object detection. The code will be public upon acceptance.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/40216", "url": null, "sourceid": 42644, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 38730, "uid": "49da248b6bc129a75aa8019f4352e943", "name": "InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields", "authors": [{"id": 190535, "fullname": "Hao Yu", "url": "http://cvpr.thecvf.com/api/miniconf/users/190535?format=json", "institution": "Zhejiang University"}, {"id": 76399, "fullname": "Haotong Lin", "url": "http://cvpr.thecvf.com/api/miniconf/users/76399?format=json", "institution": "Zhejiang University"}, {"id": 190536, "fullname": "Jiawei Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/190536?format=json", "institution": "University of Electronic Science and Technology of China"}, {"id": 190537, "fullname": "Jiaxin Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190537?format=json", "institution": "College of Computer Science and Technology, Zhejiang University"}, {"id": 153216, "fullname": "Yida Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/153216?format=json", "institution": "Li Auto Inc."}, {"id": 150336, "fullname": "Xueyang Zhang", "url": "http://cvpr.thecvf.com/api/miniconf/users/150336?format=json", "institution": "Li Auto Inc."}, {"id": 184777, "fullname": "Yue Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/184777?format=json", "institution": "Zhejiang University"}, {"id": 76363, "fullname": "Xiaowei Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/76363?format=json", "institution": "Zhejiang University"}, {"id": 87066, "fullname": "Ruizhen Hu", "url": "http://cvpr.thecvf.com/api/miniconf/users/87066?format=json", "institution": "Shenzhen University"}, {"id": 76570, "fullname": "Sida Peng", "url": "http://cvpr.thecvf.com/api/miniconf/users/76570?format=json", "institution": "Zhejiang University"}], "abstract": "Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts. Code and data will be made publicly available.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/38730", "url": null, "sourceid": 39811, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 39893, "uid": "a376ac5ee112e4c8553fc20504bdbc3c", "name": "Enhancing Spatial Understanding in Image Generation via Reward Modeling", "authors": [{"id": 182317, "fullname": "Zhenyu Tang", "url": "http://cvpr.thecvf.com/api/miniconf/users/182317?format=json", "institution": "Peking University"}, {"id": 180418, "fullname": "Chaoran Feng", "url": "http://cvpr.thecvf.com/api/miniconf/users/180418?format=json", "institution": "Peking University"}, {"id": 153805, "fullname": "Yufan Deng", "url": "http://cvpr.thecvf.com/api/miniconf/users/153805?format=json", "institution": "Peking University"}, {"id": 89921, "fullname": "Jie Wu", "url": "http://cvpr.thecvf.com/api/miniconf/users/89921?format=json", "institution": "ByteDance Inc."}, {"id": 190075, "fullname": "Xiaojie Li", "url": "http://cvpr.thecvf.com/api/miniconf/users/190075?format=json", "institution": "Tiktok"}, {"id": 87116, "fullname": "Rui Wang", "url": "http://cvpr.thecvf.com/api/miniconf/users/87116?format=json", "institution": "TikTok"}, {"id": 186320, "fullname": "Yunpeng Chen", "url": "http://cvpr.thecvf.com/api/miniconf/users/186320?format=json", "institution": "ByteDance Inc."}, {"id": 90195, "fullname": "Daquan Zhou", "url": "http://cvpr.thecvf.com/api/miniconf/users/90195?format=json", "institution": "National University of Singapore"}], "abstract": "Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity\u2014particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts.To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation.We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation. All models and datasets will be released.", "topic": null, "keywords": [], "decision": "Accept (Poster)", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/39893", "url": null, "sourceid": 38491, "sourceurl": "https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}]}